Big Data Interview Questions and Answers
Question - 31 : - What is the use of jps command in Hadoop?
Answer - 31 : - The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager etc.
Question - 32 : - Explain the process that overwrites the replication factors in HDFS.
Answer - 32 : -
There are two methods to overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is changed on the basis of file using Hadoop FS shell. The command used for this is:
$hadoop fs – setrep –w2/my/test_file
Here, test_file is the filename that’s replication factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on directory basis i.e. the replication factor for all the files under a given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.
Question - 33 : - What will happen with a NameNode that doesn’t have any data?
Answer - 33 : - A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.
Question - 34 : - How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?
Answer - 34 : -
CLASSPATH includes necessary directories that contain jar files to start or stop Hadoop daemons. Hence, setting CLASSPATH is essential to start or stop Hadoop daemons.
However, setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.
Question - 35 : - Why is HDFS only suitable for large data sets and not the correct tool to use for many small files?
Answer - 35 : -
This is due to the performance issue of NameNode. Usually, NameNode is allocated with huge space to store metadata for the large-scale file. The metadata is supposed to be a from a single file for optimum space utilization and cost benefit. In case of small size files, NameNode does not utilize the entire space which is a performance optimization issue.
Question - 36 : - DFS can handle a large volume of data then why do we need Hadoop framework?
Answer - 36 : -
Hadoop is not only for storing large data but also to process those big data. Though DFS(Distributed File System) too can store the data, but it lacks below features-
It is not fault tolerant
Data movement over a network depends on bandwidth.
Question - 37 : - What is Sequencefileinputformat?
Answer - 37 : -
Hadoop uses a specific file format which is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.