Hadoop Interview Questions and Answers
Question - 71 : - What are the three modes in which Hadoop can run?
Answer - 71 : -
The three modes in which Hadoop can run are :
- Standalone mode: This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services.
- Pseudo-distributed mode: This uses a single-node Hadoop deployment to execute all Hadoop services.
- Fully-distributed mode: This uses separate nodes to run Hadoop master and slave services.
Question - 72 : - What are the differences between regular FileSystem and HDFS?
Answer - 72 : -
- Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
- HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
Question - 73 : - What are the two types of metadata that a NameNode server holds?
Answer - 73 : -
The two types of metadata that a NameNode server holds are:
- Metadata in Disk - This contains the edit log and the FSImage
- Metadata in RAM - This contains the information about DataNodes
Question - 74 : - How can you restart NameNode and all the daemons in Hadoop?
Answer - 74 : -
The following commands will help you restart NameNode and all the daemons:
You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.
You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.
Question - 75 : - Which command will help you find the status of blocks and FileSystem health?
Answer - 75 : -
To check the status of the blocks, use the command:
To check the health status of FileSystem, use the command:
hdfs fsck / -files –blocks –locations > dfs-fsck.log
Question - 76 : - How do you copy data from the local system onto HDFS?
Answer - 76 : -
The following command will copy data from the local file system onto HDFS:
hadoop fs –copyFromLocal [source] [destination]
Example: hadoop fs –copyFromLocal /tmp/data.csv /user/test/data.csv
In the above syntax, the source is the local path and destination is the HDFS path. Copy from the local system using a -f option (flag option), which allows you to write the same file or a new file to HDFS.
Question - 77 : - When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?
Answer - 77 : -
The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.
dfsadmin -refreshNodes
This is used to run the HDFS client and it refreshes node configuration for the NameNode.
rmadmin -refreshNodes
This is used to perform administrative tasks for ResourceManager.
Question - 78 : - Is there any way to change the replication of files on HDFS after they are already written to HDFS?
Answer - 78 : -
Yes, the following are ways to change the replication of files on HDFS:
We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.
If you want to change the replication factor for a particular file or directory, use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file
Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/test.csv
Question - 79 : - Explain the process of spilling in MapReduce.
Answer - 79 : -
Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.
Question - 80 : - How can you set the mappers and reducers for a MapReduce job?
Answer - 80 : -
The number of mappers and reducers can be set in the command line using:
-D mapred.map.tasks=5 –D mapred.reduce.tasks=2
In the code, one can configure JobConf variables:
job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers