Hadoop Interview Questions and Answers
Question - 61 : - What is speculative execution in Hadoop?
Answer - 61 : -
One limitation of Hadoop is that by distributing the tasks on several nodes, there are chances that a few slow nodes limit the rest of the program. There are various reasons for the tasks to be slow, which are sometimes not easy to detect. Instead of identifying and fixing the slow-running tasks, Hadoop tries to detect when the task runs slower than expected and then launches other equivalent tasks as a backup. This backup mechanism in Hadoop is speculative execution.
Speculative execution creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job come to completion, the speculative execution mechanism schedules duplicate copies of the remaining tasks, which are slower, across the nodes that are free currently. When these tasks are finished, it is intimated to the JobTracker. If other copies are executing speculatively, then Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
Speculative execution is, by default, true in Hadoop. To disable it, mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options can be set to false.
Question - 62 : - What is Apache Oozie?
Answer - 62 : -
Apache Oozie is nothing but a scheduler that helps to schedule jobs in Hadoop and bundles them as a single logical work. Oozie jobs can largely be divided into the following two categories:
Oozie Workflow: These jobs are a set of sequential actions that need to be executed.
Oozie Coordinator: These jobs are triggered as and when there is data available for them, until which, it rests.
Question - 63 : - What happens if you try to run a Hadoop job with an output directory that is already present?
Answer - 63 : -
It will throw an exception saying that the output file directory already exists.
To run the MapReduce job, it needs to be ensured that the output directory does not exist in the HDFS.
To delete the directory before running the job, shell can be used:
Hadoop fs –rmr /path/to/your/output/
Or the Java API:
FileSystem.getlocal(conf).delete(outputDir, true);
Question - 64 : - How to configure the replication factor in HDFS?
Answer - 64 : -
The hdfs-site.xml file is used to configure HDFS. Changing the dfs.replication property in hdfs-site.xml will change the default replication for all the files placed in HDFS.
The replication factor on a per-file basis can also be modified by using the following:
Hadoop FS Shell:[training@localhost ~]$ hadoopfs –setrep –w 3 /my/fileConversely,
The replication factor of all the files under a directory can also be changed.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
Question - 65 : - How to compress a mapper output not touching reducer output?
Answer - 65 : -
To achieve this compression, the following should be set:
conf.set("mapreduce.map.output.compress", true)
conf.set("mapreduce.output.fileoutputformat.compress", false)
Question - 66 : - What are the basic parameters of a mapper?
Answer - 66 : -
Given below are the basic parameters of a mapper:
- LongWritable and Text
- Text and IntWritable
Question - 67 : - How can you transfer data from Hive to HDFS?
Answer - 67 : -
By writing the query:
hive> insert overwrite directory '/' select * from emp;
Write the query for the data to be imported from Hive to HDFS. The output received will be stored in part files in the specified HDFS path.
Question - 68 : - Which companies use Hadoop?
Answer - 68 : -
Yahoo! – It is the biggest contributor to the creation of Hadoop; its search engine uses Hadoop
- Facebook—developed Hive for analysis
- Amazon
- Netflix
- Adobe
- eBay
- Spotify
- Twitter
Question - 69 : - What are the different vendor-specific distributions of Hadoop?
Answer - 69 : -
The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera).
Question - 70 : - What are the different Hadoop configuration files?
Answer - 70 : -
The different Hadoop configuration files include:
- hadoop-env.sh
- mapred-site.xml
- core-site.xml
- yarn-site.xml
- hdfs-site.xml
- Master and Slaves