Hadoop Interview Questions and Answers
Question - 21 : - Explain the major difference between HDFS block and InputSplit
Answer - 21 : -
In simple terms, HDFS block is the physical representation of data, while InputSplit is the logical representation of the data present in the block. InputSplit acts as an intermediary between the block and the mapper.
Suppose there are two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now considering the map, it will read Block 1 from ii to ll but does not know how to process Block 2 at the same time. InputSplit comes into play here, which will form a logical group of Block 1 and Block 2 as a single block.
It then forms a key-value pair using InputFormat and records the reader and sends the map for further processing with InputSplit. If you have limited resources, then you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640 MB, 64 MB each, and limited resources, then you can assign the split size as 128 MB. This will form a logical group of 128 MB, with only five maps executing at a time.
However, if the split size property is set to false, then the whole file will form one InputSplit and will be processed by a single map, consuming more time when the file is bigger.
Question - 22 : - How is Hadoop different from other parallel computing systems?
Answer - 22 : -
Hadoop is a distributed file system that lets you store and handle large amounts of data on a cloud of machines, handling data redundancy.
The primary benefit of this is that since the data is stored in several nodes, it is better to process it in a distributed manner. Each node can process the data stored on it, instead of spending time moving the data over the network.
On the contrary, in the relational database computing system, you can query the data in real-time, but it is not efficient to store the data in tables, records, and columns, when the data is large.
Hadoop also provides a scheme to build a column database with Hadoop HBase for runtime queries on rows.
Listed below are the main components of Hadoop:
- HDFS: HDFS is Hadoop’s storage unit.
- MapReduce: MapReduce the Hadoop’s processing unit.
- YARN: YARN is the resource management unit of Apache Hadoop.
Question - 23 : - Can you list the limitations of Hadoop?
Answer - 23 : -
Hadoop is considered a very important Big Data management tool. However, like other tools, it also has some limitations of its own. They are as below:
- In Hadoop, you can configure only one NameCode.
- Hadoop is suitable only for the batch processing of a large amount of data.
- Only map or reduce jobs can be run by Hadoop.
- Hadoop supports only one Name No and One Namespace for each cluster.
- Hadoop does not facilitate horizontal scalability of NameNode.
- Hourly backup of MetaData from NameNode needs to be given to the Secondary NameNode.
- Hadoop can support only up to 4000 nodes per cluster.
- In Hadoop, the JobTracker, one and only single component, performs a majority of the activities such as managing Hadoop resources, job schedules, job monitoring, rescheduling jobs, etc.
- Real-time data processing is not possible with Hadoop.
- Due to the preceding reason, JobTracker is the only possible single point of failure in Hadoop.
Question - 24 : - Name the different configuration files in Hadoop
Answer - 24 : -
Below given are the names of the different configuration files in Hadoop:
- mapred-site.xml
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
Question - 25 : - Can you skip the bad records in Hadoop? How?
Answer - 25 : -
In Hadoop, there is an option where sets of input records can be skipped while processing map inputs. This feature is managed by the applications through the SkipBadRecords class.
The SkipBadRecords class is commonly used when map tasks fail on input records. Please note that the failure can occur due to faults in the map function. Hence, the bad records can be skipped in Hadoop by using this class.
Question - 26 : - What are the various components of Apache HBase?
Answer - 26 : -
There are three main components of Apache HBase that are mentioned below:
- HMaster: It manages and coordinates the region server just like NameNode manages DataNodes in HDFS.
- Region Server: It is possible to divide a table into multiple regions and the region server makes it possible to serve a group of regions to the clients.
- ZooKeeper: ZooKeeper is a coordinator in the distributed environment of HBase. ZooKeeper communicates through the sessions to maintain the state of the server in the cluster.
Question - 27 : - What is the syntax to run a MapReduce program?
Answer - 27 : -
The syntax used to run a MapReduce program is hadoop_jar_file.jar /input_path /output_path.
Question - 28 : - Which command will you give to copy data from the local system onto HDFS?
Answer - 28 : -
hadoop fs –copyFromLocal [source][destination]
Question - 29 : - What are the components of Apache HBase’s Region Server?
Answer - 29 : -
The following are the components of HBase’s region server:
- BlockCache: It resides on the region server and stores data in the memory, which is read frequently.
- WAL: Write ahead log or WAL is a file that is attached to each region server located in the distributed environment.
- MemStore: MemStore is the write cache that stores the input data before it is stored in the disk or permanent memory.
- HFile: HDFS stores the HFile that stores the cells on the disk.
Question - 30 : - What are the various schedulers in YARN?
Answer - 30 : -
Mentioned below are the numerous schedulers that are available in YARN:
- FIFO Scheduler: The first-in-first-out (FIFO) scheduler places all the applications in a single queue and executes them in the same order as their submission. As the FIFO scheduler can block short applications due to long-running applications, it is less efficient and desirable for professionals.
- Capacity Scheduler: A different queue makes it possible to start executing short-term jobs as soon as they are submitted. Unlike in the FIFO scheduler, the long-term tasks are completed later in the capacity scheduler.
- Fair Scheduler: The fair scheduler, as the name suggests, works fairly. It balances the resources dynamically between all the running jobs and is not required to reserve a specific capacity for them.