Big Data Interview Questions and Answers
Question - 41 : - What are the three modes that Hadoop can run?
Answer - 41 : -
Local Mode or Standalone Mode:
By default, Hadoop is configured to operate in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is more helpful for debugging, and there isn't any requirement to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Standalone mode is ordinarily the quickest mode in Hadoop.
Pseudo-distributed Mode:
In this mode, each daemon runs on a separate java process. This mode requires custom configuration ( core-site.xml, hdfs-site.xml, mapred-site.xml). The HDFS is used for input and output. This mode of deployment is beneficial for testing and debugging purposes.
Fully Distributed Mode:
It is the production mode of Hadoop. One machine in the cluster is assigned as NameNode and another as Resource Manager exclusively. These are masters. Rest nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be defined for Hadoop Daemons. This mode gives fully distributed computing capacity, security, fault endurance, and scalability.
Question - 42 : - Mention the common input formats in Hadoop.
Answer - 42 : -
The common input formats in Hadoop are -
- Text Input Format: This is the default input format in Hadoop.
- Key-Value Input Format: Used to read Plain Text Files in Hadoop.
- Sequence File Input format: This is used to read Files in a sequence in Hadoop.
Question - 43 : - What are the different Output formats in Hadoop?
Answer - 43 : -
The different Output formats in Hadoop are -
- Textoutputformat: TextOutputFormat is the default output format in Hadoop.
- Mapfileoutputformat: Mapfileoutputformat is used to write the output as map files in Hadoop.
- DBoutputformat: DBoutputformat is just used for writing output in relational databases and Hbase.
- Sequencefileoutputformat: Sequencefileoutputformat is used for writing sequence files.
- SequencefileAsBinaryoutputformat: SequencefileAsBinaryoutputformat is used to write keys to a sequence file in binary format.
Question - 44 : - What are the different big data processing techniques?
Answer - 44 : -
Big Data processing methods analyze big data sets at a massive scale. Offline batch data processing is typically full power and full scale, tackling arbitrary BI scenarios. In contrast, real-time stream processing is conducted on the most recent slice of data for data profiling to pick outliers, impostor transaction exposures, safety monitoring, etc. However, the most challenging task is to do fast or real-time ad-hoc analytics on a big comprehensive data set. It substantially means you need to scan tons of data within seconds. This is only probable when data is processed with high parallelism.
Different techniques of Big Data Processing are:
- Batch Processing of Big Data
- Big Data Stream Processing
- Real-Time Big Data Processing
- Map Reduce
Question - 45 : - What is Map Reduce in Hadoop?
Answer - 45 : -
Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel. The word MapReduce refers to two separate and different tasks. The first is the map operation, which takes a set of data and transforms it into a diverse collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.
Question - 46 : - When to use MapReduce with Big Data.
Answer - 46 : -
MapReduce is a programming model created for distributed computation on big data sets in parallel. A MapReduce model has a map function that performs filtering and sorting and a reduced function, which serves as a summary operation.
MapReduce is an important part of the Apache Hadoop open-source ecosystem, and it’s extensively used for querying and selecting data in the Hadoop Distributed File System (HDFS). A variety of queries may be done depending on the broad spectrum of MapReduce algorithms possible for creating data selections. In addition, MapReduce is fit for iterative computation involving large quantities of data requiring parallel processing. This is because it represents a data flow rather than a procedure.
The more enhanced data we produce and accumulate, the higher the need to process all that data to make it usable. MapReduce’s iterative, parallel processing programming model is a good tool for creating a sense of big data.
Question - 47 : - Mention the core methods of Reducer.
Answer - 47 : -
The core methods of a Reducer are:
- setup(): setup is a method called just to configure different parameters for the reducer.
- reduce(): reduce is the primary operation of the reducer. The specific function of this method includes defining the task that has to be worked on for a distinct set of values that share a key.
- cleanup(): cleanup is used to clean or delete any temporary files or data after performing reduce() task.
Question - 48 : - Explain the distributed Cache in the MapReduce framework.
Answer - 48 : -
Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple properties files. Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which Distributed Cache sends.
Question - 49 : - Explain overfitting in big data? How to avoid the same.
Answer - 49 : -
Overfitting is generally a modeling error referring to a model that is tightly fitted to the data, i.e. When a modeling function is closely fitted to a limited data set. Due to Overfitting, the predictivity of such models gets reduced. This effect leads to a decrease in generalization ability failing to generalize when applied outside the sample data.
There are several Methods to avoid Overfitting; some of them are:
- Cross-validation: A cross-validation method refers to dividing the data into multiple small test data sets, which can be used to tune the model.
- Early stopping: After a certain number of iterations, the generalizing capacity of the model weakens; in order to avoid that, a method called early stopping is used in order to avoid Overfitting before the model crosses that point.
- Regularization: this method is used to penalize all the parameters except intercept so that the model generalizes the data instead of Overfitting.
Question - 50 : - What is a Zookeeper? What are the benefits of using a zookeeper?
Answer - 50 : -
Hadoop’s most remarkable technique for addressing big data challenges is its capability to divide and conquer with Zookeeper. After the problem has been divided, the conquering relies on employing distributed and parallel processing methods across the Hadoop cluster.
The interactive tools cannot provide the insights or timeliness needed to make business judgments for big data problems. In those cases, you need to build distributed applications to solve those big data problems. Zookeeper is Hadoop’s way of coordinating all the elements of these distributed applications.
Zookeeper as technology is simple, but its features are powerful. Arguably, it would be difficult, if not impossible, to create resilient, fault-tolerant distributed Hadoop applications without it.
Benefits of using a Zookeeper are:
- Simple distributed coordination process: The coordination process among all nodes in Zookeeper is straightforward.
- Synchronization: Mutual exclusion and co-operation among server processes.
- Ordered Messages: Zookeeper tracks with a number by denoting its order with the stamping of each update; with the help of all this, messages are ordered here.
- Serialization: Encode the data according to specific rules. Ensure your application runs consistently.
- Reliability: The zookeeper is very reliable. In case of an update, it keeps all the data until forwarded.
- Atomicity: Data transfer either succeeds or fails, but no transaction is partial.