• +91 9723535972
  • info@interviewmaterial.com

Big Data Interview Questions and Answers

Big Data Interview Questions and Answers

Question - 81 : - What are some of the data management tools used with Edge Nodes in Hadoop?

Answer - 81 : -

This Big Data interview question aims to test your awareness regarding various tools and frameworks.

Oozie, Ambari, Pig and Flume are the most common data management tools that work with Edge Nodes in Hadoop.

Question - 82 : - Talk about the different tombstone markers used for deletion purposes in HBase.

Answer - 82 : -

This Big Data interview question dives into your knowledge of HBase and its working.
There are three main tombstone markers used for deletion in HBase. They are-
  • Family Delete Marker – For marking all the columns of a column family.
  • Version Delete Marker – For marking a single version of a single column.
  • Column Delete Marker – For marking all the versions of a single column.


Question - 83 : - How do you deploy a Big Data solution?

Answer - 83 : -

You can deploy a Big Data solution in three steps:

  • Data Ingestion – This is the first step in the deployment of a Big Data solution. You begin by collecting data from multiple sources, be it social media platforms, log files, business documents, anything relevant to your business. Data can either be extracted through real-time streaming or in batch jobs.
  • Data Storage – Once the data is extracted, you must store the data in a database. It can be HDFS or HBase. While HDFS storage is perfect for sequential access, HBase is ideal for random read/write access.
  • Data Processing – The last step in the deployment of the solution is data processing. Usually, data processing is done via frameworks like Hadoop, Spark, MapReduce, Flink, and Pig, to name a few.

Question - 84 : - Name the three modes in which you can run Hadoop.

Answer - 84 : -

One of the most common question in any big data interview. The three modes are:

  • Standalone mode – This is Hadoop’s default mode that uses the local file system for both input and output operations. The main purpose of the standalone mode is debugging. It does not support HDFS and also lacks custom configuration required for mapred-site.xml, core-site.xml, and hdfs-site.xml files. 
  • Pseudo-distributed mode – Also known as the single-node cluster, the pseudo-distributed mode includes both NameNode and DataNode within the same machine. In this mode, all the Hadoop daemons will run on a single node, and hence, the Master and Slave nodes are the same.
  • Fully distributed mode – This mode is known as the multi-node cluster wherein multiple nodes function simultaneously to execute Hadoop jobs. Here, all the Hadoop daemons run on different nodes. So, the Master and Slave nodes run separately.

Question - 85 : -
What is Feature Selection?

Answer - 85 : -

Feature selection refers to the process of extracting only the required features from a specific dataset. When data is extracted from disparate sources, not all data is useful at all times – different business needs call for different data insights. This is where feature selection comes in to identify and select only those features that are relevant for a particular business requirement or stage of data processing.

The main goal of feature selection is to simplify ML models to make their analysis and interpretation easier. Feature selection enhances the generalization abilities of a model and eliminates the problems of dimensionality, thereby, preventing the possibilities of overfitting. Thus, feature selection provides a better understanding of the data under study, improves the prediction performance of the model, and reduces the computation time significantly. 

Feature selection can be done via three techniques:

Filters method
In this method, the features selected are not dependent on the designated classifiers. A variable ranking technique is used to select variables for ordering purposes. During the classification process, the variable ranking technique takes into consideration the importance and usefulness of a feature. The Chi-Square Test, Variance Threshold, and Information Gain are some examples of the filters method.

Wrappers method
In this method, the algorithm used for feature subset selection exists as a ‘wrapper’ around the induction algorithm. The induction algorithm functions like a ‘Black Box’ that produces a classifier that will be further used in the classification of features. The major drawback or limitation of the wrappers method is that to obtain the feature subset, you need to perform heavy computation work. Genetic Algorithms, Sequential Feature Selection, and Recursive Feature Elimination are examples of the wrappers method.

Embedded method 
The embedded method combines the best of both worlds – it includes the best features of the filters and wrappers methods. In this method, the variable selection is done during the training process, thereby allowing you to identify the features that are the most accurate for a given model. L1 Regularisation Technique and Ridge Regression are two popular examples of the embedded method.

Question - 86 : - Name some outlier detection techniques.

Answer - 86 : -

Again, one of the most important big data interview questions. Here are six outlier detection methods:

  • Extreme Value Analysis – This method determines the statistical tails of the data distribution. Statistical methods like ‘z-scores’ on univariate data are a perfect example of extreme value analysis.
  • Probabilistic and Statistical Models – This method determines the ‘unlikely instances’ from a ‘probabilistic model’ of data. A good example is the optimization of Gaussian mixture models using ‘expectation-maximization’.
  • Linear Models – This method models the data into lower dimensions. Proximity-based Models – In this approach, the data instances that are isolated from the data group are determined by Cluster, Density, or by the Nearest Neighbor Analysis.
  • Information-Theoretic Models – This approach seeks to detect outliers as the bad data instances that increase the complexity of the dataset.
  • High-Dimensional Outlier Detection – This method identifies the subspaces for the outliers according to the distance measures in higher dimensions.

Question - 87 : - Explain Rack Awareness in Hadoop.

Answer - 87 : -

Rack Awareness is one of the popular big data interview questions. Rach awareness is an algorithm that identifies and selects DataNodes closer to the NameNode based on their rack information. It is applied to the NameNode to determine how data blocks and their replicas will be placed. During the installation process, the default assumption is that all nodes belong to the same rack.  

Rack awareness helps to:

  • Improve data reliability and accessibility.
  • Improve cluster performance.
  • Improve network bandwidth. 
  • Keep the bulk flow in-rack as and when possible.
  • Prevent data loss in case of a complete rack failure.

Question - 88 : - Can you recover a NameNode when it is down? If so, how?

Answer - 88 : -

Yes, it is possible to recover a NameNode when it is down. Here’s how you can do it:

  • Use the FsImage (the file system metadata replica) to launch a new NameNode. 
  • Configure DataNodes along with the clients so that they can acknowledge and refer to newly started NameNode.
  • When the newly created NameNode completes loading the last checkpoint of the FsImage (that has now received enough block reports from the DataNodes) loading process, it will be ready to start serving the client. 
However, the recovery process of a NameNode is feasible only for smaller clusters. For large Hadoop clusters, the recovery process usually consumes a substantial amount of time, thereby making it quite a challenging task. 

Question - 89 : - Name the configuration parameters of a MapReduce framework.

Answer - 89 : -

The configuration parameters in the MapReduce framework include:

  • The input format of data.
  • The output format of data.
  • The input location of jobs in the distributed file system.
  • The output location of jobs in the distributed file system.
  • The class containing the map function
  • The class containing the reduce function
  • The JAR file containing the mapper, reducer, and driver classes.

Question - 90 : - Name the common input formats in Hadoop.

Answer - 90 : -

Hadoop has three common input formats:

  • Text Input Format – This is the default input format in Hadoop.
  • Sequence File Input Format – This input format is used to read files in a sequence.
  • Key-Value Input Format – This input format is used for plain text files (files broken into lines).


NCERT Solutions

 

Share your email for latest updates

Name:
Email:

Our partners