Big Data Interview Questions and Answers
Question - 91 : - What is the need for Data Locality in Hadoop?
Answer - 91 : -
One of the important big data interview questions. In HDFS, datasets are stored as blocks in DataNodes in the Hadoop cluster. When a MapReduce job is executing, the individual Mapper processes the data blocks (Input Splits). If the data does is not present in the same node where the Mapper executes the job, the data must be copied from the DataNode where it resides over the network to the Mapper DataNode.
When a MapReduce job has over a hundred Mappers and each Mapper DataNode tries to copy the data from another DataNode in the cluster simultaneously, it will lead to network congestion, thereby having a negative impact on the system’s overall performance. This is where Data Locality enters the scenario. Instead of moving a large chunk of data to the computation, Data Locality moves the data computation close to where the actual data resides on the DataNode. This helps improve the overall performance of the system, without causing unnecessary delay.
Question - 92 : - What are the steps to achieve security in Hadoop?
Answer - 92 : -
In Hadoop, Kerberos – a network authentication protocol – is used to achieve security. Kerberos is designed to offer robust authentication for client/server applications via secret-key cryptography.
When you use Kerberos to access a service, you have to undergo three steps, each of which involves a message exchange with a server. The steps are as follows:
- Authentication – This is the first step wherein the client is authenticated via the authentication server, after which a time-stamped TGT (Ticket Granting Ticket) is given to the client.
- Authorization – In the second step, the client uses the TGT for requesting a service ticket from the TGS (Ticket Granting Server).
- Service Request – In the final step, the client uses the service ticket to authenticate themselves to the server.
Question - 93 : - How can you handle missing values in Big Data?
Answer - 93 : -
Final question in our big data interview questions and answers guide. Missing values refer to the values that are not present in a column. It occurs when there’s is no data value for a variable in an observation. If missing values are not handled properly, it is bound to lead to erroneous data which in turn will generate incorrect outcomes. Thus, it is highly recommended to treat missing values correctly before processing the datasets. Usually, if the number of missing values is small, the data is dropped, but if there’s a bulk of missing values, data imputation is the preferred course of action.
In Statistics, there are different ways to estimate the missing values. These include regression, multiple data imputation, listwise/pairwise deletion, maximum likelihood estimation, and approximate Bayesian bootstrap.
Question - 94 : - Are you open to gaining additional learning and qualifications that could help you advance your career with us?
Answer - 94 : -
Here's your chance to demonstrate your enthusiasm and career ambitions. Of course, your answer will depend on your current level of academic qualifications and certifications, as well as your personal circumstances, which might include family responsibilities and financial considerations. Therefore, respond forthrightly and honestly. Bear in mind that many courses and learning modules are readily available online. Moreover, analytics vendors have established training courses aimed at those seeking to upskill themselves in this domain. You can also inquire about the company's policy on mentoring and coaching.
Question - 95 : - What is the FSCK command used for?
Answer - 95 : -
FSCK, which stands for file system consistency check, is an HDFS filesystem checking utility that can be used to generate a summary report about the file system's status. However, the report merely identifies the presence of errors; it does not correct them. The FSCK command can be executed against an entire system or a select subset of files.
Question - 96 : - What are two common techniques for detecting outliers?
Answer - 96 : -
Analysts often use the following two techniques to detect outliers:
- Extreme value analysis. This is the most basic form of outlier detection and is limited to one-dimensional data. Extreme value analysis determines the statistical tails of the data distribution. The Altman Z-score is a good example of extreme value analysis.
- Probabilistic and statistical models. The models determine the unlikely instances from a probabilistic model of data. Data points with a low probability of membership are marked as outliers. However, these models assume that the data adheres to specific distributions. A common example of this type of outlier detection is the Bayesian probabilistic model.
These are only two of the core methods used to detect outliers. Other approaches include linear regression models, information theoretic models, high-dimensional outlier detection methods and other approaches.
Question - 97 : - What is an "outlier" in the context of big data?
Answer - 97 : -
An outlier is a data point that's abnormally distant from others in a group of random samples. The presence of outliers can potentially mislead the process of machine learning and result in inaccurate models or substandard outcomes. In fact, an outlier can potentially bias an entire result set. That said, outliers can sometimes contain nuggets of valuable information.
Question - 98 : - What is feature selection in big data?
Answer - 98 : -
Feature selection refers to the process of extracting only specific information from a data set. This can reduce the amount of data that needs to be analyzed, while improving the quality of that data used for analysis. Feature selection makes it possible for data scientists to refine the input variables they use to model and analyze the data, leading to more accurate results, while reducing the computational overhead.
Data scientists use sophisticated algorithms for feature selection, which usually fall into one of the following three categories:
- Filter methods. A subset of input variables is selected during a preprocessing stage by ranking the data based on such factors as importance and relevance.
- Wrapper methods. This approach is a resource-intensive operation that uses machine learning and predictive analytics to try to determine which input variables to keep, usually providing better results than filter methods.
- Embedded methods. Embedded methods combine attributes of both the file and wrapper methods, using fewer computational resources than wrapper methods, while providing better results than filter methods. However, embedded methods are not always as effective as wrapper methods.
Question - 99 : - What are the two main phases of a MapReduce operation?
Answer - 99 : -
A MapReduce operation can be divided into the following two primary phases:
- Map phase. MapReduce processes the input data, splits it into chunks and maps those chunks in preparation for analysis. MapReduce runs these processes in parallel.
- Reduce phase. MapReduce processes the mapped chunks, aggregating the data based on the defined logic. The output of these phases is then written to HDFS.
MapReduce operations are sometimes divided into phases other than these two. For example, the Reduce phase might be split into the Shuffle phase and the Reduce phase. In some cases, you might also see a Combiner phase, which is an optional phase used to optimize MapReduce operations.
Question - 100 : - What is MapReduce?
Answer - 100 : -
MapReduce is a software framework in Hadoop that's used for processing large data sets across a cluster of computers in which each node includes its own storage. MapReduce can process data in parallel on these nodes, making it possible to distribute input data and collate the results. In this way, Hadoop can run jobs split across a massive number of servers. MapReduce also provides its own level of fault tolerance, with each node periodically reporting its status to a primary node. In addition, MapReduce offers native support for writing Java applications, although you can also write MapReduce applications in other programming languages.