• +91 9723535972
  • info@interviewmaterial.com

Big Data Interview Questions and Answers

Big Data Interview Questions and Answers

Question - 101 : - What are the key differences between NFS and HDFS?

Answer - 101 : -

NFS, which stands for Network File System, is a widely implemented distributed file system protocol used extensively in network-attached storage (NAS) systems. It is one of the oldest distributed file storage systems and is well-suited to smaller data sets. NAS makes data available over a network but accessible like files on a local machine.

HDFS is a more recent technology. It is designed for handling big data workloads, providing high throughput and high capacity, far beyond the capabilities of an NFS-based system. HDFS also offers integrated data protections that safeguard against node failures. NFS is typically implemented on single systems that do not include the inherent fault tolerance that comes with HDFS. However, NFS-based systems are usually much less complicated to deploy and maintain than HDFS-based systems.

Question - 102 : - What is an edge node in Hadoop?

Answer - 102 : -

An edge node is a computer that acts as an end-user portal for communicating with other nodes in a Hadoop cluster. An edge node provides an interface between the Hadoop cluster and an outside network. For this reason, it is also referred to as a gateway node or edge communication node. Edge nodes are often used to run administration tools or client applications. They typically do not run any Hadoop services.

Question - 103 : - How does Hadoop protect data against unauthorized access?

Answer - 103 : -

Hadoop uses the Kerberos network authentication protocol to protect data from unauthorized access. Kerberos uses secret-key cryptography to provide strong authentication for client/server applications. A client must undergo the following three basic steps to prove its identity to a server (each of which involves message exchanges with the server):

  • Authentication. The client sends an authentication request to the Kerberos authentication server. The server verifies the client and sends the client a ticket granting ticket (TGT) and a session key.
  • Authorization. Once authenticated, the client requests a service ticket from the ticket granting server (TGS). The TGT must be included with the request. If the TGS can authenticate the client, it sends the service ticket and credentials necessary to access the requested resource.
  • Service request. The client sends its request to the Hadoop resource it is trying to access. The request must include the service ticket issued by the TGS.

Question - 104 : - What is rack awareness in Hadoop clusters?

Answer - 104 : -

Rack awareness is one of the mechanisms used by Hadoop to optimize data access when processing client read and write requests. When a request comes in, the NameNode identifies and selects the nearest DataNodes, preferably those on the same rack or on nearby racks. Rack awareness can help improve performance and reliability, while reducing network traffic. Rack awareness can also play a role in fault tolerance. For example, the NameNode might place data block replicas on separate racks to help ensure availability in case a network switch fails or a rack becomes unavailable for other reasons.

Question - 105 : - What makes an HDFS environment fault-tolerant?

Answer - 105 : -

HDFS can be easily set up to replicate data to different DataNodes. HDFS breaks files down into blocks that are distributed across nodes in the cluster. Each block is also replicated to other nodes. If one node fails, the other nodes take over, allowing applications to access the data through one of the backup nodes.

Question - 106 : - What are three common input formats in Hadoop?

Answer - 106 : -

Hadoop supports multiple input formats, which determine the shape of the data when it is collected into the Hadoop platform. The following input formats are three of the most common:

  • Text. This is the default input format. Each line within a file is treated as a separate record. The records are saved as key/value pairs, with the line of text treated as the value.
  • Key-Value Text. This input format is similar to the Text format, breaking each line into separate records. Unlike the Text format, which treats the entire line as the value, the Key-Value Text format breaks the line itself into a key and a value, using the tab character as a separator.
  • Sequence File. This format reads binary files that store sequences of user-defined key-value pairs as individual records.
  • Hadoop supports other input formats as well, so you also should have a good understanding of them, in addition to the ones described here.

Question - 107 : - What are Hadoop's primary operational modes?

Answer - 107 : -

Hadoop supports three primary operational nodes.

  • Standalone. Also referred to as Local mode, the Standalone mode is the default mode. It runs as a single Java process on a single node. It also uses the local file system and requires no configuration changes. The Standalone mode is used primarily for debugging purposes.
  • Pseudo-distributed. Also referred to as a single-node cluster, the Pseudo-distributed mode runs on a single machine, but each Hadoop daemon runs in a separate Java process. This mode also uses HDFS, rather than the local file system, and it requires configuration changes. This mode is often used for debugging and testing purposes.
  • Fully distributed. This is the full production mode, with all daemons running on separate nodes in a primary/secondary configuration. Data is distributed across the cluster, which can range from a few nodes to thousands of nodes. This mode requires configuration changes but offers the scalability, reliability and fault tolerance expected of a production system.

Question - 108 : - What is Hadoop YARN and what are its main components?

Answer - 108 : -

Hadoop YARN manages resources and provides an execution environment for required processes, while allocating system resources to the applications running in the cluster. It also handles job scheduling and monitoring. YARN decouples resource management and scheduling from the data processing component in MapReduce.

YARN separates resource management and job scheduling into the following two daemons:

  • ResourceManager. This daemon arbitrates resources for the cluster's applications. It includes two main components: Scheduler and ApplicationsManager. The Scheduler allocates resources to running applications. The ApplicationsManager has multiple roles: accepting job submissions, negotiating the execution of the application-specific ApplicationMaster and providing a service for restarting the ApplicationMaster container on failure.
  • NodeManager. This daemon launches and manages containers on a node and uses them to run specified tasks. NodeManager also runs services that determine the health of the node, such as performing disk checks. Moreover, NodeManager can execute user-specified tasks.

Question - 109 : - What is HDFS and what are its main components?

Answer - 109 : -

HDFS is a distributed file system that serves as Hadoop's default storage environment. It can run on low-cost commodity hardware, while providing a high degree of fault tolerance. HDFS stores the various types of data in a distributed environment that offers high throughput to applications with large data sets. HDFS is deployed in a primary/secondary architecture, with each cluster supporting the following two primary node types:

  • NameNode. A single primary node that manages the file system namespace, regulates client access to files and processes the metadata information for all the data blocks in the HDFS.
  • DataNode. A secondary node that manages the storage attached to each node in the cluster. A cluster typically contains many DataNode instances, but there is usually only one DataNode per physical node. Each DataNode serves read and write requests from the file system's clients.

Question - 110 : - What are some of the main configuration files used in Hadoop?

Answer - 110 : -

The Hadoop platform provides multiple configuration files for controlling cluster settings, including the following:

  • 7adoop-env.sh. Site-specific environmental variables for controlling Hadoop scripts in the bin directory.
  • yarn-env.sh. Site-specific environmental variables for controlling YARN scripts in the bin directory.
  • mapred-site.xml. Configuration settings specific to MapReduce, such as the MapReduce.framework.name setting.
  • core-site.xml. Core configuration settings, such as the I/O configurations common to HDFS and MapReduce.
  • yarn-site.xml. Configuration settings specific to YARN's ResourceManager and NodeManager.
  • hdfs-site.xml. Configuration settings specific to HDFS, such as the file path where the NameNode stores the namespace and transactions logs.


NCERT Solutions

 

Share your email for latest updates

Name:
Email:

Our partners