• +91 9723535972
  • info@interviewmaterial.com

Big Data Interview Questions and Answers

Big Data Interview Questions and Answers

Question - 51 : - What is the default replication factor in HDFS?

Answer - 51 : -

By default, the replication factor is 3. There are no two copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that one copy is always safe, even if something happens to the rack.

We can set the default replication factor of the file system and each file and directory exclusively. We can lower the replication factor for files that are not essential, and critical files should have a high replication factor.

Question - 52 : - Write the command used to copy data from the local system onto HDFS?

Answer - 52 : -

The command used for copying data from the Local system to HDFS is:
hadoop fs –copyFromLocal [source][destination]

Question - 53 : - What is partitioning in Hive?

Answer - 53 : -

In general partitioning in Hive is a logical division of tables into related columns such as date, city, and department based on the values of partitioned columns. Then these partitions are subdivided into buckets so that they provide extra structure to the data that may be used for more efficient querying.
Now let’s experience data partitioning in Hive with an instance. Consider a table named Table1. The table contains client details like id, name, dept, and year of joining. Assume we need to retrieve the details of all the clients who joined in 2014.

Then, the query examines the whole table for the necessary data. But if we partition the client data by the year and save it in a different file, this will decrease the query processing time. 

Question - 54 : - Explain Features Selection.

Answer - 54 : -

During processing, Big data may contain a large amount of data that is not required at a particular time, So we may be required to select only some specific features that we are interested in. The process of extracting only the needed features from the Big data is called Feature selection.

Feature selection Methods are -

  • Filters Method: In this method of variable ranking, we only consider the importance and usefulness of a feature.
  • Wrappers Method: In this method, ‘induction algorithm’ is used, Which can be used to produce a classifier.
  • Embedded Method: This method is a combination of efficiencies of both Filters and wrappers methods.

Question - 55 : - What is the use of the -compress-codec parameter?

Answer - 55 : -

-compress-codec parameter is generally used to get the output file of a sqoop import in formats other than .gz.

Question - 56 : - Mention the main configuration parameters that has to be specified by the user to run MapReduce.

Answer - 56 : -

The chief configuration parameters that the user of the MapReduce framework needs to mention is:

  • Job’s input Location
  • Job’s Output Location
  • The Input format
  • The Output format
  • The Class including the Map function
  • The Class including the reduce function
  • JAR file, which includes the mapper, the Reducer, and the driver classes.

Question - 57 : - How can you skip bad records in Hadoop ?

Answer - 57 : -

Hadoop can provide an option wherein a particular set of lousy input records could be skipped while processing map inputs. SkipBadRecords class in Hadoop offers an optional mode of execution in which the bad records can be detected and neglected in multiple attempts. This may happen due to the presence of some bugs in the map function. The user has to manually fix it, which may sometimes be possible because the bug may be in third-party libraries. With the help of this feature, only a small amount of data is lost, which may be acceptable because we are dealing with a large amount of data.

Question - 58 : - Explain Outliers.

Answer - 58 : -

Outliers are the data points that are very far from the group, which is not a part of any group or cluster. This may affect the behavior of the model, they may predict wrong results, or their accuracy will be very low. Therefore Outliers must be handled carefully as they may also contain some helpful information. The presence of these outliers may lead to misleading a Big Data model or a Machine Learning Model. The results of this may be, 

  • Poor Results
  • Lower accuracy
  • Longer Training Time

Question - 59 : - Explain Persistent, Ephemeral and Sequential Znodes.

Answer - 59 : -

Persistent znodes:  The default znode in ZooKeeper is the Persistent Znode. It permanently stays in the zookeeper server until any other clients leave it apart.
Ephemeral znodes: These are the temporary znodes. It is smashed whenever the creator client logs out of the ZooKeeper server. For example, assume client1 created eznode1. Once client1 logs out of the ZooKeeper server, the eznode1 gets destroyed.
Sequential znodes: Sequential znode is assigned a 10-digit number in numerical order at the end of its name. Assume client1 produced a sznode1. In the ZooKeeper server, the sznode1 will be named like this:
sznode0000000001.
If client1 generates another sequential znode, it will bear the following number in a sequence. So the subsequent sequential znode is 0000000002.

Question - 60 : - Explain the Pros and Cons of Big Data?

Answer - 60 : -

Pros of Big Data are:

  • Increased productivity: Recently, it was found that 59.9% of businesses use big data tools like Hadoop and Spark to develop their sales. Current big data tools enable analysts to examine instantly, which enhances their productivity. Also, the insights inferred from the analysis of big data can be used by organizations to increase productivity in different forms throughout the company.
  • Reduce costs: Big data analytics help businesses reduce their costs. In most companies, big data tools had served them to enhance operational performance and decrease costs, and in few other companies had started using big data to decrease expenses. Interestingly, very few companies selected cost reduction as their primary goal for big data analytics, suggesting that this is merely a very welcome side benefit for many.
  • Improved customer service: Improving customer service has always been one of the primary goals for big data analytics projects, and it has been a success for many companies with the help of this. Various customer contact points like Social media, customer relationship management systems, etc., transfer a lot of information about their customers. And this analysis and data is used to improve the services for the customers
  • Fraud detection: The primary purpose of using Big data analytics is in the financial services industry for detecting frauds. The advantage of big data analytics systems is that it depends on machine learning, because of which they are great at recognizing patterns and irregularities. As a result, these techniques can give banks and credit card companies the capacity to detect stolen credit cards or deceitful purchases, usually before the cardholder knows that something is wrong.
  • More significant innovation: A few companies have started investing in analytics with the sole purpose to bring new things and disturb their markets. The reason behind this is if they can see the future of the market with the help of insights before their competitors, they can come out strong from that situation with a few new goods and services and capture the market quickly.
On the other hand, implementing big data analytics is not as easy as we think; there are a few difficulties too when it comes to implementing it. 

Cons of Big Data are:

  • Need for talent: The number one big data challenge that we have been facing for the past three years is the skill set required for it. A lot of companies also face difficulty when designing a data lake. Hiring or training staff will only increase the cost considerably, and also, imbibing big data skills takes a lot of time.
  • Cybersecurity risks: Storing, especially sensitive big data will make those businesses a prime target for cyberattackers. Security is one of the top big data challenges, and cybersecurity breaches are the single greatest data threat that enterprises encounter.
  • Hardware needs: Another critical concern for businesses is the IT base necessary to help big data analytics drives. Storage space for storing the data, networking bandwidth for transferring it to and from analytics systems, and calculating resources to achieve those analytics are costly to buy and keep.
  • Data quality: The disadvantage in working with big data was the requirement to address data quality problems. Before companies can use big data for analytics purposes, data scientists and analysts need to ensure that the data they are working with is accurate, appropriate, and in the proper format for analysis. This slows the process, but if companies don't take care of data quality issues, they may find that the insights produced by their analytics are useless or even harmful if performed.


NCERT Solutions

 

Share your email for latest updates

Name:
Email:

Our partners