• +91 9723535972
  • info@interviewmaterial.com

Data Science Interview Questions and Answers

Data Science Interview Questions and Answers

Question - 41 : - Explain bagging in Data Science.

Answer - 41 : -

Bagging is an ensemble learning method. It stands for bootstrap aggregating. In this technique, we generate some data using the bootstrap method, in which we use an already existing dataset and generate multiple samples of the N size. This bootstrapped data is then used to train multiple models in parallel, which makes the bagging model more robust than a simple model.

Once all the models are trained, when it’s time to make a prediction, we make predictions using all the trained models and then average the result in the case of regression, and for classification, we choose the result, generated by models, that have the highest frequency.

Question - 42 : - Explain boosting in Data Science.

Answer - 42 : -

Boosting is one of the ensemble learning methods. Unlike bagging, it is not a technique used to parallelly train our models. In boosting, we create multiple models and sequentially train them by combining weak models iteratively in a way that training a new model depends on the models trained before it.

In doing so, we take the patterns learned by a previous model and test them on a dataset when training the new model. In each iteration, we give more importance to observations in the dataset that are incorrectly handled or predicted by previous models. Boosting is useful in reducing bias in models as well.

Question - 43 : - Build a confusion matrix for the model where the threshold value for the probability of predicted values is 0.6, and also find the accuracy of the model.

Answer - 43 : -

Accuracy is calculated as:

Accuracy = (True positives + true negatives)/(True positives+ true negatives + false positives + false negatives)

To build a confusion matrix in R, we will use the table function:

table(test$target,pred_heart>0.6)
Here, we are setting the probability threshold as 0.6. So, wherever the probability of pred_heart is greater than 0.6, it will be classified as 0, and wherever it is less than 0.6 it will be classified as 1.

Question - 44 : - Make a scatter plot between ‘price’ and ‘carat’ using ggplot. ‘Price’ should be on the y-axis, ’carat’ should be on the x-axis, and the ‘color’ of the points should be determined by ‘cut.’

Answer - 44 : -

We will implement the scatter plot using ggplot.

The ggplot is based on the grammar of data visualization, and it helps us stack multiple layers on top of each other.

So, we will start with the data layer, and on top of the data layer we will stack the aesthetic layer. Finally, on top of the aesthetic layer we will stack the geometry layer.

Code:

>ggplot(data=diamonds, aes(x=caret, y=price, col=cut))+geom_point()

Question - 45 : - What does the word ‘Naive’ mean in Naive Bayes?

Answer - 45 : -

Naive Bayes is a Data Science algorithm. It has the word ‘Bayes’ in it because it is based on the Bayes theorem, which deals with the probability of an event occurring given that another event has already occurred.

It has ‘naive’ in it because it makes the assumption that each variable in the dataset is independent of the other. This kind of assumption is unrealistic for real-world data. However, even with this assumption, it is very useful for solving a range of complicated problems, e.g., spam email classification, etc.

Question - 46 : - Explain how Machine Learning is different from Deep Learning.

Answer - 46 : -

A field of computer science, Machine Learning is a subfield of Data Science that deals with using existing data to help systems automatically learn new skills to perform different tasks without having rules to be explicitly programmed.

Deep Learning, on the other hand, is a field in Machine Learning that deals with building Machine Learning models using algorithms that try to imitate the process of how the human brain learns from the information in a system for it to attain new capabilities. In Deep Learning, we make heavy use of deeply connected neural networks with many layers.

Question - 47 : - Write a function to calculate the Euclidean distance between two points.

Answer - 47 : -

The formula for calculating the Euclidean distance between two points (x1, y1) and (x2, y2) is as follows:

√(((x1 - x2) ^ 2) + ((y1 - y2) ^ 2))
Code for calculating the Euclidean distance is as given below:

def euclidean_distance(P1, P2):
return (((P1[0] - P2[0]) ** 2) + ((P1[1] - P2[1]) ** 2)) ** .5

Question - 48 : - Write code to calculate the root mean square error (RMSE) given the lists of values as actual and predicted.

Answer - 48 : -

To calculate the root mean square error (RMSE), we have to:

  • Calculate the errors, i.e., the differences between the actual and the predicted values
  • Square each of these errors
  • Calculate the mean of these squared errors
  • Return the square root of the mean
The code in Python for calculating RMSE is given below:

def rmse(actual, predicted):
  errors = [abs(actual[i] - predicted[i]) for i in range(0, len(actual))]
  squared_errors = [x ** 2 for x in errors]
  mean = sum(squared_errors) / len(squared_errors)
  return mean ** .5

Question - 49 : - Mention the different kernel functions that can be used in SVM.

Answer - 49 : -

In SVM, there are four types of kernel functions:

  • Linear kernel
  • Polynomial kernel
  • Radial basis kernel
  • Sigmoid kernel

Question - 50 : - How to detect if the time series data is stationary?

Answer - 50 : -

Time series data is considered stationary when variance or mean is constant with time. If the variance or mean does not change over a period of time in the dataset, then we can draw the conclusion that, for that period, the data is stationary.


NCERT Solutions

 

Share your email for latest updates

Name:
Email:

Our partners