Data Science Interview Questions and Answers
Question - 61 : - Differentiate between univariate, bivariate, and multivariate analysis.
Answer - 61 : -
Univariate data, as the name suggests, contains only one variable. The univariate analysis describes the data and finds patterns that exist within it.
Bivariate data contains two different variables. The bivariate analysis deals with causes, relationships and analysis between those two variables.
Multivariate data contains three or more variables. Multivariate analysis is similar to that of a bivariate, however, in a multivariate analysis, there exists more than one dependent variable.
Question - 62 : - How is random forest different from decision trees?
Answer - 62 : - A Decision Tree is a single structure. Random forest is a collection of decision trees.
Question - 63 : - What is dimensionality reduction? What are its benefits?
Answer - 63 : -
Dimensionality reduction is defined as the process of converting a data set with vast dimensions into data with lesser dimensions — in order to convey similar information concisely.
This method is mainly beneficial in compressing data and reducing storage space. It is also useful in reducing computation time due to fewer dimensions. Finally, it helps remove redundant features — for instance, storing a value in two different units (meters and inches) is avoided.
In short, dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
Question - 64 : - For the given points, how will you calculate the Euclidean distance in Python? plot1 = [1,3 ] ; plot2 = [2,5]
Answer - 64 : -
import math
# Example points in 2-dimensional space...
x = (1,3)
y = (2,5)
distance = math.sqrt(sum([(a - b) ** 2 for a, b in zip(x, y)]))
print("Euclidean distance from x to y: ",distance)
Question - 65 : - How should you maintain a deployed model?
Answer - 65 : - A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Also, root cause analysis for wrong predictions should be done.
Question - 66 : - How do you find RMSE and MSE in a linear regression model?
Answer - 66 : - Mean square error is the squared sum of (actual value-predicted value) for all data points. It gives an estimate of the total square sum of errors. Root mean square is the square root of the squared sum of errors.
Question - 67 : - Can you cite some examples where a false negative holds more importance than a false positive?
Answer - 67 : - In cases of predictions when we are doing disease prediction based on symptoms for diseases like cancer.
Question - 68 : - How can outlier values be treated?
Answer - 68 : - Outlier treatment can be done by replacing the values with mean, mode, or a cap off value. The other method is to remove all rows with outliers if they make up a small proportion of the data. A data transformation can also be done on the outliers.
Question - 69 : - How can you calculate accuracy using a confusion matrix?
Answer - 69 : - Accuracy score can be calculated by the formula: (TP+TN)/(TP+TN+FP+FN), where TP= True Positive, TN=True Negatives, FP=False positive, and FN=False Negative.
Question - 70 : - What is the difference between “long” and “wide” format data?
Answer - 70 : - Wide-format is where we have a single row for every data point with multiple columns to hold the values of various attributes. The long format is where for each data point we have as many rows as the number of attributes and each row contains the value of a particular attribute for a given data point.