Top 70+ Machine Learning Interview Questions and Answers

Machine Learning is causing a revolution in industries by making systems learn from data to make intelligent decisions. With over 2,35,000+ job openings worldwide, machine learning skills are in high demand in the data science field. The following are some of the important questions that you need to prepare to crack your dream job.

Table of content

Basic Machine Learning Interview Questions
Machine Learning Interview Questions for Freshers
Machine Learning Interview Questions for Experienced
Machine Learning Engineer Interview Questions
Machine Learning Algorithms Interview Questions

Top 10 Frequently Asked Machine Learning Interview Questions

1. What is Bias and Variance in Machine Learning?
2. How will you know which machine learning algorithm to choose for your classification problem?
3. What is the difference between Correlation and Causality?
4. When should Classification be used over Regression?
5. What is Clustering in Machine Learning?
6. What is Linear Regression in Machine Learning?
7. What is a Decision Tree in Machine Learning?
8. What are the types of Machine Learning?
9. What is Bayes’s Theorem in Machine Learning?
10. What is PCA in Machine Learning?

Basic Machine Learning Interview Questions

1. What is Bias and Variance in Machine Learning?

Bias is a statistical measure that indicates the difference between actual and predicted values—the more the difference, the higher the error and the bias.

Variance is another statistical measure that indicates how a model is responding to some unseen data. When the variance is too high, the model performs very well on training data but poorly on unseen data, and when the variance is too low, it ignores crucial data points and patterns and returns inconsistent outputs.

Keeping all these possibilities in mind, Low Bias and Low Variance are the best situations that are accurate and less complex.

Model

2. How will you know which machine learning algorithm to choose for your classification problem?

There are no fixed rules for choosing a machine learning algorithm for a classification problem. However, to reduce the number of algorithms, we can use the following guidelines:

For small training datasets, use a model with high bias and low variance.
For large training datasets, use a model with high variance and low bias

Lastly, if accuracy is something that you are looking for, then you have to individually test the models.

3. What is the difference between Correlation and Causality?

Correlation and Causality are some widely used statistical measures to to imply relationships between variables. Here are the differences between Correlation and Causality:

Correlation	Causality
Correlation states the relation of one action(X) to the other action(Y), but it doesn’t imply that action(X) influences the change in action(Y)	Causality states a direct relationship between actions where the action(X) influences the change in action(Y)

Ready to take on that job interview?

Take a quick Quiz to check it out

Take a Quiz

4. When should Classification be used over Regression?

Both classification and regression are associated with prediction. Classification involves the identification of values or entities that lie in a specific group. Regression entails predicting a response value from consecutive sets of outcomes. Classification is chosen over regression when the output of the model needs to yield the belongingness of data points in a dataset to a particular category. For example, If you want to predict the price of a house, you should use regression since it is a numerical variable. However, if you are trying to predict whether a house situated in a particular area is going to be high-, medium-, or low-priced, then a classification model should be used.

5. What is Clustering in Machine Learning?

Clustering is a machine learning technique used to group similar data together. It is helpful for exploratory data analysis, finding patterns and trends in the data and performing unsupervised tasks.

Seems a bit overwhelming, let’s take an example to understand it:

Refer to the below image where we have a bunch of different animals and you want to organise them. Clustering algorithms allow you to organise similar animals together. Like, it might cluster based on the size, i.e. large animals and small animals, based on the looks, i.e. birds, snakes, lizards etc. In this case, it clustered based on type i.e. wild animals and birds. Here we don’t need any names to group the animals. If we see any similarity between them we group them together.

Here is a list of the most popular clustering machine-learning algorithms:

K means clustering
Hierarchical Clustering
DBSCAN – Density-Based Spatial Clustering of Applications with Noise
Fuzzy Clustering
Spectral Clustering
OPTICS – Ordering Points To Identify the Clustering Structure

6. What is Linear Regression in Machine Learning?

Linear Regression is a supervised machine learning algorithm that defines the relationship between a dependent variable and one or more independent variables.

These terms might be a bit confusing. Let’s take an example to understand it:

Imagine you have multiple jars of different sizes as shown in the picture below. You are required to fit candies into the jars. The bigger the jar, the more candies you can fit in it. Here jar is the dependent variable and candy is the independent variable.

For a small size jar like Jar A or B, you can fit a few candies

For a big-sized jar like JarD and E, you can fit some more candies.

If we plot this relationship on a graph with one axis representing the size of the jar and the other representing the number of candies and draw a straight line connecting the data points. This line will help us predict how many candies a jar C stores.

Here is an equation representing linear regression:

Where,

X is the input or independent variable
Y is the output or dependent variable
a is the intercept, and b is the coefficient of X

Get 100% Hike!

Master Most in Demand Skills Now!

7. What is a Decision Tree in Machine Learning?

A Decision Tree is a machine learning algorithm that models decisions and their outcomes in a hierarchical structure. Given below is a diagram of a decision tree.

The structure of a decision tree consists of:

Nodes: Each node in a decision tree represents a decision point

Branches: These are the outcomes of the decision that lead to further nodes or leaf nodes.

Leaf Nodes: Also known as terminal nodes, they represent the outcome of the decision tree.

Let’s understand it with the help of an example. We are trying to find out if a person is fit or not. Here is a decision tree representing the same. In this decision tree, there are a series of questions and outcomes based on which we can finally conclude whether a person is fit.

Check out our video on Machine Learning Interview Questions and Answers:

8. What are the types of Machine Learning?

There are three types of machine learning algorithms:

Supervised Machine Learning: In Supervised machine learning algorithms both the input data and the respective output data are provided to the model. It is just like a teacher asking a question and answering the same to make a student learn. Algorithms like Linear Regression, Logistic Regression, Decision Tree, and Random Forest are examples of supervised machine learning.

Unsupervised Machine Learning: Unsupervised machine learning is the opposite of supervised machine learning. It is like a teacher showing pictures of different animal and leaving it to the students to explore the answers. The student might group the animals by their size and further do their due diligence to get an answer. Algorithms like K-means clustering, Hierarchical Clustering, DBSCAN, and Principal Component Analysis(PCA) are a few examples of unsupervised machine learning.

Reinforcement Learning: In reinforcement machine learning, the model learns from its past decisions and feedback. It is just like playing a game, where one gets points for doing well, while losing points for every mistake and learns to do better each time. Algorithms like Proximal Policy Optimization (PPO), Q-learning, Policy Gradients, and Deep Q-Networks (DQN) are examples of reinforcement machine learning.

9. What is Bayes’s Theorem in Machine Learning?

Bayes’s Theorem is used to find the probability of an event, provided the likelihood of another event that has already occurred and is related to the event we are predicting. Bayes’s theorem states, “The conditional probability of an event A, given the occurrence of another event B, is equal to the product of the event of B, given A and the probability of A divided by the probability of event B.”

Seems a bit overwhelming, but let’s understand it with the help of an example:

Imagine today’s weather is cloudy which suggests that it might rain. However, it is not necessary that because it is cloudy it will rain. You have to decide whether to take an umbrella or not. Bayes’s theorem will help you in calculating the probability of rain today given that it is cloudy, so that you can make a more informed decision.

Two of the most significant applications of Bayes’s theorem in Machine Learning are Bayesian optimisation and Bayesian belief networks. This theorem is also the foundation behind the Machine Learning brand, which involves the Naive Bayes classifier.

10. What is PCA in Machine Learning?

Principal Component Analysis, or PCA, is an unsupervised machine learning method widely used for dimensionality reduction. Its primary objective is to transform a high-dimensional dataset into low-dimensional data, preserving the variance of the data.

The dimensions of data have to be reduced to analyse and visualise it easily. This is done by:

Removing irrelevant dimensions
Keeping only the most relevant dimensions

Mechanism of PCA:

Compute the covariance matrix for data objects
Compute eigenvectors and eigenvalues in descending order
Select the initial N eigenvectors to get new dimensions
Finally, change the initial n-dimensional data objects into N-dimensions

Example: Below are two graphs showing data points or objects and two directions, one is green and the other is yellow. Graph 2 is arrived at by rotating Graph 1 so that the x-axis and y-axis represent the green and yellow direction respectively.

After the rotation of data points, it can be inferred that the green direction, the x-axis, gives the line that best fits the data points.

Here, two-dimensional data is being represented; but in real life, the data would be multidimensional and complex. So, after recognizing the importance of each direction, the area of dimensional analysis can be reduced by cutting off the less significant directions.

Ready to take on that job interview?

Take a quick Quiz to check it out

Take a Quiz

11. Differentiate between Classification and Regression in Machine Learning

Classification and Regression are both supervised machine learning. Here are a few differences between both of them:

Classification	Regression
Classification Algorithms are used with discrete data.	Regression Algorithms are used with continuous data.
The target variables are always discrete.	The target variables are always continuous.
Metrics like Precision, Recall, and F1-Score are used to evaluate such algorithms.	Metrics like Mean Squared Error, R2-Score, and MAPE are used to evaluate such algorithms.
It can serve cases like spam detection and sentiment analysis.	It can serve use cases like Stock price prediction and house price prediction.

12. What is a Confusion Matrix?

Confusion matrix is used to explain a model performance and gives a summary of predictions of the classification problems. It assists in identifying the uncertainty between classes.

Confusion matrix gives the count of correct and incorrect values and error types. Accuracy of the model:

For example, consider the following confusion matrix. It consists of values as true positive, true negative, false positive, and false negative for a classification model. Now, the accuracy of the model can be calculated as follows:

So, in the example:

Accuracy = (200 + 50) / (200 + 50 + 10 + 60) = 0.78

This means that the model accuracy is 0.78, corresponding to its True Positive, True Negative, False Positive, and False Negative values.

13. Explain Logistic Regression

Logistic regression is the proper regression analysis used when the dependent variable is categorical or binary. Like all regression analyses, logistic regression is a technique for predictive analysis. Logistic regression is used to explain data and the relationship between one dependent binary variable and one or more independent variables. Logistic regression is also employed to predict the probability of categorical dependent variables.

Logistic regression can be used in the following scenarios:

To predict whether a citizen is a Senior Citizen (1) or not (0)
To check whether a person has a disease (Yes) or not (No)

There are three types of logistic regression:

Binary logistic regression: In this type of logistic regression, there are only two outcomes possible.

Example: To predict whether it will rain (1) or not (0)

Multinomial logistic regression: In this type of logistic regression, the output consists of three or more unordered categories.

Example: Predicting whether the prize of the house is high, medium, or low.

Ordinal logistic regression: In this type of logistic regression, the output consists of three or more ordered categories.

Example: Rating an Android application from one to five stars.

Machine Learning Interview Questions for Freshers

14. What is Dimensionality Reduction?

In the real world, Machine Learning models are built on top of features and parameters. These features can be multidimensional and large in number. Sometimes, the features may be irrelevant and it becomes a difficult task to visualize them.

This is where dimensionality reduction is used to cut down irrelevant and redundant features with the help of principal variables. These principal variables conserve the features, and are a subgroup, of the parent variables.

15. Outlier Values can be Discovered from which Tools?

The various tools that can be used to discover outlier values are scatterplots, boxplots, Z-score, etc.

16. What are Type I and Type II Errors?

Type I Error: Type I Error, false positive, is an error where the outcome of a test shows the nonacceptance of a true condition.

For example, suppose a person gets diagnosed with depression even when they are not suffering from the same, it is a case of false positive.

Type II Error: Type II Error, false negative, is an error where the outcome of a test shows the acceptance of a false condition.

For example, the CT scan of a person shows that they do not have a disease but in fact they do have the disease. Here, the test accepts the false condition that the person does not have the disease. This is a case of false negative.

17. How to handle Missing or Corrupted Data in a Dataset?

In Python pandas, there are two methods to locate lost or corrupted data and discard those values:

isNull(): It can be used for detecting the missing values.
dropna(): It can be used for removing columns or rows with null values.

fillna() can be used to fill the void values with placeholder values.

18. When to use mean and when to use median to handle a missing numeric value?

We choose the mean to impute missing values when the data distribution is normal and there are no significant outliers, as the mean is sensitive to both. In contrast, we use the median in cases of skewed distributions or when outliers are present, because the median is more robust to these factors and provides a better central tendency measure under these conditions.

19. In Machine Learning, for how many classes can Logistic Regression be used?

Logistic regression cannot be used for more than two classes. Logistic regression is, by default, a binary classifier. However, in cases where multi-class classification problems need to be solved, the default number of classes can be extended, i.e., multinomial logistic regression.

20. What is Overfitting in Machine Learning and how can it be avoided?

Overfitting happens when a machine has an inadequate dataset and tries to learn from it. So, overfitting is inversely proportional to the amount of data.

For small datasets, overfitting can be bypassed by the cross-validation method. In this approach, a dataset is divided into two sections. These two sections will comprise the testing and training dataset. To train a model, the training dataset is used, and for testing the model for new inputs, the testing dataset is used.
This is how to avoid overfitting.

21. What is ROC Curve and what does it represent?

ROC stands for receiver operating characteristic. ROC Curve is used to graphically represent the trade-off between true and false-positive rates.

In ROC, the area under the curve (AUC) gives an idea about the accuracy of the model.

The above graph shows a ROC curve. The greater the AUC, the better the performance of the model.

Next, we will be taking a look at Machine Learning interview questions on rescaling, binarizing, and standardizing.

22. What do you understand about the P-value?

P-value is used in decision-making while testing a hypothesis. The null hypothesis is rejected at the minimum significance level of the P-value. A lower P-value indicates that the null hypothesis is to be rejected.

23. What is meant by Correlation and Covariance?

Correlation is a mathematical concept used in statistics and probability theory to measure, estimate, and compare data samples taken from different populations. In simpler terms, correlation helps in establishing a quantitative relationship between two variables.

Covariance is also a mathematical concept; it is a simpler way to arrive at a correlation between two variables. Covariance basically helps in determining what change or affect does one variable has on another.

24. What is semi-supervised learning?

Semi-supervised machine learning algorithms are such algorithms that use a small amount of labelled data and a large amount of unlabelled data to train the model.

This is done when obtaining labelled data is difficult due to time constraints and abundant unlabelled data. It is often used in projects that involve data classification, voice recognition, etc, where acquiring labelled data can be complex.

25. Can logistic regression be applied to more than two classes?

No, a simple logistic regression can’t be applied to more than two classes. However, with the help of a multinomial logistic regression, one can handle such situations.

26. Why are Validation and Test Datasets Needed?

Data is split into three different categories while creating a model:

Training dataset: Training dataset is used for building a model and adjusting its variables. The correctness of the model built on the training dataset cannot be relied on as the model might give incorrect outputs after being fed new inputs.
Validation dataset: Validation dataset is used to look into a model’s response. After this, the hyperparameters on the basis of the estimated benchmark of the validation dataset data are tuned.When a model’s response is evaluated by using the validation dataset, the model is indirectly trained with the validation set. This may lead to the overfitting of the model to specific data. So, this model will not be strong enough to give the desired response to real-world data.
Test dataset: Test dataset is the subset of the actual dataset, which is not yet used to train the model. The model is unaware of this dataset. So, by using the test dataset, the response of the created model can be computed on hidden data. The model’s performance is tested on the basis of the test dataset.Note: The model is always exposed to the test dataset after tuning the hyperparameters on top of the validation dataset.

As we know, the evaluation of the model on the basis of the validation dataset would not be enough. Thus, the test dataset is used for computing the efficiency of the model.

27. Explain the difference between KNN and K-means Clustering

K-nearest neighbours (KNN): It is a supervised Machine Learning algorithm. In KNN, identified or labelled data is given to the model. The model then matches the points based on the distance from the closest points.

K-means clustering: It is an unsupervised Machine Learning algorithm. In K-means clustering, unidentified or unlabeled data is given to the model. The algorithm then creates batches of points based on the average of the distances between distinct points.

28. What is meant by Parametric and Non-parametric Models?

Parametric models refer to the models having a limited number of parameters. In case of parametric models, only the parameter of a model is needed to be known to make predictions regarding the new data.

Non-parametric models do not have any restrictions on the number of parameters, which makes new data predictions more flexible. In case of non-parametric models, the knowledge of model parameters and the state of the data needs to be known to make predictions.

Machine Learning Interview Questions for Experienced

29. What is Support Vector Machine (SVM) in Machine Learning?

SVM is a Machine Learning algorithm that is majorly used for classification. It is used on top of the high dimensionality of the characteristic vector.

The following is the code for SVM classifier:

# Introducing required libraries<br>
from sklearn import datasets<br>
from sklearn.metrics import confusion_matrix<br>
from sklearn.model_selection import train_test_split<br>
# Stacking the Iris dataset<br>
iris = datasets.load_iris()<br>
# A -> features and B -> label<br>
A = iris.data<br>
B = iris.target<br>
# Breaking A and B into train and test data<br>
A_train, A_test, B_train, B_test = train_test_split(A, B, random_state = 0)<br>
# Training a linear SVM classifier<br>
from sklearn.svm import SVC<br>
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(A_train, B_train)<br>
svm_predictions = svm_model_linear.predict(A_test)<br>
# Model accuracy for A_test<br>
accuracy = svm_model_linear.score(A_test, B_test)<br>
# Creating a confusion matrix<br>
cm = confusion_matrix(B_test, svm_predictions)<br>

30. What is Cross-validation in Machine Learning?

Cross-validation allows a system to increase the performance of the given Machine Learning algorithm, which is fed a number of sample data from the dataset. This sampling process is done to break the dataset into smaller parts that have the same number of rows, out of which a random part is selected as a test set and the rest of the parts are kept as train sets. Cross-validation consists of the following techniques:

Holdout method
K-fold cross-validation
Stratified k-fold cross-validation
Leave p-out cross-validation

31. What is Entropy in Machine Learning?

Entropy in Machine Learning measures the randomness in the data that needs to be processed. The more entropy in the given data, the more difficult it becomes to draw any useful conclusion from the data. For example, let us take the flipping of a coin. The result of this act is random as it does not favour heads or tails. Here, the result for any number of tosses cannot be predicted easily as there is no definite relationship between the action of flipping and the possible outcomes.

32. What is Epoch in Machine Learning?

Epoch in Machine Learning is used to indicate the count of passes in a given training dataset where the Machine Learning algorithm has done its job. Generally, when there is a large chunk of data, it is grouped into several batches. All these batches go through the given model, and this process is referred to as iteration. Now, if the batch size comprises the complete training dataset, then the count of iterations is the same as that of epochs.
In case there is more than one batch, d*e=i*b is the formula used, wherein d is the dataset, e is the number of epochs, i is the number of iterations, and b is the batch size.

33. What are the Two Main Types of Filtering in Machine Learning? Explain.

The two types of filtering are:

Collaborative filtering
Content-based filtering

Collaborative filtering refers to a recommender system where the interests of the individual user are matched with preferences of multiple users to predict new content. Content-based filtering is a recommender system where the focus is only on the preferences of the individual user and not on multiple users.

34. What is meant by Ensemble Learning?

Ensemble learning refers to the combination of multiple Machine Learning models to create more powerful models. The primary techniques involved in ensemble learning are bagging and boosting.

35. What are the Various Kernels that are present in SVM?

The various kernels that are present in SVM are:

Linear
Polynomial
Radial Basis
Sigmoid

36. What is Rescaling of Data and how is it done?

In real-world scenarios, the attributes present in data are in a varying pattern. So, rescaling the characteristics to a common scale is beneficial for algorithms to process data efficiently. We can rescale data using Scikit-learn. The code for rescaling the data using MinMaxScaler is as follows:

#Rescaling data import pandas import scipy import numpy from sklearn.preprocessing import MinMaxScaler names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack', 'Tim'] Dataframe = pandas.read_csv(url, names=names) Array = dataframe.values # Splitting the array into input and output X = array[:,0:8] Y = array[:,8] Scaler = MinMaxScaler(feature_range=(0, 1)) rescaledX = scaler.fit_transform(X) # Summarizing the modified data numpy.set_printoptions(precision=3) print(rescaledX[0:5,:])

Apart from the theoretical concepts, some interviewers also focus on the implementation of Machine Learning topics. The following ML Interview Questions are related to the implementation of theoretical concepts.

37. What is the difference between Standard scalar and MinMax Scaler?

StandardScaler and Min-Max scaling are two common data preprocessing techniques used in machine learning. The key differences are:

StandardScaler (Z-score normalization):
- Scales data to have a mean of 0 and a standard deviation of 1.
- Suitable for algorithms assuming normal distribution and is robust to outliers.
Min-Max Scaling:
- Scales data to a specific range, often between 0 and 1.
- Useful for models sensitive to feature magnitudes, but can be influenced by outliers.

38. How to Standardize Data?

Standardization is the method that is used for rescaling data attributes. The attributes are likely to have a mean value of 0 and a value of the standard deviation of 1. The main objective of standardization is to prompt the mean and standard deviation for the attributes. Data can be standardized using Scikit-learn. The code for standardizing the data using StandardScaler is as follows:

# Python code to Standardize data (0 mean, 1 stdev) from sklearn.preprocessing import StandardScaler import pandas import numpy names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack', 'Tim'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values # Separate the array into input and output components X = array[:,0:8] Y = array[:,8] scaler = StandardScaler().fit(X) rescaledX = scaler.transform(X) # Summarize the transformed data numpy.set_printoptions(precision=3) print(rescaledX[0:5,:])

39. Executing a binary classification tree algorithm is a simple task. But how does tree splitting take place? How does the tree determine which variable to break at the root node and which at its child nodes?

Gini index and Node Entropy assist the binary classification tree to make decisions. Basically, the tree algorithm determines the feasible feature that is used to distribute data into the most genuine child nodes. According to the Gini index, if we arbitrarily pick a pair of objects from a group, then they should be of identical class and the probability for this event should be 1. The following are the steps to compute the Gini index:

Compute Gini for sub-nodes with the formula: The sum of the square of probability for success and failure (p^2 + q^2)
Compute Gini for split by weighted Gini rate of every node of the split

Now, Entropy is the degree of indecency that is given by the following: Where a and b are the probabilities of success and failure of the node When Entropy = 0, the node is homogenous When Entropy is high, both groups are present at 50–50 percent in the node. Finally, to determine the suitability of the node as a root node, the entropy should be very low.

40. What is F1-score and How Is It Used?

F-score or F1-score is a measure of overall accuracy of a binary classification model. Before understanding F1-score, it is crucial to understand two more measures of accuracy, i.e., precision and recall. Precision is defined as the percentage of True Positives to the total number of positive classifications predicted by the model. In other words, Precision = (No. of True Positives / No. True Positives + No. of False Positives) Recall is defined as the percentage of True Positives to the total number of actual positive labelled data passed to the model. In other words, Precision = (No. of True Positives / No. True Positives + No. of False Negatives) Both precision and recall are partial measures of accuracy of a model. F1-score combines precision and recall and provides an overall score to measure a model accuracy. F1-score = 2 × (Precision × Recall) / (Precision + Recall) This is why, F1-score is the most popular measure of accuracy in any Machine-Learning-based binary classification model.

41. How to Implement the KNN Classification Algorithm?

Iris dataset is used for implementing the KNN classification algorithm.

# KNN classification algorithm from sklearn.datasets import load_iris from sklearn.neighbors import KNeighborsClassifier import numpy as np from sklearn.model_selection import train_test_split iris_dataset=load_iris() A_train, A_test, B_train, B_test = ztrain_test_split(iris_dataset["data"], iris_dataset["target"], random_state=0) kn = KNeighborsClassifier(n_neighbors=1) kn.fit(A_train, B_train) A_new = np.array([[8, 2.5, 1, 1.2]]) prediction = kn.predict(A_new) print("Predicted target value: {}n".format(prediction)) print("Predicted feature name: {}n".format (iris_dataset["target_names"][prediction])) print("Test score: {:.2f}".format(kn.score(A_test, B_test))) Output: Predicted Target Name: [0] Predicted Feature Name: [‘ Setosa’] Test Score: 0.92

42. What is Hypothesis in Machine Learning?

Machine Learning allows the use of available dataset to understand a specific function that maps input to output in the best possible way. This problem is known as function approximation. Here, approximation needs to be used for the unknown target function that maps all plausible observations based on the given problem in the best manner. Hypothesis in Machine learning is a model that helps in approximating the target function and performing the necessary input-to-output mappings. The choice and configuration of algorithms allow defining the space of plausible hypotheses that may be represented by a model. In the hypothesis, lowercase h (h) is used for a specific hypothesis, while uppercase h (H) is used for the hypothesis space that is being searched. Let us briefly understand these notations:

Hypothesis (h): A hypothesis is a specific model that helps in mapping input to output; the mapping can further be used for evaluation and prediction.
Hypothesis set (H): Hypothesis set consists of a space of hypotheses that can be used to map inputs to outputs, which can be searched. The general constraints include the choice of problem framing, the model, and the model configuration.

Machine Learning Engineer Interview Questions

43. What are the Various Tests for Checking the Normality of a Dataset?

In Machine Learning, checking the normality of a dataset is very important. Hence, certain tests are performed on a dataset to check its normality. Some of them are:

D’Agostino Skewness Test
Shapiro-Wilk Test
Anderson-Darling Test
Jarque-Bera Test
Kolmogorov-Smirnov Test

44. What are Different Kernels in SVM?

There are four different types of kernels in SVM. Choosing the correct kernel can significantly impact the performance of the model. Understanding which kernel to use on which kind of data is essential. Often, it is determined using cross-validation.

Here are the different kernels in SVM:

Linear Kernel: It is suitable for linearly separable data
Polynomial Kernel: It is ideal for curve data having a polynomial relationship.
Radial Basis Function (RBF)/ Gaussian Kernel: It is ideal for non-linear data
Sigmoid Kernel: It is ideal for activation functions in neural networks.

45. Why do we perform normalisation?

Data is normalised to reduce or eliminate redundant data(same data). If normalisation is not done, then the gradient will not converge to the global or local minima, thus making the model unstable.

46. How is a decision tree pruned?

The branches with weak predictive power are removed to prune a decision tree. This is to improve the accuracy and decrease the decision tree’s complexity. Approaches like cost function complexity and reduced error pruning are used to prune a decision tree.

47. Which is more important to you: model accuracy or model performance?

Model accuracy is just a subset of model performance. So in that term, an overall performance of the model that consists of accuracy, F1 scores, AUC-ROC and other metrics are important to understand how well a model is performing.

48. Both being Tree-based Algorithms, how is Random Forest different from Gradient Boosting Machine (GBM)?

The main difference between a random forest and GBM is the use of techniques. Random forest advances predictions using a technique called bagging. On the other hand, Gradient Boosting Mechanism advances predictions with the help of a technique called boosting.

Bagging: In bagging, we apply arbitrary sampling and we divide the dataset into N. After that, we build a model by employing a single training algorithm. Following that, we combine the final predictions by polling. Bagging helps to increase the efficiency of a model by decreasing the variance to eschew overfitting.
Boosting: In boosting, the algorithm tries to review and correct the inadmissible predictions at the initial iteration. After that, the algorithm’s sequence of iterations for correction continues until we get the desired prediction. Boosting assists in reducing bias and variance for strengthening the weak learners.

49. Differentiate between Sigmoid and Softmax Functions

Sigmoid and Softmax functions differ based on their usage in Machine Learning task classification. Sigmoid function is used in the case of binary classification, while Softmax function is used in case of multi-classification.

50. Suppose you found that your model is suffering from high variance. Which algorithm do you think could handle this situation and why?

Handling High Variance

For handling issues of high variance, we should use the bagging algorithm.
The bagging algorithm would split data into subgroups with a replicated sampling of random data.
Once the algorithm splits the data, we can use random data to create rules using a particular training algorithm.
After that, we can use polling for combining the predictions of the model.

51. What is Binarizing of Data? How to Binarize?

Converting data into binary values on the basis of threshold values is known as binarizing of data. The values that are less than the threshold are set to 0 and the values that are greater than the threshold are set to 1. This process is useful when feature engineering has to be performed. This can also be used for adding unique features. Data can be binarized using Scikit-learn. The code for binarizing data using Binarizer is as follows:

from sklearn.preprocessing import Binarizer<br>
import pandas<br>
import numpy<br>
names = ['Abhi', 'Piyush', 'Pranay', 'Sourav', 'Sid', 'Mike', 'pedi', 'Jack', 'Tim']<br>
dataframe = pandas.read_csv(url, names=names)<br>
array = dataframe.values<br>
# Splitting the array into input and output<br>
X = array[:,0:8]<br>
Y = array[:,8]<br>
binarizer = Binarizer(threshold=0.0).fit(X)<br>
binaryX = binarizer.transform(X)<br>
# Summarizing the modified data<br>
numpy.set_printoptions(precision=3)<br>
print(binaryX[0:5,:])

52. We know that one-hot encoding increases the dimensionality of a dataset, but label encoding doesn’t. How?

When one-hot encoding is used, there is an increase in the dimensionality of a dataset. The reason for the increase in dimensionality is that every class in categorical variables, forms a different variable.

Example: Suppose there is a variable “Color.” It has three sublevels, “Yellow,” “Purple,” and “Orange.” So, one-hot encoding “Color” will create three different variables as Color.Yellow, Color.Purple, and Color.Orange.

In label encoding, the subclasses of a certain variable get the value 0 and 1. So, label encoding is only used for binary variables.

This is why one-hot encoding increases the dimensionality of data and label encoding does not.

53. Imagine you are given a dataset consisting of variables having more than 30% missing values. Let’s say, out of 50 variables, 16 variables have missing values, which is higher than 30%. How will you deal with them?

To deal with the missing values, we will do the following:

We will specify a different class for the missing values.
Now, we will check the distribution of values, and we will hold those missing values that are defining a pattern.
Then, we will charge these values into yet another class while eliminating others.

54. How come logistic regression is labelled as a regression method when it is primarily used for classification tasks?

Logistic regression earns its classification primarily due to its historical connection with linear regression. However, its paramount utility lies in addressing classification tasks, given its remarkable ability to model the probability of an observation belonging to a specific class or category. In practice, it quantifies the likelihood of an event’s occurrence, endowing it with great significance in tackling classification challenges, such as discerning spam emails or making medical diagnoses. Thus, despite its nomenclature as “regression,” its predominant function is in terms of classification, which explains its frequent association with classification algorithms. This explanation is intended to be informative, ensuring originality and search engine optimization.

55. How is the suitability of a Machine Learning Algorithm determined for a particular problem?

To identify a Machine Learning Algorithm for a particular problem, the following steps should be followed:

Step 1: Problem classification: Classification of the problem depends on the classification of input and output:

Classifying the input: Classification of the input depends on whether there is data labelled (supervised learning) or unlabelled (unsupervised learning), or whether a model has to be created that interacts with the environment and improves itself (reinforcement learning.)
Classifying the output: if the output of a model is required as a class, then some classification techniques need to be used.

If the output is a number, then regression techniques must be used; if the output is a different cluster of inputs, then clustering techniques should be used.

Step 2: Checking the algorithms in hand: After classifying the problem, the available algorithms that can be deployed for solving the classified problem should be considered.

Step 3: Implementing the algorithms: If there are multiple algorithms available, then all of them are to be implemented. Finally, the algorithm that gives the best performance is selected.

56. What is the Variance Inflation Factor?

Variance inflation factor (VIF) is the estimate of the volume of multicollinearity in a collection of many regression variables.

VIF = Variance of the model / Variance of the model with a single independent variable

This ratio has to be calculated for every independent variable. If VIF is high, then it shows the high collinearity of the independent variables.

Machine Learning Algorithms Interview Questions

57. Why is rotation required in PCA? What will happen if the components are not rotated?

Rotation is a significant step in principal component analysis (PCA). Rotation maximizes the separation within the variance obtained by the components. This makes the interpretation of the components easier.

The motive behind conducting PCA is to choose fewer components that can explain the greatest variance in a dataset. When rotation is performed, the original coordinates of the points get changed. However, there is no change in the relative position of the components.

If the components are not rotated, then there needs to be more extended components to describe the variance.

58. What is the difference between stochastic gradient descent (SGD) and gradient descent (GD)?

Here are the differences between stochastic gradient descent (SGD) and gradient descent (GD):

Stochastic Gradient Descent	Gradient Descent
SGD uses randomly selected sample data from the dataset to compute the gradient and update the model parameter	GD uses an entire dataset to compute the gradient and update the model parameter
Good for large datasets	Good for small dataset

59. Why do we need to convert categorical variables into factors? Which functions are used to perform the conversion?

Machine Learning models usually take input as a number. That is the major reason why categorical variables are converted into factors. There are functions like as.factor() and factor() used to perform this conversion.

60. Explain False Negative, False Positive, True Negative, and True Positive with a simple example.

True Positive (TP): When the Machine Learning model correctly predicts the condition, it is said to have a True Positive value.

True Negative (TN): When the Machine Learning model correctly predicts the negative condition or class, then it is said to have a True Negative value.

False Positive (FP): When the Machine Learning model incorrectly predicts a negative class or condition, then it is said to have a False Positive value.

False Negative (FN): When the Machine Learning model incorrectly predicts a positive class or condition, then it is said to have a False Negative value.

61. What do you mean by the term Overfitting, and How can you avoid It?

Overfitting is a situation when the model learns too well from the training data set but when set to perform in some unknown data, results in low accuracy.

To avoid this situation, we make use of:

Regularization
Making a simple model
Making use of cross-validation methods

62. What are the ‘training set’ and ‘test sets’? How much data will you allocate for your training, validation, and test sets?

The training set is the dataset on which you will train your machine-learning model. A test set is used to test the model if it can perform on an unknown set of data or not.

Usually, we make a 70:30 split of the existing dataset as a training and test dataset. For example, if we have 100 records, then 70 random records from the dataset will be used to train the model, while 30 random records will be used to test the model.

63. What are the three stages of building a model in machine learning?

The three stages of building the machine learning model are:

Development
Testing
Deployment

64. What do you understand by the F1 score?

It is an evaluation matrix for a classification model. It combines both precision and recall. F1 = 2 * (P * R) / (P + R)

65. What do eigenvalues and eigenvectors mean in PCA?

In Principal Component Analysis (PCA), eigenvalues represent the amount of information that a given principal component can explain. Eigenvectors represent the weight of each eigenvalue.

Data Scientist Interview Question

66. Define precision and recall.

Precision = True Positive / True Positive + False Positive

Recall = True Positive / True Positive + False Negative

67. What do you mean by the term Kernel SVM?

Kernel methods are a class of algorithms that are mostly used for problem statements like pattern analysis. It is used for solving both classification and regression problems. Kernel SVM is just an abbreviated form of Kernel Support Vector Machine. It is one of the most common ones in the Kernel method list.

AI Engineer Interview Question

68. What does the “minus” in cross-entropy mean?

The “minus” in cross-entropy is used to define it as a loss function, where the higher the number, the worse the model is, while the lower the number, the better the model is. The goal is to minimize this loss during the training of a model to improve the model’s predictive accuracy.

69. What do L1 and L2 regularization mean and when would you use L1 vs. L2? Can you use them both?

L1 Regularization, adds the absolute magnitude of the coefficient as a penalty to the loss function.

L2 Regularization, adds the squared magnitude of the coefficient as the penalty to the loss function.

The choice between L1 and L2 depends on your Modeling goal and the data present. L1 is used when you suspect that many features are irrelevant, and you want a simple model with feature selection. L2 is used when all features are relevant, and you want to control the magnitudes of the weights to prevent them from becoming too large.

Yes, you can use both L1 and L2, also known as Elastic Net. It can be a good choice when you want a combination of L1 and L2 regularization, providing a trade-off between sparsity and weight shrinkage.

70. How is Adam Optimizer different from Rmsprop?

Adam (short for Adaptive Moment Estimation) and RMSprop (Root Mean Square Propagation) are optimization algorithms used to train neural networks. The differences between them are:

Adam Optimizer	Rmsprop Optimizer
For every parameter, Adam keeps track of two moving averages: the mean (first moment) and the uncentered variance (second moment).	RMSprop also uses moving averages; it only maintains a running average of squared gradients for each parameter.
Adam blends the ideas of adaptive learning rates with momentum.	RMSprop adapts the learning rate for each parameter based on the magnitude of the recent gradients.
Adam performs bias correction for the moving averages.	RMSprop does not typically use bias correction.

71. What are the different types of activation functions and explain the vanishing gradient problem?

Activation functions are functions used in a neural network to compute the weighted sum of inputs and biases, which decides whether a neuron can be activated or not.

There are multiple types of activation functions present, each with its characteristics. A few are listed below:

Sigmoid Function:

- Output values between 0 and 1.
- Commonly used in the output layer of binary classification models.

Hyperbolic Tangent Function (tanh):

- Output values between -1 and 1.
- Similar to the sigmoid, but it has a wider output range.

Rectified Linear Unit (ReLU):

- Outputs the input for positive values, zero otherwise.
- Simple and computationally efficient, commonly used in hidden layers.

Leaky ReLU:

- Similar to ReLU but allows a small, non-zero gradient for negative values (α is a small positive constant).

Parametric ReLU (PReLU):

- Similar to leaky ReLU, the negative slope (α) is learned during training.

Exponential Linear Unit (ELU):

- Smooth for negative values, allowing for improved learning.

72. What is an activation function in machine learning?

In layman’s terms, an activation function defines if the neuron should be activated or not. The activation function helps the neural network define the important data points to be activated while ignoring the irrelevant ones.

According to the definition, “Activation functions are functions used in a neural network to compute the weighted sum of inputs and biases, which decides whether a neuron can be activated or not.”

How to Prepare for the Machine Learning Interview

You have to go through many rounds of interviews in every company! Following are some of the interview rounds that you will be subjected to:

On-call Assessment Round
Technical Assessment Round
Machine Learning Theory Round
Machine Learning System Design Round
Case Study Round
Behavioral Round

Machine Learning Salary Trends based on Experience in 2024

The average salary for an entry-level machine learning engineer is ₹12,32,000 per year in India and $1,52,360 per year in the United States. The average additional cash compensation for a machine learning engineer in India is ₹1,32,000, with a range from ₹55,000 – ₹2,50,000 in India and $26,243, with a range from $19,682 – $36,741 in the United States.

Job Role	Experience	Salary Range
Machine Learning Engineer	0 – 2 years	₹08L – ₹13L /yr
Senior Machine Learning Engineer	2 – 4 years	₹14L – ₹17L /yr
Lead Machine Learning Engineer	5 – 7 years	₹14L – ₹37L /yr
Principal Machine Learning Engineer	8+ years	₹30L – ₹47L /yr

Machine Learning Trends in 2024

Global Demand: According to LinkedIn, there are currently more than 80000+ open positions for a machine learning engineer in the United States.
Projected Growth: As per the Future of Jobs Report 2023, there is a very high demand for machine learning engineers, and It is expected to grow by 40%, or 1 million jobs per year, globally.
Regional Trends: According to LinkedIn, there are currently more than 18000+ open positions for a machine learning engineer in India. And the hiring trend is set to increase by 8.3% in 2024.

Job Opportunities in Machine Learning

Multiple job roles in the industry require machine learning. Here are a few of them:

Job Role	Description
Machine Learning Engineer	They are responsible for designing, building, and developing machine learning models.
Machine Learning Developer	They are responsible for building and implementing machine learning models and algorithms into various applications.
Artificial Intelligence Engineer	They are responsible for building and developing AI models, which can include machine learning models, NLP-based models, and computer vision models.
Computer Vision Engineer	They specifically work in the field of computer vision, which involves interpreting visual data by the computer.
Research Scientist	They research and develop new machine-learning models and algorithms.

Roles and Responsibilities of a Machine Learning Engineer

A machine learning engineer is responsible for creating deep learning and machine learning models, implementing the right machine learning algorithms into practice, and conducting experiments and tests to check their accuracy for the given problem statement.

According to a job description posted by Siemens on LinkedIn:

Job Role: AI /ML Engineer

Responsibilities:

Knowledge of various data mining and machine learning techniques to extract valuable insights from large datasets and communicate the findings to stakeholders.
Should know how to evaluate the effectiveness of models and algorithms using statistical tests.
Should understand the basic deployment process using DevOps and any of the public cloud services (AWS, Azure, or GCP).

Technical Skills:

Strong programming skills in languages such as Python, R, or C++.
Should be familiar with libraries and concepts such as Tensorflow, Pytorch, Keras, Sklearn, Statmodels, Pandas, Numpy, Scipy, OpenCV, PIL, SkImage, SQL, HQL, or similar.
Proficient in the use of computer vision tools
Excellent verbal, written, and presentation skills.

I hope this set of interview questions on Machine Learning will help you prepare for your interviews. Best of luck!

Looking to start your career or even elevate your skills in the field of machine learning? You can enrol in our Machine Learning course or enrol in the Executive Post Graduate Certification in AI and Machine Learning from IIT Roorkee in collaboration with IBM and Microsoft with Intellipaat and get certified today.

Check out other Machine Learning related resources-

What is Epoch in Machine Learning?	Deep Learning Interview Questions	Machine Learning Tools for Modern AI Development
Applications of Reinforcement Learning	Top Applications of NLP	Data Science vs Machine Learning