20 Essential Machine Learning Interview Questions

Must-Know Questions for Data Science and ML Interviews

Computer scientist and ML expert Santiago Valdarrama (@svpino on Twitter) recently tweeted a list of 20 fundamental questions that you need to ace before getting a Machine Learning job. Claiming:

“Almost every company will ask these to weed out non-prepared candidates. You don’t want to show up unless you are comfortable having a discussion about all of these.”

Santiago Valdarrama — @svpino

1. Explain the difference between Supervised and Unsupervised methods.

When we train machine learning models we use data that is either labeled or unlabeled. In Supervised learning, the data we use to train the model is labeled.

  • Example: If we’re building a classifier to tell if an animal is a cat or a dog, we would train the model on a dataset of dog and cat images correctly tagged as such. Then we can get predictions on new unlabeled images! Supervised learning allows us to collect data or produce a data output from the previous experience.

But when we train a machine learning model on unlabeled data, this is called Unsupervised learning. This allows the model to work on its own to discover new information about the dataset and can help us find unknown patterns in data.

  • Example: To refer to the previous example of a dog and cat image classifier; If all of our dog/cat image data was unlabeled we could use unsupervised learning to find similarities in the different classes of images. We could use an unsupervised learning technique called clustering to find out which images are likely to be of dogs or cats!

Use of a ground truth (prior knowledge of what the output values for our samples should be. i.e. ‘labels’) is the largest difference between the two types of learning.

Unsupervised vs Supervised methods applied to data

Supervised:

  • Used on labeled data

  • Allows us to produce a data output from previous experience or examples

  • Most practical machine learning applications use supervised learning

Unsupervised:

  • Used mainly on unlabeled data

  • Allows us to learn the inherent structure of data without providing labels

  • Finds unknown discoveries and patterns about data

2. What’s your favorite algorithm? Can you explain how it works?

My favorite machine learning algorithm is Naïve Bayes!

In probability theory and statistics, Bayes’ theorem (alternatively Bayes’s theoremBayes’s law or Bayes’s rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.

For example, if the risk of developing health problems is known to increase with age, Bayes’s theorem allows the risk to an individual of a known age to be assessed more accurately than simply assuming that the individual is typical of the population as a whole.

A Naïve Bayes Classifier is a probabilistic classifier that uses Bayes theorem with strong independence (naïve) assumptions between features.

  • Probabilistic classifier: a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to.

  • Independence: Two events are independent if the occurrence of one does not affect the probability of occurrence of the other (equivalently, does not affect the odds). That assumption of independence of features is what makes Naive Bayes naive! In real world, the independence assumption is often violated, but naive Bayes classifiers still tend to perform very well.

For a deeper dive into Naïve Bayes check out my blog post on how to build an email spam filter from scratch with multinomial Naïve Bayes!

3. Given a specific dataset, how do you decide which is the best algorithm to use?

In ML and data science there is no one-size-fits-all algorithm. The answer depends on a myriad of factors like the number of features in the data, the kind of output you want, size of the dataset, available computation time/resources, and many others.

  • The type of problem:
    Input: Is the input data labeled? If so, it’s a supervised learning problem. If it’s unlabeled data with the purpose of finding structure, it’s an unsupervised learning problem. If the solution implies to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
    Output: What should the model output be? If it’s a number, that would be a regression problem. (Linear, lasso, logistic, SVM, etc.) If the output is a class, the it would be a classification problem. (Unless the output is a set of input groups. Then it would be a clustering problem.)
    After categorizing the problem and understand the data, the next milestone is identifying the algorithms that are applicable and practical to implement in a reasonable time. Some of the elements affecting the choice of a model are:

  • The size of the training dataset. Is the training dataset small? (has a fewer number of observations and a higher number of features) If so algorithms with high bias and low variance like linear regression, Naïve Bayes, or linear SVM would be preferable.

  • The accuracy of the model v.s. the interpretability of the model.

  • The complexity/implementability of the model. Do we have the time and computational resources to train the model? Are the improvements gains in accuracy high enough to justify the costs and engineering effort needed to bring them into a production environment?

  • The scalability of the model. Micro or horizontal data scaling?

  • Does the model meet the business goal?

4. When should you use classification over regression?

Like the question above about algorithm selection, the choice between classification and regression depends on the available data, problem statement, and expected output.

Classification: Hotdog or not hotdog?

Regression:

  • If your expected output is a real or continuous value.

  • Example: Predicting the increase or decrease in value of apartment buildings over time.

Classification:

  • If your expected outcome is a discrete or categorical value.

  • Used to predict class membership. (i.e. hotdog or not hotdog, dog or cat)

  • Example: Predict whether or not a user is expected to purchase something when they visit your website or online store. (Classes: likely conversion, possible conversion, unlikely conversion)

5. Can you explain how Logistic Regression works?

Logistic Regression is a machine learning algorithm used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability. Logistic regression is used to assign observations to a discrete set of classes.

Example: Do workers’ education levels and time on the job affect promotions. The independent variables would be education levels and time on the job, and the levels of the dependent variable might be promotion to team-leader roles, sales positions, or management positions.

Logistic regression transforms its output using the logistic sigmoid function to return a probability value. The sigmoid function (as depicted below) is a cost function. The hypothesis of logistic regression tends to limit the cost function between 0 and 1. The sigmoid function will map the predicted values of the model to a probability between 0 and 1. We can use a decision boundary to classify data points based on the probability they are likely to belong to a certain class. (Example: If Cats = 0 and Dogs = 1 then any predicted value greater than 0.5 would be classified as a dog)

Sigmoid function with decision boundary of 0.5

Bonus: What is Gradient Descent and why is it important in logistic regression?

6. What are the advantages and disadvantages of decision trees?

Advantages:

  • Easily understandable and explainable to stakeholders

  • Doesn’t require the data to be normalized or scaled

  • No need to impute missing data because null values don’t affect process

  • Requires less data preprocessing than other algorithms (Good baseline)

Disadvantages:

  • Leads to overfitting of the data causing incorrect predictions

  • Noise — Does not work well if you have too many un-correlated variables

  • High variance — small changes early in the tree can have a large impact on the outcome

Decision Tree about taking a new job

Bonus: What is a random forest? When should you use it over a decision tree?
Hint: Does size (of the dataset) matter?

7. Can you compare K-means with KNN?

“The ‘K’ in K-Means Clustering has nothing to do with the ‘K’ in KNN algorithm”

K-Means Clustering:

  • Used for clustering (K = number of clusters)

  • Unsupervised learning algorithm

  • Takes unlabeled data points and groups them into “k” number of clusters

  • Uses elbow method to calculate “k” and recalculates cluster centroids until it reaches a global optima

K-Nearest Neighbor (KNN):

  • Used for classification (K = number of neighbors)

  • Supervised learning algorithm’

  • Takes labeled data points and uses them to learn how to label other points

  • To label a new point, it looks at the “nearest neighbor” (labeled points closest to new point)

  • Neighbors vote on how to label a new point

8. How much data would you allocate for your training, validation, and test sets?

Train / Test / Validation Split

There is no exact percentage of how you should allocate your data but a convention in machine learning is to use a 80/20 or 70/30 train/test split. After the initial split, the training set can be further split into validation sets. Again this is a general rule and great starting point, but the best way to determine how to allocate your data is to experiment with different split sizes.

9. Can you explain the “Curse of Dimensionality”?

This scary term refers to the difficulty of using brute force — grid search — to optimize a function with too many input variables. In English, this means that when our data has too many features (columns) compared to the number of observations (rows) we risk overfitting our model resulting in false and unreliable predictions. If there are a large amount of features (compared to the observations) it becomes harder to make meaningful clusters with the observations because too many dimensions cause every observation to look equidistant from other data points. Luckily there are some techniques to reduce this and we’ll cover those in the next question.

Dimensionality Reduction Visualized

10. What are some methods to reduce dimensionality?

There are many ways to reduce dimensionality ranging from intuitive, linear, non-linear, and auto-encoder methods. Here are some of the most popular.

Feature Engineering / Selection:

  • If the necessity for dimensionality reduction comes from too many features; let’s get rid of some! We can use heatmaps, visualizations, or even domain knowledge to find which features are contributing to the accuracy of the model and which features are not.

  • We can also combine different features or create entirely new features based on some insight about the data to reduce the number of features but preserve their impact on model accuracy.

Principal Component Analysis (PCA):

  • Another way to find the most important features in your crowded dataset is PCA. Used on continuous data, this method projects data along the axis of increasing variance. The features with the highest variance are the ‘principal’ components. We can use PCA to determine which features have the largest impact on the outcome/prediction of the model.

Auto-encoders:

  • An unsupervised neural network that compresses data down to a lower dimension and then reconstructing the data based on the most important features. This gets rid of noise and redundancy in the data. Auto-encoders can also be linear or non-linear based on the activation function.

11. How would you handle an imbalanced dataset?

Evaluation Metric:

  • In many cases as a machine learning engineer you’ll have to deal with imbalanced data. In anomaly detection (used for credit card fraud, geological events, etc) it is not likely that more than 1% of the data will be classified as an anomaly. You could classify every instance as non-anomalous and you would get an accuracy of 99% but that wouldn’t be good enough in this case so we could use a confusion matrix to calculate precision, recall, and F1 scores to get a better idea of how our model performs on imbalanced data.

Algorithm:

  • We can experiment with the type of algorithm that we are using for out model as different algorithms perform better on different types of problems. (i.e. Random Forest instead of Decision Tree)

Resampling — Oversampling and Undersampling:

  • Undersampling: When there is a sufficient amount of data, this is used to balance the dataset by reducing the size of the abundant class. By keeping all samples in the rare class and randomly selecting an equal number of samples in the abundant class, a balanced new dataset can be retrieved for further modelling.

  • Oversampling: When there is an insufficient amount of data, this method is used to balance dataset by increasing the size of rare samples. Rather than getting rid of abundant samples, new rare samples are generated.

We could also use K-fold Cross Validation, resample the data with different split ratios, cluster the abundant class, and many other methods!

12. Can you explain the trade-off between bias and variance?

  • The goal is to get the algorithm to generalize, but not oversimplify

Bias:

  • Can cause a model to miss relevant or important relationships between features and its target output. Algorithms with high bias error tend to be underfit.

Variance:

  • How sensitive a machine learning model is to small changes in the training data. Models with high variance tend to focus on the random noise in training data rather than important relationships between features resulting in overfitting.

Tradeoff:

  • One of the biggest problems in supervised learning, the bias-variance tradeoff aims to choose a model optimized for accurately capturing regularities in its training data, but also generalizing well on unseen data. Sadly, it is typically impossible to do both at the same time.

  • High bias, low variance — Consistent but inaccurate.

  • High variance, low bias — Accurate but inconsistent.

13. Can you define and explain the differences between precision and recall?

Precision:

  • Classification evaluation method with the goal of answering:
    “What proportion of positive predictions are actually correct?”

  • Example: Imagine a case where you are asked to build an email spam filter. It’s not a big deal if we accidentally classify an advertisement (spam) as a genuine email. It IS a big deal if a new job offer from your dream company is classified as spam. In this case we want to focus on precision and maximize the ratio between true positives and total positives.

Recall:

  • Classification evaluation method with the goal of answering:
    “What proportion of actual positives are correctly predicted?”

  • Example: Imagine a case where you are asked to develop a predictive model to classify people as positive or negative for cancer. We REALLY don’t want people with cancer being given a false negative because it’s possible they will go even longer without treatment. But there’s not nearly as much downside when telling a healthy person that they have cancer.

14. How do you define the F1 score and why is it useful?

F1 is the harmonic mean (Pythagorean mean, appropriate for situations when the average of rates is desired) of both precision and recall. It is typically used as a best practice when there is not a specific reason to highly value either precision or recall (like in the examples in the previous question. Typically calculated in a confusion matrix, its formula is below:

Where:
TP = True positive
FP = False positive
FN = False negative

15. How do you ensure you’re not overfitting? Can you explain some techniques to reduce overfitting?

After splitting our dataset into train and testing sets, if our model does a better job on the training set than the testing set, it is likely overfit. We can take steps to reduce this:

  • Cross-validation: This could be as simple as using a train, test, validation split on your data or even something more complex like K-folds (where the data is split into K number of sections or folds where each fold is used as a testing set).

  • Early Stopping: During each training epoch the model is given more opportunities to fit the data but after a while this begins to overfit the training set. We can monitor the training performance (for an increase in loss) and stop training as the performance on the validation dataset decreases (compared to the performance on the validation dataset at the prior training epoch).

  • Regularization: Refers to regularizing the parameters that shrink the coefficient closer to zero. This method stops the model from getting over complicated allowing it to generalize better.

  • Weight Constraints: Checks the weights of a network and if their size exceeds a certain limit, the weights are rescaled to be back below the limit (or in the range) and prevents single features from dominating the model.

  • Dropout: Essentially “drops out” individual neurons in a neural network during the training process. Leading to significantly lower generalization error rates.

Neural Network with Dropout Layer

16. Can you explain what is cross-validation and how is it useful?

Cross-validation is a resampling method used to evaluate the level of fit of a machine learning models to a limited sample of data that is independent of the data we used to train the model. (i.e. holding data from the training set to test on the model later) This is especially useful when we are training a model with a limited data set. There are many forms of cross-validation including exhaustive, non-exhaustive, and nested methods.

17. Can you explain the difference between L1 and L2 regularization?

A regression model that uses L1 regularization is called Lasso Regression.
A model that uses L2 is called Ridge Regression.

The main difference is that Lasso Regression (L1) is used to shrink the coefficient of less important features and removing some features entirely (helping with feature selection). While Ridge Regression (L2) adds “squared magnitude” of coefficient as penalty term to the loss function.

Lasso v.s. Ridge Regression

18. What is the ROC Curve?

ROC (receiver operating characteristic) curve is a graph that shows the performance of a multi-class classification model at every classification threshold. AUC-ROC (area under ROC curve) also written AUROC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. A higher AUC means the model will be better at is at predicting an observation as its own class. (i.e. 0 as 0 and 1 as 1)

19. What is a Confusion Matrix and how is it useful?

A confusion matrix is a table used to represent the performance of a classification model where the output can be two or more classes. It contains the true positives, false positives, true negatives, and false negatives. Using a confusion matrix can help with calculating precision, recall, F1, and AUC-ROC, as discussed in earlier questions.

20. Which is more important: model accuracy or model performance?

Model accuracy is the most important. Once a model is deployed in production, the quality of the output is very important and retraining happens less than scoring the outputs.

As for performance, this depends on what we’re talking about when we say “performance”. If it is model training performance we can upgrade our computer, use distributed computing power, and parallelization to speed up training time. If we’re referring to model scoring performance then it would depend on the type of data we are using.

Where to find me: