ML algorithms interview questions

Here are 100 machine learning algorithms interview questions and answers, covering fundamentals, supervised and unsupervised learning, ensemble methods, deep learning basics, and scenario-based problem solving.

Fundamentals & Core Concepts

1. What is the difference between supervised, unsupervised, and reinforcement learning?
Supervised learning uses labeled data to learn a mapping from inputs to outputs. Unsupervised learning finds hidden patterns in unlabeled data. Reinforcement learning learns to make decisions by interacting with an environment, receiving rewards or penalties.

2. What is the bias-variance tradeoff?
Bias is error from overly simplistic models (underfitting). Variance is error from excessive sensitivity to training data (overfitting). The tradeoff: decreasing bias increases variance, and the goal is to find the sweet spot minimizing total error.

3. What is overfitting and how can you prevent it?
When a model learns noise and details of the training data so well it performs poorly on new data. Prevent it by: cross-validation, regularization (L1/L2), pruning (trees), early stopping, dropout (neural networks), and using more training data.

4. What is cross-validation? Explain k-fold cross-validation.
A resampling technique to evaluate model performance on unseen data. In k-fold, data is split into k equal folds; the model trains on k-1 folds and validates on the remaining fold, repeated k times. Results are averaged.

5. What are the different types of data in machine learning?
Numerical (continuous, discrete), Categorical (nominal, ordinal), Text, Image, Time-series, and Sequential data. Preprocessing differs for each.

6. What is the difference between parametric and non-parametric models?
Parametric models assume a fixed number of parameters and a specific functional form (e.g., linear regression). Non-parametric models do not make strong assumptions about the function form; they grow in complexity with data (e.g., k-NN, decision trees).

7. What is the curse of dimensionality?
As the number of features increases, the volume of the space increases so fast that data becomes sparse, making distance measures meaningless. It degrades model performance and requires more data or dimensionality reduction.

8. Explain the difference between generative and discriminative models.
Generative models learn the joint probability distribution P(X, Y) and can generate new samples (e.g., Naive Bayes, GANs). Discriminative models learn the decision boundary P(Y|X) directly (e.g., logistic regression, SVM).

9. What is the No Free Lunch theorem?
No single algorithm works best for every problem. Performance depends on the problem structure, data distribution, and evaluation metric. You must try multiple models.

10. What is feature scaling and when is it necessary?
Normalizing or standardizing features to a common scale. Essential for distance-based algorithms (k-NN, SVM, k-means) and gradient descent-based algorithms (neural networks, linear/logistic regression).

Supervised Learning – Regression

11. What is linear regression?
A linear approach to modeling the relationship between a dependent variable and one or more independent variables. Assumes linearity, independence, homoscedasticity, and normality of errors.

12. What are the assumptions of linear regression?
Linearity, independence of errors, homoscedasticity (constant variance of errors), normality of error distribution, and no multicollinearity among predictors.

13. What is the difference between simple and multiple linear regression?
Simple regression has one independent variable; multiple has two or more.

14. What is the cost function used in linear regression?
Mean Squared Error (MSE) or Sum of Squared Errors (SSE). The goal is to minimize this by adjusting coefficients.

15. How do you interpret the coefficients in linear regression?
A one-unit change in the predictor is associated with a change of the coefficient value in the target variable, holding all other predictors constant.

16. What is R-squared and adjusted R-squared?
R-squared measures the proportion of variance in the target explained by the model. Adjusted R-squared penalizes adding unnecessary predictors, preventing artificial inflation.

17. What is multicollinearity? How do you detect and fix it?
When independent variables are highly correlated. Detect via Variance Inflation Factor (VIF). Fix by removing variables, combining them (PCA), or using regularization.

18. What is polynomial regression?
Extends linear regression by adding polynomial terms of the predictors (e.g., x², x³) to model nonlinear relationships, while still being linear in parameters.

19. What is ridge regression (L2 regularization)?
Linear regression with L2 penalty: adds sum of squared coefficients to the cost function. It shrinks coefficients toward zero but never to zero. Reduces overfitting and handles multicollinearity.

20. What is lasso regression (L1 regularization)?
Linear regression with L1 penalty: adds sum of absolute values of coefficients. It can shrink some coefficients to exactly zero, performing feature selection.

21. What is elastic net regression?
Combines L1 and L2 penalties to get the benefits of both: feature selection from L1, and stable shrinkage from L2. Controlled by mixing parameter.

22. How do you choose between ridge and lasso?
If you suspect only a few features are important, use lasso for automatic feature selection. If you think all features contribute somewhat, use ridge.

23. What is the difference between gradient descent and ordinary least squares (OLS)?
OLS has a closed-form analytical solution (for linear regression). Gradient descent is an iterative optimization algorithm used when closed-form is infeasible (large datasets, many features).

24. What is stochastic gradient descent (SGD)?
A variant of gradient descent that updates parameters using a single random sample or mini-batch instead of the entire dataset, making it faster and scalable but noisier.

25. Explain the concept of regularization. Why is it used?
Adding a penalty term to the cost function to constrain model complexity and prevent overfitting. Types: L1 (lasso), L2 (ridge), Elastic Net.

Supervised Learning – Classification

26. What is logistic regression?
A classification algorithm that estimates the probability of an instance belonging to a class using the sigmoid function. Despite its name, it’s used for binary classification.

27. What is the sigmoid function?
σ(z) = 1 / (1 + e⁻ᶻ). It maps any real value to (0,1), representing a probability.

28. What is the cost function for logistic regression?
Log loss (cross-entropy loss): penalizes incorrect predictions more heavily. It’s convex, allowing gradient descent to find global minima.

29. What is a decision tree?
A tree-structured model that splits data on features to make predictions. Internal nodes test features, branches represent outcomes, leaves are predictions. Interpretable but prone to overfitting.

30. How does a decision tree choose where to split?
Using impurity measures: Gini impurity (classification) or variance reduction (regression). The split that maximizes information gain (or minimizes impurity) is selected.

31. What is information gain and entropy?
Entropy measures uncertainty in a dataset. Information gain is the reduction in entropy after a split. The tree picks the split with the highest information gain.

32. What is Gini impurity?
A measure of how often a randomly chosen element would be incorrectly labeled if randomly classified according to label distribution. Lower is better.

33. What is pruning in decision trees?
Reducing the size of a tree by removing branches that have little predictive power. Pre-pruning (early stopping) and post-pruning (remove after full tree) prevent overfitting.

34. What is random forest?
An ensemble of decision trees trained on random subsets of data and features (bagging). Predictions are aggregated (majority vote or average). Reduces overfitting and improves accuracy.

35. What is the difference between bagging and boosting?
Bagging trains models in parallel on bootstrapped samples (e.g., Random Forest) to reduce variance. Boosting trains models sequentially, each correcting the previous ones (e.g., AdaBoost, XGBoost), typically reducing bias.

36. What is AdaBoost?
An adaptive boosting algorithm that sequentially trains weak learners, giving higher weight to misclassified instances so subsequent learners focus on harder examples.

37. What is Gradient Boosting?
Boosting that trains trees sequentially, each new tree fitting on the negative gradient (residuals) of the loss function from the previous ensemble.

38. What is XGBoost and why is it popular?
An optimized, scalable implementation of gradient boosting with regularization, tree pruning, and handling missing values. It’s fast and often wins competitions.

39. What is Support Vector Machine (SVM)?
A classifier that finds the optimal hyperplane separating classes with maximum margin. Can use kernel tricks for non-linear data.

40. What is the kernel trick in SVM?
Mapping data into a higher-dimensional space where it’s linearly separable without explicitly computing the transformation, using kernel functions (RBF, polynomial).

41. What is k-Nearest Neighbors (k-NN)?
A lazy, non-parametric algorithm that classifies a point by the majority class among its k nearest neighbors in the feature space. No training phase; all computation at prediction.

42. How do you choose the value of k in k-NN?
Using cross-validation. A small k can lead to high variance (noisy), a large k can over-smooth and increase bias. Rule of thumb: k = sqrt(n).

43. What is Naive Bayes?
A probabilistic classifier based on Bayes’ theorem with the “naive” assumption of feature independence given the class. Fast and works well with text data.

44. What are the variants of Naive Bayes?
Gaussian (continuous data), Multinomial (discrete counts, e.g., word frequencies), Bernoulli (binary features).

45. Why is Naive Bayes “naive”?
Because it assumes all features are conditionally independent given the class, which is rarely true in real-world data, yet it often works well.

Unsupervised Learning – Clustering

46. What is clustering?
Grouping a set of objects so that objects in the same cluster are more similar to each other than to those in other clusters. No labels provided.

47. What is k-means clustering?
Partitions data into k clusters by iteratively assigning points to the nearest centroid and updating centroids. Sensitive to initial centroids and requires k pre-specified.

48. How do you select the optimal number of clusters in k-means?
Elbow method (plot within-cluster sum of squares vs. k), silhouette score, gap statistic.

49. What are the limitations of k-means?
Assumes spherical clusters, sensitive to outliers and scale, requires k pre-defined, may converge to local optima depending on initialization.

50. What is hierarchical clustering?
Creates a tree of clusters (dendrogram) either by agglomerative (bottom-up: merging) or divisive (top-down: splitting). No need to pre-specify k.

51. What is DBSCAN?
Density-Based Spatial Clustering of Applications with Noise. Groups points that are closely packed and marks low-density points as outliers. Doesn’t need k, handles arbitrary shapes.

52. What are the parameters of DBSCAN?
Epsilon (ε): maximum distance between two points to be considered neighbors. MinPts: minimum points to form a dense region.

53. What is the difference between k-means and DBSCAN?
k-means requires specifying k, assumes spherical clusters, can’t handle noise well. DBSCAN automatically finds number of clusters, handles arbitrary shapes, and identifies noise.

54. What is a Gaussian Mixture Model (GMM)?
A probabilistic model that assumes data is generated from a mixture of several Gaussian distributions. Provides soft clustering (probability of belonging to each cluster).

55. What is the expectation-maximization (EM) algorithm?
Iterative method used by GMM to find maximum likelihood estimates: E-step computes responsibilities (probabilities of cluster assignments), M-step updates parameters.

Dimensionality Reduction & Association

56. What is dimensionality reduction and why use it?
Reducing the number of features while preserving important information. Benefits: reduces computational cost, mitigates curse of dimensionality, removes noise, aids visualization.

57. What is Principal Component Analysis (PCA)?
A linear technique transforming data to a new coordinate system where the greatest variances lie on the first principal component, second on the next, etc. Uncovers latent structure.

58. How do you choose the number of principal components?
By looking at the explained variance ratio cumulatively; keep enough components to explain a desired percentage (e.g., 95%) of total variance.

59. What is the difference between PCA and t-SNE?
PCA is linear, deterministic, and focuses on preserving global structure. t-SNE is non-linear, stochastic, and excels at preserving local structure for visualization (2D/3D), but not for feature generation for downstream models.

60. What is an autoencoder?
A neural network that learns to copy its input to its output with a bottleneck. Used for dimensionality reduction and feature learning. Encoder compresses, decoder reconstructs.

61. What is the Apriori algorithm?
An algorithm for frequent itemset mining and association rule learning. It identifies items frequently purchased together and generates rules like {bread} -> {butter} with support, confidence, and lift.

62. What are support, confidence, and lift in association rules?
Support: frequency of itemset in all transactions. Confidence: P(Y|X) = support(X∪Y)/support(X). Lift: how much more likely Y is purchased when X is purchased compared to baseline.

63. What is the Eclat algorithm?
Similar to Apriori but uses a depth-first search and vertical data format to find frequent itemsets more efficiently.

Ensemble Methods & Model Selection

64. What is an ensemble model?
Combining predictions from multiple base models to improve overall performance, reducing variance (bagging) or bias (boosting), or improving predictions via stacking.

65. What is stacking?
An ensemble technique where predictions from multiple different models (base learners) are used as inputs to a meta-model (blender) that makes the final prediction.

66. What is the difference between hard voting and soft voting?
Hard voting: each classifier votes for a class, majority wins. Soft voting: each classifier provides probabilities, and the class with the highest average probability is chosen (often better if classifiers are well-calibrated).

67. What is LightGBM?
A gradient boosting framework that uses histogram-based algorithms and leaf-wise tree growth, making it faster and more memory-efficient than XGBoost, especially on large datasets.

68. What is CatBoost?
A gradient boosting library that handles categorical features automatically, uses ordered boosting to reduce overfitting, and is robust to hyperparameters.

69. What is the bias-variance tradeoff in ensemble methods?
Bagging reduces variance (e.g., Random Forest). Boosting can reduce bias significantly but may increase variance if overdone.

70. How do you prevent overfitting in boosting algorithms?
Use early stopping, set a low learning rate with more estimators, tree-specific parameters (max_depth, min_child_weight), subsampling, and regularization parameters.

71. What is out-of-bag (OOB) error?
In bagging, each tree is trained on a bootstrap sample; the remaining ~37% of data not used (out-of-bag) can be used as a validation set, providing an unbiased error estimate.

72. What is bootstrapping?
Sampling with replacement from the original dataset to create multiple training sets of the same size. Base for bagging.

Model Evaluation & Validation

73. What is confusion matrix?
A table describing the performance of a classification model: True Positives, False Positives, True Negatives, False Negatives.

74. What is accuracy, and when is it not a good metric?
Accuracy = (TP+TN) / Total. It’s misleading when classes are imbalanced; a model predicting the majority class can have high accuracy.

75. What are precision and recall?
Precision = TP / (TP+FP) (exactness). Recall = TP / (TP+FN) (completeness). Trade-off; important for tasks like disease detection or spam filtering.

76. What is F1-score?
Harmonic mean of precision and recall: 2 * (Precision*Recall) / (Precision+Recall). Balances both, suitable for imbalanced datasets.

77. What is ROC curve and AUC?
ROC curve plots True Positive Rate (recall) vs. False Positive Rate at various thresholds. AUC is the area under this curve; measures model’s ability to discriminate classes. 1 = perfect, 0.5 = random.

78. What is log loss?
Logarithmic loss measures the uncertainty of probabilistic predictions. Heavily penalizes confident wrong predictions. Used as cost function in logistic regression.

79. What is mean squared error (MSE) vs. mean absolute error (MAE)?
MSE = (1/n) Σ(y-ŷ)², penalizes larger errors more (due to square). MAE = (1/n) Σ|y-ŷ|, more robust to outliers. Choice depends on desired sensitivity to outliers.

80. What is R² (coefficient of determination)?
Proportion of variance in target explained by the model. Range (-∞, 1] with 1 being perfect. Negative means model is worse than a horizontal line mean.

81. What is a Type I and Type II error?
Type I: false positive (rejecting a true null hypothesis). Type II: false negative (failing to reject a false null hypothesis). The cost of each determines which metric to prioritize.

82. What is the difference between training error and test error?
Training error measures fit on training data. Test error measures generalization. A large gap suggests overfitting.

83. How do you handle imbalanced datasets?
Resampling (oversampling minority with SMOTE, undersampling majority), using appropriate metrics (F1, AUC-ROC, precision-recall), cost-sensitive learning, anomaly detection approaches.

Feature Engineering & Selection

84. What is feature engineering?
Creating new features or transforming existing ones to improve model performance. Involves domain knowledge, polynomial features, log transforms, binning, encoding categorical variables, text vectorization, etc.

85. How do you handle missing values?
Removing rows/columns (if few missing), imputation (mean/median/mode for numerical, “missing” category for categorical), model-based imputation, or using algorithms that handle missing data natively (XGBoost, LightGBM).

86. What is one-hot encoding?
Converts categorical variables into binary vectors representing each category. Avoids implying ordinality but can create high dimensionality.

87. What is label encoding?
Assigning each category a unique integer. Suitable for ordinal categories, but may introduce false ordinality for nominal variables.

88. What is target encoding?
Replacing a categorical value with the mean of the target variable for that category. Can cause overfitting, so regularization (smoothing) is used.

89. What are feature selection techniques?
Filter methods (correlation, chi-square, mutual information), wrapper methods (recursive feature elimination, forward/backward selection), embedded methods (Lasso regularization, tree-based feature importance).

90. What is the difference between feature selection and dimensionality reduction?
Feature selection keeps a subset of original features (interpretability preserved). Dimensionality reduction creates new composite features (e.g., PCA components) that may lose original meaning but can capture latent relationships.

Deep Learning & Advanced Algorithms (Basics)

91. What is a neural network?
A computing system inspired by biological neural networks. Consists of layers of neurons (nodes) with weighted connections. Learning occurs by adjusting weights via backpropagation.

92. What is an activation function? Why is ReLU used?
Introduces non-linearity into the network. ReLU (f(x)=max(0,x)) is simple, speeds up convergence, and mitigates vanishing gradient.

93. What is the vanishing gradient problem?
In deep networks, gradients can become extremely small in early layers during backpropagation, slowing learning. Solved by ReLU, batch normalization, and skip connections.

94. What is convolutional neural network (CNN)?
Specialized for grid-like data (images). Uses convolutional layers to extract spatial features, pooling to reduce dimensionality, and fully connected layers for classification.

95. What is recurrent neural network (RNN)?
Designed for sequential data (text, time series). Has hidden state that captures information about previous inputs. Suffers from vanishing gradient; LSTMs and GRUs address this.

96. What is LSTM?
Long Short-Term Memory: a type of RNN with gates (input, forget, output) that control information flow, learning long-term dependencies.

97. What is a generative adversarial network (GAN)?
Two networks, generator and discriminator, competing: generator creates fake data, discriminator tries to distinguish real from fake. Generator improves to create realistic data.

98. What is transfer learning?
Using a pre-trained model (on a large dataset) as a starting point for a related task, fine-tuning on the target data. Common in computer vision (ResNet) and NLP (BERT).

99. What is dropout?
Regularization technique where randomly selected neurons are dropped out (ignored) during training, preventing co-adaptation and reducing overfitting.

100. You have a dataset with highly imbalanced classes and need to detect fraud. Which algorithm would you consider and why?
I’d start with an ensemble method like XGBoost or LightGBM with scale_pos_weight to handle imbalance. For severe imbalance, I’d also try anomaly detection (Isolation Forest, Autoencoders) treating fraud as rare outliers. Evaluation would focus on precision-recall or F1, not accuracy. If interpretability is crucial, I’d use logistic regression with careful threshold tuning.