Here are 100 data science interview questions and answers, covering statistics, probability, data manipulation, machine learning, model evaluation, SQL, data visualization, feature engineering, experimentation, and case studies. Each question is in bold, followed by a detailed answer. No dividing lines.
What is data science and what are the key stages of a data science project?
Answer: Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. Key stages: problem definition, data collection, data cleaning and preprocessing, exploratory data analysis (EDA), feature engineering, model building, model evaluation, deployment, and monitoring.
What is the difference between data science, data analytics, and machine learning?
Answer: Data science is a broad field encompassing data analytics and machine learning. Data analytics focuses on descriptive and diagnostic analysis (what happened, why). Machine learning is a subset focused on predictive and prescriptive models (what will happen, what to do). Data science includes data engineering, visualization, and business intelligence.
What is the difference between supervised and unsupervised learning? Give examples.
Answer: Supervised learning uses labeled data (input-output pairs) to learn a mapping. Examples: classification (spam detection, image recognition), regression (house price prediction). Unsupervised learning finds hidden patterns in unlabeled data. Examples: clustering (customer segmentation), association (market basket analysis), dimensionality reduction (PCA).
Explain bias-variance tradeoff.
Answer: Bias is error due to overly simplistic assumptions (underfitting). Variance is error due to sensitivity to fluctuations in the training data (overfitting). Total error = bias² + variance + irreducible error. Increasing model complexity reduces bias but increases variance. The ideal model balances both to minimize total error.
What is overfitting and how can you prevent it?
Answer: Overfitting occurs when a model learns the training data too well, including noise, and fails to generalize to unseen data. Prevention: use more training data, simplify the model (fewer features, lower polynomial degree), regularization (L1, L2), cross-validation, early stopping (for iterative algorithms), pruning (decision trees), ensemble methods (bagging).
What is cross-validation? Why use it?
Answer: Cross-validation is a resampling technique to evaluate model performance on unseen data. K-fold cross-validation splits data into k folds; trains on k-1 folds, validates on the remaining fold, repeats k times. It reduces variance in performance estimate, helps detect overfitting, and makes efficient use of data.
What is the difference between training error and test error?
Answer: Training error is the error on the data used to train the model. Test error is the error on a separate hold-out set not used in training. A large gap between low training error and high test error indicates overfitting. Test error estimates generalization performance.
What is regularization? Explain L1 and L2 regularization.
Answer: Regularization adds a penalty term to the loss function to discourage large coefficients, reducing overfitting. L1 regularization (Lasso) adds absolute value of coefficients, can shrink some coefficients to zero (feature selection). L2 regularization (Ridge) adds squared coefficients, shrinks coefficients but not to zero. Elastic Net combines both.
What is a confusion matrix? How do you compute precision, recall, and F1 score?
Answer: A confusion matrix tabulates true positives (TP), true negatives (TN), false positives (FP), false negatives (FN). Precision = TP/(TP+FP) (accuracy of positive predictions). Recall = TP/(TP+FN) (coverage of actual positives). F1 = 2 * (Precision * Recall) / (Precision + Recall), the harmonic mean.
What is ROC curve and AUC?
Answer: ROC curve plots True Positive Rate (recall) vs False Positive Rate at various classification thresholds. AUC (Area Under the ROC Curve) measures the model’s ability to distinguish classes: 0.5 (random), 1.0 (perfect). Higher AUC indicates better performance, insensitive to class imbalance.
Explain the difference between precision and recall. When would you prioritize one?
Answer: Precision measures how many selected items are relevant; recall measures how many relevant items are selected. Prioritize precision in spam detection (false positives annoying), recall in cancer screening (false negatives dangerous). F1 balances both.
What is the difference between classification and regression?
Answer: Classification predicts discrete class labels (e.g., cat/dog, spam/ham). Regression predicts continuous values (e.g., house price, temperature). Evaluation metrics differ: accuracy, confusion matrix, F1 for classification; MSE, MAE, R² for regression.
What is logistic regression? How does it differ from linear regression?
Answer: Logistic regression is a classification algorithm modeling the probability of a binary outcome using the sigmoid function. Unlike linear regression (continuous output, least squares), logistic regression outputs probabilities between 0 and 1 and uses maximum likelihood estimation.
What is the sigmoid function? Why is it used?
Answer: Sigmoid σ(z)=1/(1+e⁻ᶻ) maps any real number to the interval (0,1), interpretable as probability. It is used as the activation in logistic regression and output layer of neural networks for binary classification.
What is linear regression? What are its assumptions?
Answer: Linear regression models the relationship between dependent and independent variables as a linear equation. Assumptions: linearity, independence of errors, homoscedasticity (constant variance), normality of residuals, no perfect multicollinearity.
How do you assess a linear regression model?
Answer: Metrics: R² (proportion of variance explained), Adjusted R² (penalizes extra predictors), RMSE, MAE. Also residual plots to check assumptions, F-test for overall significance, p-values for individual coefficients.
What is the difference between RMSE and MAE?
Answer: RMSE squares errors, penalizing larger errors more, sensitive to outliers. MAE treats errors linearly, more robust. Use RMSE if large errors are especially undesirable; MAE for robust evaluation.
What is gradient descent? Explain batch, stochastic, and mini-batch.
Answer: Gradient descent optimizes loss by updating parameters opposite the gradient direction. Batch GD uses entire dataset per update (accurate but slow). Stochastic GD uses one sample per update (noisy, faster). Mini-batch uses a small batch (balance).
What is the learning rate in gradient descent? How to choose it?
Answer: Learning rate controls step size towards the minimum. Too high → divergence, too low → slow convergence. Choose via learning rate schedules, line search, or adaptive methods (Adam, RMSprop). Typically start with 0.001 or 0.01.
What is the curse of dimensionality?
Answer: As feature dimensionality increases, data becomes sparse, distance metrics become less meaningful, and the volume of space grows exponentially. Models require exponentially more data to avoid overfitting. Mitigate with feature selection, PCA, or regularized models.
What is feature scaling? Why is it important?
Answer: Feature scaling transforms numerical features to similar ranges, preventing features with larger scales from dominating distance-based algorithms (KNN, SVM, PCA). Common methods: standardization (zero mean, unit variance) and min-max scaling (range 0-1). Tree-based models are generally unaffected.
What is PCA (Principal Component Analysis)?
Answer: PCA is a linear dimensionality reduction technique that projects data onto directions (principal components) that maximize variance. It reduces features while preserving most information. Used for visualization, noise reduction, and speeding up algorithms.
What is the difference between PCA and t-SNE?
Answer: PCA is linear, deterministic, preserves global variance structure (good for preprocessing). t-SNE is non-linear, stochastic, preserves local distances (excellent for visualization but not for general feature reduction). t-SNE is sensitive to perplexity.
What is clustering? Name algorithms.
Answer: Clustering groups similar data points without labels. Algorithms: K-means (centroid-based), hierarchical (agglomerative), DBSCAN (density-based), Gaussian Mixture Models (probabilistic), spectral clustering.
How does K-means work and how to choose K?
Answer: K-means initializes K centroids; iteratively assigns points to nearest centroid, recomputes centroids as means. Converges to local optimum. Choose K via elbow method (plot within-cluster sum of squares), silhouette score, or gap statistic.
What is DBSCAN and when would you use it?
Answer: DBSCAN is a density-based clustering algorithm that groups points with many neighbors, marking low-density regions as outliers. Does not require specifying number of clusters, handles arbitrary shapes, robust to noise. Use for spatial data or when outliers are expected.
What is a decision tree? How does it split nodes?
Answer: A decision tree recursively splits data based on feature values, forming a tree structure. Splits are chosen to maximize impurity reduction (classification: Gini impurity, entropy; regression: variance reduction). Node stops when max depth reached or minimum samples met.
What are impurity measures in decision trees?
Answer: Gini index = 1 – Σ p_i² (probability of misclassification), entropy = – Σ p_i log₂ p_i (information content). Both favor pure nodes. Classification error is less sensitive but rarely used.
How do you prevent overfitting in decision trees?
Answer: Pruning (reducing tree size), setting minimum samples per leaf, maximum depth, minimum impurity decrease, or using ensemble methods (random forest, gradient boosting).
What is random forest?
Answer: Random forest is an ensemble of decision trees each trained on a bootstrap sample (bagging) and random subset of features (feature bagging). Aggregates predictions by majority vote (classification) or average (regression). Reduces overfitting and variance.
What is the difference between bagging and boosting?
Answer: Bagging trains models in parallel on bootstrapped samples, averaging predictions to reduce variance (random forest). Boosting trains models sequentially, each correcting previous errors, to reduce bias (AdaBoost, gradient boosting). Bagging more robust to outliers, boosting can overfit.
Explain gradient boosting. How does XGBoost improve it?
Answer: Gradient boosting builds an ensemble of weak learners (usually shallow trees) sequentially, each new tree fitting the residuals of previous ensemble using gradient descent. XGBoost adds regularization (L1/L2), parallel processing, tree pruning, handling missing values, and built-in cross-validation.
What is the difference between GBM and LightGBM?
Answer: LightGBM grows trees leaf-wise (choosing leaf with highest delta loss) vs level-wise (XGBoost). Leaf-wise can be faster and more accurate but prone to overfitting on small data. LightGBM uses histogram-based binning (lower memory). Both are excellent.
What is a support vector machine (SVM)?
Answer: SVM finds the hyperplane that maximizes the margin between classes. Kernel trick maps data to higher dimensions to make it linearly separable. Effective for high-dimensional spaces, memory efficient, but less interpretable.
What is the kernel trick in SVM? Give examples.
Answer: The kernel trick implicitly maps data to a higher-dimensional space without explicit transformation using kernel functions. Examples: polynomial kernel, radial basis function (RBF) kernel, sigmoid kernel. Allows SVM to handle non-linear boundaries.
What is the role of C parameter in SVM?
Answer: C is the regularization parameter; large C gives hard margin (fewer misclassifications, risk overfitting), small C allows more misclassifications (softer margin, better generalization). Tuned via cross-validation.
Explain Naive Bayes classifier.
Answer: Naive Bayes applies Bayes’ theorem with the “naive” assumption that features are conditionally independent given the class. Fast, works with small data, good for text classification (spam detection). Despite independence violation, often performs well.
What is a neural network? What is a perceptron?
Answer: A neural network is a set of interconnected neurons (units) in layers (input, hidden, output). A perceptron is a single-layer neural network with binary output. Modern deep networks have multiple hidden layers, non-linear activations, and are trained with backpropagation.
What is an activation function? Name common ones.
Answer: Activation functions introduce non-linearity. Common: ReLU (max(0,x)), Sigmoid (1/(1+e⁻ˣ)), Tanh, Leaky ReLU, Swish, Softmax (output for classification). ReLU is default for hidden layers due to sparse gradients and speed.
What is backpropagation?
Answer: Backpropagation computes gradients of the loss function with respect to each weight using the chain rule. Gradients flow backward from output to input; used by optimizers (SGD, Adam) to update weights. Essential for training deep networks.
What is the vanishing gradient problem?
Answer: Vanishing gradient occurs when gradients become exponentially small as they propagate backward, causing early layers to learn very slowly or not at all. Common with sigmoid/tanh activations. Mitigation: ReLU, batch normalization, residual connections, careful weight initialization.
What is batch normalization?
Answer: Batch normalization normalizes layer activations to zero mean and unit variance across a mini-batch. It speeds up training, allows higher learning rates, reduces internal covariate shift, and has a slight regularizing effect.
What is dropout?
Answer: Dropout randomly drops a fraction of neurons during training, preventing co-adaptation and reducing overfitting. At test time, all neurons are used with scaled outputs. Dropout is a regularization technique.
What is a convolutional neural network (CNN)?
Answer: CNN is designed for grid-like data (images). Uses convolutional layers (learn spatial filters), pooling layers (downsample), and fully connected layers. Shares weights across space, reducing parameters, and provides translation invariance.
What is pooling in CNNs?
Answer: Pooling (max, average) down samples feature maps, reducing spatial dimensions, controlling overfitting, and providing translational invariance. Max pooling takes the maximum value in each window.
What is a recurrent neural network (RNN)?
Answer: RNN is designed for sequential data (time series, text). It has loops allowing information to persist. Suffers from vanishing/exploding gradients; solutions: LSTM, GRU, attention.
What is LSTM? How does it solve vanishing gradient?
Answer: LSTM (Long Short-Term Memory) is a gated RNN with forget, input, output gates and a cell state. Gates control information flow, allowing gradients to flow unchanged for long sequences, mitigating vanishing gradient. LSTM can learn long-range dependencies.
What is attention mechanism?
Answer: Attention weighs the importance of different input parts when generating output. In transformers, self-attention computes relationships between all positions in a sequence. Attention enables capturing long-range dependencies and is the foundation of modern NLP (BERT, GPT).
What is transfer learning?
Answer: Transfer learning reuses a pre-trained model (trained on large dataset like ImageNet) as a starting point for a new task. Freeze early layers (feature extractors), fine-tune later layers. Saves time and data, improves performance.
What is data augmentation?
Answer: Data augmentation creates new training samples by applying transformations (rotation, flip, crop, noise, color adjustments) to existing data. Commonly used in computer vision to reduce overfitting and improve generalization.
What is an imbalanced dataset? How to handle it?
Answer: Imbalanced dataset has unequal class distribution. Handling: resampling (oversample minority – SMOTE, undersample majority), use class weights, choose appropriate metrics (precision, recall, F1, AUC), or use anomaly detection algorithms.
What is SMOTE?
Answer: Synthetic Minority Over-sampling Technique (SMOTE) creates synthetic minority class examples by interpolating between existing minority samples. Reduces overfitting compared to random oversampling. Works well with decision trees and random forests.
What is the difference between bagging and pasting?
Answer: Bagging uses bootstrapped samples (sampling with replacement). Pasting uses sampling without replacement. Bagging introduces more diversity; pasting is more deterministic. Random forest uses bagging.
What is a baseline model?
Answer: A baseline model is a simple, interpretable benchmark to compare against more complex models. Examples: majority class classifier, mean predictor, linear regression. Helps measure improvement and detect data leakage.
What is data leakage?
Answer: Data leakage occurs when information from outside the training dataset (e.g., future data, test data, target information) is used to train the model, leading to overly optimistic performance. Examples: scaling before split, using features derived from future dates, duplicate rows.
How do you split data for time series?
Answer: Do not use random shuffle; use chronological split (e.g., train on earlier time periods, validate on later). Also use time-series cross-validation (rolling window or expanding window) to maintain temporal order.
What is a recommendation system? Types?
Answer: Recommender systems suggest items to users. Types: collaborative filtering (user-item interactions), content-based filtering (item features), and hybrid models. Matrix factorization (SVD, ALS) is common. Evaluation metrics: RMSE (ratings), precision@k, recall@k.
What is the cold start problem in recommender systems?
Answer: Cold start occurs when a new user or new item has no interaction data. Solutions: use content-based features, demographics, popularity bias, or hybrid models. Also employ active learning to query new users for preferences.
What is A/B testing? How do you determine sample size?
Answer: A/B testing compares two versions (A and B) to determine which performs better. Sample size depends on desired statistical power (typically 80%), significance level (α=0.05), minimum detectable effect, and baseline conversion rate. Use online calculators or formulas.
What is p-value and what does it mean?
Answer: P-value is the probability of observing the data (or more extreme) if the null hypothesis is true. A small p-value (<0.05) suggests rejecting the null hypothesis. It does not measure effect size or probability that the null is false.
What is the difference between correlation and causation?
Answer: Correlation measures statistical association (two variables move together). Causation means one variable directly affects another. Correlation does not imply causation due to confounding variables, reverse causality, or coincidence.
What is Simpson’s paradox?
Answer: Simpson’s paradox occurs when a trend appears in several groups but disappears or reverses when groups are combined. It highlights the importance of controlling for confounding variables. Example: batting averages across years.
What is a confounding variable?
Answer: A confounding variable is a third variable that influences both the independent and dependent variables, creating a spurious association. Proper study design (randomization, stratification) or statistical adjustment (regression) can control for confounders.
What is maximum likelihood estimation (MLE)?
Answer: MLE is a method for estimating parameters of a statistical model by maximizing the likelihood function (probability of observing the given data under the model). It is used in linear regression (normal errors → least squares) and logistic regression.
What is the difference between a parametric and non-parametric model?
Answer: Parametric models assume a fixed number of parameters (e.g., linear regression, normal distribution). Non-parametric models do not assume a fixed functional form (e.g., K-NN, decision trees, kernel density estimation). Non-parametric more flexible but need more data.
What is the Central Limit Theorem?
Answer: The Central Limit Theorem states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population distribution (provided variance is finite). Enables confidence intervals and hypothesis tests.
What is the difference between a histogram and a box plot?
Answer: Histogram shows distribution shape, bins, frequency. Box plot shows median, quartiles, range, and outliers. Box plots are better for comparing multiple distributions.
What is a Q-Q plot?
Answer: A Q-Q (quantile-quantile) plot compares quantiles of two distributions. Deviations from the diagonal indicate differences. Commonly used to check normality (plot against theoretical normal quantiles).
What is overfitting in model selection?
Answer: Overfitting in model selection occurs when the model is too complex and learns noise. Use cross-validation, regularization, and hold-out test set to detect and prevent. Information criteria (AIC, BIC) also penalize complexity.
What is the difference between AIC and BIC?
Answer: Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) both penalize model complexity. AIC = -2 log(L) + 2k, BIC = -2 log(L) + k log(n). BIC penalizes complexity more heavily for large n. Lower values indicate better models.
What is a random variable?
Answer: A random variable is a variable whose possible values are outcomes of a random phenomenon. Discrete (countable values) or continuous (uncountable). Described by probability distribution (pmf or pdf).
What is the difference between probability and likelihood?
Answer: Probability measures the chance of observing data given fixed parameters. Likelihood measures how well parameters explain observed data, proportional to probability but viewed as function of parameters.
What is Bayes’ theorem?
Answer: P(A|B) = P(B|A) * P(A) / P(B). Updates belief (prior) with evidence (likelihood) to obtain posterior. Foundation of Bayesian statistics and Naive Bayes classifier.
What is a confusion matrix? (Already covered)
How do you handle missing values?
Answer: Options: deletion (if few missing), imputation (mean, median, mode, regression, KNN), use algorithms that handle missing (XGBoost, LightGBM), or create indicator columns. For categorical data, impute with mode or “missing” category.
What is feature importance in tree-based models?
Answer: Feature importance measures the reduction in impurity (Gini, entropy) or MSE contributed by each feature across all splits. Used for feature selection and interpretability. Random forests provide built-in importance (permutation importance also available).
What is SHAP?
Answer: SHAP (SHapley Additive exPlanations) is a game-theoretic method to explain individual predictions by attributing contributions of each feature. It provides consistent, locally accurate explanations and is model-agnostic.
What is LIME?
Answer: LIME (Local Interpretable Model-agnostic Explanations) explains individual predictions by approximating the complex model locally with an interpretable linear model. Useful for debugging and trust.
What is the difference between L1 and L2 loss (absolute vs squared)?
Answer: L1 loss (absolute deviation) is robust to outliers but not differentiable at zero. L2 loss (squared) penalizes large errors more heavily, is differentiable everywhere, but sensitive to outliers. Huber loss combines both.
What is the concept of “no free lunch” in machine learning?
Answer: No single algorithm works best for all problems. Performance depends on the dataset and task. Experimentation and domain knowledge are required to choose appropriate models.
What is the difference between online learning and batch learning?
Answer: Batch learning trains on the entire dataset at once (offline). Online learning updates the model incrementally as new data arrives; useful for streaming data and large datasets that don’t fit in memory.
What is a pipeline in machine learning?
Answer: A pipeline chains together preprocessing steps (scaling, imputation, encoding) and a model, ensuring consistent transformations and simplifying cross-validation. Libraries: scikit-learn Pipeline, Spark ML.
What is the difference between a statistical model and a machine learning model?
Answer: Statistical models emphasize inference (understanding relationships, confidence intervals), often with parametric assumptions. Machine learning models emphasize prediction accuracy and can be non-parametric or black-box. The boundaries are blurring.
What is regularization in linear models?
Answer: Regularization adds a penalty to loss function to shrink coefficients, reducing overfitting. Ridge (L2) shrinks but doesn’t eliminate; Lasso (L1) can set coefficients to zero (feature selection). Elastic net balances.
What is the difference between cross-validation and hold-out validation?
Answer: Hold-out validation uses a single split (train/validation/test). Cross-validation uses multiple splits (k-fold), giving a more stable performance estimate, especially with small data. Provides less variance but more computational cost.
What is bootstrapping?
Answer: Bootstrapping resamples the dataset with replacement to create many simulated datasets. Used for estimating standard errors, confidence intervals (percentile method), and ensemble learning (bagging).
What is the difference between type I and type II error?
Answer: Type I error (false positive) – rejecting true null hypothesis. Type II error (false negative) – failing to reject false null hypothesis. α (significance level) controls type I, β controls type II. Power = 1 – β.
What is the curse of dimensionality in clustering?
Answer: In high dimensions, distances become less discriminative; all points appear approximately equally distant. Clustering algorithms (K-means, DBSCAN) degrade. Mitigation: dimensionality reduction, feature selection, or subspace clustering.
What is the difference between Euclidean distance and Manhattan distance?
Answer: Euclidean (L2) is straight line distance; Manhattan (L1) is sum of absolute differences along axes. Euclidean sensitive to scale, Manhattan more robust in high dimensions (less affected by curse).
What is a reference group in logistic regression?
Answer: In logistic regression with categorical predictors, one category is treated as reference (baseline). Coefficients for other categories represent log-odds relative to reference. The reference should be meaningful.
What is a loss function? Name common ones.
Answer: Loss function measures discrepancy between predicted and true values. Classification: binary cross-entropy, categorical cross-entropy, hinge loss. Regression: MSE, MAE, Huber. Also custom losses.
What is the difference between LDA and PCA?
Answer: PCA is unsupervised, finds directions maximizing variance. LDA (Linear Discriminant Analysis) is supervised, finds directions maximizing class separability. LDA reduces dimensions while preserving class information.
What is a recurrent neural network (RNN) vs feedforward NN?
Answer: RNN has loops, allowing information persistence across time steps. Feedforward NN has no memory. RNN suitable for sequences; feedforward for i.i.d. data.
What is a transformer?
Answer: Transformer architecture relies on self-attention, replacing recurrence and convolutions. It processes all positions in parallel, enabling long-range dependencies. Foundation of BERT, GPT, and modern NLP.
What is BERT?
Answer: BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model using masked language modeling and next sentence prediction. It provides deep bidirectional context. Fine-tuned for classification, QA, NER.
What is GPT?
Answer: GPT (Generative Pre-trained Transformer) is an autoregressive language model trained to predict next token. It is unidirectional (left-to-right). Used for text generation, summarization, and in-context learning.
What is an embedding?
Answer: An embedding is a dense, low-dimensional vector representation of categorical data (words, products, users). Learned via neural networks or matrix factorization. Captures semantic similarity. Word2Vec, GloVe, BERT embeddings.
What is dimensionality reduction? Name techniques.
Answer: Dimensionality reduction reduces number of features while preserving important information. Techniques: PCA (linear), t-SNE (non-linear, visualization), UMAP, autoencoders (neural), LDA (supervised).
What is the difference between a generative and discriminative model?
Answer: Generative models learn joint probability P(X,Y) and can generate new samples (Naive Bayes, GANs). Discriminative models learn conditional probability P(Y|X) (logistic regression, SVM, neural networks). Generative models handle missing data.
What is a variational autoencoder (VAE)?
Answer: VAE is a generative model that learns a latent probabilistic representation of data. Encoder outputs mean and variance; decoder samples and reconstructs. Enables interpolation and generation of new samples.
What is a GAN (Generative Adversarial Network)?
Answer: GAN consists of a generator (creates fake data) and a discriminator (distinguishes real from fake). They compete; generator learns to produce realistic samples. Used for image generation, style transfer, super-resolution.
What is reinforcement learning (RL)?
Answer: RL is learning by interacting with an environment: agent takes actions, receives rewards, learns policy to maximize cumulative reward. Key components: state, action, reward, discount factor. Applications: game playing, robotics, recommendation.
What is Q-learning?
Answer: Q-learning is a model-free, off-policy RL algorithm that learns Q-values (expected future reward) for state-action pairs. Uses Bellman equation to update. Converges to optimal policy under certain conditions.
What is the difference between on-policy and off-policy RL?
Answer: On-policy learns from actions taken by the current policy (e.g., SARSA). Off-policy learns from actions taken by a different policy (e.g., Q-learning from experience replay). Off-policy more sample efficient but can be unstable.
What is a Markov Decision Process (MDP)?
Answer: MDP is a mathematical framework for modeling decision-making with states, actions, rewards, transition probabilities, and discount factor. Assumes Markov property (future depends only on present). Foundation of RL.
What is the exploration-exploitation dilemma?
Answer: Exploration tries new actions to discover better rewards; exploitation chooses known high-reward actions. Balance via epsilon-greedy (random action with probability ε), UCB, Thompson sampling.
What is Bayesian optimization?
Answer: Bayesian optimization is a sequential method for optimizing expensive black-box functions (hyperparameters). Uses a surrogate model (Gaussian process) and acquisition function to choose next point. Efficient for small number of evaluations.
What is the difference between grid search and random search?
Answer: Grid search exhaustively evaluates all combinations of specified hyperparameter values; expensive if many parameters. Random search samples random combinations; often more efficient, especially when only few hyperparameters matter.
What is a confusion matrix? (Already covered)
What is the difference between a validation set and a test set?
Answer: Validation set is used for hyperparameter tuning and model selection; it guides development. Test set is used only once at the end to evaluate final model’s generalization performance. Never use test set for tuning.
What is a P-value and its limitation?
Answer: P-value is probability of observing data as extreme or more under null. Limitations: depends on sample size (large n yields small p for trivial effects), does not measure practical significance, misuse leads to p-hacking.
What is a confidence interval?
Answer: A confidence interval (CI) is a range of values likely to contain the true population parameter with a given confidence level (e.g., 95%). CI provides information about precision and uncertainty, unlike point estimates.
What is the difference between a predictive model and an inferential model?
Answer: Predictive model focuses on accuracy of predictions (black-box allowed). Inferential model focuses on understanding relationships, estimating effects, and hypothesis testing (interpretability required).
What is the difference between a decision tree and a random forest?
Answer: Decision tree is a single tree; prone to overfitting, high variance. Random forest averages many trees trained on bootstrapped samples and random features; reduces variance, more robust, less interpretable.
Why should we hire you as a data scientist?
Answer: I have strong statistical and machine learning fundamentals, practical experience with real-world messy data, and proficiency in Python (pandas, scikit-learn, TensorFlow) and SQL. I prioritize clean code, reproducibility, and collaboration. I can translate business problems into data solutions and communicate insights effectively. I continuously learn new techniques and care about ethical implications and model interpretability.
Conclusion
You’ve made it — and your heart is practically glowing. That overwhelming, joyful rush you’re feeling right now? That’s pure ecstatic energy, surging from a place of deep preparation. Every statistical concept, every machine learning algorithm, every SQL query and data pipeline you reviewed is locked into your mind and ready to shine. You’re not just hoping to get through the interview. You’re genuinely buzzing to step into that room and finally let your hard-earned knowledge speak for itself.
And woven through that excitement is something softer but just as powerful: you’re utterly enchanted by the world of data science all over again. The beauty of finding patterns in chaos, the thrill of building models that predict the future, and the quiet magic of storytelling through data — it’s all captured your heart once more, and that enchantment will carry you to new heights. Walk into your interview with the ecstatic joy of someone who’s fully prepared, and the enchanted curiosity of someone who will never stop learning. This is your moment. Go sparkle like the brilliant data scientist you’ve become.