Key Concepts & Definitions
Core terms and ideas from 101 Data Science Drawings, with short definitions and context for when you’ll encounter them.
Supervised Learning
| Concept |
Definition |
When It Matters |
| Bias-Variance Tradeoff |
The tension between a model that’s too simple (high bias, underfits) and one that’s too complex (high variance, overfits). The goal is to find the sweet spot. |
Model selection, hyperparameter tuning, comparing algorithms |
| Overfitting vs Underfitting |
Overfitting: the model memorises training data and fails on new data. Underfitting: the model is too simple to capture patterns at all. |
Diagnosing poor model performance, choosing model complexity |
| Gradient Descent |
An optimisation algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease in the loss function. |
Training neural networks, logistic regression, any iterative model fitting |
| Decision Trees |
A model that splits data into branches based on feature thresholds, creating a tree of if-then rules. Easy to interpret but prone to overfitting. |
Classification and regression tasks, feature importance analysis |
| Naive Bayes |
A probabilistic classifier that assumes features are independent given the class label. Fast and effective for text classification. |
Spam filtering, sentiment analysis, document categorisation |
Unsupervised Learning
| Concept |
Definition |
When It Matters |
| K-Means Clustering |
Partitions data into K groups by minimising the distance between points and their cluster centre. Requires choosing K in advance. |
Customer segmentation, grouping survey responses, geographic clustering |
| Principal Component Analysis (PCA) |
Reduces the number of variables by finding new axes (components) that capture the most variance in the data. |
Dimensionality reduction, visualising high-dimensional data, preprocessing |
Probability & Statistics
| Concept |
Definition |
When It Matters |
| OLS Regression |
Ordinary Least Squares — fits a linear model by minimising the sum of squared differences between observed and predicted values. The workhorse of quantitative research. |
Impact evaluation, econometric analysis, any linear relationship modelling |
| Confidence Intervals |
A range of values that likely contains the true population parameter (e.g., “we are 95% confident the mean is between 3.2 and 4.8”). |
Reporting results, policy briefs, uncertainty communication |
| p-values |
The probability of observing your result (or something more extreme) if the null hypothesis were true. Lower values suggest stronger evidence against the null. |
Hypothesis testing, academic publishing, programme evaluation |
Econometrics
| Concept |
Definition |
When It Matters |
| Instrumental Variables |
A technique to estimate causal effects when there’s endogeneity (the predictor is correlated with the error term). Uses a third variable (the instrument) that affects the outcome only through the predictor. |
Causal inference when randomisation isn’t possible |
| Difference-in-Differences |
Compares the change in outcomes over time between a treatment group and a control group. Controls for time-invariant unobserved differences. |
Policy evaluation, natural experiments, programme impact studies |
SQL
| Concept |
Definition |
When It Matters |
| Joins |
Combine rows from two or more tables based on a related column. Types: INNER (matching rows only), LEFT (all from left table), RIGHT (all from right table), FULL (all rows from both). |
Any multi-table database query, merging datasets |
| Window Functions |
Perform calculations across a set of rows related to the current row without collapsing the result. Examples: ROW_NUMBER(), RANK(), LAG(), LEAD(). |
Rankings, running totals, comparing rows to their neighbours |
Definitions inspired by the visual approach in Raymond Lim’s 101 Data Science Drawings.