Applications in Practice

How the concepts from 101 Data Science Drawings show up in real-world data science work — from job interviews to production systems.

Cross-validation

What it is: Splitting your dataset into K folds, training on K-1 folds, and testing on the remaining fold. Repeat K times and average the results.

Where you’ll use it:

Comparing models before selecting one for production
Tuning hyperparameters (e.g., regularisation strength, tree depth)
Estimating how well a model will generalise to unseen data

Common pitfall: Applying cross-validation after feature selection or preprocessing causes data leakage. Always cross-validate the entire pipeline.

Propensity Score Matching

What it is: A technique for estimating treatment effects in observational studies. Each unit gets a “propensity score” — the predicted probability of receiving treatment — and treated units are matched to control units with similar scores.

Where you’ll use it:

Evaluating programme impact when randomised trials aren’t feasible
Health policy research (e.g., comparing outcomes for patients who received an intervention vs. those who didn’t)
Development economics — estimating the effect of a cash transfer programme

Common pitfall: PSM only controls for observed confounders. Unobserved differences between groups can still bias results.

Lag and Lead in SQL

What it is: Window functions that access data from previous rows (LAG) or subsequent rows (LEAD) in the result set without self-joining.

Where you’ll use it:

Calculating month-over-month or year-over-year changes
Detecting trends in time-series data (e.g., “did this metric increase from last quarter?”)
Building dashboards that show deltas and growth rates

Example:

SELECT
  month,
  revenue,
  revenue - LAG(revenue) OVER (ORDER BY month) AS revenue_change
FROM monthly_sales;

Log Transformation

What it is: Applying a logarithmic function to skewed data to make it more normally distributed and to stabilise variance.

Where you’ll use it:

Modelling income, population, or any right-skewed variable
Interpreting regression coefficients as percentage changes (log-log or log-linear models)
Satisfying normality assumptions in statistical tests

Common pitfall: Log of zero is undefined. Handle zeros before transforming (e.g., log(x + 1)) and be explicit about which base you’re using.

Feature Engineering

What it is: Creating new input variables from raw data to improve model performance.

Where you’ll use it:

Extracting day-of-week, month, or hour from timestamps
Creating interaction terms (e.g., age × income)
Encoding categorical variables (one-hot, target encoding)

Why it matters: In many real-world projects, feature engineering has more impact on model quality than algorithm choice.

A/B Testing

What it is: A randomised experiment comparing two variants (A and B) to determine which performs better on a defined metric.

Where you’ll use it:

Product decisions (e.g., which button colour gets more clicks)
Policy pilots (e.g., does SMS reminders improve clinic attendance)
Email campaigns and marketing optimisation

Common pitfall: Stopping the test too early (“peeking”) inflates false positive rates. Define sample size and duration before starting.

See key_concepts.md for definitions of the underlying terms.