Open Source · MIT License

Practice datasets for
development economics

36 generators producing large, realistic datasets across every major development sector — from impact evaluation and poverty analysis to gender, climate, WASH, humanitarian response, animal welfare, governance, and more. Built for students and practitioners learning data work in global development.

36
Generators
840k+
Rows Per Run
1,300+
Variables
25
Countries
MIT
License

The Datasets

Each generator produces a large, realistic dataset with correlated variables, proper distributions, and realistic missing-data patterns.

Household Survey

~75k rows · 27 cols · LSMS-style

Multi-module household survey with demographics, per-capita consumption (log-normal), asset ownership, housing quality, food security (FIES), and subjective well-being. Individual-level (members nested in households).

Engel curvesAsset indexMNAR missingnessIntra-HH correlation

RCT Experiment

25k rows · 15 cols · Multi-arm trial

Randomized controlled trial with stratified assignment, 4 treatment arms, partial compliance, differential attrition, and spillover flags. Baseline + endline consumption and food insecurity.

StratificationComplianceAttritionSpilloversLee bounds

Panel Data

625 rows · 21 cols · 25 countries × 25 years

Balanced country-year panel with 20+ development indicators including GDP, life expectancy, mortality, enrollment, fertility, poverty, and Gini. AR(1) persistence, COVID shock, WDI-like gaps.

Cross-indicator correlationStructural breaksSparse missingness

Agricultural Survey

~31k rows · 27 cols · Plot-level

Plot-level crop production data with Cobb-Douglas yields, input use (fertilizer, improved seed, irrigation), rainfall shocks, distance to market, and profit calculation.

Cobb-DouglasWeather shocksTechnology adoptionMarket access

Health & Nutrition

35k rows · 34 cols · DHS-style

Under-5 child health data with WHO z-scores (HAZ/WAZ/WHZ), vaccination schedules, maternal health (ANC, delivery), feeding practices, morbidity, and age heaping.

AnthropometricsVaccinationWealth gradientAge heaping

Education Outcomes

~30k rows · 23 cols · Multi-level

Student-level data nested in 500 schools with math/reading scores, attendance, teacher quality, school resources, SES, grade repetition, and dropout. ICC ~0.20.

Multi-levelGender gapsSchool effectsDropout

Labor Market

40k rows · 24 cols · Labor force survey

Working-age adults with Mincer wage equation, formal/informal sector, employment status, hours, migration, remittances, social protection, and underemployment.

Mincer returnsGender gapFormalityMigration

Microfinance

30k rows · 28 cols · MFI loan records

Loan-level records with group/individual lending, repayment rates, default prediction, repeat borrowers, collateral, and internal credit scoring. Realistic MFI portfolio.

Group lendingDefault riskRepeat borrowersCredit scoring

Program Targeting

20k rows · 37 cols · PMT data

Proxy means test targeting with true vs. predicted consumption, inclusion/exclusion errors, community-based targeting comparison, and categorical eligibility. Ready for targeting accuracy analysis.

PMT scoreType I/II errorsCommunity targetingBenefit calc

Trade & Market Prices

~67k rows · 10 cols · 80 markets × 104 weeks

Weekly staple commodity prices across 80 markets in 16 countries with seasonality, spatial price correlation, transport cost wedges, border effects, and AR(1) persistence.

SeasonalitySpatial correlationBorder effectsCointegration

Sector-Specific Datasets

Gender Programme

25k rows · 34 cols · Empowerment & GBV

Women's empowerment (WEAI-like), decision-making, economic empowerment, time use, GBV prevalence with underreporting, SRH indicators, and programme intervention effects.

WEAIGBVDecision-makingTime useSRH

Girls' Education

20k rows · 36 cols · Enrollment & barriers

Longitudinal girls' education data with enrollment, MHM, safety (SRGBV), gendered barriers (marriage, pregnancy, cost), primary-secondary transition rates, and scholarship effects.

MHMDropout barriersSafetyTransitionScholarships

Climate & Resilience

20k rows · 41 cols · RIMA-style resilience

Climate shocks (drought, flood, cyclone), coping strategies, adaptation practices, RIMA resilience index (absorptive, adaptive, transformative capacity), and carbon footprint.

ShocksCopingRIMA indexAdaptationCarbon

Agriculture & Value Chain

25k rows · 25 cols · Farm-to-market

Multi-node value chain (producer→aggregator→processor→retailer) with margins, quality grading, post-harvest losses, contract farming, and cooperative membership effects.

Value chainMarginsPHLQuality gradingCooperatives

Animal Welfare

15k rows · 32 cols · Five Freedoms

Animal welfare assessment using Five Freedoms framework, body condition scoring, working animal welfare (donkeys, horses), veterinary access, rabies vaccination, and programme effects.

Five FreedomsBCSWorking animalsVet accessRabies

Public Health

20k rows · 35 cols · Epidemiology & systems

Disease surveillance (malaria, TB, HIV), health facility visits, out-of-pocket spending, CHW contact, health insurance, NCD risk factors, mental health (PHQ-9), and COVID vaccination.

Disease surveillanceCHWInsuranceMental healthNCDs

Livelihoods

20k rows · 57 cols · Economic strengthening

Income diversification, VSLA/savings groups, enterprise development, vocational training, asset accumulation, food consumption score, financial inclusion, and youth employment.

VSLAEnterpriseFCSFinancial inclusionYouth

Advocacy & Rights

15k rows · 37 cols · Legal empowerment

Legal identity, land tenure, access to justice, rights awareness (CEDAW, child rights), dispute resolution, civic participation, and advocacy campaign effectiveness.

Legal aidLand tenureAccess to justiceCivic spaceRTI

WASH

18k rows · 35 cols · JMP & water quality

JMP service ladders, water quality testing (E.coli, turbidity), sanitation ladders, CLTS/ODF status, handwashing observation, MHM, and diarrhea linked to WASH conditions.

JMP laddersE.coliCLTSHandwashingDiarrhea

Humanitarian Response

18k rows · 43 cols · Emergency needs

Displacement status (IDP/refugee/host), multi-sector needs assessment, aid distribution, protection concerns (GBV, child protection), SADD, CwC, and accountability.

SADDDisplacementProtectionAid modalityCwC

Social Protection

20k rows · 40 cols · Cash transfers

Beneficiary registry with cash transfer disbursement, conditionality compliance, payment modalities, FCS/rCSI outcomes, asset graduation model, and dependency metrics.

Cash transfersConditionalityGraduationFCSModalities

Governance & Accountability

15k rows · 39 cols · Service delivery

Citizen satisfaction with public services, trust in institutions, corruption experience (bribery), budget transparency, social accountability, community scorecards, and RTI.

Service deliveryCorruptionTrustScorecardsBudget

Cross-cutting & Methodological Datasets

Behaviour Change (KAP)

20k rows · 35 cols · BCC survey

Knowledge-Attitude-Practice surveys for health/WASH/nutrition. Campaign exposure and dose, KAP cascade (knowledge > attitude > practice), self-efficacy, and social norms.

KAPBCCCampaignSelf-efficacyNorms

Cost-Effectiveness Analysis

15k rows · 30 cols · Programme costing

Programme cost data across sectors: personnel, materials, transport, overhead. Outcomes, effect sizes, ICER, DALYs averted, QALYs gained. Pilot vs. at-scale economies.

CEAICERDALYCostingVfM

Decent Work (ILO)

25k rows · 35 cols · Labour standards

ILO decent work framework: formal/informal employment, earnings with gender pay gap, social protection, working conditions, union membership, and work-life balance.

ILOInformalWagesUnionsGender gap

Care Economy & Time Use

20k rows · 35 cols · Unpaid care

Time use diary data: paid work, unpaid care, domestic work, leisure. 3x gender care gap. Care infrastructure effects, opportunity cost, and time poverty indicators.

Time useCare workGender gapTime povertyInfrastructure

Intersectional Inequality

25k rows · 31 cols · Caste, gender, disability

Intersectional analysis: caste, religion, disability, sexuality with socioeconomic outcomes. Multiplicative disadvantage, discrimination experience, and access to services.

CasteDisabilityIntersectionalityDiscriminationAccess

Environmental Justice

20k rows · 32 cols · Pollution & health

Pollution exposure (PM2.5, indoor air, water contamination), environmental hazards, health outcomes, cooking fuel, green space access, climate vulnerability, and environmental rights.

PollutionPM2.5HealthClimateRights

Community Development

18k rows · 36 cols · Social capital

Social capital measurement: bonding vs. bridging ties, trust, collective action, participatory governance, community assets, social cohesion, and CDD programmes.

Social capitalTrustCDDParticipationCohesion

Digital Access & Literacy

20k rows · 44 cols · Digital divide

Device ownership, connectivity, digital literacy (7 hierarchical skills), social media use, misinformation exposure, privacy awareness, and barriers. Gender and age divides embedded.

Digital divideLiteracyMisinformationPrivacyBarriers

Social-Emotional Learning

15k rows · 33 cols · Youth SEL

CASEL framework: self-awareness, self-management, social awareness, relationship skills, responsible decision-making. Academic scores, wellbeing, bullying, and prosocial behaviour.

CASELWellbeingBullyingProsocialAcademics

NGO Programme Finance

10k rows · 36 cols · Budgets & donors

NGO financial management: budget breakdowns, donor types, funding gaps, burn rates, reserves, compliance audits, value-for-money scoring, and the overhead debate.

BudgetsDonorsVfMComplianceOverhead

Aid Effectiveness (ODA)

12k rows · 30 cols · Paris Declaration

Official Development Assistance flows: donor-recipient pairs, Paris Declaration indicators (ownership, alignment, harmonization), tied aid, fragmentation, and conditionality.

ODAParisFragmentationTied aidCoordination

Media & Information Ecosystems

18k rows · 42 cols · Media development

Media access and consumption, news source trust, media literacy, development communication exposure, misinformation encounters, press freedom perceptions, and language barriers.

MediaLiteracyMisinformationPress freedomDev comm

IRT Psychometric Assessment

20k rows · 132 cols · Item Response Theory

3PL IRT model with 30 items: item difficulty, discrimination, guessing. Latent ability (theta), response times, Differential Item Functioning (DIF) by gender. Full item-level data.

IRT3PLDIFPsychometricsResponse time

Field Survey Quality (Paradata)

20k rows · 41 cols · Survey QC

Survey process data: interview timing, GPS validation, response patterns (straightlining, acquiescence), back-checks, enumerator fatigue, and fabrication detection. ~5% fabricators embedded.

ParadataStraightliningGPSBack-checkFabrication

Getting Started

Generate all datasets in under 30 seconds. Only requires Python 3.8+ and four pip packages.

1 Install & Generate All

Clone the repo, install dependencies, and generate all 10 datasets as CSV files.

# Clone
git clone https://github.com/Varnasr/devdata-practice.git
cd devdata-practice

# Install (numpy, pandas, scipy, pyarrow)
pip install -r requirements.txt

# Generate all 10 datasets
python generate.py

Outputs ~35 MB of CSVs to ./output/

2 Customize

Generate specific datasets, change sizes, set seeds for reproducibility, or export as Parquet.

# Generate just two datasets
python generate.py rct_experiment labor_market

# Larger datasets (override default size)
python generate.py household_survey --rows 50000

# Set seed for exact reproducibility
python generate.py --seed 42

# Export as Parquet instead of CSV
python generate.py --format parquet

# List all available generators
python generate.py --list

Documentation

Expand any dataset below for the full column dictionary, realistic features, and methodological notes.

Household Survey (LSMS-style)
~75k rows

Structure

One row per household member. ~15,000 households with 2-8 members each, expanded to ~75k individual-level rows. Household-level variables are repeated across members.

Key Columns

  • individual_id, household_id — Unique identifiers
  • country, district, urban — Geography
  • relationship, age, female, education_years — Demographics
  • monthly_pce_usd — Monthly per-capita expenditure (USD PPP, log-normal)
  • food_share — Engel curve: food expenditure as share of total
  • food_insecurity_score — FIES-like 0-8 scale
  • owns_radio through owns_improved_stove — 8 binary asset indicators
  • wall_material, rooms, water_source, toilet_type — Housing
  • life_satisfaction — Subjective well-being (1-10)

Realistic Features

  • Latent wealth factor drives correlated consumption, assets, and housing
  • Engel curve: food share declines with wealth (0.70 base, -0.08 per SD)
  • Urban premium on wealth (+0.6 SD)
  • MNAR missingness: richer households less likely to report income
RCT Experiment
25k rows

Structure

One row per participant. 4 arms: control, cash transfer, cash + training, training only. Stratified by district × gender.

Key Columns

  • treatment_arm — Randomized assignment
  • actually_treated — Compliance indicator (65-85% take-up)
  • baseline_consumption_usd, endline_consumption_usd — Primary outcomes
  • baseline_food_insecurity, endline_food_insecurity — HFIAS 0-27
  • attrited — Endline attrition (differential by arm)
  • spillover_risk — Flag for control units in treated villages

Embedded Effects

  • Cash: +15%, Cash+Training: +22%, Training: +8% consumption increase
  • Heterogeneous effects: larger for women and poorer baseline
  • Attrition: 8% base + 3% higher in control + rural premium
Panel Data (WDI-style)
625 rows

Structure

Balanced panel: 25 developing countries × 25 years (2000-2024). 20+ indicators per country-year.

Key Columns

  • gdp_per_capita_usd, log_gdp_per_capita — Income
  • life_expectancy, infant_mortality_per_1000, under5_mortality_per_1000 — Health
  • primary_enrollment_pct, secondary_enrollment_pct, adult_literacy_pct — Education
  • fertility_rate, electricity_access_pct, sanitation_access_pct — Development
  • poverty_headcount_215, gini_coefficient — Welfare (sparse)

Realistic Features

  • AR(1) shock persistence + country-specific growth trends
  • COVID-2020 GDP shock (-8%) and 2021 partial recovery
  • WDI-like missingness: poverty and Gini ~30% missing, GDP near-complete
Agricultural Survey
~31k rows

Structure

One row per plot. ~15k farm households with 1-4 plots each. 7 crops, full input-output accounting.

Key Columns

  • plot_size_acres, crop, soil_quality — Plot characteristics
  • improved_seed, fertilizer_kg, irrigation, pesticide_used — Inputs
  • rainfall_deviation_sd — Weather shock (SD from normal)
  • harvest_kg, price_per_kg_usd, revenue_usd, profit_usd — Outputs
  • extension_contact, distance_to_market_km — Access

Production Function

Yields follow a Cobb-Douglas: Y = A · L0.50 · Lab0.25 · F0.15 with TFP shifters for improved seed (+15%), irrigation (+10%), and soil quality. Rainfall has an inverted-U effect.

Health & Nutrition (DHS-style)
35k rows

Structure

One row per child under 5 with linked mother characteristics. Wealth quintile drives most health gradients.

Key Columns

  • height_for_age_z, weight_for_age_z, weight_for_height_z — WHO z-scores
  • stunted, underweight, wasted — Binary flags (z < -2)
  • bcg_vaccine, dpt1_vaccine, dpt3_vaccine, measles_vaccine, fully_vaccinated
  • anc_visits, facility_delivery, skilled_birth_attendant — Maternal health
  • diarrhea_2wk, fever_2wk, cough_2wk, sought_treatment — Morbidity

Realistic Features

  • Growth faltering after 6 months (HAZ declines with age)
  • Age-appropriate vaccination (can't get measles before 9 months)
  • Age heaping at 6, 12, 24, 36, 48 months (interviewer rounding)
  • Birth weight MAR: heavier babies more likely to be weighed
Education Outcomes
~30k rows

Structure

Students nested in 500 schools (grades 3-8). Multi-level data for HLM analysis.

Key Columns

  • math_score, reading_score — 0-100 test scores
  • attendance_rate, distance_to_school_km — Student access
  • ses_score, school_type, school_meal_program — SES & resources
  • pupil_teacher_ratio, pct_trained_teachers, has_library — School quality
  • repeated_grade, dropped_out — Persistence

Embedded Effects

  • School random effect (ICC ~0.20): 20% of score variance is between-school
  • Girls outperform in reading (+2 pts), boys in math (+2 pts)
  • Dropout students have missing endline scores (structural missingness)
Labor Market
40k rows

Structure

Working-age adults (15-64) with employment status, wages, sector, and social protection.

Wage Equation (Mincer)

ln(wage) = 1.8 + 0.08·educ + 0.04·exp - 0.0006·exp² - 0.18·female + 0.25·urban + 0.35·formal + ε

  • 8% return to education, concave experience profile
  • 18% gender wage gap (suitable for Oaxaca-Blinder decomposition)
  • 35% formality premium
  • Employment: wage, self-employed, unpaid family, unemployed
  • Migration and remittances with realistic amounts
Microfinance
30k rows

Structure

Loan-level records from a microfinance institution. Group and individual lending products.

Key Columns

  • loan_amount_usd, interest_rate_annual_pct, term_months — Terms
  • loan_product, loan_purpose — Product classification
  • cycle_number — Repeat borrower indicator (graduation)
  • defaulted, days_past_due, repayment_rate — Performance
  • internal_credit_score — 300-850 score

Default Model

Default probability via logistic: higher for larger loans, consumption purpose, first-cycle borrowers, uncollateralized. Women default less. Repeat borrowers graduate to larger amounts.

Program Targeting (PMT)
20k rows

Structure

Household-level data with both true consumption (survey) and PMT-predicted consumption. Built for targeting accuracy analysis.

Key Columns

  • true_monthly_pce_usd — Actual per-capita consumption (with residual noise)
  • pmt_predicted_pce_usd — Fitted values from proxy means formula
  • truly_poor, pmt_classified_poor — Binary poverty flags
  • exclusion_error, inclusion_error — Targeting mistakes
  • community_selected — Community-based targeting comparison
  • monthly_benefit_usd — Calculated transfer amount

Why It's Useful

The gap between true and predicted consumption creates realistic targeting errors. Students can compare PMT, community, and categorical targeting; compute leakage and undercoverage; and simulate benefit reforms.

Trade & Market Prices
~67k rows

Structure

80 markets × 104 weeks × 8 staple commodities. 2 years of weekly price data across 16 countries.

Key Columns

  • market, country, date, commodity — Identifiers
  • price_per_kg_usd — Commodity price with seasonality
  • volume_traded_kg — Market volume
  • transport_cost_pct — Distance-based cost wedge

Price Dynamics

  • Seasonal cycles: lean-season highs, post-harvest lows
  • AR(1) persistence in price levels
  • Spatial correlation: nearby markets move together
  • Border effect: +8% premium for cross-country market pairs
  • Suitable for cointegration / market integration analysis

Sector-Specific Datasets

Gender Programme
25k rows

Structure

One row per woman/girl. 25,000 individuals across programme and non-programme areas with empowerment, GBV, and SRH indicators.

Key Columns

  • individual_id, household_id, country, district — Identifiers
  • programme_participant, programme_type — Cash transfers, skills training, savings groups, awareness, legal aid
  • decides_own_healthcare through decides_own_earnings — 5 WEAI decision-making domains
  • decision_making_score, empowerment_index — Composite empowerment scores
  • owns_land, owns_house, has_bank_account, has_mobile_money — Economic empowerment
  • care_work_hours_day, productive_work_hours_day, leisure_hours_day — Time use diary
  • reported_physical_gbv, reported_emotional_gbv, reported_economic_gbv — GBV with underreporting
  • using_modern_contraception, unmet_need_family_planning — SRH indicators

Realistic Features

  • WEAI-like empowerment domains with intra-household bargaining
  • GBV prevalence modelled with 40–60% underreporting of true cases
  • Time-use data summing to realistic daily totals
  • Programme effects vary by type (cash vs. training vs. awareness)
Girls’ Education
20k rows

Structure

One row per girl (grades 1–12). Tracks enrollment, learning outcomes, safety, menstrual hygiene, and dropout barriers.

Key Columns

  • girl_id, grade, age, wealth_quintile — Identifiers & demographics
  • enrolled, attendance_rate, dropped_out — Enrollment status
  • math_score, literacy_score — Learning outcomes (0–100)
  • barrier_marriage, barrier_pregnancy, barrier_cost, barrier_distance — Dropout barriers
  • has_menstruated, mhm_knowledge, has_sanitary_products, missed_school_menstruation — MHM
  • srgbv_experienced, feels_safe_route_to_school — Safety indicators
  • receives_scholarship, receives_school_meals — Programme support
  • at_primary_secondary_transition, transitioned_to_secondary — Transition tracking

Realistic Features

  • Gendered dropout barriers: early marriage, pregnancy, household chores, cost, distance
  • MHM affects attendance — girls without sanitary products miss more school
  • School-related GBV (SRGBV) linked to dropout
  • Primary–secondary transition rates driven by scholarships and parental attitudes
Climate & Resilience
20k rows

Structure

One row per household. Climate shock exposure, coping strategies, adaptation practices, and RIMA-like resilience indices.

Key Columns

  • agroecological_zone — Arid, semi-arid, sub-humid, humid, highland
  • experienced_drought, experienced_flood, experienced_cyclone, experienced_pest_outbreak — Shock exposure
  • crop_loss_pct, livestock_loss_pct, income_loss_pct — Shock losses
  • cs_reduced_meals, cs_sold_assets, cs_borrowed_money, cs_migration — Coping strategies
  • adopted_drought_resistant_crop, adopted_irrigation, has_crop_insurance — Adaptation
  • absorptive_capacity, adaptive_capacity, transformative_capacity — RIMA pillars
  • resilience_index — Composite resilience score
  • carbon_footprint_tco2_yr — Household emissions proxy

Realistic Features

  • Shock exposure varies by agro-ecological zone (drought in arid, floods in humid)
  • RIMA-like resilience measurement: absorptive, adaptive, transformative pillars
  • Coping strategy severity ladder from consumption smoothing to asset depletion
  • Early warning system access improves preparedness outcomes
Agriculture & Value Chain
25k rows

Structure

One row per transaction across a 4-node value chain: producer → aggregator → processor → retailer. 8 commodities.

Key Columns

  • chain_node — Position in value chain (producer/aggregator/processor/retailer)
  • commodity — Maize, coffee, dairy, poultry, horticulture, rice, groundnuts, cassava
  • quality_grade — A/B/C affecting price premiums
  • volume_kg, price_per_kg_usd, revenue_usd — Transaction values
  • total_cost_usd, margin_usd, margin_pct — Profitability
  • post_harvest_loss_pct — Loss rates by chain node
  • in_cooperative, has_contract, has_certification — Market linkage
  • buyer_type, season — Market context

Realistic Features

  • Value addition markups: 15% aggregator, 45% processor, 80% retailer
  • Quality grading premiums: A = +25%, C = −25%
  • Post-harvest losses decline along the chain (30% farm → 8% retail)
  • Contract farming and cooperative membership yield price premiums
Animal Welfare
15k rows

Structure

One row per animal/household. Covers livestock, working animals, and companion animals with Five Freedoms welfare assessment.

Key Columns

  • animal_type — Cattle, goats, sheep, poultry, donkey, horse, pig, camel, dog, cat
  • freedom_hunger_thirst through freedom_fear_distress — Five Freedoms (1–5 each)
  • welfare_score_avg, body_condition_score, shelter_score — Composite welfare
  • distance_to_vet_km, accessed_vet_last_year, vaccinated, dewormed — Veterinary access
  • working_hours_daily, working_has_wounds, working_proper_harness — Working animal indicators
  • companion_rabies_vaccinated, companion_sterilized — Companion animal health
  • hh_consumes_animal_source_food — Nutrition linkage

Realistic Features

  • Five Freedoms framework with inter-correlated domain scores
  • Body Condition Score (1–5) driven by wealth and training
  • Working animal data (donkeys, horses, camels) with wound prevalence and harness quality
  • Rabies vaccination coverage for companion animals
Public Health & Epidemiology
20k rows

Structure

One row per individual. Disease surveillance, health facility use, insurance, NCDs, mental health (PHQ-9), and COVID-19 vaccination.

Key Columns

  • malaria_tested, malaria_rdt_positive — Malaria RDT cascade
  • tb_ever_diagnosed, tb_on_treatment — TB cascade
  • hiv_tested_ever, hiv_positive, hiv_on_art — HIV cascade
  • facility_visits_12m, oop_health_spending_usd, catastrophic_health_expenditure — Utilization
  • chw_contact_6m, chw_referred, chw_referral_completed — CHW referral cascade
  • health_insurance_type — None, CBHI, NHIF, private, employer
  • hypertension_diagnosed, hypertension_controlled, diabetes_diagnosed — NCD screening
  • phq9_score, depression_moderate, depression_severe — Mental health
  • covid_vaccine_doses — 0/1/2/booster doses with wealth gradient

Realistic Features

  • Disease cascades (tested → diagnosed → treated) with realistic drop-offs
  • Catastrophic health expenditure flag (>10% of household consumption)
  • PHQ-9 depression score (0–27) correlated with poverty and gender
  • CHW referral completion rates driven by distance and wealth
Livelihoods & Economic Strengthening
20k rows

Structure

One row per household. Three treatment arms (control, livelihoods only, livelihoods + savings). Covers income, VSLA, training, assets, and food security.

Key Columns

  • treatment_arm — Control, livelihoods_only, livelihoods_plus_savings
  • primary_income_source, n_income_sources, income_diversification_index — Income
  • owns_enterprise, enterprise_type, enterprise_monthly_revenue_usd — Enterprise
  • vsla_member, vsla_savings_usd, vsla_shareout_usd, vsla_loan_usd — Savings groups
  • received_vocational_training, training_type, completed_apprenticeship — Skills
  • baseline_asset_index, endline_asset_index — 7-asset accumulation
  • food_consumption_score, fcs_category, reduced_coping_strategies_index — Food security
  • has_mobile_money, has_bank_account, accessed_credit_12m — Financial inclusion

Realistic Features

  • Shannon-like income diversification index
  • VSLA cycle: savings → share-out → borrowing with realistic interest
  • Baseline–endline asset change driven by treatment arm
  • Youth (15–35) employment, NEET, and training indicators
Advocacy, Rights & Legal Empowerment
15k rows

Structure

One row per individual. Legal identity, land tenure, dispute resolution, rights awareness, civic participation, and advocacy campaigns.

Key Columns

  • has_birth_certificate, has_national_id — Legal identity documentation
  • owns_land, has_land_title, land_dispute_experienced — Land tenure security
  • rights_awareness_score, knows_cedaw, knows_child_rights — Rights knowledge
  • experienced_dispute, dispute_type, resolution_mechanism, dispute_resolved — Justice
  • barrier_cost, barrier_distance, barrier_fear, barrier_distrust — Justice barriers
  • voted_last_election, attended_community_meeting, feels_can_influence_decisions — Civic participation
  • exposed_to_advocacy_campaign, campaign_channel, changed_behavior_post_campaign — Campaigns

Realistic Features

  • Legal identity gaps driven by wealth and rural status
  • Dispute resolution pathways: formal courts, customary leaders, legal aid, mediation
  • Justice barriers — cost, distance, fear, distrust — vary by gender and wealth
  • Advocacy campaign reach and self-reported behavior change
WASH (Water, Sanitation & Hygiene)
18k rows

Structure

One row per household. JMP service ladders for water, sanitation, and hygiene. Water quality testing, CLTS, MHM, and school WASH.

Key Columns

  • water_source, water_improved, jmp_water_service_level — JMP water ladder
  • ecoli_cfu_100ml, ecoli_risk_category, turbidity_ntu — Water quality testing
  • liters_per_person_day, sufficient_water_15lpd — Water quantity (Sphere)
  • sanitation_facility, jmp_sanitation_service_level, open_defecation — JMP sanitation
  • clts_triggered, community_odf_declared, odf_slippage — CLTS programme
  • hw_water_and_soap, hygiene_service_level — Observed handwashing
  • mhm_private_space, mhm_materials_available — Menstrual hygiene
  • child_diarrhea_2wk, diarrhea_ors_used, diarrhea_zinc_used — Child health
  • school_separate_toilets_girls, school_has_mhm_facility, school_pupil_toilet_ratio — School WASH

Realistic Features

  • JMP service ladders: safely managed → basic → limited → unimproved → surface water / open defecation
  • E. coli and turbidity correlated with source type
  • Handwashing observation vs. self-reported discrepancy (social desirability bias)
  • Child diarrhoea prevalence linked to WASH conditions via logistic model
  • CLTS triggering → ODF declaration → verification → slippage cascade
Humanitarian & Disaster Response
18k rows

Structure

One row per individual with SADD (Sex and Age Disaggregated Data). Displacement, multi-sector needs, aid distribution, protection, and accountability.

Key Columns

  • displacement_status — IDP, refugee, returnee, host community
  • crisis_type, months_displaced, times_displaced — Displacement profile
  • need_food through need_livelihoods — 7 sector needs scored 0–5 (JIAF-like)
  • overall_severity, people_in_need — Composite need assessment
  • meets_sphere_water, meets_sphere_shelter, meets_sphere_food — Sphere standards
  • received_aid, aid_modality, aid_amount_usd — Aid distribution
  • gbv_risk_reported, child_protection_concern, mine_uxo_awareness — Protection
  • knows_feedback_mechanism, filed_complaint, complaint_resolved — Accountability
  • movement_intention — Stay, return, relocate, seek asylum, undecided

Realistic Features

  • SADD throughout: 5 standard age groups with sex disaggregation
  • Multi-sector severity scoring following JIAF methodology
  • Sphere minimum standards compliance (water 15 L/p/d, shelter 3.5 m²/p, food 2100 kcal/p/d)
  • Communication with Communities (CwC): information access, preferred channels, feedback loops
  • Vulnerability markers: unaccompanied minors, pregnant/lactating, disability, elderly alone
Social Protection & Cash Transfers
20k rows

Structure

One row per beneficiary household. Programme registry with transfer tracking, conditionality compliance, and graduation model.

Key Columns

  • programme_type — Unconditional cash, conditional cash, public works, cash plus, food vouchers, school feeding
  • transfer_modality — Mobile money, cash-in-hand, bank transfer, voucher, in-kind
  • monthly_transfer_usd, total_received_usd, pct_payments_received — Transfer tracking
  • has_conditionality, conditionality_compliant — Compliance monitoring
  • baseline_consumption_usd, endline_consumption_usd — Impact measurement
  • fcs_baseline, fcs_endline, rcsi_baseline, rcsi_endline — Food security change
  • graduation_score, graduated, would_cope_without_transfer — Graduation model

Realistic Features

  • Graduation model with thresholds on food security, assets, and savings
  • Payment regularity metrics (% received, delays) affecting outcomes
  • Baseline → endline change in consumption, food security, and asset accumulation
  • Dependency indicator: “would cope without transfer”
Governance & Accountability
15k rows

Structure

One row per citizen. Service delivery satisfaction, trust in institutions, corruption experience, budget transparency, and social accountability.

Key Columns

  • satisfaction_health through satisfaction_police, overall_service_satisfaction — Service delivery (1–5)
  • trust_local_govt through trust_ngos — Institutional trust (1–5)
  • bribery_experience, bribery_context, bribe_amount_usd, reported_bribery — Corruption
  • aware_of_local_budget, budget_literacy_score — Budget transparency
  • in_social_accountability_prog, attended_scorecard_session — Programme participation
  • scorecard_health, scorecard_education, scorecard_water — Community scorecards (0–100)
  • knows_rti_law, gets_info_radio, gets_info_social_media — Information access

Realistic Features

  • Bribery reporting rate very low (8–13%) — realistic underreporting
  • Social accountability programme effects on trust and budget awareness
  • Community scorecard scores varying across service sectors
  • MNAR missingness: bribe amounts more likely missing for larger bribes

Cross-cutting & Methodological Datasets

Behaviour Change (KAP)
20k rows

Structure

One row per individual. Knowledge-Attitude-Practice survey for health, WASH, and nutrition behaviours with campaign exposure tracking.

Key Columns

  • exposed_to_campaign, campaign_type, campaign_doses — Campaign exposure (radio, community drama, peer education, SMS, poster, social media)
  • knowledge_score, knows_handwashing_times, knows_ors_for_diarrhea, knows_exclusive_breastfeeding — Knowledge (0–10)
  • attitude_score, approves_family_planning, gender_equitable_attitude, stigma_hiv — Attitudes (0–10)
  • practice_score, practices_handwashing, uses_treated_water, uses_mosquito_net — Practice (0–10)
  • knowledge_practice_gap, attitude_practice_gap — KAP cascade gaps
  • self_efficacy_score, perceives_community_support, discussed_with_peers — Social norms

Realistic Features

  • KAP cascade: campaigns improve knowledge > attitudes > practice (realistic drop-off)
  • Campaign dose-response: more exposures yield stronger effects
  • Self-efficacy mediates the attitude–practice gap
Cost-Effectiveness Analysis
15k rows

Structure

One row per programme. Cost breakdowns, beneficiary data, outcomes, and CEA metrics across health, education, nutrition, WASH, livelihoods, and social protection sectors.

Key Columns

  • sector, programme_type, implementer_type — Programme classification
  • total_cost_usd, personnel_cost_usd, materials_cost_usd, overhead_cost_usd — Cost breakdown
  • cost_per_beneficiary_usd, cost_per_outcome_usd, overhead_ratio, personnel_ratio — Cost ratios
  • effect_size, icer, daly_averted, qaly_gained — CEA metrics
  • is_pilot — Pilot vs. at-scale with economies of scale

Realistic Features

  • Lognormal cost distributions; personnel 40–65%, overhead 8–25%
  • DALYs and QALYs only for health sector; ICER for all
  • Economies of scale: at-scale programmes have lower unit costs
Decent Work (ILO)
25k rows

Structure

One row per worker. ILO decent work framework covering formal/informal employment, earnings, social protection, working conditions, freedom of association, and the informal economy.

Key Columns

  • employment_status — Formal wage, informal wage, self-employed, casual daily, unpaid family
  • monthly_earnings_usd, hourly_wage_usd, below_minimum_wage — Earnings with embedded gender gap (~0.82 ratio)
  • has_written_contract, has_social_security, has_health_insurance, has_pension — Social protection
  • occupational_safety_training, experienced_injury_12m, workplace_harassment — Conditions
  • member_of_union, freedom_of_association, collective_bargaining_covered — Labour rights
  • operates_without_registration, no_bookkeeping — Informality indicators

Realistic Features

  • Gender pay gap embedded in earnings (female/male ratio ~0.82)
  • Social protection strongly linked to formal employment status
  • Occupational segregation: different sector distributions by gender
Care Economy & Time Use
20k rows

Structure

One row per individual (mixed gender). Time use diary data with care breakdown, care infrastructure, opportunity cost, and time poverty indicators.

Key Columns

  • sleep_hours, paid_work_hours, unpaid_care_hours, domestic_work_hours, leisure_hours — Time diary (~24h/day)
  • childcare_hours, eldercare_hours, cooking_hours, water_collection_hours — Care breakdown
  • has_childcare_access, has_electricity, has_improved_cookstove — Care infrastructure
  • forgone_earnings_usd, reduced_labor_participation — Opportunity cost
  • time_poor — Flag for >10.5 hours/day on paid + unpaid work

Realistic Features

  • Women do ~3.4x more unpaid care than men (4.3h vs. 1.3h daily)
  • Care infrastructure reduces care burden (water access, cookstoves, childcare)
  • Time poverty at ~19.5% of population, higher for women
Intersectional Inequality
25k rows

Structure

One row per individual. Multiple identity dimensions (caste, religion, disability, sexuality) with socioeconomic outcomes showing multiplicative intersectional disadvantage.

Key Columns

  • caste_category (general/OBC/SC/ST), religion, has_disability, disability_type, sexual_minority, indigenous
  • monthly_income_usd, employed, housing_quality_score, food_security_score — Outcomes
  • experienced_discrimination, discrimination_basis, discrimination_context — Discrimination
  • accessed_education through accessed_justice — Service access

Realistic Features

  • Multiplicative disadvantage: SC + female + disability worse than sum of individual effects
  • Intersectional penalty increases with each additional axis of marginalisation
  • Discrimination basis and context vary by identity combination
Environmental Justice
20k rows

Structure

One row per household. Pollution exposure, environmental hazards, health outcomes, cooking fuel, green space, climate vulnerability, and environmental governance.

Key Columns

  • air_quality_pm25, indoor_air_pollution, water_contamination_score, noise_pollution_level — Pollution
  • proximity_to_industrial_site_km, proximity_to_waste_dump_km, flood_risk_zone — Hazards
  • respiratory_illness_12m, waterborne_illness_12m, child_blood_lead_elevated — Health outcomes
  • cooking_fuel_type, cooking_location — Indoor air quality determinant
  • carbon_footprint_tco2, climate_vulnerability_score — Climate justice

Realistic Features

  • Environmental racism/classism: poorer communities have higher pollution exposure
  • Health outcomes linked to pollution load via logistic model
  • Indoor air pollution driven by cooking fuel type (firewood > LPG)
Community Development
18k rows

Structure

One row per individual. Social capital measurement with bonding vs. bridging ties, collective action, participatory governance, and community-driven development.

Key Columns

  • n_group_memberships, primary_group_type — Group membership
  • bonding_social_capital_score, bridging_social_capital_score — Bonding vs. bridging
  • trust_neighbors, trust_strangers, trust_local_leaders — Trust (1–5)
  • participated_in_collective_action, collective_action_type — Collective action
  • attended_village_assembly, voiced_opinion_in_meeting — Participatory governance
  • in_cdd_programme, contributed_to_project, satisfied_with_project — CDD

Realistic Features

  • Bonding social capital higher in rural areas; bridging higher for educated
  • Free-rider perception inversely related to trust
  • CDD programme effects on participation and community asset satisfaction
Digital Access & Literacy
20k rows

Structure

One row per individual. Device ownership, connectivity, 7-level digital skills hierarchy, usage patterns, misinformation, privacy, and barriers to access.

Key Columns

  • owns_smartphone, owns_computer, shared_device_only — Device access
  • has_internet_access, internet_type, monthly_data_cost_usd — Connectivity
  • can_make_call through can_use_govt_services_online — 7 hierarchical skills
  • digital_literacy_score — Composite (0–10)
  • encountered_misinformation, can_identify_misinformation, shared_unverified_info
  • barrier_cost, barrier_literacy, barrier_language, barrier_infrastructure

Realistic Features

  • Gender digital divide: women have lower access and literacy scores
  • Age divide: youth more digitally literate, elderly less connected
  • Digital literacy is hierarchical: basic skills prerequisite for advanced ones
Social-Emotional Learning
15k rows

Structure

One row per student (ages 6–18). CASEL framework with 5 SEL domains, academic outcomes, wellbeing, bullying, prosocial behaviour, and teacher/parent ratings.

Key Columns

  • self_awareness, self_management, social_awareness, relationship_skills, responsible_decision_making — CASEL domains (1–5)
  • sel_composite_score — Overall SEL (1–5)
  • math_score, reading_score, attendance_rate — Academic outcomes
  • life_satisfaction, bullying_experienced, bullying_perpetrated — Wellbeing
  • in_sel_programme, programme_duration_months, teacher_trained_in_sel — Programme

Realistic Features

  • SEL programme improves all 5 domains; bigger effects with longer duration
  • Girls score higher on social awareness and relationship skills
  • SEL composite correlates with academic performance and lower bullying
NGO Programme Finance
10k rows

Structure

One row per programme. Organisation characteristics, funding, budget breakdowns, financial health, compliance, effectiveness, and partnerships.

Key Columns

  • org_type — Local NGO, INGO, CBO, faith-based, social enterprise
  • annual_budget_usd, n_donors, primary_donor_type, funding_gap_usd — Funding
  • personnel_pct, programme_activities_pct, admin_pct, indirect_cost_pct — Budget
  • burn_rate_pct, months_of_reserves, sustainability_score — Financial health
  • audit_completed, audit_qualified, vfm_score — Compliance & VfM

Realistic Features

  • Admin costs 10–25%, with ~25% of orgs under-reporting overhead (real-world pressure)
  • Larger orgs have better compliance and financial health scores
  • Funding secured percentage varies by donor type and org capacity
Aid Effectiveness (ODA)
12k rows

Structure

One row per aid flow (2010–2025). Donor-recipient pairs with Paris Declaration indicators, fragmentation, conditionality, and coordination.

Key Columns

  • donor_country, recipient_country, year — Flow identifiers
  • oda_amount_usd, disbursement_type, channel — Aid modality
  • country_ownership_score, alignment_with_national_plan, uses_country_systems — Paris principles
  • donor_concentration_index, is_tied_aid — Fragmentation
  • has_conditionality, conditionality_type, conditionality_met — Conditionality

Realistic Features

  • Aid flows more to poorer, larger countries (with geopolitical weighting)
  • Paris indicators improve slightly over time (2010–2025 trend)
  • Budget support improves Paris scores; humanitarian aid bypasses systems
Media & Information Ecosystems
18k rows

Structure

One row per individual. Media access, consumption, source trust, media literacy, development communication, misinformation, and press freedom perceptions.

Key Columns

  • has_radio, has_tv, has_smartphone, has_internet — Media access
  • radio_hours, tv_hours, social_media_hours — Consumption (hours/week)
  • primary_news_source, trusts_primary_source — Information sources
  • media_literacy_score, can_identify_fake_news — Media literacy (0–10)
  • exposed_to_dev_content, dev_content_topic — Development communication
  • encountered_health_misinformation, shared_misinformation — Misinformation

Realistic Features

  • Urban–rural divide: urban = more digital, rural = more radio
  • Youth lean to social media; elderly to radio and TV
  • Media literacy reduces misinformation sharing
IRT Psychometric Assessment
20k rows

Structure

One row per respondent with 30 item responses. 3-Parameter Logistic (3PL) IRT model with item difficulty, discrimination, guessing, and response times.

Key Columns

  • theta — Latent ability (standard normal, driven by education/wealth/urban)
  • item_1 through item_30 — Binary responses (0/1 correct)
  • Item parameters: difficulty (b, −3 to 3), discrimination (a, 0.5–2.5), guessing (c, ~0.25)
  • rt_item_1 through rt_item_30 — Response times (seconds, lognormal)
  • total_score, pct_correct — Test-level summaries

Realistic Features

  • 3PL model: P = c + (1−c) / (1 + exp(−a × (theta − b)))
  • Differential Item Functioning: items 5, 12, 18 favor females; items 8, 22, 27 favor males
  • Response times: slower for harder items, faster for higher-ability respondents
  • Theta–total_score correlation ~0.87
Field Survey Quality (Paradata)
20k rows

Structure

One row per survey interview. 200 enumerators with timing, GPS, response pattern, back-check, and enumerator-level quality data. ~5% of enumerators are fabricators.

Key Columns

  • interview_duration_min, travel_time_min, too_short, too_long, outside_working_hours — Timing
  • gps_latitude, gps_longitude, gps_accuracy_m, gps_suspicious — GPS validation
  • straightlining_score, acquiescence_score, digit_preference_score — Response patterns
  • back_checked, back_check_match_rate — Verification
  • enumerator_experience_months, surveys_completed_today, fatigue_flag — Enumerator
  • missing_rate, dont_know_rate, refused_rate, outlier_count — Data quality

Realistic Features

  • ~5% fabricating enumerators: short duration + straightlining + GPS issues
  • Fatigue effect: quality degrades after 6+ surveys per day
  • Suitable for training data quality auditors and building fraud detection models

Practice Exercises

Suggested exercises for each dataset, organized by difficulty. Perfect for coursework and self-study.

Introductory

Poverty Profile

Using the household survey, calculate headcount poverty rates by district, urban/rural, and household head gender. Plot the Lorenz curve and compute the Gini coefficient.

Dataset: household_survey
Introductory

Descriptive Health Statistics

Compute stunting, underweight, and wasting prevalence by wealth quintile. Plot vaccination coverage by age. Detect the age heaping pattern.

Dataset: health_nutrition
Intermediate

ITT & LATE Estimation

Estimate intent-to-treat effects for each arm. Use compliance data to compute LATE via IV/2SLS. Test for differential attrition and compute Lee bounds.

Dataset: rct_experiment
Intermediate

Mincer Wage Regression

Estimate returns to education and experience. Perform Oaxaca-Blinder decomposition of the gender wage gap. Compare formal vs. informal sector returns.

Dataset: labor_market
Intermediate

Agricultural Productivity

Estimate the Cobb-Douglas production function. Test for technology adoption effects. Analyze how rainfall shocks affect yields and whether irrigation mitigates damage.

Dataset: agriculture
Intermediate

Multi-level Education Model

Fit a hierarchical linear model with student and school levels. Estimate the ICC. Test whether school meal programs improve test scores controlling for SES.

Dataset: education
Advanced

Targeting Accuracy Analysis

Compare PMT, community-based, and categorical targeting. Compute undercoverage, leakage, and total error. Simulate moving the poverty line and observe the error trade-off.

Dataset: targeting
Advanced

Credit Risk & Default Prediction

Build a logistic regression model predicting default. Compute ROC/AUC. Analyze whether group lending reduces default vs. individual lending, controlling for observables.

Dataset: microfinance
Advanced

Market Integration

Test for cointegration between market pairs using the Engle-Granger method. Estimate the law of one price. Analyze how transport costs and borders affect price transmission.

Dataset: trade_markets
Advanced

Growth & Development Regressions

Estimate the relationship between GDP growth and poverty reduction. Test convergence across countries. Use fixed effects to control for unobserved country heterogeneity.

Dataset: panel_data
Intermediate

Women’s Empowerment Index

Construct a WEAI-like composite index from the 5 decision-making domains. Analyze how programme type (cash vs. training vs. awareness) differentially affects empowerment. Examine GBV underreporting patterns.

Dataset: gender_programme
Intermediate

Girls’ Dropout Analysis

Build a logistic regression predicting dropout. Quantify the relative contribution of each barrier (marriage, cost, distance, MHM). Estimate scholarship programme effects on primary–secondary transition.

Dataset: girls_education
Intermediate

Climate Resilience Measurement

Reconstruct the RIMA resilience index from its three pillars. Compare resilience across agro-ecological zones. Analyze whether early warning access reduces crop losses from drought.

Dataset: climate_resilience
Intermediate

Value Chain Margin Analysis

Calculate value addition at each node. Test whether cooperative membership and quality certification improve producer margins. Analyze how post-harvest losses vary by chain node and storage type.

Dataset: agri_value_chain
Introductory

Animal Welfare Assessment

Compute mean Five Freedoms scores by animal type. Visualize how welfare training shifts body condition scores. Compare veterinary access between working and companion animals.

Dataset: animal_welfare
Intermediate

Disease Cascade Analysis

Map the testing–diagnosis–treatment cascade for malaria, TB, and HIV. Identify drop-off points. Estimate the equity gap in catastrophic health expenditure by wealth quintile.

Dataset: public_health
Intermediate

VSLA Impact Evaluation

Compare asset accumulation (baseline vs. endline) across treatment arms. Estimate the effect of savings group membership on food consumption score. Analyze financial inclusion disparities by gender.

Dataset: livelihoods
Intermediate

Access to Justice Barriers

Model the determinants of dispute resolution. Estimate how legal aid affects outcomes. Analyze whether rights awareness (CEDAW, child rights) translates into civic participation and advocacy behavior change.

Dataset: advocacy_rights
Advanced

JMP Service Ladder Analysis

Classify households across JMP water and sanitation ladders. Correlate E. coli contamination with source type. Test whether handwashing observation vs. self-report shows social desirability bias. Link WASH conditions to child diarrhoea.

Dataset: wash
Advanced

Humanitarian Needs Assessment

Construct multi-sector severity scores (JIAF-like). Profile vulnerability by displacement status. Analyze Sphere standards compliance gaps. Assess whether feedback mechanisms improve aid satisfaction.

Dataset: humanitarian
Advanced

Graduation Model Evaluation

Evaluate the social protection graduation model: which thresholds (food security, assets, savings) best predict self-sufficiency? Compare programme types. Estimate consumption smoothing effects of regular transfers.

Dataset: social_protection
Advanced

Governance & Corruption Analysis

Estimate the determinants of bribery experience. Test whether social accountability programmes improve trust and budget awareness. Analyze the gap between bribery experience and reporting rates. Build community scorecards from the data.

Dataset: governance
Intermediate

KAP Cascade Analysis

Measure the knowledge–attitude–practice cascade gap. Test whether campaign dose-response is linear. Model the role of self-efficacy in closing the attitude–practice gap.

Dataset: behaviour_change
Advanced

Cost-Effectiveness Comparison

Calculate cost-per-beneficiary and ICER across sectors. Compare pilot vs. at-scale programmes for economies of scale. Build a CEA league table ranking interventions by DALY averted per USD.

Dataset: cost_effectiveness
Intermediate

Decent Work & Gender Pay Gap

Decompose the gender wage gap using Oaxaca-Blinder. Compare social protection coverage across formal and informal workers. Estimate the incidence of below-minimum-wage employment by sector.

Dataset: decent_work
Intermediate

Care Work Gender Gap

Calculate the gender care gap in hours/day. Estimate how care infrastructure (water access, cookstoves, childcare) reduces women’s care burden. Compute time poverty rates by gender and wealth quintile.

Dataset: care_economy
Advanced

Intersectional Disadvantage

Test for multiplicative (vs. additive) intersectional penalties on income. Compare outcomes for SC women with disabilities against single-axis disadvantage. Map service access gaps across identity combinations.

Dataset: intersectionality
Intermediate

Pollution & Health Inequality

Correlate PM2.5 exposure with wealth quintile to test environmental injustice. Model respiratory illness as a function of pollution exposure and cooking fuel. Compare green space access by income.

Dataset: environmental_justice
Introductory

Social Capital Measurement

Compare bonding vs. bridging social capital across urban/rural areas. Visualize trust levels by education. Test whether CDD programme participants report higher satisfaction and community participation.

Dataset: community_development
Intermediate

Digital Divide Analysis

Map the gender and age digital divide. Test whether digital literacy (hierarchical skills) predicts misinformation resilience. Analyze barriers to internet access by wealth and geography.

Dataset: digital_access
Introductory

SEL Programme Impact

Compare CASEL domain scores between programme and non-programme students. Test whether SEL composite score correlates with academic performance and lower bullying. Visualize gender differences in SEL domains.

Dataset: social_emotional_learning
Intermediate

NGO Financial Health

Analyze the overhead debate: do lower admin costs predict better outcomes? Compare financial health metrics across org types. Build a value-for-money composite. Identify under-reporting patterns in admin costs.

Dataset: ngo_finance
Advanced

Aid Effectiveness & Paris Principles

Analyze Paris Declaration compliance over time (2010–2025). Test whether budget support improves country ownership scores. Compute donor fragmentation (HHI) by sector. Evaluate tied aid trends.

Dataset: aid_effectiveness
Introductory

Media Landscape Mapping

Profile media consumption by age group and urban/rural status. Test whether media literacy score predicts ability to identify fake news. Analyze which channels reach development communication content most effectively.

Dataset: media_development
Advanced

IRT Model Estimation

Estimate item parameters (difficulty, discrimination) from the response data. Detect DIF items by gender. Compare 1PL, 2PL, and 3PL model fits. Analyze response time patterns by item difficulty and respondent ability.

Dataset: irt_assessment
Advanced

Survey Fraud Detection

Build a classifier to detect fabricating enumerators from paradata (duration, straightlining, GPS, back-checks). Calculate false positive and negative rates. Recommend a quality threshold for field team supervision.

Dataset: field_survey_quality