36 generators producing large, realistic datasets across every major development sector — from impact evaluation and poverty analysis to gender, climate, WASH, humanitarian response, animal welfare, governance, and more. Built for students and practitioners learning data work in global development.
Each generator produces a large, realistic dataset with correlated variables, proper distributions, and realistic missing-data patterns.
Multi-module household survey with demographics, per-capita consumption (log-normal), asset ownership, housing quality, food security (FIES), and subjective well-being. Individual-level (members nested in households).
Randomized controlled trial with stratified assignment, 4 treatment arms, partial compliance, differential attrition, and spillover flags. Baseline + endline consumption and food insecurity.
Balanced country-year panel with 20+ development indicators including GDP, life expectancy, mortality, enrollment, fertility, poverty, and Gini. AR(1) persistence, COVID shock, WDI-like gaps.
Plot-level crop production data with Cobb-Douglas yields, input use (fertilizer, improved seed, irrigation), rainfall shocks, distance to market, and profit calculation.
Under-5 child health data with WHO z-scores (HAZ/WAZ/WHZ), vaccination schedules, maternal health (ANC, delivery), feeding practices, morbidity, and age heaping.
Student-level data nested in 500 schools with math/reading scores, attendance, teacher quality, school resources, SES, grade repetition, and dropout. ICC ~0.20.
Working-age adults with Mincer wage equation, formal/informal sector, employment status, hours, migration, remittances, social protection, and underemployment.
Loan-level records with group/individual lending, repayment rates, default prediction, repeat borrowers, collateral, and internal credit scoring. Realistic MFI portfolio.
Proxy means test targeting with true vs. predicted consumption, inclusion/exclusion errors, community-based targeting comparison, and categorical eligibility. Ready for targeting accuracy analysis.
Weekly staple commodity prices across 80 markets in 16 countries with seasonality, spatial price correlation, transport cost wedges, border effects, and AR(1) persistence.
Women's empowerment (WEAI-like), decision-making, economic empowerment, time use, GBV prevalence with underreporting, SRH indicators, and programme intervention effects.
Longitudinal girls' education data with enrollment, MHM, safety (SRGBV), gendered barriers (marriage, pregnancy, cost), primary-secondary transition rates, and scholarship effects.
Climate shocks (drought, flood, cyclone), coping strategies, adaptation practices, RIMA resilience index (absorptive, adaptive, transformative capacity), and carbon footprint.
Multi-node value chain (producer→aggregator→processor→retailer) with margins, quality grading, post-harvest losses, contract farming, and cooperative membership effects.
Animal welfare assessment using Five Freedoms framework, body condition scoring, working animal welfare (donkeys, horses), veterinary access, rabies vaccination, and programme effects.
Disease surveillance (malaria, TB, HIV), health facility visits, out-of-pocket spending, CHW contact, health insurance, NCD risk factors, mental health (PHQ-9), and COVID vaccination.
Income diversification, VSLA/savings groups, enterprise development, vocational training, asset accumulation, food consumption score, financial inclusion, and youth employment.
Legal identity, land tenure, access to justice, rights awareness (CEDAW, child rights), dispute resolution, civic participation, and advocacy campaign effectiveness.
JMP service ladders, water quality testing (E.coli, turbidity), sanitation ladders, CLTS/ODF status, handwashing observation, MHM, and diarrhea linked to WASH conditions.
Displacement status (IDP/refugee/host), multi-sector needs assessment, aid distribution, protection concerns (GBV, child protection), SADD, CwC, and accountability.
Beneficiary registry with cash transfer disbursement, conditionality compliance, payment modalities, FCS/rCSI outcomes, asset graduation model, and dependency metrics.
Citizen satisfaction with public services, trust in institutions, corruption experience (bribery), budget transparency, social accountability, community scorecards, and RTI.
Knowledge-Attitude-Practice surveys for health/WASH/nutrition. Campaign exposure and dose, KAP cascade (knowledge > attitude > practice), self-efficacy, and social norms.
Programme cost data across sectors: personnel, materials, transport, overhead. Outcomes, effect sizes, ICER, DALYs averted, QALYs gained. Pilot vs. at-scale economies.
ILO decent work framework: formal/informal employment, earnings with gender pay gap, social protection, working conditions, union membership, and work-life balance.
Time use diary data: paid work, unpaid care, domestic work, leisure. 3x gender care gap. Care infrastructure effects, opportunity cost, and time poverty indicators.
Intersectional analysis: caste, religion, disability, sexuality with socioeconomic outcomes. Multiplicative disadvantage, discrimination experience, and access to services.
Pollution exposure (PM2.5, indoor air, water contamination), environmental hazards, health outcomes, cooking fuel, green space access, climate vulnerability, and environmental rights.
Social capital measurement: bonding vs. bridging ties, trust, collective action, participatory governance, community assets, social cohesion, and CDD programmes.
Device ownership, connectivity, digital literacy (7 hierarchical skills), social media use, misinformation exposure, privacy awareness, and barriers. Gender and age divides embedded.
CASEL framework: self-awareness, self-management, social awareness, relationship skills, responsible decision-making. Academic scores, wellbeing, bullying, and prosocial behaviour.
NGO financial management: budget breakdowns, donor types, funding gaps, burn rates, reserves, compliance audits, value-for-money scoring, and the overhead debate.
Official Development Assistance flows: donor-recipient pairs, Paris Declaration indicators (ownership, alignment, harmonization), tied aid, fragmentation, and conditionality.
Media access and consumption, news source trust, media literacy, development communication exposure, misinformation encounters, press freedom perceptions, and language barriers.
3PL IRT model with 30 items: item difficulty, discrimination, guessing. Latent ability (theta), response times, Differential Item Functioning (DIF) by gender. Full item-level data.
Survey process data: interview timing, GPS validation, response patterns (straightlining, acquiescence), back-checks, enumerator fatigue, and fabrication detection. ~5% fabricators embedded.
Generate all datasets in under 30 seconds. Only requires Python 3.8+ and four pip packages.
Clone the repo, install dependencies, and generate all 10 datasets as CSV files.
# Clone
git clone https://github.com/Varnasr/devdata-practice.git
cd devdata-practice
# Install (numpy, pandas, scipy, pyarrow)
pip install -r requirements.txt
# Generate all 10 datasets
python generate.py
Outputs ~35 MB of CSVs to ./output/
Generate specific datasets, change sizes, set seeds for reproducibility, or export as Parquet.
# Generate just two datasets
python generate.py rct_experiment labor_market
# Larger datasets (override default size)
python generate.py household_survey --rows 50000
# Set seed for exact reproducibility
python generate.py --seed 42
# Export as Parquet instead of CSV
python generate.py --format parquet
# List all available generators
python generate.py --list
Expand any dataset below for the full column dictionary, realistic features, and methodological notes.
One row per household member. ~15,000 households with 2-8 members each, expanded to ~75k individual-level rows. Household-level variables are repeated across members.
individual_id, household_id — Unique identifierscountry, district, urban — Geographyrelationship, age, female, education_years — Demographicsmonthly_pce_usd — Monthly per-capita expenditure (USD PPP, log-normal)food_share — Engel curve: food expenditure as share of totalfood_insecurity_score — FIES-like 0-8 scaleowns_radio through owns_improved_stove — 8 binary asset indicatorswall_material, rooms, water_source, toilet_type — Housinglife_satisfaction — Subjective well-being (1-10)One row per participant. 4 arms: control, cash transfer, cash + training, training only. Stratified by district × gender.
treatment_arm — Randomized assignmentactually_treated — Compliance indicator (65-85% take-up)baseline_consumption_usd, endline_consumption_usd — Primary outcomesbaseline_food_insecurity, endline_food_insecurity — HFIAS 0-27attrited — Endline attrition (differential by arm)spillover_risk — Flag for control units in treated villagesBalanced panel: 25 developing countries × 25 years (2000-2024). 20+ indicators per country-year.
gdp_per_capita_usd, log_gdp_per_capita — Incomelife_expectancy, infant_mortality_per_1000, under5_mortality_per_1000 — Healthprimary_enrollment_pct, secondary_enrollment_pct, adult_literacy_pct — Educationfertility_rate, electricity_access_pct, sanitation_access_pct — Developmentpoverty_headcount_215, gini_coefficient — Welfare (sparse)One row per plot. ~15k farm households with 1-4 plots each. 7 crops, full input-output accounting.
plot_size_acres, crop, soil_quality — Plot characteristicsimproved_seed, fertilizer_kg, irrigation, pesticide_used — Inputsrainfall_deviation_sd — Weather shock (SD from normal)harvest_kg, price_per_kg_usd, revenue_usd, profit_usd — Outputsextension_contact, distance_to_market_km — AccessYields follow a Cobb-Douglas: Y = A · L0.50 · Lab0.25 · F0.15 with TFP shifters for improved seed (+15%), irrigation (+10%), and soil quality. Rainfall has an inverted-U effect.
One row per child under 5 with linked mother characteristics. Wealth quintile drives most health gradients.
height_for_age_z, weight_for_age_z, weight_for_height_z — WHO z-scoresstunted, underweight, wasted — Binary flags (z < -2)bcg_vaccine, dpt1_vaccine, dpt3_vaccine, measles_vaccine, fully_vaccinatedanc_visits, facility_delivery, skilled_birth_attendant — Maternal healthdiarrhea_2wk, fever_2wk, cough_2wk, sought_treatment — MorbidityStudents nested in 500 schools (grades 3-8). Multi-level data for HLM analysis.
math_score, reading_score — 0-100 test scoresattendance_rate, distance_to_school_km — Student accessses_score, school_type, school_meal_program — SES & resourcespupil_teacher_ratio, pct_trained_teachers, has_library — School qualityrepeated_grade, dropped_out — PersistenceWorking-age adults (15-64) with employment status, wages, sector, and social protection.
ln(wage) = 1.8 + 0.08·educ + 0.04·exp - 0.0006·exp² - 0.18·female + 0.25·urban + 0.35·formal + ε
Loan-level records from a microfinance institution. Group and individual lending products.
loan_amount_usd, interest_rate_annual_pct, term_months — Termsloan_product, loan_purpose — Product classificationcycle_number — Repeat borrower indicator (graduation)defaulted, days_past_due, repayment_rate — Performanceinternal_credit_score — 300-850 scoreDefault probability via logistic: higher for larger loans, consumption purpose, first-cycle borrowers, uncollateralized. Women default less. Repeat borrowers graduate to larger amounts.
Household-level data with both true consumption (survey) and PMT-predicted consumption. Built for targeting accuracy analysis.
true_monthly_pce_usd — Actual per-capita consumption (with residual noise)pmt_predicted_pce_usd — Fitted values from proxy means formulatruly_poor, pmt_classified_poor — Binary poverty flagsexclusion_error, inclusion_error — Targeting mistakescommunity_selected — Community-based targeting comparisonmonthly_benefit_usd — Calculated transfer amountThe gap between true and predicted consumption creates realistic targeting errors. Students can compare PMT, community, and categorical targeting; compute leakage and undercoverage; and simulate benefit reforms.
80 markets × 104 weeks × 8 staple commodities. 2 years of weekly price data across 16 countries.
market, country, date, commodity — Identifiersprice_per_kg_usd — Commodity price with seasonalityvolume_traded_kg — Market volumetransport_cost_pct — Distance-based cost wedgeOne row per woman/girl. 25,000 individuals across programme and non-programme areas with empowerment, GBV, and SRH indicators.
individual_id, household_id, country, district — Identifiersprogramme_participant, programme_type — Cash transfers, skills training, savings groups, awareness, legal aiddecides_own_healthcare through decides_own_earnings — 5 WEAI decision-making domainsdecision_making_score, empowerment_index — Composite empowerment scoresowns_land, owns_house, has_bank_account, has_mobile_money — Economic empowermentcare_work_hours_day, productive_work_hours_day, leisure_hours_day — Time use diaryreported_physical_gbv, reported_emotional_gbv, reported_economic_gbv — GBV with underreportingusing_modern_contraception, unmet_need_family_planning — SRH indicatorsOne row per girl (grades 1–12). Tracks enrollment, learning outcomes, safety, menstrual hygiene, and dropout barriers.
girl_id, grade, age, wealth_quintile — Identifiers & demographicsenrolled, attendance_rate, dropped_out — Enrollment statusmath_score, literacy_score — Learning outcomes (0–100)barrier_marriage, barrier_pregnancy, barrier_cost, barrier_distance — Dropout barriershas_menstruated, mhm_knowledge, has_sanitary_products, missed_school_menstruation — MHMsrgbv_experienced, feels_safe_route_to_school — Safety indicatorsreceives_scholarship, receives_school_meals — Programme supportat_primary_secondary_transition, transitioned_to_secondary — Transition trackingOne row per household. Climate shock exposure, coping strategies, adaptation practices, and RIMA-like resilience indices.
agroecological_zone — Arid, semi-arid, sub-humid, humid, highlandexperienced_drought, experienced_flood, experienced_cyclone, experienced_pest_outbreak — Shock exposurecrop_loss_pct, livestock_loss_pct, income_loss_pct — Shock lossescs_reduced_meals, cs_sold_assets, cs_borrowed_money, cs_migration — Coping strategiesadopted_drought_resistant_crop, adopted_irrigation, has_crop_insurance — Adaptationabsorptive_capacity, adaptive_capacity, transformative_capacity — RIMA pillarsresilience_index — Composite resilience scorecarbon_footprint_tco2_yr — Household emissions proxyOne row per transaction across a 4-node value chain: producer → aggregator → processor → retailer. 8 commodities.
chain_node — Position in value chain (producer/aggregator/processor/retailer)commodity — Maize, coffee, dairy, poultry, horticulture, rice, groundnuts, cassavaquality_grade — A/B/C affecting price premiumsvolume_kg, price_per_kg_usd, revenue_usd — Transaction valuestotal_cost_usd, margin_usd, margin_pct — Profitabilitypost_harvest_loss_pct — Loss rates by chain nodein_cooperative, has_contract, has_certification — Market linkagebuyer_type, season — Market contextOne row per animal/household. Covers livestock, working animals, and companion animals with Five Freedoms welfare assessment.
animal_type — Cattle, goats, sheep, poultry, donkey, horse, pig, camel, dog, catfreedom_hunger_thirst through freedom_fear_distress — Five Freedoms (1–5 each)welfare_score_avg, body_condition_score, shelter_score — Composite welfaredistance_to_vet_km, accessed_vet_last_year, vaccinated, dewormed — Veterinary accessworking_hours_daily, working_has_wounds, working_proper_harness — Working animal indicatorscompanion_rabies_vaccinated, companion_sterilized — Companion animal healthhh_consumes_animal_source_food — Nutrition linkageOne row per individual. Disease surveillance, health facility use, insurance, NCDs, mental health (PHQ-9), and COVID-19 vaccination.
malaria_tested, malaria_rdt_positive — Malaria RDT cascadetb_ever_diagnosed, tb_on_treatment — TB cascadehiv_tested_ever, hiv_positive, hiv_on_art — HIV cascadefacility_visits_12m, oop_health_spending_usd, catastrophic_health_expenditure — Utilizationchw_contact_6m, chw_referred, chw_referral_completed — CHW referral cascadehealth_insurance_type — None, CBHI, NHIF, private, employerhypertension_diagnosed, hypertension_controlled, diabetes_diagnosed — NCD screeningphq9_score, depression_moderate, depression_severe — Mental healthcovid_vaccine_doses — 0/1/2/booster doses with wealth gradientOne row per household. Three treatment arms (control, livelihoods only, livelihoods + savings). Covers income, VSLA, training, assets, and food security.
treatment_arm — Control, livelihoods_only, livelihoods_plus_savingsprimary_income_source, n_income_sources, income_diversification_index — Incomeowns_enterprise, enterprise_type, enterprise_monthly_revenue_usd — Enterprisevsla_member, vsla_savings_usd, vsla_shareout_usd, vsla_loan_usd — Savings groupsreceived_vocational_training, training_type, completed_apprenticeship — Skillsbaseline_asset_index, endline_asset_index — 7-asset accumulationfood_consumption_score, fcs_category, reduced_coping_strategies_index — Food securityhas_mobile_money, has_bank_account, accessed_credit_12m — Financial inclusionOne row per individual. Legal identity, land tenure, dispute resolution, rights awareness, civic participation, and advocacy campaigns.
has_birth_certificate, has_national_id — Legal identity documentationowns_land, has_land_title, land_dispute_experienced — Land tenure securityrights_awareness_score, knows_cedaw, knows_child_rights — Rights knowledgeexperienced_dispute, dispute_type, resolution_mechanism, dispute_resolved — Justicebarrier_cost, barrier_distance, barrier_fear, barrier_distrust — Justice barriersvoted_last_election, attended_community_meeting, feels_can_influence_decisions — Civic participationexposed_to_advocacy_campaign, campaign_channel, changed_behavior_post_campaign — CampaignsOne row per household. JMP service ladders for water, sanitation, and hygiene. Water quality testing, CLTS, MHM, and school WASH.
water_source, water_improved, jmp_water_service_level — JMP water ladderecoli_cfu_100ml, ecoli_risk_category, turbidity_ntu — Water quality testingliters_per_person_day, sufficient_water_15lpd — Water quantity (Sphere)sanitation_facility, jmp_sanitation_service_level, open_defecation — JMP sanitationclts_triggered, community_odf_declared, odf_slippage — CLTS programmehw_water_and_soap, hygiene_service_level — Observed handwashingmhm_private_space, mhm_materials_available — Menstrual hygienechild_diarrhea_2wk, diarrhea_ors_used, diarrhea_zinc_used — Child healthschool_separate_toilets_girls, school_has_mhm_facility, school_pupil_toilet_ratio — School WASHOne row per individual with SADD (Sex and Age Disaggregated Data). Displacement, multi-sector needs, aid distribution, protection, and accountability.
displacement_status — IDP, refugee, returnee, host communitycrisis_type, months_displaced, times_displaced — Displacement profileneed_food through need_livelihoods — 7 sector needs scored 0–5 (JIAF-like)overall_severity, people_in_need — Composite need assessmentmeets_sphere_water, meets_sphere_shelter, meets_sphere_food — Sphere standardsreceived_aid, aid_modality, aid_amount_usd — Aid distributiongbv_risk_reported, child_protection_concern, mine_uxo_awareness — Protectionknows_feedback_mechanism, filed_complaint, complaint_resolved — Accountabilitymovement_intention — Stay, return, relocate, seek asylum, undecidedOne row per beneficiary household. Programme registry with transfer tracking, conditionality compliance, and graduation model.
programme_type — Unconditional cash, conditional cash, public works, cash plus, food vouchers, school feedingtransfer_modality — Mobile money, cash-in-hand, bank transfer, voucher, in-kindmonthly_transfer_usd, total_received_usd, pct_payments_received — Transfer trackinghas_conditionality, conditionality_compliant — Compliance monitoringbaseline_consumption_usd, endline_consumption_usd — Impact measurementfcs_baseline, fcs_endline, rcsi_baseline, rcsi_endline — Food security changegraduation_score, graduated, would_cope_without_transfer — Graduation modelOne row per citizen. Service delivery satisfaction, trust in institutions, corruption experience, budget transparency, and social accountability.
satisfaction_health through satisfaction_police, overall_service_satisfaction — Service delivery (1–5)trust_local_govt through trust_ngos — Institutional trust (1–5)bribery_experience, bribery_context, bribe_amount_usd, reported_bribery — Corruptionaware_of_local_budget, budget_literacy_score — Budget transparencyin_social_accountability_prog, attended_scorecard_session — Programme participationscorecard_health, scorecard_education, scorecard_water — Community scorecards (0–100)knows_rti_law, gets_info_radio, gets_info_social_media — Information accessOne row per individual. Knowledge-Attitude-Practice survey for health, WASH, and nutrition behaviours with campaign exposure tracking.
exposed_to_campaign, campaign_type, campaign_doses — Campaign exposure (radio, community drama, peer education, SMS, poster, social media)knowledge_score, knows_handwashing_times, knows_ors_for_diarrhea, knows_exclusive_breastfeeding — Knowledge (0–10)attitude_score, approves_family_planning, gender_equitable_attitude, stigma_hiv — Attitudes (0–10)practice_score, practices_handwashing, uses_treated_water, uses_mosquito_net — Practice (0–10)knowledge_practice_gap, attitude_practice_gap — KAP cascade gapsself_efficacy_score, perceives_community_support, discussed_with_peers — Social normsOne row per programme. Cost breakdowns, beneficiary data, outcomes, and CEA metrics across health, education, nutrition, WASH, livelihoods, and social protection sectors.
sector, programme_type, implementer_type — Programme classificationtotal_cost_usd, personnel_cost_usd, materials_cost_usd, overhead_cost_usd — Cost breakdowncost_per_beneficiary_usd, cost_per_outcome_usd, overhead_ratio, personnel_ratio — Cost ratioseffect_size, icer, daly_averted, qaly_gained — CEA metricsis_pilot — Pilot vs. at-scale with economies of scaleOne row per worker. ILO decent work framework covering formal/informal employment, earnings, social protection, working conditions, freedom of association, and the informal economy.
employment_status — Formal wage, informal wage, self-employed, casual daily, unpaid familymonthly_earnings_usd, hourly_wage_usd, below_minimum_wage — Earnings with embedded gender gap (~0.82 ratio)has_written_contract, has_social_security, has_health_insurance, has_pension — Social protectionoccupational_safety_training, experienced_injury_12m, workplace_harassment — Conditionsmember_of_union, freedom_of_association, collective_bargaining_covered — Labour rightsoperates_without_registration, no_bookkeeping — Informality indicatorsOne row per individual (mixed gender). Time use diary data with care breakdown, care infrastructure, opportunity cost, and time poverty indicators.
sleep_hours, paid_work_hours, unpaid_care_hours, domestic_work_hours, leisure_hours — Time diary (~24h/day)childcare_hours, eldercare_hours, cooking_hours, water_collection_hours — Care breakdownhas_childcare_access, has_electricity, has_improved_cookstove — Care infrastructureforgone_earnings_usd, reduced_labor_participation — Opportunity costtime_poor — Flag for >10.5 hours/day on paid + unpaid workOne row per individual. Multiple identity dimensions (caste, religion, disability, sexuality) with socioeconomic outcomes showing multiplicative intersectional disadvantage.
caste_category (general/OBC/SC/ST), religion, has_disability, disability_type, sexual_minority, indigenousmonthly_income_usd, employed, housing_quality_score, food_security_score — Outcomesexperienced_discrimination, discrimination_basis, discrimination_context — Discriminationaccessed_education through accessed_justice — Service accessOne row per household. Pollution exposure, environmental hazards, health outcomes, cooking fuel, green space, climate vulnerability, and environmental governance.
air_quality_pm25, indoor_air_pollution, water_contamination_score, noise_pollution_level — Pollutionproximity_to_industrial_site_km, proximity_to_waste_dump_km, flood_risk_zone — Hazardsrespiratory_illness_12m, waterborne_illness_12m, child_blood_lead_elevated — Health outcomescooking_fuel_type, cooking_location — Indoor air quality determinantcarbon_footprint_tco2, climate_vulnerability_score — Climate justiceOne row per individual. Social capital measurement with bonding vs. bridging ties, collective action, participatory governance, and community-driven development.
n_group_memberships, primary_group_type — Group membershipbonding_social_capital_score, bridging_social_capital_score — Bonding vs. bridgingtrust_neighbors, trust_strangers, trust_local_leaders — Trust (1–5)participated_in_collective_action, collective_action_type — Collective actionattended_village_assembly, voiced_opinion_in_meeting — Participatory governancein_cdd_programme, contributed_to_project, satisfied_with_project — CDDOne row per individual. Device ownership, connectivity, 7-level digital skills hierarchy, usage patterns, misinformation, privacy, and barriers to access.
owns_smartphone, owns_computer, shared_device_only — Device accesshas_internet_access, internet_type, monthly_data_cost_usd — Connectivitycan_make_call through can_use_govt_services_online — 7 hierarchical skillsdigital_literacy_score — Composite (0–10)encountered_misinformation, can_identify_misinformation, shared_unverified_infobarrier_cost, barrier_literacy, barrier_language, barrier_infrastructureOne row per student (ages 6–18). CASEL framework with 5 SEL domains, academic outcomes, wellbeing, bullying, prosocial behaviour, and teacher/parent ratings.
self_awareness, self_management, social_awareness, relationship_skills, responsible_decision_making — CASEL domains (1–5)sel_composite_score — Overall SEL (1–5)math_score, reading_score, attendance_rate — Academic outcomeslife_satisfaction, bullying_experienced, bullying_perpetrated — Wellbeingin_sel_programme, programme_duration_months, teacher_trained_in_sel — ProgrammeOne row per programme. Organisation characteristics, funding, budget breakdowns, financial health, compliance, effectiveness, and partnerships.
org_type — Local NGO, INGO, CBO, faith-based, social enterpriseannual_budget_usd, n_donors, primary_donor_type, funding_gap_usd — Fundingpersonnel_pct, programme_activities_pct, admin_pct, indirect_cost_pct — Budgetburn_rate_pct, months_of_reserves, sustainability_score — Financial healthaudit_completed, audit_qualified, vfm_score — Compliance & VfMOne row per aid flow (2010–2025). Donor-recipient pairs with Paris Declaration indicators, fragmentation, conditionality, and coordination.
donor_country, recipient_country, year — Flow identifiersoda_amount_usd, disbursement_type, channel — Aid modalitycountry_ownership_score, alignment_with_national_plan, uses_country_systems — Paris principlesdonor_concentration_index, is_tied_aid — Fragmentationhas_conditionality, conditionality_type, conditionality_met — ConditionalityOne row per individual. Media access, consumption, source trust, media literacy, development communication, misinformation, and press freedom perceptions.
has_radio, has_tv, has_smartphone, has_internet — Media accessradio_hours, tv_hours, social_media_hours — Consumption (hours/week)primary_news_source, trusts_primary_source — Information sourcesmedia_literacy_score, can_identify_fake_news — Media literacy (0–10)exposed_to_dev_content, dev_content_topic — Development communicationencountered_health_misinformation, shared_misinformation — MisinformationOne row per respondent with 30 item responses. 3-Parameter Logistic (3PL) IRT model with item difficulty, discrimination, guessing, and response times.
theta — Latent ability (standard normal, driven by education/wealth/urban)item_1 through item_30 — Binary responses (0/1 correct)difficulty (b, −3 to 3), discrimination (a, 0.5–2.5), guessing (c, ~0.25)rt_item_1 through rt_item_30 — Response times (seconds, lognormal)total_score, pct_correct — Test-level summariesOne row per survey interview. 200 enumerators with timing, GPS, response pattern, back-check, and enumerator-level quality data. ~5% of enumerators are fabricators.
interview_duration_min, travel_time_min, too_short, too_long, outside_working_hours — Timinggps_latitude, gps_longitude, gps_accuracy_m, gps_suspicious — GPS validationstraightlining_score, acquiescence_score, digit_preference_score — Response patternsback_checked, back_check_match_rate — Verificationenumerator_experience_months, surveys_completed_today, fatigue_flag — Enumeratormissing_rate, dont_know_rate, refused_rate, outlier_count — Data qualitySuggested exercises for each dataset, organized by difficulty. Perfect for coursework and self-study.
Using the household survey, calculate headcount poverty rates by district, urban/rural, and household head gender. Plot the Lorenz curve and compute the Gini coefficient.
Compute stunting, underweight, and wasting prevalence by wealth quintile. Plot vaccination coverage by age. Detect the age heaping pattern.
Estimate intent-to-treat effects for each arm. Use compliance data to compute LATE via IV/2SLS. Test for differential attrition and compute Lee bounds.
Estimate returns to education and experience. Perform Oaxaca-Blinder decomposition of the gender wage gap. Compare formal vs. informal sector returns.
Estimate the Cobb-Douglas production function. Test for technology adoption effects. Analyze how rainfall shocks affect yields and whether irrigation mitigates damage.
Fit a hierarchical linear model with student and school levels. Estimate the ICC. Test whether school meal programs improve test scores controlling for SES.
Compare PMT, community-based, and categorical targeting. Compute undercoverage, leakage, and total error. Simulate moving the poverty line and observe the error trade-off.
Build a logistic regression model predicting default. Compute ROC/AUC. Analyze whether group lending reduces default vs. individual lending, controlling for observables.
Test for cointegration between market pairs using the Engle-Granger method. Estimate the law of one price. Analyze how transport costs and borders affect price transmission.
Estimate the relationship between GDP growth and poverty reduction. Test convergence across countries. Use fixed effects to control for unobserved country heterogeneity.
Construct a WEAI-like composite index from the 5 decision-making domains. Analyze how programme type (cash vs. training vs. awareness) differentially affects empowerment. Examine GBV underreporting patterns.
Build a logistic regression predicting dropout. Quantify the relative contribution of each barrier (marriage, cost, distance, MHM). Estimate scholarship programme effects on primary–secondary transition.
Reconstruct the RIMA resilience index from its three pillars. Compare resilience across agro-ecological zones. Analyze whether early warning access reduces crop losses from drought.
Calculate value addition at each node. Test whether cooperative membership and quality certification improve producer margins. Analyze how post-harvest losses vary by chain node and storage type.
Compute mean Five Freedoms scores by animal type. Visualize how welfare training shifts body condition scores. Compare veterinary access between working and companion animals.
Map the testing–diagnosis–treatment cascade for malaria, TB, and HIV. Identify drop-off points. Estimate the equity gap in catastrophic health expenditure by wealth quintile.
Compare asset accumulation (baseline vs. endline) across treatment arms. Estimate the effect of savings group membership on food consumption score. Analyze financial inclusion disparities by gender.
Model the determinants of dispute resolution. Estimate how legal aid affects outcomes. Analyze whether rights awareness (CEDAW, child rights) translates into civic participation and advocacy behavior change.
Classify households across JMP water and sanitation ladders. Correlate E. coli contamination with source type. Test whether handwashing observation vs. self-report shows social desirability bias. Link WASH conditions to child diarrhoea.
Construct multi-sector severity scores (JIAF-like). Profile vulnerability by displacement status. Analyze Sphere standards compliance gaps. Assess whether feedback mechanisms improve aid satisfaction.
Evaluate the social protection graduation model: which thresholds (food security, assets, savings) best predict self-sufficiency? Compare programme types. Estimate consumption smoothing effects of regular transfers.
Estimate the determinants of bribery experience. Test whether social accountability programmes improve trust and budget awareness. Analyze the gap between bribery experience and reporting rates. Build community scorecards from the data.
Measure the knowledge–attitude–practice cascade gap. Test whether campaign dose-response is linear. Model the role of self-efficacy in closing the attitude–practice gap.
Calculate cost-per-beneficiary and ICER across sectors. Compare pilot vs. at-scale programmes for economies of scale. Build a CEA league table ranking interventions by DALY averted per USD.
Decompose the gender wage gap using Oaxaca-Blinder. Compare social protection coverage across formal and informal workers. Estimate the incidence of below-minimum-wage employment by sector.
Calculate the gender care gap in hours/day. Estimate how care infrastructure (water access, cookstoves, childcare) reduces women’s care burden. Compute time poverty rates by gender and wealth quintile.
Test for multiplicative (vs. additive) intersectional penalties on income. Compare outcomes for SC women with disabilities against single-axis disadvantage. Map service access gaps across identity combinations.
Correlate PM2.5 exposure with wealth quintile to test environmental injustice. Model respiratory illness as a function of pollution exposure and cooking fuel. Compare green space access by income.
Compare bonding vs. bridging social capital across urban/rural areas. Visualize trust levels by education. Test whether CDD programme participants report higher satisfaction and community participation.
Map the gender and age digital divide. Test whether digital literacy (hierarchical skills) predicts misinformation resilience. Analyze barriers to internet access by wealth and geography.
Compare CASEL domain scores between programme and non-programme students. Test whether SEL composite score correlates with academic performance and lower bullying. Visualize gender differences in SEL domains.
Analyze the overhead debate: do lower admin costs predict better outcomes? Compare financial health metrics across org types. Build a value-for-money composite. Identify under-reporting patterns in admin costs.
Analyze Paris Declaration compliance over time (2010–2025). Test whether budget support improves country ownership scores. Compute donor fragmentation (HHI) by sector. Evaluate tied aid trends.
Profile media consumption by age group and urban/rural status. Test whether media literacy score predicts ability to identify fake news. Analyze which channels reach development communication content most effectively.
Estimate item parameters (difficulty, discrimination) from the response data. Detect DIF items by gender. Compare 1PL, 2PL, and 3PL model fits. Analyze response time patterns by item difficulty and respondent ability.
Build a classifier to detect fabricating enumerators from paradata (duration, straightlining, GPS, back-checks). Calculate false positive and negative rates. Recommend a quality threshold for field team supervision.