One Hundred Years of

Method & Sources

How this report was built, what data it draws on, and where the numbers stop being certain.
Section   06 / 07
Pipeline   6 scripts
Sources   5 databases
Updated   May 2024

Confidence system

Every number in this report carries an implicit confidence level. We use three tiers, applied uniformly across all five tabs. No number is presented without stating or implying which tier it belongs to.

High confidence
Figures drawn directly from official box scores, confirmed by multiple databases. Integration dates, team rosters, career totals for MLB players (Lahman), and Negro Leagues stats ratified by MLB in May 2024. All leaderboard values on Tab 04 are HIGH CONFIDENCE where the source is Lahman or the MLB-ratified Seamheads data.
Candidate
Figures derived from reliable partial data with reasonable interpolation. Negro Leagues batting stats from Retrosheet box scores (107,000 game records); FiveThirtyEight player ratings based on contemporary accounts and surviving statistics. Career WAR estimates for Negro Leagues players (used on Tabs 02 and 03) are CANDIDATE-level.
Speculative
Counterfactual projections. Career trajectory models (quadratic fit to WAR, extrapolated backward to age 21), "lost WAR" totals, and the aggregate 2,063.5 lost-WAR figure are SPECULATIVE. They answer "what if the door had opened earlier?" rather than "what happened." They are the most interesting numbers in this report and the least certain.

Data sources

Primary databases

The pipeline

Six Python scripts run sequentially. Each reads the output of the previous step. Total runtime is under sixty seconds on commodity hardware.

StepScriptPurpose
00 00_acquire.py Load and merge all data sources into a unified parquet file. Standardizes player IDs across Lahman, Retrosheet, and FiveThirtyEight.
01 01_leaderboards.py Compute career and single-season leaderboards. Generates pre-2024 (MLB only) and post-2024 (MLB + Negro Leagues) rankings with league-appropriate AB thresholds.
02 02_integration_timeline.py Team integration dates and yearly performance analysis. Tracks the twelve-year wave from Robinson to Green.
03 03_lost_seasons.py Compute lost seasons for each Negro Leagues player. Peak years, years they could have played in MLB, and the gap between.
04 04_career_projections.py Build career trajectory models for integration-era players. Quadratic fit to actual WAR, extrapolated backward to age 21.
05 05_export.py Combine all outputs into meta.json with headline statistics. Final assembly of the JSON files that feed the five report tabs.

Limitations

On the forty-years irony

The Negro Leagues operated for roughly forty years, from the founding of the Negro National League in 1920 to the last competitive season in the late 1950s. The barrier that preceded them held for forty-seven. The gap between the leagues' end and MLB's official recognition of their statistics — from the late 1950s to May 29, 2024 — is another sixty-five years. The players waited longer to be counted than they played. The series is called One Hundred Years. The Negro Leagues did not get one hundred years. They got forty, and then the door opened and the leagues died so the players could live.

This report does not argue that the statistical merger was right or wrong. It treats it as done, and measures what changed. The leaderboards reshuffled. The names appeared. The numbers — the numbers were always true.