Confidence system
Every number in this report carries an implicit confidence level. We use three tiers, applied uniformly across all five tabs. No number is presented without stating or implying which tier it belongs to.
High confidence
Figures drawn directly from official box scores, confirmed by multiple databases. Integration dates,
team rosters, career totals for MLB players (Lahman), and Negro Leagues stats ratified by MLB in May 2024.
All leaderboard values on Tab 04 are HIGH CONFIDENCE where the source is Lahman or the MLB-ratified Seamheads data.
Candidate
Figures derived from reliable partial data with reasonable interpolation. Negro Leagues batting stats
from Retrosheet box scores (107,000 game records); FiveThirtyEight player ratings based on contemporary accounts
and surviving statistics. Career WAR estimates for Negro Leagues players (used on Tabs 02 and 03) are CANDIDATE-level.
Speculative
Counterfactual projections. Career trajectory models (quadratic fit to WAR, extrapolated backward to age 21),
"lost WAR" totals, and the aggregate 2,063.5 lost-WAR figure are SPECULATIVE. They answer "what if the door
had opened earlier?" rather than "what happened." They are the most interesting numbers in this report and the
least certain.
Data sources
Primary databases
- Lahman Baseball Database Complete MLB player statistics, 1871 through 2016. Career batting, pitching, fielding. The backbone of all MLB leaderboard rankings.
- Retrosheet Negro Leagues 107,000 game records, 1903 through 1962. Box scores, play-by-play where available. The most complete Negro Leagues statistical archive.
- FiveThirtyEight NLB Ratings 480 Negro Leagues player ratings, including estimated WAR equivalents. Built from contemporary press accounts, surviving statistics, and expert assessment.
- Seamheads Negro Leagues DB Season-level batting and pitching statistics. Cross-referenced with Retrosheet. Source of the stats MLB officially adopted in May 2024.
- Library of Congress Public-domain photographs. Player portraits, team photos, and stadium images used on the Lost Seasons cards (Tab 02).
The pipeline
Six Python scripts run sequentially. Each reads the output of the previous step. Total runtime is under sixty seconds on commodity hardware.
| Step | Script | Purpose |
|---|---|---|
| 00 | 00_acquire.py | Load and merge all data sources into a unified parquet file. Standardizes player IDs across Lahman, Retrosheet, and FiveThirtyEight. |
| 01 | 01_leaderboards.py | Compute career and single-season leaderboards. Generates pre-2024 (MLB only) and post-2024 (MLB + Negro Leagues) rankings with league-appropriate AB thresholds. |
| 02 | 02_integration_timeline.py | Team integration dates and yearly performance analysis. Tracks the twelve-year wave from Robinson to Green. |
| 03 | 03_lost_seasons.py | Compute lost seasons for each Negro Leagues player. Peak years, years they could have played in MLB, and the gap between. |
| 04 | 04_career_projections.py | Build career trajectory models for integration-era players. Quadratic fit to actual WAR, extrapolated backward to age 21. |
| 05 | 05_export.py | Combine all outputs into meta.json with headline statistics. Final assembly of the JSON files that feed the five report tabs. |
Limitations
- The Lahman database covers MLB through 2016 only. Post-2016 seasons are not included in career totals or leaderboard rankings. This does not affect the core analysis (integration era is 1947 through 1959) but means active-player career stats are incomplete.
- The Lahman database does not include a race field. Player identification as Black or white relies on the FiveThirtyEight Negro Leagues dataset and cross-referencing with integration records. Some players may be misclassified or omitted.
- Negro Leagues box score coverage is incomplete. Retrosheet contains 107,000 game records, but barnstorming games, exhibition matches, and some regular-season contests were never recorded. Surviving statistics likely undercount Negro Leagues performance.
- The "lost seasons" model assumes a player would have debuted at age 21, which was the median debut age for position players in the relevant era. Individual players may have debuted earlier or later. The model does not account for military service or injury.
- Career WAR projections use a simple quadratic fit. More sophisticated models (aging curves, era adjustments, positional factors) would yield different numbers. The direction of the finding is robust; the precise figures are not.
- Photo coverage on the Lost Seasons cards is limited to 13 of 50 players. Public-domain images of Negro Leagues players are scarce, which is itself a form of the same erasure this report documents.
On the forty-years irony
The Negro Leagues operated for roughly forty years, from the founding of the Negro National League in 1920
to the last competitive season in the late 1950s. The barrier that preceded them held for forty-seven.
The gap between the leagues' end and MLB's official recognition of their statistics — from the late 1950s
to May 29, 2024 — is another sixty-five years. The players waited longer to be counted than they played.
The series is called One Hundred Years. The Negro Leagues did not get one hundred years. They got
forty, and then the door opened and the leagues died so the players could live.
This report does not argue that the statistical merger was right or wrong. It treats it as done, and measures what changed. The leaderboards reshuffled. The names appeared. The numbers — the numbers were always true.