Whistlewatch

How we calculate Bias

Plain-English version of the formula, plus its known limits.

The 0–100 Bias Index

The Bias Index is a statistical leaning score, not an accusation. It compares a referee's per-match decision patterns to the league-wide average across the same season, and rolls five sub-scores into one composite 0–100 number.

  • 0–33 · close to or under league average — low.
  • 34–66 · noticeable asymmetry — mid.
  • 67–100 · pronounced asymmetry — high, worth a closer look.

High values do not imply wrongdoing. They flag that the data pattern deviates from the league baseline by an amount large enough to be interesting.

Sub-scores & weights (ADR-009)

Each sub-score is a 0–100 normalisation of one decision asymmetry.

Sub-scoreWeightStatus
Penalty imbalance (home vs away)0.55live
Card imbalance (home vs away)0.45live
VAR overruled rate0.00N/A · weighted out
Stoppage-time bias0.00N/A · weighted out
Disallowed-goals bias0.00N/A · weighted out

The three "N/A" sub-scores have no public data source we trust (ADR-012 logs the eight discovery URLs we tested), so their weight has been redistributed onto the two live sub-scores. That's statistically cleaner than fixing them at a fake-neutral 50 — the score now reflects only what we measured. If a VAR data feed becomes available in Phase 4+, the composite tightens automatically.

Data source & coverage

All match data comes from FBref via the open-source soccerdata library (ADR-008). Currently covered: Bundesliga, Premier League, La Liga, Serie A, Ligue 1 and Primeira Liga, seasons 2024-2025 and 2025-2026.

Only referees with at least 10 matches in a given (league, season) appear in the leaderboard. Smaller samples produce extreme bias values from a single fluke match and aren't statistically meaningful.

Why the range? Confidence intervals

Every referee page also shows a 90% confidence interval next to the bias index — e.g. "Bias 51, 90% CI 32–70". This is the range we're statistically confident the "true" bias would fall in if the referee officiated infinitely many comparable matches under the same league conditions.

We compute it as a Wald interval from the per-match Poisson standard error of the live sub-scores (penalties + cards), propagated through the weighted bias-index composition. Wide intervals mean small sample (~11 matches) and high uncertainty; narrow intervals mean ~25+ matches and a tighter estimate.

90% instead of 95% was a conscious choice: at our typical sample sizes a 95% interval would be ~25 points wide and visually drown out the point estimate. 90% keeps the band readable while still honestly reporting that small samples are noisy. See ADR-011 for the derivation.

Update frequency

Pipeline runs every day at 05:05 UTC via GitHub Actions (ADR-014). New match data therefore reaches whistlewatch.fans within ~24 hours of the FBref upload. If no matches were played the previous day, the deploy step is skipped to keep the git history clean.

Known limitations

  • No VAR review counts (FBref does not expose per-match VAR fields)
  • No stoppage-time-per-half breakdown
  • No disallowed-goals counter
  • The weights are provisional — Phase 3+ will empirically calibrate them (ADR-010 planned)