- Joined
- Jul 11, 2008
- Messages
- 1,940
- Reaction score
- 185
- Points
- 63
The honest answer is that building a referee database used to require either paid data access or a genuinely painful amount of manual spreadsheet work. LLMs have changed that calculation significantly. Not completely - you still need to put the time in - but the data extraction and structuring work that used to take an evening now takes about thirty minutes once you know what you're doing. The analysis work is still yours to do. The LLM just stops being the bottleneck.
This is the practical implementation guide. Specific prompts, specific data sources, specific database structure. If you follow it, you'll have something usable within a week.
Start With the Database Structure Before You Touch Any Data
The most common mistake is pulling data first and deciding what to do with it afterwards. You end up with inconsistently structured notes that don't compare across officials, and after three or four referees the whole thing becomes unwieldy. Build the structure first, then populate it.
The fields you need for each referee are: total matches officiated in your target competitions over the tracking window, yellow cards per match, red cards per match (which will be a small number - that's fine, it still matters), penalty awards per match, fouls called per match, average added time awarded when a lead is being protected in the final fifteen minutes versus when the match is open, and a qualitative notes field for patterns that don't reduce to numbers cleanly.
That qualitative field is more important than it sounds. Some referees are genuinely hard to reduce to metrics - they manage games contextually in ways that show up as noise in the raw numbers but are identifiable once you've watched enough of their matches and read enough match reports about them. The notes field is where you capture that. "Tends to let physicality go in cup ties," "unusually quick to reach for yellow in derbies," "added time consistently generous when both teams are chasing." These observations have value even when they resist quantification.
Set the tracking window at two to three seasons for established officials. One season is too noisy. More than three seasons and you're mixing in data from career phases where the referee's tendencies may have been genuinely different - officials develop and change, sometimes significantly, and pre-2021 data for a referee who's matured considerably since then can actively mislead you.
Where to Get the Raw Data
Three sources, used in combination, cover most of what you need without requiring paid subscriptions.
SofaScore and FlashScore match detail pages. Both platforms carry match-level officiating data - cards by minute, fouls called, and for most competitions the referee's name attached to the fixture. SofaScore's referee profile pages aggregate this automatically for some officials; for others you're pulling match by match. FlashScore's historical depth is slightly better for lower league fixtures. Neither is perfect but together they cover the Premier League, Championship, Scottish Premiership, and most major European leagues back three or four seasons without cost.
Match reports from BBC Sport, The Guardian, and major outlet match centres. These are where the qualitative data lives. A match report that mentions the referee allowing a strong challenge to go unpunished in the sixty-fifth minute, or awarding five minutes of added time in circumstances where three would have been conventional, is giving you exactly the kind of signal that doesn't appear in the raw card and foul tallies. You're not reading these reports for the match analysis. You're reading them for the officiating texture.
WhoScored for aggregate competition-level comparisons. Useful for calibrating your referee-specific data against competition averages. An official with 4.2 yellow cards per match sounds high until you learn the competition average is 3.9, at which point it's barely above normal. Context matters and WhoScored gives you the competition baseline to compare against.
For the Premier League and Championship specifically, the referees' official performance data occasionally surfaces through media coverage, particularly around controversial decisions and accountability pieces. These aren't regular enough to build a systematic data source around but they're worth incorporating when they appear.
The LLM Extraction Workflow
Here's where it gets practical. The process has two stages: bulk data extraction from structured sources, and qualitative signal extraction from match reports.
For the bulk extraction, you're going to a referee's SofaScore or FlashScore profile, copying the match log data - either directly if the platform allows it, or by screenshot and asking the LLM to parse it - and then prompting the model to structure it. The prompt that works:
"I'm going to paste match data for a football referee. For each match, extract: date, competition, teams, yellow cards, red cards, penalties awarded, and result at full time. Structure this as a table. Flag any matches where the data appears incomplete or where you're uncertain about a value."
That last instruction matters. LLMs will fill gaps with plausible-looking numbers if you don't specifically tell them to flag uncertainty instead. The flag instruction forces the model to acknowledge what it doesn't know rather than quietly fabricating it.
For the qualitative extraction from match reports, the prompt is different. You're not asking the model to summarise the match - you're asking it to extract officiating-specific observations:
"Read this match report and extract only information relevant to the referee's decision-making. Specifically: any mentions of cards and the circumstances around them, any fouls called or not called that the report treats as significant, any added time awarded and whether the report considers it appropriate, any general characterisation of how the referee managed the game. Quote the relevant section of the report for each observation. Do not summarise match events that don't involve officiating."
The "quote the relevant section" instruction is important. It prevents the model from paraphrasing in ways that lose the specific language that tells you something about the context. A report saying a referee "surprisingly allowed a challenge that left the striker requiring treatment" is different from "the referee declined to caution the defender for a late challenge," and the paraphrased version often loses that distinction.
Run this prompt across ten to fifteen match reports for each referee. That's your qualitative layer. It takes about forty minutes per official once you've found the reports - the extraction itself is fast, the sourcing is where the time goes.
Structuring What You've Got
Once the extraction is done, you have raw data that needs structuring before it's usable. The LLM does this too, but with a specific prompt:
"I'm going to paste officiating data I've collected for a specific referee across [X] matches in [competitions] over [seasons]. I want you to calculate: average yellow cards per match, average red cards per match, average penalties per match, average fouls per match where the data exists. Then identify any patterns across the qualitative observations - recurring situations where this referee appears more or less interventionist than average. Present this as a brief referee profile I can use for pre-match analysis. Acknowledge explicitly where the sample size is too small to support a confident conclusion."
That last instruction is the one most people omit and the one that matters most. A referee with four penalty awards in twelve matches has a suggestive rate but not a reliable one. A referee with eleven penalty awards in thirty-eight matches is telling you something real. The model will conflate these if you don't specifically ask it to distinguish them.
The output you get is your referee profile. Save it in a consistent format - a simple spreadsheet works, one row per referee with the quantitative fields plus a text cell for the qualitative summary. Anything more complicated than that becomes a maintenance problem.
Building the Fixture Flagging System
This is the part that converts the database from interesting to useful. The point of the database isn't to know which referees have the most cards per match in the abstract. It's to flag upcoming fixtures where the assigned referee's tendencies suggest the market's over/under or card props line needs adjustment.
The prompt for this is:
"I have a referee profile for [official name] and an upcoming fixture. The referee's profile is: [paste your profile summary]. The fixture is [home team] vs [away team] in [competition]. The match context is [brief description - cup tie, relegation six-pointer, top-of-table clash, etc]. Based on the referee's documented tendencies, identify: whether the assigned referee's card rate suggests the player card market needs adjustment, whether the foul rate suggests the over/under line is affected, any specific contexts in the match description that align with documented tendency patterns. Be specific about which tendencies are relevant and which are not."
The "be specific about which tendencies are relevant and which are not" is what stops the model producing generic responses. Without it you get a summary of the referee profile rather than an assessment of this specific fixture.
Practically, you run this once for each weekend's fixtures where you have referee assignment data. Assignments typically appear twenty-four to forty-eight hours before kick-off for Premier League and Championship fixtures. For some competitions they're available earlier. The Thursday line opening article covered the pre-match information timeline - the referee assignment fits into that workflow at the same point as team news.
What the Database Catches and What It Doesn't
It catches consistent behavioural tendencies that persist across different fixture types - the official who reliably produces high-card matches regardless of teams, the one whose foul rate is low enough to affect total goals markets, the one whose added time is genuinely generous compared to the competition average in specific match states.
It doesn't catch match-specific context that might override tendencies. The referee who's usually permissive managing a fixture between two clubs with a history of genuine nastiness may respond differently to the environment than his card rate predicts. The qualitative notes layer is where you flag these potential overrides, but no database fully resolves this. It's an input, not a conclusion.
It also doesn't catch recent form. A referee who picked up a serious accusation of inconsistency in November and visibly tightened his approach for the next eight matches shows up in your database as one smoothed average that understates the recent change. Running your qualitative extraction across recent matches first, and flagging where the recent reports diverge from the longer-term pattern, is the partial solution. It requires attention to the chronological sequence of the data rather than treating all matches in the window equally.
Anyway. The database isn't difficult to build once the workflow is set up. It's maybe four hours for the first ten referees, then thirty minutes per official as you add more. The LLM does the extraction and structuring. You do the fixture-specific application. That division of labour is pretty much exactly what the prompt engineering article was describing as the right use of these tools.
Maintaining It
A referee database that gets built once and never updated is worth less than you'd think. Officials' tendencies genuinely shift - sometimes from instruction, sometimes from career stage, sometimes from changed supervisory pressure after a high-profile mistake. The database needs a monthly update pass: pull the last month's matches, run the extraction prompt, compare the recent match average to the database profile, update the profile if there's meaningful divergence.
The monthly update takes about an hour for a database covering twenty to thirty officials. That's the ongoing cost. Whether that's worth it depends entirely on how frequently you're betting markets where referee tendency is a meaningful input.
For card props and player booking markets specifically, it's worth it. For over/under total goals in fixtures between teams whose style matchup already produces a strong prior, the referee layer is probably marginal except for the officials at the extreme ends of the foul-rate distribution. You calibrate your investment in database maintenance to how often you're actually using it.
FAQ
Q: Can I ask the LLM to build the database for me from its own knowledge rather than from data I provide?
Don't. The LLM's knowledge of referee statistics is patchy, outdated, and - importantly - the model will produce confident-sounding data that it's partly fabricated. This is exactly the hallucination problem the betting content article described. The workflow here is built around providing verified data and asking the model to structure and analyse it. Asking the model to generate the data inverts the process in a way that makes it unreliable.
Q: How many matches do I need before a referee's card rate is reliable enough to use?
Roughly twenty-five to thirty matches for a confident quantitative profile. Between fifteen and twenty-five you have directional signal worth knowing but not worth heavily weighting. Under fifteen, the qualitative layer - recurring observations about specific situations - is more reliable than the aggregate numbers. The model's flagging instructions are calibrated to acknowledge this, but it's worth knowing the thresholds yourself so you're not over-relying on a thin sample.
Q: Does this work for referee assistants and VAR officials?
Partially. Referee assistant data is rarely captured at match report level unless a specific decision becomes the focal point of coverage. VAR official assignment data is available for some competitions and some of the same qualitative extraction applies - specific officials develop reputations for how they use technology review, particularly around the threshold decisions. The database structure works for both, but the data sourcing is harder and the sample sizes accumulate more slowly. Worth building if you're heavily active in competitions with consistent VAR usage, less worth the effort if your betting is spread across competitions where VAR application is inconsistent.