The Training Data Problem in Football AI: Why Historical Data Produces Systematically Backward-Looking Models

Betting Forum

Administrator
Staff member
Joined
Jul 11, 2008
Messages
1,924
Reaction score
185
Points
63
The Training Data Problem in Football AI.webp
Every machine learning model is a compressed representation of its training data. What the model has learned is what the training data contained. What the training data contained is what football looked like during the period it was collected. If football has changed structurally in ways that aren't reflected proportionally in the training data - if the game played in 2024 differs systematically from the game played in 2012 - then a model trained heavily on historical data is pricing a version of football that no longer fully exists.

This isn't a hypothetical concern. Football has undergone one of the most rapid and widespread tactical evolutions in its history over the last decade. High pressing has moved from tactical novelty to near-universal standard. Positional play concepts have spread from elite clubs to the Championship and below. Goalkeeper sweeping and distribution have transformed from optional attributes to positional requirements. The offside trap is deployed differently. Full-backs function differently. The press trigger mechanic has been systematised across thousands of clubs who didn't employ it at all fifteen years ago.

A model trained on data from 2010 to 2020 is partly learning how football was played in 2010. A model that weights all historical data equally is pricing the 2024-25 season using statistical relationships derived partly from a game that ended years ago. The specific ways this produces systematic directional bias - and which market types are most affected - is what this article examines.
Recommended USA sportsbooks: Bovada, Everygame | Recommended UK sportsbook: 888 Sport | Recommended ROW sportsbooks: Pinnacle, 1XBET

How Training Data Weighting Works and Why It Matters​

Machine learning models don't simply memorise training data. They learn statistical relationships between inputs and outputs across the full dataset, weighted by how frequently specific patterns appear. Recent data and older data contribute to the learned relationships in proportion to their representation in the training set and to any explicit recency weighting the model designer applies.

The design choice of how to weight historical data is critical and its implications are rarely discussed publicly. A model trained on ten seasons of data with equal weighting learns relationships that are averages across ten seasons of football - relationships that may accurately reflect the middle of that period but systematically misrepresent both the earliest and most recent periods if the game changed across that time. A model with recency weighting - giving more weight to recent seasons than older ones - corrects some of this distortion but introduces its own tradeoffs: more recent data means smaller sample sizes, and small samples produce noisier learned relationships.

The operators who build these models make specific choices about training windows and weighting, and those choices have betting-relevant consequences. An operator who uses five seasons of equally-weighted data is working with a training set that describes football as it was played on average across that period, not football as it's currently played. If football has changed systematically across those five seasons - and it has - the model's learned relationships are already partially outdated at launch.

The structural evolution of football means this isn't a static problem that gets solved once. Even a model that was perfectly calibrated to the game as it was played in 2022 is becoming slightly less accurate every month as tactical evolution continues. The model is always chasing a moving target, and the gap between the target and the model's position is determined by how fast the game is changing relative to how frequently the model is retrained.

The Specific Structural Changes and Their Statistical Consequences​

Before the market implications, the specific changes to football's structure over the last decade need to be described precisely, because the direction of the resulting bias depends on what specifically changed.

High Pressing Proliferation​

In 2012, high pressing was associated with a small number of clubs - Dortmund under Klopp, Southampton under Pochettino in the Championship, a handful of others. By 2024, high pressing in various forms is deployed by a majority of professional clubs across all tiers of English football and across most major European leagues. The clubs that don't press are now the minority that has specifically chosen a different approach, not the majority default.

This structural change has specific statistical consequences. The relationship between a team's pressing metrics and their defensive performance was learned in an environment where low-pressing teams were the majority and high-pressing teams were the minority. The model learned: teams that press more tend to produce better defensive outcomes. But this relationship was partly driven by the fact that pressing teams in 2012 to 2015 were typically the most sophisticated tactical operations with above-average squad quality across all dimensions. The pressing was correlated with general quality, not just defensive approach.

As pressing has become universal, this correlation has weakened. A Championship club pressing aggressively in 2024 is doing something qualitatively different from Dortmund pressing aggressively in 2012 - they share the tactical approach but not the accompanying quality premium. A model that learned the relationship between pressing and defensive quality in the pre-proliferation era is overestimating the defensive benefit of pressing for average and below-average pressing teams in the current era.

Goalkeeper Distribution and Sweeping​

The goalkeeper position has changed more structurally in the last decade than in the preceding fifty years. Ball-playing goalkeepers who initiate build-up through short distribution, who act as sweepers against long balls over the defensive line, and who participate in the pressing trigger from the back - these attributes were optional add-ons for most clubs in 2010 and are near-universal requirements in 2024 in systems that press high.

The statistical consequence: the historical data on goalkeeper distribution quality, passing accuracy, and ball-playing contribution is a dataset of increasing size where the early portions describe an almost entirely different skill demand profile from the later portions. A model that uses goalkeeper distribution statistics from 2012 through 2024 is partly learning about a different role - the traditional last-line-of-defence goalkeeper who punted long and organised defensively - and generalising those relationships to a role that now has fundamentally different demands.

The directional bias: expected goals against models that incorporate goalkeeper contribution using historical distribution relationships are underestimating the defensive value of genuinely ball-playing goalkeepers in systems that require them, and underestimating the defensive vulnerability of traditional goalkeepers placed in high-defensive-line systems that require sweeping and distribution they can't adequately provide.

Full-Back Transformation​

The modern full-back in most progressive systems is functionally a wide midfielder when in possession - inverting into central positions to overload midfield, advancing into wide attacking positions, providing crossing threat from advanced areas. The statistical profile of a full-back in a modern system looks almost nothing like the statistical profile of a full-back in 2010. Goals, assists, key passes, progressive carries, crossing volume - all substantially elevated from the historical average for the position.

A model that has learned positional average baselines for full-back contribution using historical data is using baselines that include large amounts of data from a period when full-backs played a fundamentally different role. The expected contribution for a full-back in a modern system is being underestimated relative to a baseline that partly reflects the traditional defensive role. The prop market applications of this are specific: full-back assists, crossing, and attacking involvement props are priced from positional averages that are lower than the current structural role warrants in specific systems.

Set Piece Sophistication​

Set piece design has undergone a professionalisation that has changed the statistical relationship between set piece volume and set piece outcomes. In 2012, most clubs used broadly similar set piece approaches - a few fixed routines, zonal or man-marking with modest innovation. By 2024, clubs with dedicated set piece coaches run fifty or more distinct set piece routines designed to create specific matchup advantages, exploit specific defensive vulnerabilities, and account for opponent tracking tendencies.

The statistical consequence: the historical relationship between corner volume and goals from corners has changed. A corner in 2024 at a club with a sophisticated set piece operation is a higher-xG event than a corner in 2012 at a club with basic set piece organisation. A model learning this relationship from a dataset that spans this period is learning an average that understates the current expected value of set pieces at sophisticated clubs and overstates it at clubs who haven't invested in set piece design.

The directional bias is competition-specific and club-specific, which makes it harder to generalise but more exploitable for specialists with the knowledge to identify which specific clubs are above and below the historical average set piece efficiency.

Which Market Types Are Most Affected​

The structural change problem doesn't affect all markets equally. The markets most affected are those whose pricing most directly depends on historical relationships that the game's evolution has made less accurate.

Total Goals Markets​

The total goals market is the most structurally affected by historical training data problems, because total goals expectation depends on both teams' attacking and defensive quality across the full possession sequence - including the pressing and build-up phases that have changed most dramatically.

The directional bias is specifically toward underestimating total goals in matches between high-pressing, possession-based teams - the match type that has become most common but was least represented in historical training data. A model that learned total goals expectations from a training set where defensive organisation dominated and pressing was unusual will underestimate total goals in the style of match that is now prevalent.

The practical market implication: systematic underpricing of the over in high-press vs high-press fixtures in competitions where the style has become predominant but where historical model training data includes large quantities of lower-press era data. The PPDA matchup matrix from earlier in this series interacts with this historical bias - the model's total goals expectations for high-PPDA vs high-PPDA matchups were learned partly from an era when such matchups were relatively rare, and the expectations are correspondingly imprecise.

Clean Sheet Markets​

Clean sheet probability expectations are particularly vulnerable to the goalkeeper distribution change. The expected goals against framework that underpins clean sheet probability assessment was learned partly from an era when goalkeeper sweeping wasn't a standard defensive tool and high defensive lines weren't deployed by average-quality clubs.

The relationship between defensive line height and clean sheet probability has changed. High defensive lines with ball-playing sweeper goalkeepers produce different xG-against distributions from high defensive lines with traditional goalkeepers - the sweeper goalkeeper eliminates the high-ball-over-the-top chance type that the high line creates. A model that learned the defensive line height and clean sheet probability relationship from historical data that included both types of goalkeeping is learning an averaged relationship that misestimates both categories.

Clubs that have specifically invested in ball-playing sweeper goalkeepers for their high-line pressing system have lower clean sheet probability estimates from historically-trained models than their actual defensive performance warrants. The clean sheet market for these clubs is systematically mispriced in the direction of underestimating their clean sheet probability. The direction is consistent because it's structurally driven.

Asian Handicap Lines for Positional Style Matchups​

The handicap line for a fixture between a high-possession positional play team and a low-block counter-attacking team is priced partly from historical relationships between these style archetypes and result distributions. The problem: in 2012, the positional play team against the low block produced a specific distribution of results that reflected the tactical dynamic of that era. In 2024, both the positional play sophistication and the low-block defensive sophistication have evolved in ways that change the result distribution from the historical average.

Specifically, the evolution of low-block defending - more sophisticated defensive shape, better organised pressure triggers to prevent build-up, improved transition speed when winning the ball - has partially offset the improvement in positional play attacking quality. The historical relationship between possession-based teams' quality advantage and their expected result margin has compressed as defensive sophistication has improved to match offensive sophistication. A model that learned this relationship from pre-2020 data may be overestimating the quality advantage of possession-based teams against well-organised defensive blocks, producing Asian Handicap lines that give too much credit to the possession-based team.

Prop Markets for Full-Backs and Technical Midfielders​

The positional baseline problem for full-backs and technical midfielders affects any prop market priced from positional averages. The historical average for full-back attacking contribution - assists per season, key passes per 90, progressive carries per 90 - was calculated from a dataset that includes large amounts of full-back performance data from the traditional full-back era. Players in modern attacking full-back roles who are being priced against this historical positional average are being systematically undervalued.

This is the progressive carries article's market mispricing viewed through the historical data lens. The mechanism producing the underpricing isn't just that carry data isn't incorporated - it's that the positional baseline the market uses includes historical data from a different role definition, which drags the expected contribution downward relative to what modern attacking full-backs in specific systems actually produce.

The Recency Weighting Solution and Its Limits​

The obvious solution to the historical data problem is recency weighting - giving more weight to recent seasons in the training data so the model reflects current football more accurately than historical football.

Recency weighting is widely used and partially effective. It reduces the magnitude of the historical bias for market types where the structural change is gradual and continuous. It doesn't eliminate the problem for two reasons.

The first reason is sample size. Giving more weight to recent data means learning relationships from smaller samples. A model trained primarily on the last two seasons of data is working with roughly 760 Premier League matches per season - useful but not enough to reliably learn subtle statistical relationships that require large samples to distinguish from variance. The recency weighting that would most accurately represent current football leaves the model underpowered for learning the relationships that require large samples.

The second reason is that recency weighting helps with gradual continuous change but not with step-change events. A tactical innovation that spreads rapidly across a single season - the way gegenpressing spread across Germany following Dortmund's Champions League runs - produces a step change in the relevant statistical relationships that recency weighting smooths over rather than captures. The model learns an average that includes both before and after the step change, with the relative weighting determined by how much data exists in each phase rather than by when the change actually happened.

The result is that even well-designed models with appropriate recency weighting have systematic backward-looking bias in specific market types and for specific structural changes. The bias is smaller than it would be without recency weighting. It's not eliminated.

Identifying When Historical Bias Is Affecting Current Pricing​

The practical question is whether the historical training data bias is identifiable in specific fixtures and specific markets in ways that allow targeted betting.

The most reliable identification method is the same cross-line comparison described in the propagating errors article, combined with a specific check for whether the mispriced fixture is one where the historical-versus-current style gap is largest.

High-press versus high-press fixtures in leagues that adopted pressing culture relatively recently - Championship clubs that have implemented pressing systems in the last three seasons - are the clearest target for historical total goals underestimation. The model's training data for these fixtures includes seasons when the same clubs were playing in a significantly lower-press style. The current fixture involves clubs whose style has changed more than their historical data reflects.

Fixtures involving clubs that have recently implemented ball-playing sweeper goalkeepers in high-line systems - particularly in the Championship and League One where this tactical evolution is less universally complete than in the Premier League - are the clearest targets for clean sheet probability underestimation. The model's defensive xG expectations for these clubs include historical data from before the system change. Their current actual clean sheet rate is above what the historical model expects.

Fixtures in competitions that have undergone rapid tactical evolution within the last three to four seasons - leagues that went from predominantly physical, low-press approaches to predominantly high-press approaches as coaching networks imported the style - represent the widest gap between historical model calibration and current reality. The gap is widest in the leagues that changed fastest.

The Calibration Update Problem​

One might expect that the historical bias is constantly shrinking as operators retrain their models with new data. This is partly true but the rate of calibration is itself limited by structural factors.

Model retraining is expensive and requires significant validation work before deployment. Operators don't retrain their core pricing models weekly. Most retrain annually or seasonally at best for the foundational models, with more frequent updating for specific market calibrations. The foundational model's learned relationships - the deep structural understanding of how football matches produce results - is updated slowly. The surface calibrations - the specific adjustments for current form, recent results, current squad availability - are updated more frequently.

The foundational model retrain is where the historical data problem lives. And because full retraining is expensive, the training dataset for each retrain typically includes all available historical data with recency weighting rather than a pure rolling window that would eliminate older data. Each retrain adds a new season but doesn't remove the oldest data - the oldest data's weight in the overall distribution shrinks but it doesn't disappear. The pre-2015 data that reflects a structurally different game persists in every training set as a diminishing but non-zero contributor to the learned relationships.

The complete elimination of historical data problem would require a model trained exclusively on post-2020 data - a five-season rolling window - with sufficient samples to learn the necessary relationships from that compressed history. This approach would produce better calibration to current football at the cost of smaller samples and noisier learned relationships for the tail events that require large historical samples to estimate reliably. No operator has publicly adopted this approach, and the commercial incentives don't strongly favour it because the historical data problem is not visible to recreational bettors who represent most of the revenue.

The Bettor's Calibration Advantage​

The historical data problem creates a specific and durable advantage for bettors who watch current football carefully and calibrate their assessments from what they observe rather than from historical statistical relationships.

A bettor who watches thirty Championship matches per season and builds their quality assessments from current tactical observation - who sees how high the pressing lines are now, how the ball-playing goalkeepers function in current systems, how modern full-backs contribute in specific formations - is working from a sample of current football that's calibrated to the game as it's actually played. Their mental model of what to expect in specific fixture types is updated continuously by what they observe.

The pricing model is updated annually from a training set that includes substantial quantities of historically-different football. The bettor's observational model is updated weekly from current football. For the specific market types where the historical-versus-current gap is largest - total goals in high-press matchups, clean sheet probability for evolved defensive systems - the bettor's continuously-updated observational model is more accurate than the historically-anchored pricing model.

This is the most concrete version of the durable human edge from the AI pricing problem article. It's not that humans are smarter than models in the abstract. It's that humans who watch current football carefully have an observational sample that's better calibrated to current football than a model trained on a dataset that includes the pre-transformation game. The edge is specific, it's structural, and it's durably present for as long as football continues evolving faster than models can be retrained to reflect the evolution.

FAQ​

Q1: Is there a way to test empirically whether historical training data bias is affecting a specific market, or is the identification necessarily qualitative?
There's a partially quantitative approach that doesn't require model access. Compare the market's expected goals estimates for specific fixture types to your own xG estimates built from current match data. If you're calculating xG from current season footage and statistics for high-press versus high-press Championship fixtures and finding that the market's implied total goals is consistently below your own estimate over twenty or more fixtures, you have evidence of systematic directional bias in that specific fixture type. The comparison requires that your own xG estimates are built from current data rather than from the same historical data the model uses. This means watching matches and calibrating from current observation rather than from historical databases - which is demanding but also the exact source of the advantage the article identifies. The quantitative test is only as good as the independence of your own estimate from the historical data that biases the market's estimate.

Q2: Do the newest AI architectures - transformer models, large language models applied to sports data - address the historical data problem better than the gradient boosted trees and neural networks that currently dominate sports pricing?
Partially, in specific ways. Transformer-based architectures with attention mechanisms are better at identifying which historical data is most relevant to a specific current prediction rather than averaging across all historical data equally. In principle, this allows a model to up-weight matches from the most recent and stylistically similar historical period when making specific predictions, reducing the contamination from distant historical data. In practice, the implementation of this capability for sports pricing is in early stages and the publicly available evidence of significant improvement over current approaches for the specific historical bias problem is thin. The more transformative development is the integration of language model capabilities that allow qualitative information - tactical analysis, press conference content, coaching philosophy descriptions - to enter the pricing model directly. This addresses the qualitative information absence problem from the AI pricing problem article rather than the historical data problem specifically. The historical data problem remains fundamentally architectural: no model learns relationships that aren't in the training data, and the training data for football AI is historical by definition.

Q3: Has the structural evolution of football been uniform across major European leagues, or have some leagues evolved faster than others in ways that create different levels of historical bias by competition?
The evolution has been notably non-uniform and the variation creates specific competition-level differences in historical bias severity. The Premier League and Bundesliga adopted high-pressing, positional play systems most rapidly - by 2018 to 2019, both leagues had reached near-saturation of pressing adoption at the upper and middle tiers. Historical data for these leagues from 2015 onward is reasonably representative of current football at the top level. Serie A and La Liga evolved more slowly, with the transition to pressing culture taking longer at the club level outside the elite. Historical data from 2015 to 2019 for these leagues represents a transitional period that's less similar to current play than the equivalent Premier League data. Ligue 1 and the Championship underwent the most rapid and recent transitions, with pressing adoption accelerating sharply from 2019 to 2022. Historical data for these leagues from before 2020 is least representative of current football, and the historical model bias is therefore largest for these competitions. The practical implication: the same model, applied across all competitions, produces more historical bias in Ligue 1 and the Championship than in the Premier League, because the historical data is less representative of current play in those competitions. Bettors who specialise in the competitions with the most rapid recent evolution and calibrate from current observation are furthest ahead of a market using historical model relationships.
 
Back
Top
GOALLLL!
Odds