← Back

Datasheet

Documentation for the SmartPlayFPL dataset, following the framework proposed by Gebru et al. (2021). Every dataset used to train a machine learning model should be accompanied by a datasheet documenting its motivation, composition, collection process, and intended uses.

Motivation

For what purpose was the dataset created?

The dataset was created to train and evaluate machine learning models that predict Fantasy Premier League (FPL) player points per fixture. It merges two public but separate data sources — the official FPL API and Understat's advanced football statistics — into a single, model-ready table with consistent player and club identifiers across seasons.

Who created the dataset and on behalf of which entity?

The dataset was created by Qazybek Beken as part of the SmartPlayFPL project at UC Berkeley.

Who funded the creation of the dataset?

Self-funded. No grants or external funding were involved.

Any other comments?

The feature engineering builds on OpenFPL by Daniel Groos (Groos Analytics). SmartPlay adds 16 additional features on top of OpenFPL's 235-column base.

Composition

What do the instances represent?

Each instance is one player in one Premier League fixture — a single player-fixture pair. A player who appears in gameweek 10 of the 2023-24 season has one row regardless of whether they played 90 minutes or 0 minutes.

How many instances are there in total?

Over 150,000 rows across six seasons (2020-21 through 2025-26), growing each gameweek as the pipeline appends new data.

Does the dataset contain all possible instances?

The dataset aims to contain all FPL-registered players for every fixture in the covered seasons. It is not a sample — it is a census of every player-fixture combination where FPL recorded data.

What data does each instance consist of?

Each instance consists of 115 columns:

GroupColsExamples
Identity8season, gameweek, fpl_code, player_name, team_name, position
Match context6is_home, match_date, opponent_team, fixture
FPL actuals15total_points, minutes, goals_scored, assists, clean_sheets, bonus
FPL expected5expected_goals, expected_assists, expected_goal_involvements, xP
FPL ICT index4ict_index, influence, creativity, threat
FPL market data5value, selected, transfers_in, transfers_out
Understat player13us_xG, us_xA, us_npxG, us_shots, us_key_passes, us_xGChain
Understat team17us_team_xG, us_team_xGA, us_ppda, team_xG_avg, team_xGA_avg
Derived11us_xG_per90, points_per_million, goals_vs_xG, assists_vs_xA
Other26cache_price, mng_win, clearances_blocks_interceptions, team_a_score

Is there a label or target?

Yes. The primary target is total_points — the FPL points a player scored in that fixture. Secondary targets include minutes (for starting probability models) and the four outcome buckets used by SmartPlay v9 (Zeros/Blanks/Tickers/Haulers).

Is any information missing from individual instances?

Yes. Several columns have systematic NaN values:

  • Understat player data is NaN for players without an Understat mapping (~5% of players)
  • Pipeline snapshot columns are only populated for the current season's active pipeline runs
  • Manager and defensive stats are NaN for incrementally updated rows

These gaps are handled by the models via NaN-aware XGBoost.

Are there recommended data splits?

Yes. Temporal cross-validation — train on earlier seasons, test on a later season the model has never seen:

SplitTrainTest
CV12020-21 to 2022-232023-24
CV22020-21 to 2023-242024-25
CV32020-21 to 2024-252025-26

Are there any errors, sources of noise, or redundancies?

  • ~5% of Understat-FPL player mappings are medium or low confidence, which may cause misattribution of xG/xA stats
  • Double gameweeks produce two rows per player — models must handle this
  • Some columns are near-duplicates kept for compatibility (e.g., is_home vs was_home)
  • Newly promoted teams have no historical Understat season-level stats, causing NaN in early gameweeks

Is the dataset self-contained?

The file is self-contained for training and evaluation. To update with new gameweeks, the FPL API and Understat API are required (both public, no authentication needed).

Does the dataset contain sensitive or confidential data?

No. All data is derived from publicly available APIs. Player names and statistics are public information published by the Premier League and Understat. No private individuals are included.

Collection Process

How was the data acquired?

Programmatically from two public APIs:

  • FPL API — player identity, match results, points, market data, ICT index, expected stats
  • Understat API — player-level xG, xA, shot data, and team-level match statistics (xG, xGA, PPDA, deep completions)

Both APIs return structured JSON. The data was directly observable (actual match results, not survey responses or inferred values).

What mechanisms were used to collect the data?

Custom Python scripts make API calls, parse JSON responses, and merge the two sources using player/club ID mappings. Rate limiting is applied (0.35s between FPL API calls) to avoid overloading the public APIs.

Who was involved in the data collection process?

Data collection was fully automated via scripts. No crowdworkers or manual annotators were involved.

Over what timeframe was the data collected?

Six Premier League seasons: 2020-21 through 2025-26 (up to GW24). Data is collected after each gameweek's matches are completed, typically within 24 hours.

Were any ethical review processes conducted?

No formal ethical review was conducted. The dataset uses only publicly available sports statistics from official APIs. No private or sensitive data is collected.

Were the individuals notified about or did they consent to the data collection?

The individuals are professional footballers whose performance statistics are published by the Premier League as part of their public-facing Fantasy Premier League game. The data is inherently public. No additional notification or consent was sought.

Preprocessing

Was any preprocessing/cleaning/labelling done?

Yes:

  1. ID mapping — FPL player codes mapped to Understat IDs via golden record CSVs
  2. Club name normalisation — e.g., "Man City" to "Manchester City"
  3. Position mapping — FPL's "GKP" mapped to "GK"
  4. Deduplication — keyed on (season, fixture, fpl_code)
  5. Per-90 derivation — Understat stats normalised to per-90-minute rates
  6. Team season aggregates — rolling team-level averages from Understat match data
  7. Derived metrics — points_per_million, goals_vs_xG, assists_vs_xA

Was the raw data saved?

No. The raw API responses are not saved. However, the preprocessing is deterministic and can be reproduced by re-running the collection scripts against the same API endpoints.

Is the preprocessing software available?

Yes. All preprocessing code is included in the public repository.

Uses

Has the dataset been used for any tasks already?

Yes. It trains and evaluates two FPL prediction models:

  • SmartPlay v9 — multi-bucket mixture XGBoost (28 models). Achieves 0.730 Spearman correlation on 2025-26 starters.
  • OpenFPL — ensemble XGBoost reimplementation (~200 models). Achieves 0.696 Spearman on the same test set.

Both models are included in the public repository with pre-trained weights and evaluation scripts.

What other tasks could the dataset be used for?

  • Player valuation modelling (predicting FPL price changes)
  • Starting XI prediction
  • Team strength estimation via Understat metrics
  • Fixture difficulty modelling
  • Academic benchmarking of time-series prediction methods

Are there tasks for which the dataset should not be used?

  • Gambling or betting — this dataset should not be used to build systems optimised for sports betting markets. See our Ethics page.
  • Real-world player evaluation — FPL points are an imperfect proxy for actual football performance.
  • Demographic inference — player names are included but should not be used to infer demographic attributes.

Distribution

How is the dataset distributed?

As a CSV file (~20 MB) tracked with Git LFS in a public GitHub repository. Pre-trained model weights and inference code are distributed alongside it.

When was the dataset first distributed?

February 2026.

What licence is it distributed under?

Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). Sharing and adaptation for non-commercial purposes with attribution is permitted. Commercial use is prohibited.

Have any third parties imposed restrictions on the data?

The underlying data originates from the FPL API (Premier League retains IP rights), Understat (publicly available statistics), and ChrisMusson/FPL-ID-Map (MIT licence). Users should review the FPL Terms of Service.

Maintenance

Who is maintaining the dataset?

Qazybek Beken. Contact: [email protected].

Will the dataset be updated?

Yes. New gameweeks are appended throughout the active Premier League season (August to May). Users can also update the dataset themselves using the included scripts.

Can others extend or contribute to the dataset?

Yes. The repository includes scripts for appending new gameweeks and updating player/club mappings. Contributions can be submitted via pull requests.

Framework: Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Daumé III, H., & Crawford, K. (2021). Datasheets for Datasets. Communications of the ACM, 64(12), 86–92.