Building a dataset for Fantasy Premier League analysis


R


This post is part of a series:

Fantasy Premier League

  1. Building a dataset for Fantasy Premier League analysis
  2. Visualizing Premier League fixture difficulty


People often ask me if I have some sort of artificially intelligent algorithm assisting me with my Fantasy Premier League (FPL) team. Not because I’m particularly good at FPL, but because my day job is to develop data driven solutions to business problems. The answer has always been ‘no’, but the tone within one of my fantasy leagues has become rather sassy as of late, so I figured it’s time to smash out some data science.

This is the first post in a series about Fantasy Premier League, a free online game. Over the course of this series, we’ll use R to build cool stuff on top of Premier League data. In this post, we’ll build a player performance dataset covering not just recent fixtures, but also historical results going all the way back to the 2016 season.

The Fantasy Premier League API

Most match and player stats from the current season are freely available through an API. The downside is that the API only contains detailed data for the ongoing Premier League season; if we want historical data we’ll have to look elsewhere.

The majority of the APIs data can be accessed through its bootstrap-static dataset. You can view this data in your browser by visiting https://fantasy.premierleague.com/api/bootstrap-static/. Note that this dataset gets updated throughout the season, and will disappear to make room for next season’s data sometime during summer 2020. The structure of the data, and even the APIs URL, has been known to change over time, so the code in this post may stop working some day.

The following R code downloads the bootstrap-static dataset into your session:

library(tidyverse)
library(jsonlite)

# Download the `bootstrap-static` dataset into your R session
url <- "https://fantasy.premierleague.com/api/"
bootstrap_static <- fromJSON(paste0(url, "bootstrap-static/"))

The boostrap_static object defined above is a list of lists and tables. We are mostly interested in two items from that list; elements and teams. The following code chunk extracts the teams table from bootstrap_static:

# Get a list of Permier League teams, along with their id values
teams <- bootstrap_static$teams %>%
  select(name, code, id) %>%
  as_tibble()

teams
## # A tibble: 20 x 3
##    name            code    id
##    <chr>          <int> <int>
##  1 Arsenal            3     1
##  2 Aston Villa        7     2
##  3 Bournemouth       91     3
##  4 Brighton          36     4
##  5 Burnley           90     5
##  6 Chelsea            8     6
##  7 Crystal Palace    31     7
##  8 Everton           11     8
##  9 Leicester         13     9
## 10 Liverpool         14    10
## 11 Man City          43    11
## 12 Man Utd            1    12
## 13 Newcastle          4    13
## 14 Norwich           45    14
## 15 Sheffield Utd     49    15
## 16 Southampton       20    16
## 17 Spurs              6    17
## 18 Watford           57    18
## 19 West Ham          21    19
## 20 Wolves            39    20

You may be wondering why each team has two numerical identifiers (id and code). code is a permanent id number that can be used to identify a team across different seasons. id has values from 1 to 20, and is based on the alphabetical order of team names within a season. I don’t know why they need both, our task would be simpler if the APIs data structure was built around code instead of id, but it’s nothing a bunch a code won’t fix.

The next step is to get data about individual players. The following code chunk extracts the elements table from bootstrap_static. Note that the API refers to players as elements.

# Display player data included in `bootstrap_static`
players <- bootstrap_static$elements %>%
  as_tibble()

players
## # A tibble: 529 x 53
##    chance_of_playi… chance_of_playi…   code cost_change_eve… cost_change_eve… cost_change_sta… cost_change_sta… dreamteam_count element_type
##               <int>            <int>  <int>            <int>            <int>            <int>            <int>           <int>        <int>
##  1               NA               NA  69140                0                0                0                0               0            2
##  2                0                0  98745                0                0                0                0               0            2
##  3              100               75 111457                0                0                0                0               0            2
##  4               NA               NA 154043                0                0                0                0               1            2
##  5               NA               NA  39476                0                0                0                0               0            2
##  6               NA               NA  38411                0                0                0                0               0            2
##  7                0                0  51507                0                0                0                0               0            2
##  8                0                0 233963                0                0                0                0               0            2
##  9                0                0  80254                0                0                0                0               0            2
## 10                0                0 156074                0                0                0                0               0            2
## # … with 519 more rows, and 44 more variables: ep_next <chr>, ep_this <chr>, event_points <int>, first_name <chr>, form <chr>, id <int>,
## #   in_dreamteam <lgl>, news <chr>, news_added <chr>, now_cost <int>, photo <chr>, points_per_game <chr>, second_name <chr>,
## #   selected_by_percent <chr>, special <lgl>, squad_number <lgl>, status <chr>, team <int>, team_code <int>, total_points <int>,
## #   transfers_in <int>, transfers_in_event <int>, transfers_out <int>, transfers_out_event <int>, value_form <chr>, value_season <chr>,
## #   web_name <chr>, minutes <int>, goals_scored <int>, assists <int>, clean_sheets <int>, goals_conceded <int>, own_goals <int>,
## #   penalties_saved <int>, penalties_missed <int>, yellow_cards <int>, red_cards <int>, saves <int>, bonus <int>, bps <int>, influence <chr>,
## #   creativity <chr>, threat <chr>, ict_index <chr>

More player data is available through the element-summary subdomain of the API. Each player has a separate URL pointing to their element-summary table. To view a player’s element-summary in your browser, visit https://fantasy.premierleague.com/api/element-summary/PLAYER-ID-VALUE. The following chunk of R code will collect the summaries for all player id values present in the bootstrap_static table. It’s accessing a lot of data through a lot of different sub-domains, so it may take a few minutes to run.

# Use `bootstrap_static` player id's to access richer player data through the API
players <- bootstrap_static$elements %>%
  as_tibble()  %>%
  mutate(
    # Merge the name columns
    name = paste(first_name, second_name),
    # Make the `status` column more interpretable
    status = case_when(
      status == "a" ~ "Available",
      status == "i" ~ "Injured",
      status == "u" ~ "Unavailable",
      status == "d" ~ "Doubtful",
      status == "n" ~ "On loan",
      status == "s" ~ "Suspended"
    ),
    # Download the element-summary dataset for each player
    element_summary = map(id, function(x) {
      paste0(url, "element-summary/", x, "/") %>% 
        fromJSON()
    }),
    # Extract data on upcoming fixtures for each player
    fixtures = map(element_summary, function(x) {
        x$fixtures %>%
        mutate(
          # Fix time format
          kickoff_time = kickoff_time %>% 
            str_replace("T", " ") %>% 
            str_replace("Z", "") %>% 
            as.POSIXct()
        ) %>% 
        as_tibble()
    }),
    # Extract data on match statistics from games played this season
    history = map(element_summary, function(x) {
      x$history %>% 
        mutate(
          # Fix format
          kickoff_time = kickoff_time %>% 
            str_replace("T", " ") %>% 
            str_replace("Z", "") %>% 
            as.POSIXct()
        ) %>%        
        # Fix format issues (numeric variables being read as strings)
        mutate_at(
          vars(influence, creativity, threat, ict_index), 
          funs(as.numeric)
        ) %>% 
        as_tibble()
    })
  ) %>% 
  select(
    name, player_id = id, player_code = code, 
    element_type, status, team_code, fixtures, history
  )    

players
## # A tibble: 529 x 8
##    name                      player_id player_code element_type status      team_code fixtures           history          
##    <chr>                         <int>       <int>        <int> <chr>           <int> <list>             <list>           
##  1 Shkodran Mustafi                  1       69140            2 Available           3 <tibble [36 × 13]> <tibble [2 × 31]>
##  2 Héctor Bellerín                   2       98745            2 Injured             3 <tibble [36 × 13]> <tibble [2 × 31]>
##  3 Sead Kolasinac                    3      111457            2 Available           3 <tibble [36 × 13]> <tibble [2 × 31]>
##  4 Ainsley Maitland-Niles            4      154043            2 Available           3 <tibble [36 × 13]> <tibble [2 × 31]>
##  5 Sokratis Papastathopoulos         5       39476            2 Available           3 <tibble [36 × 13]> <tibble [2 × 31]>
##  6 Nacho Monreal                     6       38411            2 Available           3 <tibble [36 × 13]> <tibble [2 × 31]>
##  7 Laurent Koscielny                 7       51507            2 Unavailable         3 <tibble [36 × 13]> <tibble [2 × 31]>
##  8 Konstantinos Mavropanos           8      233963            2 Injured             3 <tibble [36 × 13]> <tibble [2 × 31]>
##  9 Carl Jenkinson                    9       80254            2 Unavailable         3 <tibble [36 × 13]> <tibble [2 × 31]>
## 10 Rob Holding                      10      156074            2 Injured             3 <tibble [36 × 13]> <tibble [2 × 31]>
## # … with 519 more rows

players is now an up to date overview of each player’s current season. The history column contains nested data on games played this season:

# Display an example of the `history` datasets
players$history[[1]]
## # A tibble: 2 x 31
##   element fixture opponent_team total_points was_home kickoff_time        team_h_score team_a_score round minutes goals_scored assists
##     <int>   <int>         <int>        <int> <lgl>    <dttm>                     <int>        <int> <int>   <int>        <int>   <int>
## 1       1      10            13            0 FALSE    2019-08-11 11:00:00            0            1     1       0            0       0
## 2       1      11             5            0 TRUE     2019-08-17 09:30:00            2            1     2       0            0       0
## # … with 19 more variables: clean_sheets <int>, goals_conceded <int>, own_goals <int>, penalties_saved <int>, penalties_missed <int>,
## #   yellow_cards <int>, red_cards <int>, saves <int>, bonus <int>, bps <int>, influence <dbl>, creativity <dbl>, threat <dbl>, ict_index <dbl>,
## #   value <int>, transfers_balance <int>, selected <int>, transfers_in <int>, transfers_out <int>

The fixtures column contains data on upcoming games:

# Display an example of the `fixtures` datasets
players$fixtures[[1]]
## # A tibble: 36 x 13
##       code team_h team_h_score team_a team_a_score event finished minutes provisional_start_ti… kickoff_time        event_name is_home difficulty
##      <int>  <int> <lgl>         <int> <lgl>        <int> <lgl>      <int> <lgl>                 <dttm>              <chr>      <lgl>        <int>
##  1 1059725     10 NA                1 NA               3 FALSE          0 FALSE                 2019-08-24 14:30:00 Gameweek 3 FALSE            5
##  2 1059732      1 NA               17 NA               4 FALSE          0 FALSE                 2019-09-01 13:30:00 Gameweek 4 TRUE             4
##  3 1059750     18 NA                1 NA               5 FALSE          0 FALSE                 2019-09-15 13:30:00 Gameweek 5 FALSE            3
##  4 1059752      1 NA                2 NA               6 FALSE          0 FALSE                 2019-09-22 13:30:00 Gameweek 6 TRUE             2
##  5 1059768     12 NA                1 NA               7 FALSE          0 FALSE                 2019-09-30 17:00:00 Gameweek 7 FALSE            4
##  6 1059772      1 NA                3 NA               8 FALSE          0 FALSE                 2019-10-06 13:30:00 Gameweek 8 TRUE             2
##  7 1059789     15 NA                1 NA               9 FALSE          0 FALSE                 2019-10-21 17:00:00 Gameweek 9 FALSE            2
##  8 1059792      1 NA                7 NA              10 FALSE          0 FALSE                 2019-10-27 15:30:00 Gameweek … TRUE             3
##  9 1059803      1 NA               20 NA              11 FALSE          0 FALSE                 2019-11-02 14:00:00 Gameweek … TRUE             3
## 10 1059814      9 NA                1 NA              12 FALSE          0 FALSE                 2019-11-10 14:00:00 Gameweek … FALSE            3
## # … with 26 more rows

Note that in history and fixtures, teams are referenced by their season specific id number.

API data from previous seasons

Each player’s nested history table only contains two rows, because at the time of writing, only two rounds of the current season has been played. This isn’t really enough data to do anything interesting. Like I mentioned earlier, we’ll need to look outside the API if we want data from previous seasons. My favorite source for historical data is vaastav’s GitHub repository. Vaastav collects data from the API each year, and puts in on GitHub so people can still access it after they wipe the API. What we want to do, is combine current data from the API with historical data from vaastav’s repository. Why not just the repository instead of the API you might ask, the answer is that it’s missing some information that’s available through the API, and because the API probably updates faster than the repository.

# Download data for the past couple of seasons from vaastav's GitHub repository
vaastav <- tibble(
  gameweek = rep(1:38, 3) %>% paste0("gw", .),
  season = c(rep("2016-17", 38), rep("2017-18", 38), rep("2018-19", 38))
) %>% 
  mutate(
    gw_data = map2(season, gameweek, function(x, y) {
      paste0(
        "https://raw.githubusercontent.com/",
        "vaastav/Fantasy-Premier-League/master/data/",
        x, "/gws/", y, ".csv"
      ) %>% 
      read_csv()
    })
  ) %>% 
  # Merge into one long dataset for each season
  group_by(season) %>% 
  summarize(
    gw_data = list(bind_rows(gw_data))
  ) %>% 
  ungroup() %>% 
  mutate(
    # Download additional data about players from the `players_raw.csv` files
    player_info = map(season, function(x) {
      paste0(
        "https://raw.githubusercontent.com/",
        "vaastav/Fantasy-Premier-League/master/data/",
        x, "/players_raw.csv"
      ) %>% 
        read_csv() %>% 
        mutate(name = paste(first_name, second_name)) %>% 
        select(element = id, player_code = code, name, element_type)
    }),
    # Add useful columns from the `players_raw.csv` files to the first bunch of files we downloaded
    clean_gw_data = map2(gw_data, player_info, function(x, y) {
      x %>% 
        # The `name` variables from the gw files is a bit messed up, so we'll drop it
        select(-name) %>% 
        # ..and replace it with the `name` variable from `player_info`
        left_join(., y, by = "element") %>% 
        # Only keep variables present in the `history` dataset from the API
        # since our aim is to merge all of our data into one big table
        select(name, player_code, element_type, names(players$history[[1]]))
    })
  ) %>% 
  # Combine all the modified `gw` files into one long dataset
  pull(clean_gw_data) %>% 
  map_dfr(., ~bind_rows(.x)) %>% 
  # Nest the data by player_code so the result looks like our `players` dataset
  group_by(name, element_type, player_code) %>% 
  nest(.key = "history")

Sometimes, a player’s name can change between seasons, or at least the way it’s written within the API. A player’s position can also change over time. The vaastav table created above was intentionally nested by name, element_type (position) and player_code to highlight these issues, and the importance of using player_code when combining data from different seasons. Here are some examples of why you shouldn’t attempt to merge historical FPL datasets by player names / positions:

# `player_code` values with multiple associated names / positions
duplicates <- vaastav$player_code[which(vaastav$player_code %>% duplicated)]

# Display the duplicates
vaastav %>%
  filter(player_code %in% duplicates) %>%
  arrange(player_code)
## # A tibble: 58 x 4
##    name                      element_type player_code history           
##    <chr>                            <dbl>       <dbl> <list>            
##  1 James Milner                         3       15157 <tibble [76 × 31]>
##  2 James Milner                         2       15157 <tibble [38 × 31]>
##  3 Ashley Young                         3       18892 <tibble [76 × 31]>
##  4 Ashley Young                         2       18892 <tibble [38 × 31]>
##  5 Chris Brunt                          2       19151 <tibble [38 × 31]>
##  6 Chris Brunt                          3       19151 <tibble [38 × 31]>
##  7 Bojan Krkic                          3       40276 <tibble [38 × 31]>
##  8 Bojan Krkic Perez                    3       40276 <tibble [38 × 31]>
##  9 Eduardo Dos Reis Carvalho            1       41251 <tibble [7 × 31]> 
## 10 Eduardo dos Reis Carvalho            1       41251 <tibble [38 × 31]>
## # … with 48 more rows

Below we fix the name/position issues so the vaastav table can be combined with our players table from the FPL API.

# Merge rows belonging to a single player
vaastav <- vaastav %>%
  mutate(
    # Add `element_type` to the `history` datasets
    history = map2(history, element_type, function(x, y) {
      x$element_type <- y
      return(x)
    })
  ) %>% 
  # Collapse into one row per `player_code`
  group_by(player_code) %>% 
  summarize(
    name = last(name),
    history = list(bind_rows(history))
  ) %>%
  ungroup() %>% 
  arrange(player_code)

Please note that if a player’s position changed during an ongoing season, as opposed to in between seasons, we are not able to capture that with our current data (I don’t know if this actually happens). In the vaastav table, we get element_type from his players_raw.csv files; these files only describe a player’s position at the end of a season. In players, we have an opportunity to capture position changes, but I didn’t bother doing it because I’m not able to correct the problem in vaastav. As a consequence, each player’s position will be fixed within a given season, even if it actually changed in real life. For the time being, this is a potential problem that we’ll just have to live with.

We’re now left with a table, vaastav, where each row represents a player, and contains a nested dataset over his performance going back to the start of the 2016 Premier League season:

vaastav 
## # A tibble: 1,086 x 3
##    player_code name                history           
##          <dbl> <chr>               <list>            
##  1        1243 Robert Green        <tibble [33 × 32]>
##  2        1616 Alexander Manninger <tibble [38 × 32]>
##  3        1632 Gareth Barry        <tibble [76 × 32]>
##  4        1718 John Terry          <tibble [38 × 32]>
##  5        1801 Paul Robinson       <tibble [76 × 32]>
##  6        1822 Shay Given          <tibble [38 × 32]>
##  7        2404 Michael Carrick     <tibble [76 × 32]>
##  8        2513 Jamie Murphy        <tibble [38 × 32]>
##  9        3201 Stuart Taylor       <tibble [67 × 32]>
## 10        3736 John O'Shea         <tibble [38 × 32]>
## # … with 1,076 more rows

Combining data from current and previous seasons

At this stage, we’ve got the players table containing data from the current PL season, and the vaastav table containing data from previous PL seasons. We want to combine these tables, leaving us with a rich description of the real world phenomenon we call the Premier League. With this table, we’ll be able to visualize complex information, make predictions, and beat my friends at a silly online game.

# Combine `vaastav` and `players`
completed_fixtures <- vaastav %>% 
  left_join(
    players %>% 
      select(player_code, current_season_history = history),
    by = "player_code"
  ) %>% 
  mutate(
    # Add a helper variable indicating whether a player still plays in the Premier League
    in_current_season = map_lgl(current_season_history, ~!is.null(.x)),
    # Merge the history datasets
    history = map2(history, current_season_history, function(x, y) {
      if(is.null(y)) return(x)
      else return(bind_rows(x, y))
    })
  ) %>% 
  select(-current_season_history) %>% 
  unnest()

completed_fixtures
## # A tibble: 68,784 x 35
##    player_code name  in_current_seas… element fixture opponent_team total_points was_home kickoff_time        team_h_score team_a_score round
##          <dbl> <chr> <lgl>              <dbl>   <dbl>         <dbl>        <dbl> <lgl>    <dttm>                     <dbl>        <dbl> <dbl>
##  1        1243 Robe… FALSE                547      51             4            0 FALSE    2017-09-23 14:00:00            0            0     6
##  2        1243 Robe… FALSE                547      63            17            0 TRUE     2017-09-30 11:30:00            0            4     7
##  3        1243 Robe… FALSE                547      78            16            0 FALSE    2017-10-14 14:00:00            2            0     8
##  4        1243 Robe… FALSE                547      83            12            0 TRUE     2017-10-21 14:00:00            2            1     9
##  5        1243 Robe… FALSE                547      97            10            0 FALSE    2017-10-28 14:00:00            3            0    10
##  6        1243 Robe… FALSE                547     103            19            0 TRUE     2017-11-04 15:00:00            1            0    11
##  7        1243 Robe… FALSE                547     112             2            0 FALSE    2017-11-18 15:00:00            4            0    12
##  8        1243 Robe… FALSE                547     122            11            0 TRUE     2017-11-26 16:00:00            1            2    13
##  9        1243 Robe… FALSE                547     135             1            0 FALSE    2017-11-29 19:45:00            5            0    14
## 10        1243 Robe… FALSE                547     145             7            0 FALSE    2017-12-02 15:00:00            2            0    15
## # … with 68,774 more rows, and 23 more variables: minutes <dbl>, goals_scored <dbl>, assists <dbl>, clean_sheets <dbl>, goals_conceded <dbl>,
## #   own_goals <dbl>, penalties_saved <dbl>, penalties_missed <dbl>, yellow_cards <dbl>, red_cards <dbl>, saves <dbl>, bonus <dbl>, bps <dbl>,
## #   influence <dbl>, creativity <dbl>, threat <dbl>, ict_index <dbl>, value <dbl>, transfers_balance <dbl>, selected <dbl>, transfers_in <dbl>,
## #   transfers_out <dbl>, element_type <dbl>

The rows in completed_fixtures are a single player’s stats from a single game, let’s call them player-fixtures. Now that we’ve got all of our 68784 player-fixtures stored in a single table, we can perform some simple cleaning operations to make the data a bit more user friendly.

We’ll perform the following cleaning operations:

  • Add a categorical variable indicating which PL season each fixture belongs to.
  • Replace team id values with the team’s name.
  • Replace element_type with a more interpretable position variable.

Adding team names to our completed_fixtures table is a bit of an ordeal. We have our teams table from the API, but this table doesn’t contain information about teams that are not currently playing in the Premier League. Vaastav’s repository does contain team_codes, but it doesn’t contain team names. Furthermore, our completed_fixtures table references teams by their id value (which can change every season), not their team_code.

To get around all this, we’ll start by building an overview of all team codes and team names going back to the 2016 season:

# Get previous season (2018-19) team names from vaastav's repository
team_names <- paste0(
    "https://raw.githubusercontent.com/",
    "vaastav/Fantasy-Premier-League/master/data/2018-19/raw.json"
  ) %>% 
  fromJSON(.) %>% 
  .$teams %>% 
  as_tibble() %>% 
  bind_rows(
    # Get current season (2019-20) team names from the FPL api.
    bootstrap_static$teams
  ) %>% 
  select(code, name) %>% 
  # Add some missing teams manually
  bind_rows(
    tribble(
      ~code, ~name, 
      88,  "Hull",
      25,  "Middlesbrough",
      56,  "Sunderland",
      35,  "West Bromwich Albion",
      80,  "Swansea City",
      110, "Stoke City"
    )
  ) %>% 
  distinct()

team_names
## # A tibble: 29 x 2
##     code name          
##    <dbl> <chr>         
##  1     3 Arsenal       
##  2    91 Bournemouth   
##  3    36 Brighton      
##  4    90 Burnley       
##  5    97 Cardiff       
##  6     8 Chelsea       
##  7    31 Crystal Palace
##  8    11 Everton       
##  9    54 Fulham        
## 10    38 Huddersfield  
## # … with 19 more rows

The next step is to build an overview over team id values and team_code for each season. Vaastav has this covered, so we’ll just use data from his repository, including the latest season (which we’ve been using the API for up to this point).

# Get `id` and `code` values for each team, from each season
team_ids <- tibble(
  season = c("2016-17","2017-18","2018-19", "2019-20")
) %>% 
  mutate(
    player_info = map(season, function(x) {
      paste0(
        "https://raw.githubusercontent.com/",
        "vaastav/Fantasy-Premier-League/master/data/",
        x, "/players_raw.csv"
      ) %>% 
        read_csv() %>% 
        select(team, team_code) %>% 
        distinct()
    })
  ) %>% 
  unnest() %>% 
  as_tibble() %>% 
  select(season, id = team, code = team_code) %>% 
  # Add team names
  left_join(team_names, by = "code") %>% 
  arrange(season, id)

team_ids
## # A tibble: 80 x 4
##    season     id  code name          
##    <chr>   <dbl> <dbl> <chr>         
##  1 2016-17     1     3 Arsenal       
##  2 2016-17     2    91 Bournemouth   
##  3 2016-17     3    90 Burnley       
##  4 2016-17     4     8 Chelsea       
##  5 2016-17     5    31 Crystal Palace
##  6 2016-17     6    11 Everton       
##  7 2016-17     7    88 Hull          
##  8 2016-17     8    13 Leicester     
##  9 2016-17     9    14 Liverpool     
## 10 2016-17    10    43 Man City      
## # … with 70 more rows

Finally, we’ll create the season variable, replace the id values in completed_fixtures with team names, and create the position variable based on element_type:

# Helper function for identifying which PL season a given date belongs to
get_season <- function(x) {
  require(lubridate)
  
  year <- year(x)
  month <- month(x)
  
  case_when(
    month %in% c(1:7) ~ paste(year - 1, str_sub(year, 3, 4), sep = "-"),
    month %in% c(8:12) ~ paste(year, str_sub(year + 1, 3, 4), sep = "-")
  )
}

completed_fixtures <- completed_fixtures %>% 
  # Add a `season` column
  mutate(season = map_chr(kickoff_time, ~get_season(.x))) %>% 
  # Replace `opponent_team` id values with team names
  rename(opponent_team_id = opponent_team) %>% 
  left_join(
    team_ids %>% 
      select(season, opponent_team_id = id, opponent_team = name),
    by = c("season", "opponent_team_id")
  ) %>% 
  select(-opponent_team_id) %>% 
  # Add a `team_name` variable for indicating what team a player belongs to
  left_join(
    {. -> x; x %>% 
      group_by(season, fixture, was_home) %>% 
      summarize(team_name = first(opponent_team)) %>% 
      mutate(was_home = !was_home)},
    by = c("season", "fixture", "was_home")
  ) %>% 
  # Create the `position` variable
  mutate(
    position = case_when(
      element_type == 1 ~ "GK",
      element_type == 2 ~ "DEF",
      element_type == 3 ~ "MID",
      element_type == 4 ~ "FWD"
    )
  ) %>% 
  # Order columns
  select(
    player_code, name, kickoff_time, season, team_name, opponent_team, position,
    everything(), -element_type
  ) %>% 
  # Order rows
  arrange(season, fixture, team_name)

That’s it. Behold our final table in all its glory (only showing 100 of 68822 rows to avoid slowing down your browser):

It’s worth noting that we have limited ourselves to columns that are present in the current version of the API, and all of the csv files in Vaastav. This means that we’ve thrown away a few columns that are only available in one of the two sources; not ideal, but if we didn’t take this approach we’d get a lot of trouble with missing values when we get to the predictive modelling stage of our path to FPL dominance.


Feel free to leave a comment (requires a GitHub account):