Intro

Hello everyone! Today I want to share a tutorial on using nflreadR to read historic NFL results and model game outcomes with tidymodels.

First, a quick introduction of the R packages we’ll use. nflreadR is a part of the nflverse family of packages that easily and efficiently obtain data from NFL games. This includes past games results and statistics. In this post, we’ll be using its suite of functions to get the data we need to build a simple predictive model. We’ll also use the tidymodels package to setup our model and tidyverse for data cleaning and manipulation.

# install and load packages
#install.packages("nflreadr")
library('nflreadr')
library("tidyverse")
library('pROC')
library('tidymodels')

Load Data

Now that we have the relevant packages loaded, let’s get started getting our data together. Starting with game data, we’ll pull game results from 2011 - 2021. Here, we see that we get a schedule where each row (record) represents a game. There’s a home and away team, corresponding scores, and more contextual information for each game.

# Scrape schedule Results
load_schedules(seasons = seq(2011,2024)) -> nfl_game_results 
head(nfl_game_results)

# A tibble: 6 × 46
  game_id   season game_type  week gameday weekday gametime away_team away_score
  <chr>      <int> <chr>     <int> <chr>   <chr>   <chr>    <chr>          <int>
1 2011_01_…   2011 REG           1 2011-0… Thursd… 20:30    NO                34
2 2011_01_…   2011 REG           1 2011-0… Sunday  13:00    PIT                7
3 2011_01_…   2011 REG           1 2011-0… Sunday  13:00    ATL               12
4 2011_01_…   2011 REG           1 2011-0… Sunday  13:00    CIN               27
5 2011_01_…   2011 REG           1 2011-0… Sunday  13:00    IND                7
6 2011_01_…   2011 REG           1 2011-0… Sunday  13:00    TEN               14
# ℹ 37 more variables: home_team <chr>, home_score <int>, location <chr>,
#   result <int>, total <int>, overtime <int>, old_game_id <chr>, gsis <int>,
#   nfl_detail_id <chr>, pfr <chr>, pff <int>, espn <chr>, ftn <int>,
#   away_rest <int>, home_rest <int>, away_moneyline <int>,
#   home_moneyline <int>, spread_line <dbl>, away_spread_odds <int>,
#   home_spread_odds <int>, total_line <dbl>, under_odds <int>,
#   over_odds <int>, div_game <int>, roof <chr>, surface <chr>, temp <int>, …

We’ve loaded our schedules in with some interesting variables to use in our model. However, it’s not quite in the format we need it to be. Ideally, we’d like to feed in 2 teams and have the model give us a winner.

nfl_game_results %>%
  # Remove the upcoming season
  filter(season < 2024) %>%
  pivot_longer(cols = c(away_team,home_team),
               names_to = "home_away",
               values_to = "team") %>%
  mutate(team_score = ifelse(home_away == "home_team",yes = home_score,no = away_score),
         opp_score = ifelse(home_away == "home_team", away_score,home_score)) %>%  # sort for cumulative avg
  arrange(season,week) %>%
  select(season,game_id,team,team_score,opp_score,week) -> team_games

Let’s use pivot_longer() to rearrange our dataset and select some simple variables before making the matchup set.

Feature Engineering

Our goal is to be able to predict the outcome of each. To do that, we need to think about what impacts the outcome of a game before teams even take the field.

In the case of an NFL game it could be things like player skill level, how far a team has to travel, injuries, even the food that players ate the night before. Using nflreadr we can see that there are several variables that can potentially impact the game’s outcome from injuries to previous results.

We’ll start off by pulling in previous_results. By using previous results, we can hopefully capture a team’s quality as a predictor for how they will perform in the next game. There are several ways to quantify team strength, some more complex than others, but for this tutorial, we will use cumulative results as a measure of team strength. The results will be in the form of cumulative points scored/allowed and winning percentage leading up to the game.

team_games %>%
  arrange(week) %>%
  group_by(season,team) %>%
  # For each team's season calculate the cumulative scores for after each week
  mutate(cumul_score_mean = cummean(team_score),
          cumul_score_opp = cummean(opp_score),
          cumul_wins = cumsum(team_score > opp_score),
          cumul_losses = cumsum(team_score < opp_score),
          cumul_ties = cumsum(team_score == opp_score),
         cumul_win_pct = cumul_wins / (cumul_wins + cumul_losses),
         # Create the lag variable
         cumul_win_pct_lag_1 = lag(cumul_win_pct,1),
         cumul_score_lag_1 = lag(cumul_score_mean,1),
         cumul_opp_lag_1 = lag(cumul_score_opp,1)
         ) %>%
  # Non-lag variables leak info
  select(week,game_id,contains('lag_1')) %>%
  ungroup() -> cumul_avgs

Let’s also calculate winning percentage as a feature.

team_games %>%
  group_by(season,team) %>%
  summarise(wins = sum(team_score > opp_score),
            losses = sum(team_score < opp_score),
            ties = sum(team_score == opp_score))%>%
  ungroup() %>%
  arrange(season) %>%
  group_by(team) %>%
  mutate(win_pct = wins / (wins + losses),
         lag1_win_pct = lag(win_pct,1)) %>%
  ungroup() -> team_win_pct

This should be a good start, but I still feel like something is missing. Football is a dangerous game and players regularly get injured. Thankfully nflreadr provides weekly injury reports. Let’s try incorporating that into our model.

# Load depth charts and injury reports
dc = load_depth_charts(seq(2011,most_recent_season()))
injuries = load_injuries(seq(2011,most_recent_season()))

injuries %>%
  filter(report_status == "Out") -> out_inj

dc %>% 
  filter(depth_team == 1) -> starters

# Determine roster position of injured players
starters %>%
  select(-c(last_name,first_name,position,full_name)) %>%
  inner_join(out_inj, by = c('season','club_code' = 'team','gsis_id','game_type','week')) -> injured_starters

# Number of injuries by position
injured_starters %>%
  group_by(season,club_code,week,position) %>%
  summarise(starters_injured = n()) %>%
  ungroup() %>%
  pivot_wider(names_from = position, names_prefix = "injured_",values_from = starters_injured) -> injuries_position

head(injuries_position)

# A tibble: 6 × 19
  season club_code  week injured_S injured_LB injured_RB injured_T injured_C
   <dbl> <chr>     <dbl>     <int>      <int>      <int>     <int>     <int>
1   2011 ARI           7         1         NA         NA        NA        NA
2   2011 ARI           8         1         NA         NA        NA        NA
3   2011 ARI           9        NA          1          1        NA        NA
4   2011 ARI          10         1          1         NA        NA        NA
5   2011 ARI          11         1          1         NA        NA        NA
6   2011 ARI          12         1          1         NA        NA        NA
# ℹ 11 more variables: injured_DT <int>, injured_WR <int>, injured_CB <int>,
#   injured_G <int>, injured_K <int>, injured_TE <int>, injured_QB <int>,
#   injured_DE <int>, injured_LS <int>, injured_P <int>, injured_FB <int>

Alright, now we have some flags for injured starter at each position. Next, we need to bring all of our new features together.

Joins

nfl_game_results %>%
  inner_join(cumul_avgs, by = c('game_id','season','week','home_team' = 'team')) %>%
  inner_join(cumul_avgs, by = c('game_id','season','week','away_team' = 'team'),suffix = c('_home','_away'))-> w_avgs

# Check for stragglers
nfl_game_results %>%
  anti_join(cumul_avgs, by = c('game_id','season','home_team' = 'team','week')) -> unplayed_games

# Join previous season's results
#w_avgs %>%
#  left_join(team_win_pct,by = c('season','home_team' = 'team')) %>%
#  left_join(team_win_pct, by = c('away_team' = 'team','season'),suffix = c('_home','_away')) -> matchups

# Indicate whether home team won
w_avgs %>%
  mutate(home_win = as.numeric(result > 0)) -> matchups

Now, let’s bring in our injury data.

matchups %>%
  left_join(injuries_position,by = c('season','home_team'='club_code','week')) %>%
  left_join(injuries_position,by = c('season','away_team'='club_code','week'),suffix = c('_home','_away')) %>%
  mutate(across(starts_with('injured_'), ~replace_na(.x, 0))) -> matchup_full

And … BOOM! We have a dataset with game-by-game matchups and some features to start out. Feel free to peruse the data to find potential features to include in our Model.

Building the Model

Before we do any preprocessing, let’s split the data.

First, let’s remove columns that might leak info to the model. Remember the model must only use information available prior to kickoff.

matchup_full %>%
  select(-c(home_score,away_score,overtime,home_team,away_team,away_qb_name,home_qb_name,referee,stadium,home_coach,away_coach,ftn,espn,old_game_id,gsis,nfl_detail_id,pfr,pff,result)) -> matchup_ready

# Remove columns
matchup_ready = matchup_ready%>%
  # Transform outcome into factor variable
  select(where(is.numeric),game_id) %>% 
  mutate(home_win = as.factor(home_win))

# Split Data
set.seed(123)
matchups24 = matchup_ready %>% filter(season == 2024)
splits = matchup_ready %>% 
  filter(season != 2024) %>%
  initial_split(prop = 0.7)
train_data <- training(splits)
test_data  <- testing(splits)

Let’s checkout the variables avaialable to us.

colnames(train_data)

 [1] "season"                   "week"                    
 [3] "total"                    "away_rest"               
 [5] "home_rest"                "away_moneyline"          
 [7] "home_moneyline"           "spread_line"             
 [9] "away_spread_odds"         "home_spread_odds"        
[11] "total_line"               "under_odds"              
[13] "over_odds"                "div_game"                
[15] "temp"                     "wind"                    
[17] "cumul_win_pct_lag_1_home" "cumul_score_lag_1_home"  
[19] "cumul_opp_lag_1_home"     "cumul_win_pct_lag_1_away"
[21] "cumul_score_lag_1_away"   "cumul_opp_lag_1_away"    
[23] "home_win"                 "injured_S_home"          
[25] "injured_LB_home"          "injured_RB_home"         
[27] "injured_T_home"           "injured_C_home"          
[29] "injured_DT_home"          "injured_WR_home"         
[31] "injured_CB_home"          "injured_G_home"          
[33] "injured_K_home"           "injured_TE_home"         
[35] "injured_QB_home"          "injured_DE_home"         
[37] "injured_LS_home"          "injured_P_home"          
[39] "injured_FB_home"          "injured_S_away"          
[41] "injured_LB_away"          "injured_RB_away"         
[43] "injured_T_away"           "injured_C_away"          
[45] "injured_DT_away"          "injured_WR_away"         
[47] "injured_CB_away"          "injured_G_away"          
[49] "injured_K_away"           "injured_TE_away"         
[51] "injured_QB_away"          "injured_DE_away"         
[53] "injured_LS_away"          "injured_P_away"          
[55] "injured_FB_away"          "game_id"

Off the bat we see a lot of variables that we created. Do we really need that many variables tracking injured players? Probably not, but we’ll let the model handle this issue later on.

What about NAs? You may have noticed quite a few variables still containing missing values. We can turn to the package naniar to investigate the missingness.

library(naniar)

gg_miss_var(x = train_data,show_pct = T)

It looks like our custom variables have some missingess as well as some wind and temperature variables. There are several ways to deal with this. However, we will use a simple imputation method as provided by tidymodels.

library('tidymodels')

rec_impute = recipe(formula = home_win ~ .,
                 data = train_data) %>%
  #create ID role (do not remove) game ID. We'll use this to match predictions to specific games
  update_role(game_id, new_role = "ID") %>%
  #for each numeric variable/feature, replace any NA's with the median value of that variable/feature
  step_impute_median(all_numeric_predictors())

# Create recipe to for moneyline model

# Evalate imputation step
tidy(rec_impute, number = 1)

# A tibble: 1 × 3
  terms                    value id                 
  <chr>                    <dbl> <chr>              
1 all_numeric_predictors()    NA impute_median_BeUC1

Here, we can see what the medians (NA-filler) values for each predictor.

imp_models <- rec_impute %>%
  check_missing(all_numeric_predictors()) %>%
  prep(training = train_data)

# Check if imputation worked
imp_models %>%
  bake(train_data) %>%
  is.na() %>%
  colSums()

                  season                     week                    total 
                       0                        0                        0 
               away_rest                home_rest           away_moneyline 
                       0                        0                        0 
          home_moneyline              spread_line         away_spread_odds 
                       0                        0                        0 
        home_spread_odds               total_line               under_odds 
                       0                        0                        0 
               over_odds                 div_game                     temp 
                       0                        0                        0 
                    wind cumul_win_pct_lag_1_home   cumul_score_lag_1_home 
                       0                        0                        0 
    cumul_opp_lag_1_home cumul_win_pct_lag_1_away   cumul_score_lag_1_away 
                       0                        0                        0 
    cumul_opp_lag_1_away           injured_S_home          injured_LB_home 
                       0                        0                        0 
         injured_RB_home           injured_T_home           injured_C_home 
                       0                        0                        0 
         injured_DT_home          injured_WR_home          injured_CB_home 
                       0                        0                        0 
          injured_G_home           injured_K_home          injured_TE_home 
                       0                        0                        0 
         injured_QB_home          injured_DE_home          injured_LS_home 
                       0                        0                        0 
          injured_P_home          injured_FB_home           injured_S_away 
                       0                        0                        0 
         injured_LB_away          injured_RB_away           injured_T_away 
                       0                        0                        0 
          injured_C_away          injured_DT_away          injured_WR_away 
                       0                        0                        0 
         injured_CB_away           injured_G_away           injured_K_away 
                       0                        0                        0 
         injured_TE_away          injured_QB_away          injured_DE_away 
                       0                        0                        0 
         injured_LS_away           injured_P_away          injured_FB_away 
                       0                        0                        0 
                 game_id                 home_win 
                       0                        0

tidy(imp_models,number = 1)

# A tibble: 54 × 3
   terms            value id                 
   <chr>            <dbl> <chr>              
 1 season            2017 impute_median_BeUC1
 2 week                10 impute_median_BeUC1
 3 total               44 impute_median_BeUC1
 4 away_rest            7 impute_median_BeUC1
 5 home_rest            7 impute_median_BeUC1
 6 away_moneyline     139 impute_median_BeUC1
 7 home_moneyline    -155 impute_median_BeUC1
 8 spread_line          3 impute_median_BeUC1
 9 away_spread_odds  -105 impute_median_BeUC1
10 home_spread_odds  -105 impute_median_BeUC1
# ℹ 44 more rows

Modeling

Ahh finally, now we can get to the actual model building…. which we’ll do in about 3 lines of code.

library('glmnet')
# Penalized Linear Regression
# Mixture = 1 means pure lasso
lr_mod <- 
  logistic_reg(mixture = 0.05,penalty = 1) %>% 
  set_engine("glmnet")

And that’s it! We’ll start off with a basic logistic regression. There are a few tunable (though we won’t be tuning in this post) parameters, but we’ll manually set them for this exercise.

Next, we want to train our model with cross-validation, so that we can train and test against different samples to avoid an overfit model as much as we can.

# Create folds for cross-validation
folds <- vfold_cv(train_data)

The tidymodels workflow helps organize the steps of the model creation and consolidate model objects. We won’t go into details of workflows in this post, but there is plenty of online documentation.

# create a workflow using recipe and model objects
game_pred_wflow <- 
  workflow() %>% 
  add_model(lr_mod) %>% 
  add_recipe(rec_impute)

fit_cv = game_pred_wflow %>%
  fit_resamples(folds)

Check for best model fit. It looks like using ROC-AUC and accuracy agree that the first model is the best.

collect_metrics(fit_cv)

# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.623    10 0.00725 Preprocessor1_Model1
2 brier_class binary     0.229    10 0.00100 Preprocessor1_Model1
3 roc_auc     binary     0.719    10 0.00806 Preprocessor1_Model1

Extract best model fit.

final_wf = game_pred_wflow %>%
  last_fit(splits)

final_model = extract_workflow(final_wf)

Get variable estimates and penalty terms.

final_model %>%
  extract_fit_engine() %>%
  tidy() %>%
  rename(penalty = lambda)

# A tibble: 3,622 × 5
   term         step estimate penalty dev.ratio
   <chr>       <dbl>    <dbl>   <dbl>     <dbl>
 1 (Intercept)     1    0.253    3.74 -4.05e-14
 2 (Intercept)     2    0.250    3.41  1.76e- 3
 3 (Intercept)     3    0.247    3.11  4.56e- 3
 4 (Intercept)     4    0.242    2.83  7.95e- 3
 5 (Intercept)     5    0.237    2.58  1.15e- 2
 6 (Intercept)     6    0.232    2.35  1.51e- 2
 7 (Intercept)     7    0.227    2.14  1.88e- 2
 8 (Intercept)     8    0.221    1.95  2.26e- 2
 9 (Intercept)     9    0.215    1.78  2.65e- 2
10 (Intercept)    10    0.209    1.62  3.04e- 2
# ℹ 3,612 more rows

# Align predictions to test dataset
predicted_df = augment(final_model,test_data)

Now, so we get a sample boost, let’s retrain the model on the full training sample.

# Extract model specs
final_model %>%
  extract_spec_parsnip() -> final_specs

# Update workflow object and retrain on full training set w/ same parameters
game_pred_wflow %>%
  update_model(spec = final_specs) %>%
  # Bind test and training data together
  fit(data = cbind(train_data,test_data))-> final_flow

Let’s now save the model for future use and investigation.

library(yaml)

final_flow %>%
  extract_fit_parsnip() %>%
  saveRDS(file = "gamePred_model2024.rda")

We’ll use this model to make predictions on future games and evaluate how our model performs in real-time.