# install and load packages
#install.packages("nflreadr")
library('nflreadr')
library("tidyverse")
library('pROC')
library('tidymodels')Intro
Hello everyone! Today I want to share a tutorial on using nflreadR to read historic NFL results and model game outcomes with tidymodels.
First, a quick introduction of the R packages we’ll use. nflreadR is a part of the nflverse family of packages that easily and efficiently obtain data from NFL games. This includes past games results and statistics. In this post, we’ll be using its suite of functions to get the data we need to build a simple predictive model. We’ll also use the tidymodels package to setup our model and tidyverse for data cleaning and manipulation.
Load Data
Now that we have the relevant packages loaded, let’s get started getting our data together. Starting with game data, we’ll pull game results from 2011 - 2021. Here, we see that we get a schedule where each row (record) represents a game. There’s a home and away team, corresponding scores, and more contextual information for each game.
# Scrape schedule Results
load_schedules(seasons = seq(2011,2024)) -> nfl_game_results
head(nfl_game_results)# A tibble: 6 × 46
game_id season game_type week gameday weekday gametime away_team away_score
<chr> <int> <chr> <int> <chr> <chr> <chr> <chr> <int>
1 2011_01_… 2011 REG 1 2011-0… Thursd… 20:30 NO 34
2 2011_01_… 2011 REG 1 2011-0… Sunday 13:00 PIT 7
3 2011_01_… 2011 REG 1 2011-0… Sunday 13:00 ATL 12
4 2011_01_… 2011 REG 1 2011-0… Sunday 13:00 CIN 27
5 2011_01_… 2011 REG 1 2011-0… Sunday 13:00 IND 7
6 2011_01_… 2011 REG 1 2011-0… Sunday 13:00 TEN 14
# ℹ 37 more variables: home_team <chr>, home_score <int>, location <chr>,
# result <int>, total <int>, overtime <int>, old_game_id <chr>, gsis <int>,
# nfl_detail_id <chr>, pfr <chr>, pff <int>, espn <chr>, ftn <int>,
# away_rest <int>, home_rest <int>, away_moneyline <int>,
# home_moneyline <int>, spread_line <dbl>, away_spread_odds <int>,
# home_spread_odds <int>, total_line <dbl>, under_odds <int>,
# over_odds <int>, div_game <int>, roof <chr>, surface <chr>, temp <int>, …
We’ve loaded our schedules in with some interesting variables to use in our model. However, it’s not quite in the format we need it to be. Ideally, we’d like to feed in 2 teams and have the model give us a winner.
nfl_game_results %>%
# Remove the upcoming season
filter(season < 2024) %>%
pivot_longer(cols = c(away_team,home_team),
names_to = "home_away",
values_to = "team") %>%
mutate(team_score = ifelse(home_away == "home_team",yes = home_score,no = away_score),
opp_score = ifelse(home_away == "home_team", away_score,home_score)) %>% # sort for cumulative avg
arrange(season,week) %>%
select(season,game_id,team,team_score,opp_score,week) -> team_gamesLet’s use pivot_longer() to rearrange our dataset and select some simple variables before making the matchup set.
Feature Engineering
Our goal is to be able to predict the outcome of each. To do that, we need to think about what impacts the outcome of a game before teams even take the field.
In the case of an NFL game it could be things like player skill level, how far a team has to travel, injuries, even the food that players ate the night before. Using nflreadr we can see that there are several variables that can potentially impact the game’s outcome from injuries to previous results.
We’ll start off by pulling in previous_results. By using previous results, we can hopefully capture a team’s quality as a predictor for how they will perform in the next game. There are several ways to quantify team strength, some more complex than others, but for this tutorial, we will use cumulative results as a measure of team strength. The results will be in the form of cumulative points scored/allowed and winning percentage leading up to the game.
team_games %>%
arrange(week) %>%
group_by(season,team) %>%
# For each team's season calculate the cumulative scores for after each week
mutate(cumul_score_mean = cummean(team_score),
cumul_score_opp = cummean(opp_score),
cumul_wins = cumsum(team_score > opp_score),
cumul_losses = cumsum(team_score < opp_score),
cumul_ties = cumsum(team_score == opp_score),
cumul_win_pct = cumul_wins / (cumul_wins + cumul_losses),
# Create the lag variable
cumul_win_pct_lag_1 = lag(cumul_win_pct,1),
cumul_score_lag_1 = lag(cumul_score_mean,1),
cumul_opp_lag_1 = lag(cumul_score_opp,1)
) %>%
# Non-lag variables leak info
select(week,game_id,contains('lag_1')) %>%
ungroup() -> cumul_avgsLet’s also calculate winning percentage as a feature.
team_games %>%
group_by(season,team) %>%
summarise(wins = sum(team_score > opp_score),
losses = sum(team_score < opp_score),
ties = sum(team_score == opp_score))%>%
ungroup() %>%
arrange(season) %>%
group_by(team) %>%
mutate(win_pct = wins / (wins + losses),
lag1_win_pct = lag(win_pct,1)) %>%
ungroup() -> team_win_pctThis should be a good start, but I still feel like something is missing. Football is a dangerous game and players regularly get injured. Thankfully nflreadr provides weekly injury reports. Let’s try incorporating that into our model.
# Load depth charts and injury reports
dc = load_depth_charts(seq(2011,most_recent_season()))
injuries = load_injuries(seq(2011,most_recent_season()))injuries %>%
filter(report_status == "Out") -> out_inj
dc %>%
filter(depth_team == 1) -> starters
# Determine roster position of injured players
starters %>%
select(-c(last_name,first_name,position,full_name)) %>%
inner_join(out_inj, by = c('season','club_code' = 'team','gsis_id','game_type','week')) -> injured_starters
# Number of injuries by position
injured_starters %>%
group_by(season,club_code,week,position) %>%
summarise(starters_injured = n()) %>%
ungroup() %>%
pivot_wider(names_from = position, names_prefix = "injured_",values_from = starters_injured) -> injuries_position
head(injuries_position)# A tibble: 6 × 19
season club_code week injured_S injured_LB injured_RB injured_T injured_C
<dbl> <chr> <dbl> <int> <int> <int> <int> <int>
1 2011 ARI 7 1 NA NA NA NA
2 2011 ARI 8 1 NA NA NA NA
3 2011 ARI 9 NA 1 1 NA NA
4 2011 ARI 10 1 1 NA NA NA
5 2011 ARI 11 1 1 NA NA NA
6 2011 ARI 12 1 1 NA NA NA
# ℹ 11 more variables: injured_DT <int>, injured_WR <int>, injured_CB <int>,
# injured_G <int>, injured_K <int>, injured_TE <int>, injured_QB <int>,
# injured_DE <int>, injured_LS <int>, injured_P <int>, injured_FB <int>
Alright, now we have some flags for injured starter at each position. Next, we need to bring all of our new features together.
Joins
nfl_game_results %>%
inner_join(cumul_avgs, by = c('game_id','season','week','home_team' = 'team')) %>%
inner_join(cumul_avgs, by = c('game_id','season','week','away_team' = 'team'),suffix = c('_home','_away'))-> w_avgs
# Check for stragglers
nfl_game_results %>%
anti_join(cumul_avgs, by = c('game_id','season','home_team' = 'team','week')) -> unplayed_games
# Join previous season's results
#w_avgs %>%
# left_join(team_win_pct,by = c('season','home_team' = 'team')) %>%
# left_join(team_win_pct, by = c('away_team' = 'team','season'),suffix = c('_home','_away')) -> matchups# Indicate whether home team won
w_avgs %>%
mutate(home_win = as.numeric(result > 0)) -> matchupsNow, let’s bring in our injury data.
matchups %>%
left_join(injuries_position,by = c('season','home_team'='club_code','week')) %>%
left_join(injuries_position,by = c('season','away_team'='club_code','week'),suffix = c('_home','_away')) %>%
mutate(across(starts_with('injured_'), ~replace_na(.x, 0))) -> matchup_fullAnd … BOOM! We have a dataset with game-by-game matchups and some features to start out. Feel free to peruse the data to find potential features to include in our Model.
Building the Model
Before we do any preprocessing, let’s split the data.
First, let’s remove columns that might leak info to the model. Remember the model must only use information available prior to kickoff.
matchup_full %>%
select(-c(home_score,away_score,overtime,home_team,away_team,away_qb_name,home_qb_name,referee,stadium,home_coach,away_coach,ftn,espn,old_game_id,gsis,nfl_detail_id,pfr,pff,result)) -> matchup_ready# Remove columns
matchup_ready = matchup_ready%>%
# Transform outcome into factor variable
select(where(is.numeric),game_id) %>%
mutate(home_win = as.factor(home_win)) # Split Data
set.seed(123)
matchups24 = matchup_ready %>% filter(season == 2024)
splits = matchup_ready %>%
filter(season != 2024) %>%
initial_split(prop = 0.7)
train_data <- training(splits)
test_data <- testing(splits)Let’s checkout the variables avaialable to us.
colnames(train_data) [1] "season" "week"
[3] "total" "away_rest"
[5] "home_rest" "away_moneyline"
[7] "home_moneyline" "spread_line"
[9] "away_spread_odds" "home_spread_odds"
[11] "total_line" "under_odds"
[13] "over_odds" "div_game"
[15] "temp" "wind"
[17] "cumul_win_pct_lag_1_home" "cumul_score_lag_1_home"
[19] "cumul_opp_lag_1_home" "cumul_win_pct_lag_1_away"
[21] "cumul_score_lag_1_away" "cumul_opp_lag_1_away"
[23] "home_win" "injured_S_home"
[25] "injured_LB_home" "injured_RB_home"
[27] "injured_T_home" "injured_C_home"
[29] "injured_DT_home" "injured_WR_home"
[31] "injured_CB_home" "injured_G_home"
[33] "injured_K_home" "injured_TE_home"
[35] "injured_QB_home" "injured_DE_home"
[37] "injured_LS_home" "injured_P_home"
[39] "injured_FB_home" "injured_S_away"
[41] "injured_LB_away" "injured_RB_away"
[43] "injured_T_away" "injured_C_away"
[45] "injured_DT_away" "injured_WR_away"
[47] "injured_CB_away" "injured_G_away"
[49] "injured_K_away" "injured_TE_away"
[51] "injured_QB_away" "injured_DE_away"
[53] "injured_LS_away" "injured_P_away"
[55] "injured_FB_away" "game_id"
Off the bat we see a lot of variables that we created. Do we really need that many variables tracking injured players? Probably not, but we’ll let the model handle this issue later on.
What about NAs? You may have noticed quite a few variables still containing missing values. We can turn to the package naniar to investigate the missingness.
library(naniar)
gg_miss_var(x = train_data,show_pct = T)
It looks like our custom variables have some missingess as well as some wind and temperature variables. There are several ways to deal with this. However, we will use a simple imputation method as provided by tidymodels.
library('tidymodels')
rec_impute = recipe(formula = home_win ~ .,
data = train_data) %>%
#create ID role (do not remove) game ID. We'll use this to match predictions to specific games
update_role(game_id, new_role = "ID") %>%
#for each numeric variable/feature, replace any NA's with the median value of that variable/feature
step_impute_median(all_numeric_predictors())
# Create recipe to for moneyline model
# Evalate imputation step
tidy(rec_impute, number = 1)# A tibble: 1 × 3
terms value id
<chr> <dbl> <chr>
1 all_numeric_predictors() NA impute_median_BeUC1
Here, we can see what the medians (NA-filler) values for each predictor.
imp_models <- rec_impute %>%
check_missing(all_numeric_predictors()) %>%
prep(training = train_data)
# Check if imputation worked
imp_models %>%
bake(train_data) %>%
is.na() %>%
colSums() season week total
0 0 0
away_rest home_rest away_moneyline
0 0 0
home_moneyline spread_line away_spread_odds
0 0 0
home_spread_odds total_line under_odds
0 0 0
over_odds div_game temp
0 0 0
wind cumul_win_pct_lag_1_home cumul_score_lag_1_home
0 0 0
cumul_opp_lag_1_home cumul_win_pct_lag_1_away cumul_score_lag_1_away
0 0 0
cumul_opp_lag_1_away injured_S_home injured_LB_home
0 0 0
injured_RB_home injured_T_home injured_C_home
0 0 0
injured_DT_home injured_WR_home injured_CB_home
0 0 0
injured_G_home injured_K_home injured_TE_home
0 0 0
injured_QB_home injured_DE_home injured_LS_home
0 0 0
injured_P_home injured_FB_home injured_S_away
0 0 0
injured_LB_away injured_RB_away injured_T_away
0 0 0
injured_C_away injured_DT_away injured_WR_away
0 0 0
injured_CB_away injured_G_away injured_K_away
0 0 0
injured_TE_away injured_QB_away injured_DE_away
0 0 0
injured_LS_away injured_P_away injured_FB_away
0 0 0
game_id home_win
0 0
tidy(imp_models,number = 1)# A tibble: 54 × 3
terms value id
<chr> <dbl> <chr>
1 season 2017 impute_median_BeUC1
2 week 10 impute_median_BeUC1
3 total 44 impute_median_BeUC1
4 away_rest 7 impute_median_BeUC1
5 home_rest 7 impute_median_BeUC1
6 away_moneyline 139 impute_median_BeUC1
7 home_moneyline -155 impute_median_BeUC1
8 spread_line 3 impute_median_BeUC1
9 away_spread_odds -105 impute_median_BeUC1
10 home_spread_odds -105 impute_median_BeUC1
# ℹ 44 more rows
Modeling
Ahh finally, now we can get to the actual model building…. which we’ll do in about 3 lines of code.
library('glmnet')
# Penalized Linear Regression
# Mixture = 1 means pure lasso
lr_mod <-
logistic_reg(mixture = 0.05,penalty = 1) %>%
set_engine("glmnet")And that’s it! We’ll start off with a basic logistic regression. There are a few tunable (though we won’t be tuning in this post) parameters, but we’ll manually set them for this exercise.
Next, we want to train our model with cross-validation, so that we can train and test against different samples to avoid an overfit model as much as we can.
# Create folds for cross-validation
folds <- vfold_cv(train_data)The tidymodels workflow helps organize the steps of the model creation and consolidate model objects. We won’t go into details of workflows in this post, but there is plenty of online documentation.
# create a workflow using recipe and model objects
game_pred_wflow <-
workflow() %>%
add_model(lr_mod) %>%
add_recipe(rec_impute)fit_cv = game_pred_wflow %>%
fit_resamples(folds)Check for best model fit. It looks like using ROC-AUC and accuracy agree that the first model is the best.
collect_metrics(fit_cv)# A tibble: 3 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 accuracy binary 0.623 10 0.00725 Preprocessor1_Model1
2 brier_class binary 0.229 10 0.00100 Preprocessor1_Model1
3 roc_auc binary 0.719 10 0.00806 Preprocessor1_Model1
Extract best model fit.
final_wf = game_pred_wflow %>%
last_fit(splits)final_model = extract_workflow(final_wf)Get variable estimates and penalty terms.
final_model %>%
extract_fit_engine() %>%
tidy() %>%
rename(penalty = lambda)# A tibble: 3,622 × 5
term step estimate penalty dev.ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1 0.253 3.74 -4.05e-14
2 (Intercept) 2 0.250 3.41 1.76e- 3
3 (Intercept) 3 0.247 3.11 4.56e- 3
4 (Intercept) 4 0.242 2.83 7.95e- 3
5 (Intercept) 5 0.237 2.58 1.15e- 2
6 (Intercept) 6 0.232 2.35 1.51e- 2
7 (Intercept) 7 0.227 2.14 1.88e- 2
8 (Intercept) 8 0.221 1.95 2.26e- 2
9 (Intercept) 9 0.215 1.78 2.65e- 2
10 (Intercept) 10 0.209 1.62 3.04e- 2
# ℹ 3,612 more rows
# Align predictions to test dataset
predicted_df = augment(final_model,test_data) Now, so we get a sample boost, let’s retrain the model on the full training sample.
# Extract model specs
final_model %>%
extract_spec_parsnip() -> final_specs
# Update workflow object and retrain on full training set w/ same parameters
game_pred_wflow %>%
update_model(spec = final_specs) %>%
# Bind test and training data together
fit(data = cbind(train_data,test_data))-> final_flowLet’s now save the model for future use and investigation.
library(yaml)
final_flow %>%
extract_fit_parsnip() %>%
saveRDS(file = "gamePred_model2024.rda")We’ll use this model to make predictions on future games and evaluate how our model performs in real-time.