Predicting Wide Receiver Fantasy Points w/ Tidymodels

news
code
analysis
Author

Sameer Sapre

Published

September 16, 2024

Introduction

I’ve played fantasy football for a decade and have always wondered how ESPN (and Yahoo, Sleeper, etc.) make their fantasy projections week to week. Where are they getting their estimates from? Why is Ja’Marr Chase projected to score 16.3 points for me this week? Why does his projection increase by that much (or that little) if Tee Higgins happens to be out injured? Perhaps we won’t get a chance to peak inside of ESPN’s crystal ball, but maybe we can try to build our own!

In this post, I’ll attempt to build a predictive model that can output our projections for fantasy football players based on their past performance. We’ll narrow the scope to wide receivers now, but this process can largely be applied to players of any position.

We’ll go step by step through data cleaning, feature engineering, model selection and training, and model validation before finishing up with conclusions and takeaways.

R Markdown

library(tidyverse)
library(nflreadr)
library(tidymodels)
library(nflfastR)
library(naniar)
library(TTR)

Data Load

We’ll start by loading in directly relevant player statistics and depth chart information before joining them together.

stats = load_player_stats(seasons = seq(2006,2023))

dc = load_depth_charts(season = seq(2016,most_recent_season())) %>% filter(position == 'WR',formation == 'Offense') %>%
  select(season,recent_team = club_code,week,season_type = game_type,player_id = gsis_id,depth_team)
# Filter for wide receiver plays
wr_data <- stats %>%
  filter(position == "WR") %>%
  # Select identifying info + receiver specfic metrics
  select(player_id,player_name,position,recent_team,season,week,season_type,
         receptions:fantasy_points_ppr) %>%
  # Add depth chart status since we don't have participation data
  left_join(y = dc,by = c('player_id','week','season','season_type',"recent_team")) %>%
  # Only first 3 are counted so some players are NA'd if they're below 3rd on DC
  replace_na(list(depth_team = '4')) %>%
  mutate(depth_team = as.numeric(depth_team))

Missingness

Let’s check if there are any missing values in the dataset we’ve created. We’ll use the library naniar to do so.

# Explore variables patterns of missingness
gg_miss_var(wr_data)

wr_data %>%
  select(racr,air_yards_share,wopr,target_share,receiving_epa) -> df_miss

df_miss %>%
  gg_miss_upset()

# What if we take out racr?

Ideally, we would like to impute these values, but for brevity, we’ll filter out the NA’s of ‘racr’.

Time Weighting

We want our predictions for each player to be based on their past performance. However, we’d like the most recent performances to be weighed heaviest. To do that, we can introduce a simple time-weighting scheme. In the function below, we take a data vector (ex. receptions) and return a weighted average based on the length of the vector.

# Create Time Weighting 

weighted_avg = function(metric_vector){
  
  # Take in sliding window of vector of chosen metric
  n = length(metric_vector)

  # Create Weights for each value based on recency
  weights = seq(1,n)
  
  # Calculated weighted average
  w_avg = sum(metric_vector * weights) / sum(weights)
  
  return(w_avg)

}

Now that we have our simple time-weighting built, let’s make use of Davis Vaughn’s slider package. It’s a tidy way to calculate sliding window summaries. We’ll do that to create a set of potential numeric predictors.

library(slider)

wr_data %>%
  # Should remove most missingness
  filter(!is.na(racr)) %>%
  group_by(player_id,season) %>%
  arrange(week) %>%
  # Weighted Avg (moving)
  # Take lag so we are not leaking data
  mutate(across(receptions:depth_team, ~ lag(slide_dbl(.x,.f = weighted_avg,.before = Inf,.complete = TRUE)),.names = "wt_{col}")) %>%
  ungroup() %>%
  # Convert negative fantasy points to 0
  mutate(fantasy_points_target = ifelse(fantasy_points < 0,0,fantasy_points),
         # Take log (more on this later)
         log_fantasy_points = log(fantasy_points_target + 1)) -> ma_wr

Let’s use the example of Brandon Aiyuk as a sanity check.

ma_wr %>%
  filter(season == 2023, player_id == '00-0036261') %>%
  select(player_name,week,targets,wt_targets, receptions,wt_receptions, receiving_yards,wt_receiving_yards)
# A tibble: 19 × 8
   player_name  week targets wt_targets receptions wt_receptions receiving_yards
   <chr>       <int>   <int>      <dbl>      <int>         <dbl>           <dbl>
 1 B.Aiyuk         1       8      NA             8         NA                129
 2 B.Aiyuk         2       6       8             3          8                 43
 3 B.Aiyuk         4       6       6.67          6          4.67             148
 4 B.Aiyuk         5       7       6.33          4          5.33              58
 5 B.Aiyuk         6      10       6.6           4          4.8               76
 6 B.Aiyuk         7       6       7.73          5          4.53              57
 7 B.Aiyuk         8       9       7.24          5          4.67             109
 8 B.Aiyuk        10       3       7.68          3          4.75              55
 9 B.Aiyuk        11       6       6.64          5          4.36             156
10 B.Aiyuk        12       4       6.51          2          4.49              50
11 B.Aiyuk        13       7       6.05          5          4.04              46
12 B.Aiyuk        14       9       6.21          6          4.20             126
13 B.Aiyuk        15       5       6.64          3          4.47              37
14 B.Aiyuk        16       7       6.41          6          4.26             113
15 B.Aiyuk        17       8       6.49          7          4.50             114
16 B.Aiyuk        18       4       6.68          3          4.81              25
17 B.Aiyuk        20       6       6.36          3          4.60              32
18 B.Aiyuk        21       8       6.32          3          4.42              68
19 B.Aiyuk        22       6       6.50          3          4.27              49
# ℹ 1 more variable: wt_receiving_yards <dbl>

A quick scan shows that our time weighted averages are calculated as intended.

Correlations

Before we start splitting data for the model, let’s take a look at variable correlations. This may influence our model choices.

# Calculate the correlation matrix
cor_matrix <- ma_wr %>% 
  select(starts_with("wt_"), fantasy_points_target) %>%
  cor(use = "complete.obs")

# Reshape the correlation matrix for ggplot
cor_data <- reshape2::melt(cor_matrix)

ggplot(data = cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), name="Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  labs(title = "Correlation Matrix of Dataset", x = "", y = "")

Okay, the heatmap here shows a lot of red. It looks like there is a high degree of multicollinearity between our weighted measures. We’ll touch on this issue, but won’t fully address it in this post. Just be aware that it can make inference and interpretation of a linear model quite difficult.

In addition, if we look at our target variable, we notice something interesting… it has a skewed distribution. What does this tell us about using a linear model?

ma_wr %>%
  ggplot(aes(x = fantasy_points_target)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s see what happens if we transform our target with a log-transform.

ma_wr %>%
  ggplot(aes(x = log_fantasy_points)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Hmmm, our histogram still looks a bit funky. The distribution seems to have shifted closer to a bell curve we’d want for linear regression, but there are A LOT of 0s (more on that later). For now, we’ll trudge ahead and start preparing the modeling pipeline in tidymodels with this transformed variable as our target.

Preprocess

We’ll start by splitting our dataset into training and testing splits and create recipes for preprocessing in the following blocks.

library(tidymodels)

# Split

set.seed(222)
# Put 3/4 of the data into the training set 
data_split <- ma_wr %>% 
  # Filter on relevant columns
  select(starts_with('wt_'),fantasy_points_target,player_id,season,week,recent_team,
         fantasy_points,fantasy_points_ppr,log_fantasy_points) %>% 
  # make split
  initial_split( prop = 3/4)

# Create data frames for the two sets:
train_data <- training(data_split)
test_data  <- testing(data_split)
test_data %>%
  filter(!is.na(wt_receptions)) -> test_data

sum(colSums(is.na(test_data)))
[1] 0

Model Building

exp_recipe = train_data  %>%
  recipe(log_fantasy_points ~ .,) %>%
  update_role(c(player_id,recent_team,fantasy_points,fantasy_points_ppr,fantasy_points_target),new_role = 'ID') %>%
  # Generally not recommended to throw out all data, but for brevity, let's remove NAs
  #step_naomit(all_numeric_predictors()) %>%
  step_impute_median(all_numeric_predictors()) %>%

 # Remove zero variance predictors (ie. variables that contribute nothing to prediction)
  step_zv(all_predictors()) %>%
  step_center(all_numeric_predictors())


#summary(exp_recipe)

Cross-Validation + Tuning

GLM

We’ll train a general linear model (GLM). This is a type of linear regression which provides built-in variable selection through a process called regularization. The nuts and bolts of this modeling strategy are beyond the scope of this post, but for those interested in learning, there are tons of material on this concept anywhere on the internet. Here’s a good reference from UW professor Shuai Huang.

library(glmnet)
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-8
glm_spec <- linear_reg(
  penalty = tune(),     
  mixture = tune(),    # Mixutre (0 = Ridge, 1 = Lasso, values in between = Elastic Net)
    
) %>%
  set_engine("glmnet")

# Setup workflow
glm_wflow <-
  workflow() %>% 
  add_recipe(exp_recipe) %>%
  add_model(glm_spec)

# Create cross validation splits
wr_folds <- vfold_cv(train_data, v = 5)

# Tune the hyperparameters using a grid of values
glm_tune_results <- tune_grid(
  glm_wflow,
  resamples = wr_folds,
  grid = 10   # Number of tuning combinations to evaluate
)

Examine Tuning Results

# Display tuning results

glm_tune_results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean)
# A tibble: 10 × 8
    penalty mixture .metric .estimator  mean     n std_err .config              
      <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1 8.38e- 7   0.469 rmse    standard   0.799     5 0.00167 Preprocessor1_Model05
 2 1.44e- 5   0.744 rmse    standard   0.799     5 0.00165 Preprocessor1_Model08
 3 4.65e- 6   0.670 rmse    standard   0.799     5 0.00164 Preprocessor1_Model07
 4 1.18e- 9   0.352 rmse    standard   0.799     5 0.00167 Preprocessor1_Model04
 5 1.00e- 8   0.287 rmse    standard   0.799     5 0.00167 Preprocessor1_Model03
 6 3.44e- 4   0.125 rmse    standard   0.799     5 0.00167 Preprocessor1_Model01
 7 2.35e-10   0.559 rmse    standard   0.799     5 0.00164 Preprocessor1_Model06
 8 1.43e- 3   0.168 rmse    standard   0.799     5 0.00165 Preprocessor1_Model02
 9 2.42e- 2   0.889 rmse    standard   0.802     5 0.00161 Preprocessor1_Model09
10 2.36e- 1   0.990 rmse    standard   0.840     5 0.00206 Preprocessor1_Model10

Find the best parameters of the group - finalize_workflow() will choose the model with the optimal set of hyperparameters as found in select_best().

# Select the best hyperparameters based on RMSE
best_glm <- select_best(glm_tune_results, metric = 'rmse')

# Finalize the workflow with the best hyperparameters
final_glm_workflow <- finalize_workflow(glm_wflow, best_glm)

best_glm
# A tibble: 1 × 3
      penalty mixture .config              
        <dbl>   <dbl> <chr>                
1 0.000000838   0.469 Preprocessor1_Model05

^ Above you’ll find the optimal configuration of the model. We won’t get too far into the weeds here, but will note that the penalty term is small (low regularization, high complexity) and the mixture (type of regularization) indicates more of a Ridge regression (more shrinkage, less variable elimination).

Below, we’ll take the best model and fit it to the entire training data set before validating it against the test set.

# Fit the finalized model on the entire training data
final_glm_fit <- fit(final_glm_workflow, data = train_data)

Model Evaluation

Diagnostics

Now that we have our model fit on the full training set, let’s evaluate it, check it’s reliability and it’s compliance with key assumptions.

# Make predictions on the test set and tidy
glm_predictions <- augment(final_glm_fit, new_data = test_data) %>%
  mutate(.pred_fp = exp(.pred) + 1,
         .resid = fantasy_points - .pred_fp)

# Evaluate the model's performance (RMSE)
glm_metrics <- glm_predictions %>%
  metrics(truth = fantasy_points, estimate = .pred_fp)

# Print the evaluation metrics
print(glm_metrics)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       5.45 
2 rsq     standard       0.148
3 mae     standard       4.21 
# Distribution of predictors
glm_predictions  %>%
  ggplot(aes(x = .resid)) +
    geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Residuals vs fitted
ggplot(glm_predictions,aes(x = .pred_fp, y=.resid)) +
  geom_point() +
  geom_smooth(se = F) +
  geom_hline(yintercept = 0,linetype = "dashed",color = "red")+
  labs(title = "Residuals vs. Fitted", y= "Residuals",x = "Fitted")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(glm_predictions,aes(sample = .resid)) +
  geom_qq() +
  geom_qq_line() +
  labs(title = "Q-Q Plot of Residuals") +
  theme_minimal() 

While the histogram of residuals looks relatively normal. The other plots draw attention to some potential issues. Starting with the residuals vs. fits, we’re seeing the points tail off on the right side. These points should be scattered randomly we can see a pattern. Similarly, with the QQPlot, the right side of the plot seems to tail well off course. It suggests first that the model struggles mightly with estimating large fantasy point values, it could be due to outliers and/or that a linear model may not be the best approach for this problem. This lines up with the inflated number of 0s seen in our target variable. We probably should not trust this model for valid inference or prediction and should continue to explore non-linear options or modifications to this model (i.e. variable interactions or transformations).

Variable Importance

final_glm_fit %>% 
  vip::vi() %>%
  mutate(Importance = abs(Importance),
         Variable = fct_reorder(Variable,Importance)) %>%
  ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
  geom_col() +
  scale_x_continuous(expand = c(0,0)) +
  labs(x = "Impact")

I’m personally not a huge fan of variable importance plots because they don’t communicate the actual “importance” of the variable to the outcome of our target. In other words, they’re not that useful and can be misleading (by themselves) for inference. They just measure the impact of the variables on predictions. I’ll take a few variables on this plot as examples.

In this case, we can see that “wt_target_share” is defined as “Player’s share of team receiving targets in this game”.

It makes sense that this has a positive impact on fantasy points scored as it directly describes fantasy opportunity for receivers. All else being equal, it’s likely that increased opportunity results in increased fantasy point value.

However, does “receiving_fumbles_lost” having a positive impact on projected fantasy points seem reasonable? Considering that fantasy points are deducted when a player fumbles the ball away in real life, this doesn’t make much sense. We’d be hard-pressed to convince anyone to target a player who fumbles the ball often. Why is this the case?

It’s likely due to the afformentioned multicollinearity issue. In order to fumble the ball a player must first have received it. The players with the highest number of fumbles are likely the players that receive the ball most often (players that the offense wants to get the ball to) and consequently get increased opportunity to score more points in addition to fumbles. Below we can see “fumbles lost” as a function of receptions and target share, respectfully.

ggplot(train_data, aes(x = wt_receptions, y = wt_receiving_fumbles_lost)) + 
  geom_smooth() +
  theme_minimal() +
  labs(y = "Fumbles Lost", x = "Receptions")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2693 rows containing non-finite outside the scale range
(`stat_smooth()`).

ggplot(train_data, aes(x = wt_target_share, y = wt_receiving_fumbles_lost)) + 
  geom_smooth() +
  labs(y = "Fumbles", x= "Target Share")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 2693 rows containing non-finite outside the scale range
(`stat_smooth()`).

As mentioned earlier, the regularized model handles variable selection but does not completely remove the effects of multicollinearity. Multicollinearity makes inference very challenging and we will likely need a model better suited for non-linear relationships or to include more interactions to improve model performance and interpretability.

Conclusion

This was a good first step in creating a model for my fantasy football needs. There is clearly room for improvement as far as model selection goes, but it was fun to use the slider package for the first time and attempt to estimate a transformed target variable. I have some ideas for additions to and subtractions from this model that will hopefully make it into another blog post. In the meantime, big shoutout to the nflreadr team for making this data easily available.

Ho T, Carl S (2024). nflreadr: Download ‘nflverse’ Data. R package version 1.4.1.05, https://github.com/nflverse/nflreadr, https://nflreadr.nflverse.com.