Predicting Fantasy Football Projections

the most valuable player at his position (Antonio Brown, WR) celebrating yet another highlight-worthy touchdown. side-note: he was on my championship team roster!

the most valuable player at his position (Antonio Brown, WR) celebrating yet another highlight-worthy touchdown. side-note: he was on my championship team roster!

for jupyter notebook code, click here


Can we predict fantasy football projections using a player's previous years data?

For those unfamiliar with the rules of fantasy football, here's a friendly crash-course:

  1. Teams are picked on "draft day" and consist primarily of offensive "skill positions" (i.e., quarterbacks, running backs, and wide receivers.
  2. Throughout the football season, "fantasy points" are derived from real life stats like total yardsand touchdowns in a given game. 
  3. For example, a running back will get 1 fantasy point for every 10 yards gained and 6 points for every touchdown.
  4. A fantasy football match is head-to-head, and the team who accumulates the most points will win that week.

As a fervid fantasy football fanatic, I wanted to see if I could leverage data science to help me make better draft day decisions. The running back position is not only the most scarce, but also the position that bears the most variance due to both frequent injuries and preference for younger backs (the average career length being the lowest of all positions at only 2.57 years). And so, look no further than to the team with high-performing running backs to be among the championship favorites in your league.

With all of this domain knowledge at hand, I wanted to see which feature(s) is the strongest indicator of projected average fantasy football points per game for an NFL running back. After all, when debating between which players to pick on draft day, should I care more about last year's total touchdown count or average rushing yards per game? Or maybe it's rushing attempts? 

I began by using BeautifulSoup to acquire individual player data of the past twelve years from one of my personal favorite sports stats websites FFtoday.

A snapshot of my final (cleaned) data frame of individual player data from FFtoday.

A snapshot of my final (cleaned) data frame of individual player data from FFtoday.

Using all of the numerical categories (columns) as features (9 total) and FPG_NEXT as the target variable, I started off by experimenting with the tree-based algorithms random forest and gradient boosted trees. In short, what's going on is that I am using a given year's stats to try and predict the following year's average fantasy football points per game.

I found that my model was making poor predictions because it was simply guessing that any given player will average about the median, which was roughly ten points. Since ten points is the baseline value without looking using any data science, I couldn't be satisfied that the presence of these nine features were basically unaccounted for. 

Taking a further look at the correlation values of my two models, I found that there was an overwhelming feature outlier: fantasy points per game. Its correlation value (0.507) was significantly higher than any of the other eight features. This means that my tree-based models were basing its predictions using primarily a single feature, and disregarding the others.

The next step of the experiment was to explore this phenomenon. I built a single-variable linear regression model using only the FPTS/G feature. Interestingly, the root-mean-square error of this model (3.022) was in the same ball park compared to that of the random forest (3.053) and gradient boost (3.301), which both had more features to learn from.

A linear regression model that used just Fantasy Points per game as its only feature, (RMSE=3.022).

A linear regression model that used just Fantasy Points per game as its only feature, (RMSE=3.022).

I concluded that my original tree-based models with nine features were learning to disproportionately weigh FPTS/G of the previous year in its algorithm to predict the following year's FPTS/G. In general, this occurs when a feature is too directly linked with the target variable. This makes sense considering a current year's fantasy points per game should be among the strongest indicators of the following years' on average. In other words, we can simply point to last year's FPTS/G as the strongest predictor of this year's projected FPTS/G.

This is already the consensus for draft day strategy, but the goal has always been to gain a competitive advantage leveraging data science. Future work can be done with less "direct" features that could also impact a running back's fantasy points production. For example, one could scrape the official NFL website with Selenium for player ages, professional experience, and NFL draft position. I was also forced to drop all rookies (for obvious reasons) from my DataFrame, so using college stats could supplement; however, it's worth noting this wouldn't be reflective of NFL caliber defenses.

To be even less "direct" about the feature engineering, one could experiment with teammate stats because there is a ceiling to the amount of total fantasy points that a team can generate on game day. In the fantasy world, we equate this as "mouths to feed" on an NFL team. As an example, the San Francisco 49ers could have a strong wide receiver corps (not true, unfortunately) and so any touchdown(s) in the passing game is taken away from their running back(s). Also, the niners could have multiple talented running backs (still not true) that would lower the ceiling for each of them. Finally, one could consider the difficulty of the running back's team schedule versus opposing team's run defense ranking.

Happy drafting!