NFL Handicapping Fundamentals

by Quantum Sports September 10, 2018

This is an excerpt from a book I have been working on, although I'm not sure if I'm going to release it at this point.

Before we get into detailed handicapping methodology, we first need to walk through a few key handicapping fundamentals pertinent to the NFL, the sport of football, and sports betting as a whole. The rest of the book will make much more sense if you understand the concepts we discuss here, and even if you decide that you don’t want to handicap the NFL at a high level, should improve your understanding and enjoyment of the game.

One Game Doesn’t Mean Much

From a handicapping perspective, probably the most critical factor that fans, bettors, and even front offices and coaches don’t realize about the sport of football, is that a single football game is almost meaningless when viewed from the lens of measuring the skill level of the teams involved. At its core, sports handicapping is all about using all of the information available to form our viewpoint of the teams and players involved, then translating this information into a point spread. The way this usually works each week, for most bettors who are at least trying to handicap well, is something like the following:

Bettors develop their own prior views on the strength of the teams based on past statistics, scouting/coaching ratings, and injury or game plan news, and wager accordingly
Bettors watch or read the box scores from the current week’s games
Bettors update their prior views to account for the results of the games in preparation for the following week’s wagers

The part of this process that most get wrong is the final step. Almost regardless of how a bettor sets their prior views, they will over-adjust their future expectations after watching the past week’s games. Regardless of how poorly a team may have played, a single NFL football game is far too short, with too little information included about the relative strength of the teams, to make many conclusions about how good the teams actually are.

To better understand this concept, suppose we are in 2017 and Tom Brady and the Patriots are taking on the Browns as 16-point favorites. You have laid the number because the Pats are a lock and always cover, but unfortunately on the opening drive, the Browns manage to stop the Patriots, then return the ensuing punt for a touchdown. You are obviously quite concerned about your wager at this point. But has anything changed in your mind about the relative strength of the Patriots vs. the Browns? Probably not. Everyone knows one drive means almost nothing, and a big punt return also tends to be a freak play that is mostly luck.

Now consider that the Browns fight the rest of the first half and go into the second half with a 14-10 lead. At this point, maybe you begin to consider that the Browns are a little better than you thought. After all, even counting out the lucky punt return, the Pats only beat them by 3 points in the entire half, which you expected them to win by more like 10 points. Maybe you think that the Pats really should have only been about a 14-point favorite, or at least a 16-point favorite, in which case you would not have bet on them. Finally, you come to the end of the game, which the Pats barely win 27-24 on a last second touchdown. You realize your bet was on the “wrong side” at this point, and that the Browns were better than you thought, or perhaps the Patriots were worse.

The problem with this thinking is that you are probably wrong. Barring injuries, the line the following week in an imaginary game between the two teams would be no worse than Pats -14.5, would most likely be Pats -15.5 or -16, and depending on how the game played out, may even feature the Pats as bigger favorites than before. And the reason is that there just aren’t enough plays in a single football game for the result of those plays alone to mean more than all of the other past information we have on the teams – mostly statistics from past games, but also non-statistical information on how talented each team is and how good their coaches and staff are. Even when teams perform far worse than their true talent level, much of the time it is due to reasons that only come into play that given week, such as bad luck, worse than usual preparation, travel issues, a team getting the flu – almost anything you can imagine.

But let’s prove it. We can test how much one game should influence our predictions using a technique called linear regression, which we will use throughout this book. For now, it is enough to know that linear regression is a statistical tool where, given the input variable of statistics from past football games and a set of scores from the relevant current football games, the tool will find how to best predict the score using the statistics you give it.

We start by testing all games since 2010 in our linear regression model. In our model, we will have only one input, the margin by which each team won or lost their past game, which we will ask the model to use to project the difference in score between the home and away team in the following game. We exclude games played in week 1 of the regular season, where there is no previous game to go on. We make one adjustment to this margin to account for home advantage, adding three points if the team was on the road in their past game, or subtracting three points if they were at home. The results are as follows:

For those not familiar with the linear regression it is worth having a brief discussion of what each coefficient means. The coefficient in the first row, corresponding to the intercept, is the average scoring margin across all games, assuming all of the other variables in the linear regression model are set to zero. In the NFL, where there is no bias such that home teams are likely to be better than away teams other than in the playoffs, this value is equal to the home field advantage (in high school and college football, better teams are more likely to be at home, so this may not always be true unless we adjust for team strength).

The second coefficient corresponds to the difference in scoring margin each of the two teams had in the past game. The value of 0.12 tells us that, to predict the result of the following game, the best choice is to take 0.12 times this variable. In other words, we take how many points the home team won or lost by in their last game and subtract how many points the away team won or lost by in the last game, then multiply this value by 0.12. And to come up with our final prediction, we add 2.83 to account for the home field advantage.

The value of 0.12 suggests that we only use 12% of what happened in the last game to predict the future game. Even if we have a team that won by 20 in their last game playing a team who lost their last game by 20, with a difference of 40 points - almost as extreme of a case as is possible in the NFL, we’d still only predict a difference of 4.8 points (plus the home advantage). Most of the time this variable is going to make a difference of less than 2 points.

In reality, the true value of the last game is actually less than 12%. In our model above, we only used what happened in the last game to predict the next game. This means that we essentially placed all our handicapping value solely on the score of the last game, which hopefully we would never do in real life. Using only the past game will cause our model to read more into the results of that game than it otherwise would if we used a larger sample.

Let’s see how valuable one game is later in the season, when we have more past results to go on. In the second version of this simple model, we are going to include two variables. The first variable will be the same last-game scoring margin variable as before, and the second variable will be the difference in average scoring differential per game for both the home and away team, up to that point in the season, including the last game. To ensure that we are only looking at games where we have a well-established view of both teams, we will start at week 11 of the season and continue through week 16 (we exclude week 17 because many teams don’t try in week 17, a factor we will discuss later in the book). We come up with the following result:

We can see that after including it as part of our season average, the scoring margin in the last game makes absolutely no difference in our prediction – we would only use 0.2% of the difference in predicting the next game. What we do use is 72.8% of the difference in scoring margin to date. What this means is that in week 11 and onwards, when teams have played 10 or more games, while the last game matters, it only matters as much as any other game, including the game played in week 1. So, for example, in week 11, assuming the team has not yet had a bye, one game makes up one tenth of the games-to-date, and multiplying 0.1 times 0.728, we come up with 0.0728, meaning the last game would only make up 7.28% of the prediction, and even less at the season goes on.

This may come as a surprise – one would think recent games should be weighted more heavily – but the reality is that to the extent that a football team changes in performance level over the course of a season, most of this is determined by injuries or other changes in personnel. And even if there is some reason to weigh recent games more heavily, because one football game is such a small sample size, we almost always want to include as much performance from the current season as possible, including games which happened as long as four months ago.

There is one final concept worthy of mention around the value of a single game in sports handicapping. In general, the less competitive the league or sport in question, the more value the models will place on the outcome of a single game, or any small sample of games. For example, if we did the above analysis on college football, after we adjusted for strength of schedule we’d probably wind up being told to use something like 20-30% of the past game’s scoring margin, instead of 12%. In high school football it would be even higher. There are a few more plays in a college game compared to the NFL, but this is not the reason for the difference; there are probably fewer meaningful plays per game in college when compared to the NFL.

To understand the difference between NFL, college football, and other sports, we have to go back to the reason why we would ever ignore 88% of what happened in the previous game in the first place, a phenomenon known as regression to the mean. In the NFL, the league is structured in a way such that teams tend to be very evenly matched when measured over the course of a season. The salary cap ensures roughly similar talent levels, and market forces ensure the ruthless culling of incompetent coaches and management, at least most of the time. Even the worst teams in the league are usually only around one touchdown below average per game, and often much of that is caused by injuries. In contrast, in college football, there are often teams that are four touchdowns better than an average FBS team.

What this means statistically is that when we see a big blowout in the NFL, while it certainly should influence our future predictions, it is still overwhelmingly more likely that the two teams are roughly evenly-matched meaning we should only adjust our prior expectations very slightly. There is a ton of variance in one football game, and too many years of knowing that teams can never differ by all that much in the NFL, to say anything else. But in college football, because there actually are games where teams are 30 or more points different in talent, the blowout means a lot more. We can’t just waive those types of games aside and say, “no team is this bad in college football, they’ll bounce back.”

Now, this is not a college football handicapping book, although I’m sure most readers of this book bet college football as well. And for what it’s worth, the college football market still over-reacts to recent games, even if they do mean more than the NFL. But what is important to note is that the same concept does pop up in the NFL as well. In fact, it was a misapplication of this concept that led to me losing huge in the 2016 NFL season, on the Cleveland Browns.

The 2016 Browns were a favorite of a lot of “sharp” handicappers for the simple reason that they were getting “too many” points each game (I say "sharp" here because while ESPN and a few other media sources said everyone was on the Browns, there were plenty of long-term winners who weren't). In fact, the Browns were getting on average 7 points each game, and even more later in the season. Historically in the NFL, these sorts of big underdogs have done pretty well, because teams are usually just not that bad. Put another way, any kind of statistical model that uses game statistics (regardless of the complexity of the stats used) would have suggested being on the Browns each week, as there was value between the line and what such models would predict. I was using such a model and enjoyed getting great “value” as the Browns finished the year 4-12 ATS.

The problem with the statistical models here is that they were built on a sample of typical NFL teams, who usually try to win every year. With these teams, the statistical models are right; no team that is trying to win should be laying over a touchdown week after week, barring certain injury situations. The models will always tell you to heavily regress performance to the mean, which is another way of saying that you should assume teams will perform like any other. But the Browns were a special case as they were actively tanking. What this meant was that holes in the roster were not filled with NFL-caliber replacement talent, and game plans were constructed in a manner inconsistent with trying to win games. Such a team is going to break any model based on raw game statistics alone. Fortunately, we have other information that we can add to our models that will help us spot out truly horrible teams like the Browns, while allowing us to continue to properly regress short-term results to the mean with most other teams.

Context is King

Anyone who watches the NFL will quickly realize that not all plays, or all games, have the same value in predicting future outcomes. For example, a team may roll out to a big lead by dominating the first half, only to turn conservative in the second half. The other team, knowing the opponent is likely to want to run out the clock will call defenses to stuff the run, which will usually work if the opponent does actually run the ball. They then move the ball well throwing short passes against an overly-conservative defense. At the end of the game the stats, including somewhat more advanced stats like adjusted yards per play, will suggest they were the better team in the game, but the reality is that were worse, and primarily benefited from the game situation. In the following game, the same team may fall behind due to poor play or bad luck, only to get the same benefit in context and dominate the second half. The advanced statistics will laud their incredible improvement in performance, when in reality they played at the same level in both games.

Even over a sample size of several games, this difference in context can vary widely from team to team, due to factors like strength of schedule, home vs. away, and sheer variance in the game flow. The one constant across all teams in the NFL is that they do better when playing aggressively, even after one adjusts for the increased likelihood of turnovers. Teams who are placed in situations where they are more likely to be aggressive will appear to be a little better than others, even if they are not. The type of situations that cause teams to play more aggressively include being down in the game, 2^nd and 3^rd and long situations, and not getting the ball near one’s own red zone. On the other side, playing with the lead and having the ball in the red zone or in short-yardage situations tend to cause teams to play more conservatively, which hurts team statistics.

By “aggressive” I don’t necessarily mean throwing only deep balls or running crazy plays on offense, going for every fourth down (although that would beat most teams’ fourth down strategy), or always blitzing on defense. Rather I am talking about simply playing roughly optimally regardless of the score situation – like still trying to score points through the passing game with the lead in the fourth quarter, intelligent decision-making on 2^nd and short and 4^th downs, and trying to prevent first downs with the lead instead of just preventing big plays on defense. Aggressive strategies in NFL football are so superior that much of what makes some teams better than others is the willingness to play more aggressive in situations where other teams would not be as aggressive.

In any case, to best predict the performance of teams, we must adjust for these contextual differences. The adjustment process is very complex and requires significant statistical modeling to do well. First, one has to even establish what “good” performance is on each play. Is it yards gained / yards allowed per play? Touchdowns scored/allowed per drive following that play? Or something else? The choice is very important in determining how predictive your statistic will be, and there are plenty of good supporting reasons for each of the common choices. After deciding on this, one then has to figure out how well the average team would have performed in each situation. The best way to do this is through a series of statistical models going back through play-by-play data to measure the impact of all the various factors, including strength of schedule, scoring margin, weather, down and distance, and many other factors.

Fortunately, someone has already done all this work for us, and furthermore, they provide all their data for free, or basically for “free” in comparison to how much most people bet on the NFL. While I have no affiliation with their site or service, the Football Outsiders DVOA Ratings do a very good job of adjusting for context, and are highly predictive of performance particularly later in the NFL season. While it is possible to improve on DVOA, having personally built the same type of model they have, I can assure the reader that one is losing very little by just going with what FO provides, and will save a ton of time-consuming database work in the process.

DVOA measures team ability based on how well they perform on each play. The measure Football Outsiders uses is proprietary and uses a combination of yards per play and success, which is defined as the percentage of yards-to-go gained. They also provide bonuses for touchdowns and penalties. Each team’s performance on each play is compared to the league average team, in that season, and adjusted for the situation in which the play took place. The DVOA measure is then provided in terms of a percentage above or below the league average team. On offense and special teams, a higher DVOA is better (more yardage gained), while on defense, a lower DVOA is better (less yards given up). In a typical season the league leaders in each category will be around 30%, meaning they gain or allow 30% more or less yards than average, with most teams clustered between -10 and +10%.

We can prove the value of DVOA, and the importance of adjusting for context, with another statistical test. In this test we are going to compare two models – one model will be largely the same as the last we discussed, NFL week 11-16 performance based on past point differential per game. And the other will be a DVOA based model, using the season-to-date offensive, defensive, and special teams DVOA team statistics, over the same time period.

In this comparison we see our old model, our new DVOA model, and a new statistical tool, called the R-squared. Because the DVOA value is an abstract percentage, rather than something we can really define like yards or points in past games, the coefficients in this new model may not mean much to the reader right now, although we can define their unit as points per percentage DVOA differential. But what is important is this new R-squared at the bottom of each chart. R-squared measures the portion of the variance in the variable we are trying to predict, in this case how many points the home team will win by, that is explained by the model we have defined. Comparing the two, the “point differential only” model explains 16.99% of the variance, while the DVOA model explains 19.15%.

This may not seem like much, and it is not on an absolute perspective, but it is very significant from a betting perspective. Studying the variance of the typical NFL game, which is defined as the square of the difference between the home and away team scores, then taking the square root of that to get back to the difference in scoring margin, we find that explaining 2.15% more of this variance winds up making of a difference of around 0.2 points per game, and will help us predict the winning team (not ATS, as that is tougher) somewhere around two percent more often.

Improving a betting model even by just a fraction of a point per game can make a big difference on one’s betting results. To understand this idea we can compare our simple models to the best model of all: the closing betting line at Pinnacle Sports.

We can see that our model using DVOA is somewhere between halfway as good as the simplest possible “model” one could really consider, straightforward scoring margin per game, and the best possible model, the Pinnacle Sports closing line. Our goal as handicappers is to find a way to handicap better than, or about as well as the market, so that when we find plays where our line is far off, it is more likely that our difference comes from our prediction being superior, rather than random noise. Since we have to pick the right side and pay vigorish to do so, if we can’t handicap at least as well as the betting market, we don’t have a chance.

Right now, we are about halfway there, just using one relatively simple stat in DVOA. Fortunately, there are many factors we have not yet considered, some of which do not even come into any type of statistical model but are significant nonetheless. We will explore how to make up this other half throughout the rest of the book.

Early Season and Late Season Require Different Approaches

Late in the NFL season, we usually have enough of a sample size that statistics like DVOA, and even just past game results, give us a pretty good handle on the relative strength of teams. Especially when we adjust for injuries, context, and motivation, the statistics provide most of what we need. But early in the season, the picture is much different. No matter what statistics we use, one or two games worth of numbers do not give us much information.

One possible solution is to use last season’s statistics. This is certainly better than nothing, as after all, most of the same players and coaches come back each year for each team. But it turns out that the previous season’s performance is not all that predictive of the following season. We can see this by testing how effective last season’s stats are in predicting early-season outcomes in the following season. In weeks 1 through 3, if we try to predict the outcome of games using last year’s scoring margin alone, our linear regression model suggests that we only use 47% of the difference in last season's scoring margin. As we showed earlier, in weeks 10-16 we’d use more like 72% of the current season’s scoring margin, and by the end of that period, where we have 14-15 regular season games of sample size, it winds up being closer to 80%. When we compare models using DVOA, we get a similar answer.

Offseason changes are the major driver for this decline in the value of the past season’s stats when compared to stats from the current season. Rosters change due to free agency, bad coaches get fired, teams that were healthy get hurt, and good players tend to get worse as they age, among other reasons. For the most part, the direction of these changes tends to make good teams worse and bad teams better. The worst coaches get fired, not the best coaches. Healthy teams in the past season tend to be more injured in the following season, while teams that were hurt tend to be healthier. Players on good teams, who tend to have put up good stats, tend to get bigger salary increases to switch teams in free agency while players from bad teams who may otherwise be just as good make less. These sorts of factors bring about a natural regression to the mean in NFL team performance, while in the current season, the level of regression is much lower.

The statistical models tell us across the league, teams regress to the mean from season-to-season, and from our knowledge of football we know this happens for several reasons. But instead of relying on statistics across all teams alone to determine how much of this regression we would expect, we can instead focus on the underlying reasons, to improve our early-season predictions. Bad teams tend to improve through hiring better coaches, signing quality free agents, and young players getting better, but then there are the Cleveland Browns. Good teams tend to lose talent to free agency and aging, but some teams manage the cap and their roster well and don’t decline as far. The best way to determine how much a team is likely to improve or decline between seasons is through a careful, detailed study of the quality of players on their likely roster.

While the reader may laugh, it turns out that like with DVOA, there is a high-quality source of talent evaluation that is completely free, in the form of the player ratings published each year for free as part of an extremely popular football video game, whose name we will exclude for copyright reasons. This may seem like a total joke, but as we will see, the crew they use to rate the players does a pretty damn good job. How do we know this? As always, we can do a test. In our test we have gone back and measured the “video game rating” of the three components of each team’s roster in each game – offense ex-QB, defense, and the starting quarterback. Because we don’t always know who started the game, we use the top 11 ratings of any defensive player on that team who participated in the game, the top rating among all quarterbacks who played in the game, and the top 10 ratings among offensive players to calculate our team rating. We also make a few other adjustments which we will discuss later in the book. We then create a predictive model using the difference in home video game ratings minus away video game ratings at each position to predict the outcome of games.

We will discuss and improve upon this ratings-based model in the next chapter. But for now, we will compare the strength of this ratings-based model with two other approaches – a DVOA based model that uses both the current and past season, and the Pinnacle sports closing line:

We can see that the ratings-based model outperforms both the DVOA approach and the market early in the season, picking around 2% more winners on the money line. While this number may not mean much for now, we will later test and see how well this ratings-based approach would have done versus the final point spread and will find that it is a significant winner. The ratings-based approach peaks mid-season as players tend to round into their true form, but then falls off late in the year, where the DVOA-based approach which uses statistics from past games begins to perform better. The market steadily improves in its predictive strength throughout the year as well – more so than our DVOA approach alone, it is able to effectively weigh the statistics from past games to become much sharper late in the season.

Maybe the most surprising thing about the above chart, besides the fact that the ratings outperform the market early in the year, is that they outperform DVOA even in the middle of the season. These video game ratings are set at the beginning of the year (we do not use the in-season updates in our data) meaning that at mid-season, they completely ignore everything that happened in the first half of the season. But even with an average of 6 games played, they still outperform past statistics when used to predict a football game. Some of this is probably because our ratings-based model accounts for injuries while our DVOA model does not. But it is rare that injuries make a difference of more than a couple of points. Most of this goes back to our first fundamental, that one game does not contain that much information about the strength of football teams, and as we see here, six games do not contain much information either. Even almost halfway through the season, our preseason prior view is still more useful than what has actually happened in the previous games.