Taking a Look at PDO

A common tenet of hockey analytics is that PDO - the sum of a team's shooting and save percentages - converges to 100% over time. A corrolary is that a team's past shot differential, or shot attempt differential (Corsi), is a better predictor of winning than actual results of past games, as goals are small-sample size, relatively rare events, meaning they are subject to significant luck, whereas there are typically 15 to 20 times more shot attempts in a game, and therefore should have more predictive value.

This is indeed true to a certain extent. As we will see, teams that outshoot their opponents win more games, and past shot attempt differential (particularly when adjusted for situation) is the single best predictor of future goal differential. However, to assume that PDO converges to 100 in the long run vastly underrates the importance of goalie play, and also tends to overrate certain play styles, which may tend to generate more shot and corsi events on offense but intrinsically will score less goals.

From a wagering perspective, as a statistical model-based handicapper, I am looking to incorporate all data possible into my handicapping models. Typically I will read all of the latest sports analytcs literature to understand what sort of variables are likely to be most predictive. In the past, I utilized a model that almost completely relied on Corsi statistics to handicap hockey games. When I began placing wagers with the model I found myself consistently wagering on teams that dominated puck possession but had scored less than one would expect and vice versa. In the past, these teams would tend to be undervalued by the market, leading to substantial profits, but in recent seasons the gap had closed and as a result my results were mediocre. It is only when I incorporated other statistical aspects of winning hockey play into my models that I begun to really profit from the sport. 

While I am not going to give away my full handicapping models here, as a study of the limitations of only utilizing shot differential in handicapping (and in hockey analytics in general), I have created a handicapping model that only utilizes score-adjusted corsi in past games to predict the results of future games. The only variable in this model (besides the intercept, which corresponds to the home advantage) is even strength, score-adjusted corsi per second, weighted fully for games to date in the current season, plus about four games worth of this statistic for that team in the past season. Obviously this is not an ideal handicapping model as it does not account for roster changes, injuries, back-to-backs, etc. But the purpose of this model is not to handicap hockey games, but to show which teams most outperformed their expected Corsi-based prediction over the course of a season, and to find the magnitude at which this takes place. By looking at what sort of teams that had results far from their Corsi-based prediction, we can hope to better understand what types of factors may tend to create higher, or lower PDOs over time.

The best Corsi-based statistic I have found for predicting the winner of hockey games is score-adjusted, non-empty net, 5v5 Corsi (although there is really not much difference between Corsi and Fenwick). This article has a great explanation of how to perform score adjustments, and the predictive power of Corsi in general. After making the adjustments I find each teams' past Corsi rate per second at the point in time of the game, using a combination of 100% of the Corsi rate in all past games in the current season, and 11000 seconds (roughly 4 games) worth of their Corsi rate in the past season. To give the Corsi statistics time to converge to a reasonable number, I am also excluding the first 200 games of any season in the below analysis (these games will still be used to calculate the Corsi statistic, but will not be part of any of the handicapping models or results).

To find a Corsi differential per game, we can multiply this rate per second by 2800, roughly the number of 5v5 minutes per game. A histogram of team Corsi differential per game can be seen below:

Most teams cluster between a difference of -10 and 10 per game, as the NHL is a high parity league and teams are rarely a big favorite over one another. The exception are the 2014-2015 Sabres who gave up an incredible 23 shots more per game than they took (losing by 1.33 goals per game) as they followed "the process" and tanked in the McDavid/Eichel draft year. 

Here is the handicapping model, forecasting goal differential in a given game based on this statistic:

 

The intercept represents the home advantage of roughly .29 goals per game. The second coefficient represents the predictive value of corsi rate in previous games. What the coefficient implies is that for each additional net shot attempt a team has averaged in previous games over the other team, they can be expected to score an additional 0.0532 more goals in the following game. The league-wide corsi percentage is usually somewhere around 4.5%, or 0.045 goals per attempt. When one considers that good corsi teams are probably better in the non-even strength minutes of a game and are more likely to have the lead at the end of the game leading to empty net goals, the fact that the coefficient is slightly higher than 0.045 is reasonable.

The high value of this coefficient suggests that past Corsi success is highly predictive of future success at what really matters, scoring more goals and winning hockey games. Typically in models like this, the coefficients will be regressed to the mean - that is, a team that has had high corsi success in past games would be expected to only carry forward a portion of that success into future games. You can read more about this concept in my article here. But in this basic model, almost 100% of the past performance is expected to carry forward into future contests. Although I use many other variables in my handicapping models for hockey, shot differential is always a powerful predictor, as traditional hockey analytics would suggest.

In addition to a goal differential model, I have a created a second, Corsi-based model forecasting total goals per game. This model is identical to the previous goal differential model, only instead of taking the difference of the team corsi differentials, it takes the sum of the Corsi rates (Corsi for + Corsi against, or Corsi attempts per second). Furthermore, these rates are adjusted to be measured in terms of Corsi attempts above/below the league average. This Corsi pace statistic is more densely clustered than Corsi differential:

The model is as below:

The key aspect of this model is that unlike goal differential, total goals is more heavily regressed to the mean. One additional shot attempt leads to an expected .0532 increase in goal differential, but only an increase of 0.0337 goals per game. This is a common phenomenon across sports handicapping; prior statistics are always less predictive for totals than sides. The basic reason for this is that the total of a game tends to depend more on game flow, ice conditions, and the other team's style; factors that tend to be less repeatable going forward. For now, the important conclusion is that Corsi rate is also a decent predictor of total goals, as we'd expect - teams that have more attempts on goal in their games tend to play a more fast-paced style that leads to higher-scoring games.

Having a prediction for each team of their goal differential and total goals per game from Corsi, we can now find the impact, in goals, of non-Corsi factors for each team, by taking the actual results of their games and subtracting their Corsi prediction. The impacts are as follows:

Goal Differential Above Corsi : (Actual Goal Differential) - (Corsi Goal Differential Prediction)

Total Goals Above Corsi : (Actual Total Goals) - (Corsi Total Goal Prediction)

Goals For Above Corsi : (Total Goals Above Corsi)/2 + (Goal Differential Above Corsi)/2

Goals Against Above Corsi (Positive = fewer goals against or more goals prevented) : (Goal Differential Above Corsi) - (Goals For Above Corsi)

In this way we can separate the impact of goal prevention occuring outside of Corsi factors, such as goalie play and the penalty kill, and goal scoring, such as shooting ability and the power play. First, let's take a look at the highest and lowest ranked teams in goal prevention beyond Corsi between 2014-2016 (I am doing this because these teams are more fresh in my mind than earlier seasons, where I didn't bet on hockey):

Focusing on the outlier teams, we can see some consistent trends. The teams with the highest ratings in goal prevention almost all tend to be teams with well-respected, long-term proven goalies, while the lower rated teams all played backup or worse-level goalies (remember that Price was injured in 2015-2016 and Varlamov was injured for most of 2016-2017). On the surface this would suggest that goalie play is a major driver of this statistic, and "good goalie" teams like Montreal, NY Rangers, and the Capitals ranked highly in this statistic in other seasons as well. Furthermore, the impact of the best goalies is extremely large; as much as half a goal per game over the course of a season.

Moving on to goal scoring above that predicted by Corsi:

In general the rankings would suggest that teams with superstar goal scorers score more than would be expected by Corsi alone, which is already well known in hockey analytics – there are some players that can lift shooting percentages for themselves or their team; and these players are among the most valuable in hockey. At the bottom of the rankings we see a couple of historically bad teams that were likely tanking, along with three teams in New Jersey, Carolina, and last year’s Kings that were well known as grind-it-out, defensive teams. A more defensive style of hockey indeed prevents goals and often results in strong possession numbers, but also leads to fewer strong chances at the other end, as shooting percentages are well known to be higher in breakaway type situations.

Finally, let’s take a look at a chart that shows Goals For and Goals Against above Corsi, for all teams over the past three seasons:

Probably the most interesting aspect of this graph is the correlation between Goals For and Goals Against. Teams that score more than Corsi would suggest also tend to give up fewer goals and vice versa, at around a 31% rate. One reason behind this is that a driver of this statistic is penalty differential. When a team draws a power play, they are more likely to score in the next two minutes, but nearly as much of the advantage is due to the fact that the other team is also far less likely to score. However, variance in penalty differential tends to have a modest impact over the course of a season, perhaps on the order of five to ten goals.

The bigger driver of this correlation is the draft and salary cap. Teams with a good goalie or superstar goal scorer are more likely to look for the other piece to contend for the Cup, or at the very least, if they have a truly bad goalie but great offense, they will trade or sign at least a league-average goalie. On the other hand bad teams are unlikely to make slight upgrades for the sake of finishing 25th in the league vs last in the league, and may start some truly terrible goalies.

To wrap up this study, we can take a brief look at whether performance in this statistic tends to repeat itself over time and to what extent this repetition takes place. If the statistic does not repeat at all, the theory of PDO is correct; teams who out perform their shot differential will completely regress over time. We will look at two common measures for all four Corsi-outperformance statistics (Goal Differential, Total Goals, Goals For, and Goals Against).

The first measure is year-over-year correlation, where we simply regress the teams performance in year X in each statistic against their performance in the last season, X-1. The correlation coefficient in this model represents the fraction of performance that historically has been carried forward into the following season. The results are as below:

All statistics lead to a model that is statistically significant at the 95% level with the exception of Total Goals, which falls just short. Still, the carry-over from past seasons is relatively modest; only around 24% of goal differential performance above expected Corsi can be expected to carry over into the following season. Even in the extreme case of a team with a 50 goal difference we would expect them to only be 12 goals better in the following season.

Of course, much of this regression will occur in any statistic, even statistics like Corsi which are not luck based. This is simply due to the fact that in modern US team sports, over time good teams tend to get worse and bad teams tend to get better, due to age effects, the draft, and the salary cap. To combat this we can do a second test, split-half correlation, where we perform the same type of regression as the YOY test, but this time look at the results of the second half of each season versus the first. In this version, since we have already excluded the first 200 games of the season, we will use games between 201 and 710 in our first half, and games between 711 and 1280 in our last half. In addition, the lockout-shortened 2012-2013 season has been excluded. While this type of test will have more consistency in team roster than the YOY test, the downside is our sample size is half as large, so luck will be more influential. The results are below:

In the split-half test the correlations are lower than the YOY, particularly outside of Goals Against. The smaller sample size of the split-half test causes this statistic to lose more predictive power than what is gained by the increased consistency in roster makeup and coaching. Still, while the impact is fairly small, at least some of the past performance above Corsi can be expected to carry forward into future seasons.

Taking a step back, from the Corsi histogram we can recall that most teams fall in a range of around -10 to 10 shot attempts per game better than their opponent, and each Corsi attempt results in a prediction of around .053 goals. So the best teams can gain perhaps half a goal per game over an average team due to puck possesion.

In the same manner most teams fall in a range of -25 to 25 in Goals For and Goals Against above Corsi. Using a combination of YOY and year to date statistics we can perhaps carry over as much as 30 percent of each of those statistics. This would suggest that teams can gain around 8 goals a season over an average team each from shooting, special teams, and goalie play, or 0.1 goals per game. The impact of both is as high as 0.2 goals per game or roughly 40% of what we'd see from Corsi. While modest, that can certainly add up to several wins (or units for the bettor) over the course of a season.

Furthermore, we have not looked at any player-level factors in this extremely basic model, even though our outlier teams suggest that player-level impacts, particularly that of the goalie and elite scorers, are highly significant in figuring out which teams will outscore their Corsi prediction. When considering these other factors, the predictive power of non-Corsi factors will improve even further. We will revisit this topic in a future article on this site.

In the end, shot differential in past games is probably the single biggest predictive factor of future hockey success, but shooting percentage is not even close to being completely driven by luck. Particularly when aspects of a team's roster or play style would lead you to believe that they are not likely to shoot or defend as well as a typical team, you should hesitate to predict that their PDO number will regress to 100. Of course, many teams which are actually very strong at goal scoring or prevention can be very unlucky for extended periods of time. Using a long-term PDO number which is carefully chosen based on the player talent levels and team play style will strongly outperform a simple regression to the league average.