Skip to main content

Bayesian updating and the NFL

It's football season again, hooray! Every year for my friends' football pool I try out a different algorithm. Invariably, my picks are around 60% accurate. Not terrible, but according to NFL Pickwatch (archive, current season), the best pickers get to 68 or 69%. So, an amazing performance—my upper bound—is just under 70%, and the lower bound for a competitive model—the FiveThirtyEight baseline—is 60%.

I've been modeling NFL outcomes for a couple of years, and running linear (predicting point spread) and logistic (predicting win probability) regressions given various team and player data. My best year so far incorporated the Vegas spread into the model, and my biggest disaster so far was an aggressive lasso model on every player in every offensive line, with team defenses lumped as a group. Attempting to track injuries, suspensions, and other changes to the starting lineup was not sustainable for the amount of time I wanted to spend.

Enter Nate Silver's awesome NFL Elo rankings, the aspirational target for this year. What's impressive is that he gets something like 60% accuracy out of literally no information but home field advantage and past scores. I particularly love that it updates weekly to incorporate the new information—this immediately says "Bayesian" and in fact is a lot how people using their intuition are making their picks anyway. A system like his—but with a more straightforward Bayesian model—is the goal of this post.

Bayesian updating for the (NFL) win 🏈

Silver's Elo model maintains a running metric of team ability, $Elo_{tm}$, that updates after every game. Here is his formula and source code. The number is used to estimate win probability in almost the same way as you'd use the result from a logistic regression:

$$ Pr(\text{team A wins}) = \frac{1}{10^{-(Elo_A - Elo_B)/400} + 1} $$

Copy Silver's approach, but use a different model

The FiveThirtyEight Elo approach is intuitive and simple:

  1. Start with a baseline value for team skill
  2. Update the Elo numbers every week
  3. At the end of each season revert the values back to the mean

This post will do the same thing, but will be updating hyperparameters of known distributions rather than the Elo parameters FiveThirtyEight used. I want to only select from distributions that have analytic solutions. This table on Wikipedia lists those in the exponential family, with columns for the likelihood, its conjugate prior, and their posterior distribution. The two models selected for this blog post are:

  • A Bernoulli posterior to predict win probability
    ⇒ A running tally of wins vs. losses for each team is all you need.
  • A Student's T posterior to predict point spread
    ⇒ A running game count, mean spread, and variance are all for this model. The theory depends on every future outcome being independent of previous outcomes, which is true so we're fine.

Data

FiveThirtyEight's GitHub repository contains a cleaned dataset of historical scores that date back to 1920. We can use to develop the algorithm. The column named "neutral" has two values: "neutral == 1" means the game is on neutral ground (no home field advantage), and "neutral == 0" means the home field advantage goes to "team1". The rest of the column names are straightforward. The goal is to make an online-updating model like FiveThirtyEight's Elo using only prior scores, team, and home field advantage.

Initial setup

This blog post is an executable Jupyter notebook. You can reproduce everything by re-running the original notebook. This first cell does the standard imports, then pulls the data from FiveThirtyEight.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd  # at least 0.19.2 to read_csv from url
import seaborn as sns
import scipy.stats as stats
from collections import OrderedDict


# Seaborn setup
sns.set_context("notebook")
sns.set_style("white")

url = "https://github.com/fivethirtyeight/nfl-elo-game/raw/master/data/nfl_games.csv"
df = pd.read_csv(url)
print(df.shape)
df.head(5)
(16007, 12)
Out[1]:
date season neutral playoff team1 team2 elo1 elo2 elo_prob1 score1 score2 result1
0 1920-09-26 1920 0 0 RII STP 1503.947 1300.000 0.824651 48 0 1.0
1 1920-10-03 1920 0 0 AKR WHE 1503.420 1300.000 0.824212 43 0 1.0
2 1920-10-03 1920 0 0 RCH ABU 1503.420 1300.000 0.824212 10 0 1.0
3 1920-10-03 1920 0 0 DAY COL 1493.002 1504.908 0.575819 14 0 1.0
4 1920-10-03 1920 0 0 RII MUN 1516.108 1478.004 0.644171 45 0 1.0

Explore the data

See what's in the dataset. There are over 100 football teams total, but if we cut off at 1980, we get the familiar 32 teams, so that's fine. Also, the FiveThirtyEight folks have made it so teams that changed cities are always named whatever they're called now. Very nice.

For those skeptical about the point spread being normally distributed, you're kind of right, but it's still good enough for me. The plot below contains histograms of each team's point spread at home since 1980. Each bar is 7 points wide, and a normal distribution is superimposed with a dashed line. Some observations:

  • The Normal distribution isn't terrible for most teams. ⇒ Modeling point spread with a T distribution won't be horrible.
  • A disproportionate number of games end with spreads that are within 1 touchdown (±7 points). ⇒ Coaching/performance under pressure?
    • The Browns (CLE), Lions (DET), Chargers (LAC), Rams (LAR), and Jets (NYJ) often lose by 1 touchdown or less.
    • The Cardinals (ARI), Broncos (DEN), Dolphins (MIA), Patriots (NE), and Seahawks (SEA) often win by 1 touchdown or less.
In [2]:
# use toggle

# First create the "spread" column
df['spread'] = df.score1 - df.score2
print("Total teams in the dataset:", len(df.team1.unique()))
print("Total teams after 1980:", len(df[df.season > 1980].team1.unique()))


def color_coded_hist(x, **kwargs):
    """Color the negative and positive bins differently.

    Color scheme from http://colorbrewer2.org/    
    """
    hist_bins = [-75, -42, -35, -28, -21, -14, -7, 0, 7, 14, 21, 28, 35, 42, 75]
    __, __, patches = plt.hist(x, density=True, bins=hist_bins, color="#f1a340")
    # Purple for positive spread:
    i = next(i for i, val in enumerate(hist_bins) if val == 0)
    for p in patches[i:]:
        p.set_facecolor("#998ec3")
        

def best_norm(x, **kwargs):
    """Plot the best normal fit for the data."""
    mu, std = stats.norm.fit(x)
    # Plot the PDF.
    xmin, xmax = min(x), max(x)
    x = np.linspace(xmin, xmax, 100)
    p = stats.norm.pdf(x, mu, std)
    plt.plot(x, p, 'k--', alpha=0.6)


# Plot all of the teams' spreads since 1980
g = sns.FacetGrid(df[df.season > 1980].sort_values('team1'), col="team1", col_wrap=5, height=2)
g = (g.map(color_coded_hist, "spread")
     .map(best_norm, "spread")
     .set_titles("{col_name}")
     .set_axis_labels("spread", "density"))
msg = "Histograms of spread (at home) since 1980. Normal approximation is dashed line."
plt.suptitle(msg, y=1.025, fontsize=14);
Total teams in the dataset: 101
Total teams after 1980: 32

Home field advantage

Nate Silver's model bumps the team's Elo rating by a constant to account for home field advantage. Great idea. I'm copying it.

To avoid using unknown future information, the model home field advantage value will be the average from the past three seasons, rather than the average over the entire dataset, saved in the data frame rolling_avg. The plots below show spread on the left and win percent on the right, year over year. The approximation for home advantage is a dotted black line. The central tendency for the data is a solid blue line, and the 95% confidence interval for each season is the light blue region. Other observations about home field advantage:

  • Starting around the 1970's the overall point spread tightens: home field advantage starts at 10 points on average in 1920 and settles down to 2 to 3 points per game, give or take a couple of points.
  • This translates to about a 16% greater chance of winning at home (58% at home versus 42% away).
  • Approximating home field advantage via the average of the past three years (black dotted line) looks fine: it closely tracks the actual median and is always within the confidence interval. Good enough for me.
In [3]:
# Ignore a future warning that comes from scipy.
import warnings
warnings.filterwarnings("ignore")


# Show some of the values
print("~ Overall ~")
print(df[df.neutral == 0][['spread', 'result1']].mean())
print("\n~ Past few seasons ~")
print(df[(df.neutral == 0) & (df.season > 2011)].groupby('season')[['spread', 'result1']].mean())


# Create a data frame with lagged rolling means and combine with the existing data
rolling_avg = (
    df.groupby('season')[['spread', 'result1']].mean()
    .rolling(3, min_periods=1)
    .mean()
    .shift(1)
)
rolling_avg['win_pct_advantage'] =  rolling_avg.result1 - .5
rolling_avg['spread_advantage'] = rolling_avg.spread
rolling_avg = rolling_avg.drop(columns=['result1', 'spread'])


# Plot everything
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14.0, 4.0)
f, ax = plt.subplots(1, 2)
sns.lineplot(x="season", y="spread", data=df[df.neutral == 0], ax=ax[0])
ax[0].plot(rolling_avg.index, rolling_avg.spread_advantage, 'k:')
ax[0].set_ylabel("home spread")

sns.lineplot(x="season", y="result1", data=df[df.neutral == 0], ax=ax[1])
ax[1].plot(rolling_avg.index, rolling_avg.win_pct_advantage + .5, 'k:')
ax[1].set_ylabel("home win percent")
plt.suptitle(
    'Central tendency and confidence interval of spread (left), '
    'win percent (right) by season — lagged moving average is dotted line',
    fontsize=14);
~ Overall ~
spread     2.976579
result1    0.580278
dtype: float64

~ Past few seasons ~
          spread   result1
season                    
2012    2.615094  0.575472
2013    3.174242  0.600379
2014    2.851711  0.583650
2015    1.475285  0.539924
2016    3.057252  0.585878
2017    2.639847  0.574713

Convert from "by-game" to "by-team" format

The original FiveThirtyEight dataset has one row per game ("by-game" format), with the home team labelled as "team1." If I want to see all of the games for a single team progress I have to bounce between the "team1" and "team2" columns. Reshape the dataset to "by-team" format, meaning there are two rows per game, adding an extra column "home" to identify the home team. This way I can group by "season" and "team1" and have all of the games for each team. The new dataset will be called by_team.

by-game format: one row per game
season date team1 team2 result1
1985 1985-09-15 CHI NE 1
1985 1985-09-19 MIN CHI 0
by-team format: one row per team per game
season date home team1 team2 result1
1985 1985-09-15 1 CHI NE 1
1985 1985-09-19 0 CHI MIN 1
1985 1985-09-15 0 NE CHI 0
1985 1985-09-19 1 MIN CHI 0
In [4]:
# use toggle
home_games = df[[
    'date', 'season', 'neutral', 'playoff', 'team1', 'team2', 'result1', 'spread'
]][df.season > 1975]
home_games['home'] = 1

# Now swap the teams for "away"
away_games = home_games.rename(columns={'team1':'team2', 'team2':'team1'})
away_games['home'] = 1 - home_games.home
# Remember to switch the meaning of the winning and spread columns too
away_games['result1'] = 1 - home_games.result1
away_games['spread'] = - home_games.spread

by_team = (
    pd.concat([home_games, away_games], ignore_index=True)
    .sort_values(by=['season', 'team1', 'date'])
)

Set up for iterative updating

Both the models will follow the same procedure:

  1. Initialize the model with a baseline:
    1. Bernoulli win probability = 50%
    2. T-distribution spread = 0
  2. After every game, update the hyperparameters with the new information as described in their respective formulas
  3. Use the hyperparameters to calculate the posterior predictive distribution

Make a base class that will keep track of all of the hyperparameters for each team and each game. The parameters can be added as columns to the main dataset for backtesting. In practice, the Updater.lookup attribute would be accessing a database, not a dictionary, and the update step would be run weekly to increment the hyperparameters based on the game results. But for this blog post, it will remain a dictionary.

In [5]:
# use toggle
from abc import ABC, abstractmethod
from collections import namedtuple

class Updater(ABC):
    def __init__(self, *hyperparameters):
        # Form of lookup:
        # {season: {team: [{week1 data}, {week2 data}, ... {weekN data}]}}
        self.lookup = {}
        self.Params = namedtuple('Params', ['date'] + list(hyperparameters))
        
    def iterrows(self):
        for season, teams in self.lookup.items():
            for team, results in teams.items():
                for row in results:
                    yield dict(season=season, team1=team, **row._asdict())
        
    def get_rows(self):
        return [r for r in self.iterrows()]
    
    @abstractmethod
    def revert_to_mean(self, season, team, keep=.3, n_obs=8):
        pass
    
    @abstractmethod
    def update(self, row):
        pass

Bernoulli distribution

The probability of winning is equal to:

$$ \text{Pr}(\text{win}) = \frac{\alpha}{\alpha + \beta} \pm \text{home advantage} $$

The hyperparameters for the Bernoulli distribution are:

$$ \begin{aligned} \alpha &= 1 + \text{number of wins} \\ \beta &= 1 + \text{number of losses} \end{aligned} $$

Updates are done by this rule:

  • A win increments $\alpha$ by 1
  • A loss increments $\beta$ by 1
  • A tie increments both $\alpha$ and $\beta$ by 0.5

The default initial value sets $\alpha = \beta$, where you can choose how many "observations" (n_obs) to pretend we have already seen. If n_obs is a large number, then the incremental effect of new games won't be as pronounced. If n_obs is a small number, then the new season's data will quickly overwhelm the past performance. I tuned the parameters for good performance:

  • Weight $\alpha$ and $\beta$ at the beginning of each year as if they count for 4 games.
  • Revert $\alpha$ and $\beta$ back to the mean ($\alpha = \beta$) keeping 80% of last year's final value.
In [6]:
class BernoulliUpdater(Updater):
    def __init__(self):
        super().__init__('alpha1', 'beta1')
        
    def revert_to_mean(self, season, team, keep=.8, n_obs=4):
        # default
        alpha = beta = 1 + n_obs * .5
        # or use existing data        
        if season in self.lookup and team in self.lookup[season]:
            last_entry = self.lookup[season][team].pop()
            date, alpha0, beta0 = last_entry
            p = alpha0 / (alpha0 + beta0)
            alpha = 1 + n_obs * (keep * p + (1 - keep) * .5)
            beta = 1 + n_obs * (keep * (1 - p) + (1 - keep) * .5)
            # push back the reverted value to the list
            self.lookup[season][team].append(self.Params(date, alpha, beta))
        return alpha, beta
            
    def update(self, row):
        if row.season not in self.lookup:
            self.lookup[row.season] = {}
        if row.team1 not in self.lookup[row.season]:
            self.lookup[row.season][row.team1] = []
            alpha, beta = self.revert_to_mean(row.season - 1, row.team1)
        else:
            __, alpha, beta = self.lookup[row.season][row.team1][-1]
        # THE UPDATE STEP:
        # a' = a + 1 if win else 0
        # b' = b + 1 if lose
        if row.result1 == 1:  # Won
            alpha_beta_next = self.Params(row.date, alpha + 1, beta)
        elif row.result1 == 0.5:  # Tie
            alpha_beta_next = self.Params(row.date, alpha + .5, beta + .5)
        else:  # Lost
            alpha_beta_next = self.Params(row.date, alpha, beta + 1)
        self.lookup[row.season][row.team1].append(alpha_beta_next)
        return alpha, beta


bernoulli_updater = BernoulliUpdater()
for i, row in by_team.iterrows():
    bernoulli_updater.update(row)

ab = pd.DataFrame(bernoulli_updater.get_rows()).sort_values(['team1','season'])
g = ab.groupby('team1')
ab = ab.assign(alpha1 = g.alpha1.shift(), beta1=g.beta1.shift())

bernoulli_dataset = (
    by_team[[c for c in by_team.columns if c != 'spread']]
    .merge(ab, on=['season', 'date', 'team1'])
    .reindex(columns=[
        'season', 'date', 'home', 'neutral', 'playoff',
        'team1', 'team2', 'result1', 'alpha1', 'beta1'])
)

print("\n\n☆ Da Bears! ☆")
bernoulli_dataset[(bernoulli_dataset.season == 1985) & (bernoulli_dataset.team1 == 'CHI')]

☆ Da Bears! ☆
Out[6]:
season date home neutral playoff team1 team2 result1 alpha1 beta1
3938 1985 1985-09-08 1 0 0 CHI TB 1.0 3.259186 2.740814
3939 1985 1985-09-15 1 0 0 CHI NE 1.0 4.259186 2.740814
3940 1985 1985-09-19 0 0 0 CHI MIN 1.0 5.259186 2.740814
3941 1985 1985-09-29 1 0 0 CHI WSH 1.0 6.259186 2.740814
3942 1985 1985-10-06 0 0 0 CHI TB 1.0 7.259186 2.740814
3943 1985 1985-10-13 0 0 0 CHI SF 1.0 8.259186 2.740814
3944 1985 1985-10-21 1 0 0 CHI GB 1.0 9.259186 2.740814
3945 1985 1985-10-27 1 0 0 CHI MIN 1.0 10.259186 2.740814
3946 1985 1985-11-03 0 0 0 CHI GB 1.0 11.259186 2.740814
3947 1985 1985-11-10 1 0 0 CHI DET 1.0 12.259186 2.740814
3948 1985 1985-11-17 0 0 0 CHI DAL 1.0 13.259186 2.740814
3949 1985 1985-11-24 1 0 0 CHI ATL 1.0 14.259186 2.740814
3950 1985 1985-12-02 0 0 0 CHI MIA 0.0 15.259186 2.740814
3951 1985 1985-12-08 1 0 0 CHI IND 1.0 15.259186 3.740814
3952 1985 1985-12-14 0 0 0 CHI NYJ 1.0 16.259186 3.740814
3953 1985 1985-12-22 0 0 0 CHI DET 1.0 17.259186 3.740814
3954 1985 1986-01-05 1 0 1 CHI NYG 1.0 18.259186 3.740814
3955 1985 1986-01-12 1 0 1 CHI LAR 1.0 19.259186 3.740814
3956 1985 1986-01-26 0 1 1 CHI NE 1.0 20.259186 3.740814

Convert back from "by-team" to "by-game" format

We now have to convert back to the original layout of one row per game. The goal is to combine each team's performance to predict the outcome of the game: team1's hyperparameters $\alpha_1$ and $\beta_1$ are combined with team2's hyperparameters $\beta_1$ and $\beta_2$, so that overall the formula for win probability will be:

$$ \text{Pr}(\text{team}_1 \text{ wins}) = \frac{\alpha}{\alpha + \beta} + \text{team}_1\text{ home advantage} $$

in which $\alpha = \left(\alpha_1 + \beta_2 - 1\right)$ and $ \beta = \left(\beta_1 + \alpha_2 - 1\right)$.

In [7]:
# use toggle
b = (
    bernoulli_dataset[['season', 'date', 'team1', 'alpha1', 'beta1']]
    .rename(columns=dict(team1='team2', alpha1='alpha2', beta1='beta2'))
    .merge(bernoulli_dataset, on=['season', 'date', 'team2'])
    .join(
        rolling_avg[['win_pct_advantage']]
        .rename(columns={'win_pct_advantage':'home_advantage'})
        , on='season')
)

b = (
    b.assign(
        pwin = 
        (b.alpha1 + b.beta2 - 1) / (b.alpha1 + b.beta1 + b.alpha2 + b.beta2 - 2)
        # if at home and not neutral add home advantage
        + b.home * (1 - b.neutral) * b.home_advantage
        # if away and not neutral subtract home advantage
        - (1 - b.home) * (1 - b.neutral) * b.home_advantage
        ,
        success = lambda row:  row.pwin.round() == row.result1
    )
    .reindex(columns=(
        list(bernoulli_dataset.columns)
        + ['alpha2', 'beta2', 'home_advantage', 'pwin', 'success']
    ))
)

print(b.success.mean())
b.tail()
0.6327005036807439
Out[7]:
season date home neutral playoff team1 team2 result1 alpha1 beta1 alpha2 beta2 home_advantage pwin success
20643 2017 2017-11-30 1 0 0 DAL WSH 1.0 8.555930 8.444070 8.068991 8.931009 0.072409 0.587626 True
20644 2017 2017-12-10 1 0 0 LAC WSH 1.0 8.482438 9.517562 8.068991 9.931009 0.072409 0.584570 True
20645 2017 2017-12-17 0 0 0 ARI WSH 0.0 9.031599 9.968401 8.068991 10.931009 0.072409 0.454330 True
20646 2017 2017-12-24 0 0 0 DEN WSH 0.0 8.258948 11.741052 9.068991 10.931009 0.072409 0.406274 True
20647 2017 2017-12-31 1 0 0 NYG WSH 1.0 5.301124 15.698876 10.068991 10.931009 0.072409 0.453213 False

Check model performance

One way to visualize the performance of a binary classification model is with a Receiver operating characteristic (ROC) plot (Wikipedia page). The total area under the curve (AUC) can be used as a measure of model performance, with larger AUC identifying better models.

No effect
Mild effect
Amazing

Our best model with the this Bernoulli posterior has about 63% accuracy, and an area under the ROC curve of about 0.68. This is as good as my logistic regression models from prior years that incldued moving averages of penalties, turnovers, and passing and rushing yards. Cool.

In [8]:
# use toggle
def plot_roc(predicted, actual, resolution=100, ax=None):
    """'predicted' and 'actual' are pandas Series."""
    ax = ax or plt.gca()
    cutoff = np.linspace(0, 1, resolution)
    total_pos = (actual == 1).sum()
    total_neg = (actual != 1).sum()
    true_positive_rate = np.fromiter(
        map(lambda c: (actual[predicted > c] == 1).sum() / total_pos, cutoff),
        float)
    false_positive_rate = np.fromiter(
        map(lambda c: (actual[predicted > c] != 1).sum() / total_neg, cutoff),
        float)
    ax.plot(
        false_positive_rate, true_positive_rate,
        linestyle='-', color=sns.color_palette()[0], linewidth=3)
    ax.set_xlim([0,1])
    ax.set_ylim([0,1])
    ax.plot([0,1], [0,1], 'k:')
    # Area under the curve
    auc = sum((true_positive_rate[:-1] + true_positive_rate[1:]) / 2
              * (false_positive_rate[:-1] - false_positive_rate[1:]))
    ax.set_title('ROC curve. AUC = {:0.3f}'.format(auc), fontsize=14);


## Start the actual plot
plt.rcParams['figure.figsize'] = (15.0, 3.0)
f, ax = plt.subplots(1, 3)

summary = b.groupby(['team1', 'season'], as_index=False).success.mean()

# Histogram
sns.distplot(summary.success, ax=ax[0], bins=np.linspace(0, 1, 11))
ax[0].axvline(0.5, color='k', linestyle=':')
ax[0].set_ylabel("frequency count")
ax[0].set_title('Model accuracy (grouped by team, season)', fontsize=14)

# Time series
sns.lineplot(x="season", y="success", data=summary, ax=ax[1])
ax[1].set_ylabel("Model success rate")
ax[1].set_title('Accuracy year over year (mean {:0.0%})'.format(b.success.mean()), fontsize=14)

# ROC
plot_roc(b.pwin, b.result1, resolution=100, ax=ax[2])

T distribution

This section is based on the content of this table on the Wikipedia page for conjugate priors. The spread follows a T distribution with parameters:

$$ \text{Pr}(\text{spread}) = t_{2\alpha}\left(\tilde{\text{spread}} | \mu, \sigma^2 = \frac{\beta (\nu + 1)}{\alpha\nu}\right) $$

The hyperparameters for the T distribution are:

$$ \begin{aligned} \nu &= \text{number of games} \\ \mu &= \text{running mean of spread} \\ \alpha &= \nu / 2 \\ \beta &= (\text{running sum of the squared deviations}) / 2 \end{aligned} $$

Updates are done by these rules:

  • Increment $\nu$ each game, $\nu' = \nu + 1$
  • Intermediate variable $\delta = x - \mu$
  • The running mean is $\mu' = \mu + \delta / \nu'$
  • The running sum of squared deviations is $2\beta = 2\beta + \delta * \mu'$

The default initial value sets $\mu = 0$, and the number of observations n_obs to 3. The parameters for mean reversion were tuned for good performance and are:

  • Set $\nu$ to 3 games.
  • Revert $\mu$ back to zero keeping 50% of last year's final value.
  • Revert $\beta$ back to the mean $\beta$ of all teams in the prior year, scaled down to 3 games. Keep 50% of last year's final value.
In [9]:
class TUpdater(Updater):
    def __init__(self):
        super().__init__('nu1', 'mu1', 'alpha1', 'beta1')
        
    def get_mean_beta(self, season):
        mean_beta = 16**2 / 2  # Default
        if season in self.lookup:
            team_sets = self.lookup[season].values()
            mean_beta = (
                sum(ts[-1].beta1 for ts in team_sets)
                / sum(ts[-1].nu1 for ts in team_sets))
        return mean_beta
        
    def revert_to_mean(self, season, team, keep=.5, n_obs=3):
        mean_beta = self.get_mean_beta(season - 1)  # Default
        nu, mu, alpha, beta = n_obs, 0, n_obs / 2, mean_beta * n_obs
        # or use existing data
        if season in self.lookup and team in self.lookup[season]:
            last_entry = self.lookup[season][team].pop()
            date, nu0, mu0, alpha0, beta0 = last_entry
            mu = keep * mu0
            beta = nu * (keep * beta0 / nu0 + (1 - keep) * mean_beta)
            # push back the reverted value to the list
            self.lookup[season][team].append(self.Params(date, nu, mu, alpha, beta))
        return nu, mu, alpha, beta
            
    def update(self, row):
        if row.season not in self.lookup:
            self.lookup[row.season] = {}
        if row.team1 not in self.lookup[row.season]:
            self.lookup[row.season][row.team1] = []
            nu, mu, alpha, beta = self.revert_to_mean(row.season - 1, row.team1)
        else:
            __, nu, mu, alpha, beta = self.lookup[row.season][row.team1][-1]
        # THE UPDATE STEP:
        delta = row.spread - mu
        nu_mu_alpha_beta_next = self.Params(
            row.date,
            nu + 1,                       # nu' = nu + 1
            mu + delta / (nu + 1),        # mu' = mu + delta / (nu + 1)
            alpha + .5,                   # alpha' = alpha + 1/2
            beta + delta * (mu + delta / (nu + 1)) / 2
                                          # beta' = beta + delta * mu' / 2
        )
        self.lookup[row.season][row.team1].append(nu_mu_alpha_beta_next)
        return nu, mu, alpha, beta


t_updater = TUpdater()
for i, row in by_team.iterrows():
    t_updater.update(row)

nmab = pd.DataFrame(t_updater.get_rows()).sort_values(['team1','season'])
g = nmab.groupby('team1')
nmab = nmab.assign(
    nu1 = g.nu1.shift(),
    mu1 = g.mu1.shift(),
    alpha1 = g.alpha1.shift(),
    beta1=g.beta1.shift())

t_dataset = (
    by_team[[c for c in by_team.columns if c != 'result1']]
    .merge(nmab, on=['season', 'date', 'team1'])
    .reindex(columns=[
        'season', 'date', 'home', 'neutral', 'playoff',
        'team1', 'team2', 'spread', 'nu1', 'mu1', 'alpha1', 'beta1'])
)

print("\n\n☆ Da Bears! ☆")
t_dataset[(t_dataset.season == 1985) & (t_dataset.team1 == 'CHI')]

☆ Da Bears! ☆
Out[9]:
season date home neutral playoff team1 team2 spread nu1 mu1 alpha1 beta1
3938 1985 1985-09-08 1 0 0 CHI TB 10 3.0 1.390742 1.5 23.341826
3939 1985 1985-09-15 1 0 0 CHI NE 13 4.0 3.543057 2.0 38.593370
3940 1985 1985-09-19 0 0 0 CHI MIN 9 5.0 5.434445 2.5 64.289990
3941 1985 1985-09-29 1 0 0 CHI WSH 35 6.0 6.028704 3.0 75.037828
3942 1985 1985-10-06 0 0 0 CHI TB 8 7.0 10.167461 3.5 222.320086
3943 1985 1985-10-13 0 0 0 CHI SF 16 8.0 9.896528 4.0 211.594917
3944 1985 1985-10-21 1 0 0 CHI GB 16 9.0 10.574692 4.5 243.866083
3945 1985 1985-10-27 1 0 0 CHI MIN 18 10.0 11.117223 5.0 274.023262
3946 1985 1985-11-03 0 0 0 CHI GB 6 11.0 11.742930 5.5 314.435248
3947 1985 1985-11-10 1 0 0 CHI DET 21 12.0 11.264352 6.0 282.090056
3948 1985 1985-11-17 0 0 0 CHI DAL 44 13.0 12.013248 6.5 340.568433
3949 1985 1985-11-24 1 0 0 CHI ATL 36 14.0 14.298016 7.0 569.241980
3950 1985 1985-12-02 0 0 0 CHI MIA -14 15.0 15.744815 7.5 740.088842
3951 1985 1985-12-08 1 0 0 CHI IND 7 16.0 13.885764 8.0 533.574098
3952 1985 1985-12-14 0 0 0 CHI NYJ 13 17.0 13.480719 8.5 487.161572
3953 1985 1985-12-22 0 0 0 CHI DET 20 18.0 13.454013 9.0 483.927771
3954 1985 1986-01-05 1 0 1 CHI NYG 21 19.0 13.798538 9.5 529.090300
3955 1985 1986-01-12 1 0 1 CHI LAR 24 20.0 14.158611 10.0 580.071649
3956 1985 1986-01-26 0 1 1 CHI NE 36 21.0 14.627249 10.5 652.047869

Convert back from "by-team" to "by-game" format

We will again convert back to the original layout of one row per game. Combining the two teams' hyperparameters is a little more complicated than for the Bernoulli distribution. The distribution of the spread will be:

$$ \text{Pr}(\text{team}_1 \text{ spread}) = t_{2\alpha}\left(\tilde{\text{spread}} | \mu, \sigma^2 = \frac{\beta (\nu + 1)}{\alpha\nu}\right) $$

in which

$$ \begin{aligned} \nu &= \nu_1 + \nu_2 \\ \alpha &= \alpha_1 + \alpha_2 \\ \mu &= \frac{\nu_1 \mu_1 - \nu_2 \mu_2}{\nu_1 + \nu_2} \\ \beta &= \beta_1 + \beta_2 + \frac{\nu_1 \nu_2}{\nu_1 + \nu_2} \frac{(\mu_1 + \mu_2)^2}{2} \end{aligned} $$

In [10]:
t = (
    t_dataset[['season', 'date', 'team1', 'nu1', 'mu1', 'alpha1', 'beta1']]
    .rename(columns=dict(
        team1='team2', nu1='nu2', mu1='mu2', alpha1='alpha2', beta1='beta2'))
    .merge(t_dataset, on=['season', 'date', 'team2'])
    .join(
        rolling_avg[['spread_advantage']]
        .rename(columns={'spread_advantage':'home_advantage'})
        , on='season')
)

t = (
    t.assign(
        pspread =
            (t.nu1 * t.mu1 - t.nu2 * t.mu2) / (t.nu1 + t.nu2)
            # if at home and not neutral add home advantage
            + t.home * (1 - t.neutral) * t.home_advantage
            # if away and not neutral subtract home advantage
            - (1 - t.home) * (1 - t.neutral) * t.home_advantage
        ,
        betaprime =
            t.beta1 + t.beta2
            + (t.nu1 * t.nu2) / (t.nu1 + t.nu2)
            * (t.mu1 + t.mu2)**2 / 2
        ,
        pwin = (
            lambda row: 1 - stats.t.cdf(
                0,
                row.nu1 + row.nu2,
                loc=row.pspread,
                scale=(
                    row.betaprime
                    * (row.nu1 + row.nu2 + 1)
                    / (row.nu1 + row.nu2) / (row.alpha1 + row.alpha2)
                )))
        ,
        success = lambda row: row.pwin.round() == (row.spread > 0)
    )
    .reindex(columns=(
        list(t_dataset.columns)
        + ['nu2', 'mu2', 'alpha2', 'beta2', 'home_advantage', 'pspread', 'pwin', 'success']
    ))
)

print(t.success.mean())
print(t.shape)
t.tail()
0.636139093374661
(20648, 20)
Out[10]:
season date home neutral playoff team1 team2 spread nu1 mu1 alpha1 beta1 nu2 mu2 alpha2 beta2 home_advantage pspread pwin success
20643 2017 2017-11-30 1 0 0 DAL WSH 24 14.0 -1.009843 7.0 122.008579 14.0 -1.220737 7.0 58.007248 2.519351 2.624798 0.570663 True
20644 2017 2017-12-10 1 0 0 LAC WSH 17 15.0 3.632639 7.5 102.490156 15.0 -2.739355 7.5 89.207489 2.519351 5.705348 0.663206 True
20645 2017 2017-12-17 0 0 0 ARI WSH -5 16.0 -5.046426 8.0 133.114383 16.0 -3.630645 8.0 115.095160 2.519351 -3.227241 0.463974 True
20646 2017 2017-12-24 0 0 0 DEN WSH -16 17.0 -4.156688 8.5 131.158625 17.0 -3.122960 8.5 101.618580 2.519351 -3.036215 0.456733 True
20647 2017 2017-12-31 1 0 0 NYG WSH 8 18.0 -8.337091 9.0 261.273841 18.0 -2.060573 9.0 81.916448 2.519351 -0.618908 0.494824 False

Check model performance

So, the model's performance at predicting spread is pretty crappy. But, using a probit link to connect the T distribution to a win/loss prediction, we get performance similar to the Bernoulli distribution, with an AUC of 0.67 under the ROC plot. Not much improvement for all that extra complication.

In [11]:
# use toggle
ss_res = ((t.spread - t.pspread)**2).sum()
ss_tot = ((t.spread - t.spread.mean())**2).sum()
r_squared = 1 -  ss_res/ ss_tot
sns.jointplot("spread", "pspread", data=t, kind="hex", space=0, color="b", ratio=4)
title = "Actual spread vs. mean of  distribution. R squared= {:0.0%}".format(r_squared)
plt.suptitle(title, x=.45, y=1.01, fontsize=14);
In [12]:
# use toggle
plt.rcParams['figure.figsize'] = (15.0, 3.0)
f, ax = plt.subplots(1, 3)

summary = t.groupby(['team1', 'season'], as_index=False).success.mean()

# Histogram
sns.distplot(summary.success, ax=ax[0], bins=np.linspace(0, 1, 11))
ax[0].axvline(0.5, color='k', linestyle=':')
ax[0].set_ylabel("frequency count")
ax[0].set_title('Model accuracy (grouped by team, season)', fontsize=14)

# Time series
sns.lineplot(x="season", y="success", data=summary, ax=ax[1])
ax[1].set_ylabel("Model success rate")
ax[1].set_title('Accuracy year over year (mean {:0.0%})'.format(t.success.mean()), fontsize=14)

# ROC
plot_roc(t.pwin, t.spread > 0, resolution=100, ax=ax[2])

Conclusion

It is possible to use nothing but home advantage and per-team historical win/loss data to achieve better than 60% predictive accuracy (on average) year over year for NFL wins. Further, it can be done using an analytical Bayesian approach, at a computational cost of a handful of arithmetic operations per team per week.

  • It is worth your time to look at simple Bayesian solutions to predictive problems.
  • Not everything needs numerical methods, people have been solving hard statistical problems for centuries without computers.
  • Even if you end up using other methods, the hyperparameters from the Bayesian method can be part of your feature engineering pipeline.
    • Logit / probit link for a regression that adds injuries and rest between games?
    • Computational cost is nearly nothing.
  • The '85 Bears are the best football team ever to have existed. 🐻

    You can track the Bernoulli model's performance under my name in this year's FiveThirtyEight forecasting game leaderboard…or join the fun, and enter the game yourself!

Python libraries used in this notebook

The dataset is from the FiveThirtyEight GitHub repo:

The rest of the dependencies are regular Python packages, listed below:

In [13]:
import sys, platform, IPython, pandas as pd, numpy as np, seaborn as sns
print("This notebook was created on an %s computer running OSX %s and using:\nPython %s\nIPython %s\nNumPy %s\nPandas %s\nSeaborn %s\n" % (platform.machine(), platform.mac_ver()[0], sys.version[:5], IPython.__version__, np.__version__, pd.__version__, sns.__version__))
This notebook was created on an x86_64 computer running OSX 10.13.6 and using:
Python 3.6.6
IPython 6.5.0
NumPy 1.15.1
Pandas 0.23.4
Seaborn 0.9.0

Vocabulary

  • hyperparameter
    The parameters for the conjugate prior are called "prior hyperparameters," and the parameters for the posterior distribution are called "posterior hyperparameters." The only thing called "model parameters" are the parameters of the likelihood. The word "hyperparameter" is used specifically to distinguish between the "model parameters" and everything else. Here is a worked example showing hyperparameters for the Bernoulli distribution.
  • posterior
    In the context of Bayesian probability theory, the posterior is the distribution that is found a posteriori, meaning after you combine your prior knowledge and the likelihood of the observed data using Bayes's theorem: $p(\theta | x) = \frac{p(x|\theta) p(\theta)}{p(x)}$. Wikipedia definition of posterior.
  • recursive Bayesian estimation
    I didn't mention this in the blog, but upon reading the entire Wikipedia page on conjugate priors, I found out there's a name for the online updating method I implemented: recursive Bayesian estimation. It is common in engineering—the Kalman filter is a Bayesian approach used in feedback control that estimates unknown parameters in a linear system. Here is the Wikipedia recursive Bayesian estimation page and the Kalman filter page.
  • support
    In the context of a probability distribution, the support is the set of values for which there is a nonzero probability. For example, the support for a Bernoulli random variable is the set {0, 1}, which contains the only two possible outcomes.

References

All links in this post were accessed on or before September 9, 2018.

Appendix: major changes to NFL rules 🏈

I didn't want to mess up the models with games that are totally different thanks to rules changes, so I made a summary of major NFL rules changes from the official NFL operations site, with additional citations as noted.

  • 1933:
    • Quarterbacks can now throw from anywhere behind the line of scrimmage (previously at least 5 yards back), drastically increasing the prevalence of the passing attack
  • By 1940, to increase scoring for a "maximum of entertainment":
    • No penalties for multiple incomplete passes in the same series of downs
    • Hash marks are at 10 yards, then 15, then 20 yards from the sidelines (Wikipedia)
    • 15-yard penalty for roughing the passer
    • Ball shape is changed to be more pass-friendly
  • 1972:
    • Hash marks move from the sidelines to 70 feet, 9 inches (~23.5 yards) from the sidelines
  • 1974
    • Goal posts move from goal line to the end line
    • All field goals missed beyond the 20-yard line are a turnover
    • Kickoffs are at the 35-yard line (back from 40)
    • Offense can't move on a punt until after the kick
    • Penalty for holding, illegal use of hands, tripping is 10 yards (down from 15 which was almost unrecoverable)
    • A receiver can only be chucked (legal contact? I can't find the definition) once by a given defender after the receiver has gone 3 yards downfield
  • 1978:
    • Illegal contact is now anything beyond 5 yards downfield
    • Offensive linemen can extend arms and open hands on pass plays
  • 1994:
    • Multiple unlisted changes intended to increase scoring (which is at 2 TD/team/game at this point) and reduce injuries
  • 2016:

Actually all of these rules are either for safety or to favor offense and increase scoring. Except the recent change to the distance for an extra point, which just messes up historical rankings of kickers. Since both teams have an offense, and all the data I'm using is win/loss or points, any advantage is a wash. In the end I didn't use this, but didn't want to delete all the typing.