Bayesian updating and the NFL
It's football season again, hooray! Every year for my friends' football pool I try out a different algorithm. Invariably, my picks are around 60% accurate. Not terrible, but according to NFL Pickwatch (archive, current season), the best pickers get to 68 or 69%. So, an amazing performance—my upper bound—is just under 70%, and the lower bound for a competitive model—the FiveThirtyEight baseline—is 60%.
I've been modeling NFL outcomes for a couple of years, and running linear (predicting point spread) and logistic (predicting win probability) regressions given various team and player data. My best year so far incorporated the Vegas spread into the model, and my biggest disaster so far was an aggressive lasso model on every player in every offensive line, with team defenses lumped as a group. Attempting to track injuries, suspensions, and other changes to the starting lineup was not sustainable for the amount of time I wanted to spend.
Enter Nate Silver's awesome NFL Elo rankings, the aspirational target for this year. What's impressive is that he gets something like 60% accuracy out of literally no information but home field advantage and past scores. I particularly love that it updates weekly to incorporate the new information—this immediately says "Bayesian" and in fact is a lot how people using their intuition are making their picks anyway. A system like his—but with a more straightforward Bayesian model—is the goal of this post.
Bayesian updating for the (NFL) win 🏈¶
Silver's Elo model maintains a running metric of team ability, $Elo_{tm}$, that updates after every game. Here is his formula and source code. The number is used to estimate win probability in almost the same way as you'd use the result from a logistic regression:
$$ Pr(\text{team A wins}) = \frac{1}{10^{(Elo_A  Elo_B)/400} + 1} $$
Copy Silver's approach, but use a different model¶
The FiveThirtyEight Elo approach is intuitive and simple:
 Start with a baseline value for team skill
 Update the Elo numbers every week
 At the end of each season revert the values back to the mean
This post will do the same thing, but will be updating hyperparameters of known distributions rather than the Elo parameters FiveThirtyEight used. I want to only select from distributions that have analytic solutions. This table on Wikipedia lists those in the exponential family, with columns for the likelihood, its conjugate prior, and their posterior distribution. The two models selected for this blog post are:

A Bernoulli posterior to predict win probability
⇒ A running tally of wins vs. losses for each team is all you need. 
A Student's T posterior to predict point spread
⇒ A running game count, mean spread, and variance are all for this model. The theory depends on every future outcome being independent of previous outcomes, which is true so we're fine.
Data¶
FiveThirtyEight's GitHub repository contains a cleaned dataset of historical scores that date back to 1920. We can use to develop the algorithm. The column named "neutral" has two values: "neutral == 1" means the game is on neutral ground (no home field advantage), and "neutral == 0" means the home field advantage goes to "team1". The rest of the column names are straightforward. The goal is to make an onlineupdating model like FiveThirtyEight's Elo using only prior scores, team, and home field advantage.
Initial setup¶
This blog post is an executable Jupyter notebook. You can reproduce everything by rerunning the original notebook. This first cell does the standard imports, then pulls the data from FiveThirtyEight.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd # at least 0.19.2 to read_csv from url
import seaborn as sns
import scipy.stats as stats
from collections import OrderedDict
# Seaborn setup
sns.set_context("notebook")
sns.set_style("white")
url = "https://github.com/fivethirtyeight/nflelogame/raw/master/data/nfl_games.csv"
df = pd.read_csv(url)
print(df.shape)
df.head(5)
Explore the data¶
See what's in the dataset. There are over 100 football teams total, but if we cut off at 1980, we get the familiar 32 teams, so that's fine. Also, the FiveThirtyEight folks have made it so teams that changed cities are always named whatever they're called now. Very nice.
For those skeptical about the point spread being normally distributed, you're kind of right, but it's still good enough for me. The plot below contains histograms of each team's point spread at home since 1980. Each bar is 7 points wide, and a normal distribution is superimposed with a dashed line. Some observations:
 The Normal distribution isn't terrible for most teams. ⇒ Modeling point spread with a T distribution won't be horrible.
 A disproportionate number of games end with spreads that are within 1 touchdown (±7 points).
⇒ Coaching/performance under pressure?
 The Browns (CLE), Lions (DET), Chargers (LAC), Rams (LAR), and Jets (NYJ) often lose by 1 touchdown or less.
 The Cardinals (ARI), Broncos (DEN), Dolphins (MIA), Patriots (NE), and Seahawks (SEA) often win by 1 touchdown or less.
# use toggle
# First create the "spread" column
df['spread'] = df.score1  df.score2
print("Total teams in the dataset:", len(df.team1.unique()))
print("Total teams after 1980:", len(df[df.season > 1980].team1.unique()))
def color_coded_hist(x, **kwargs):
"""Color the negative and positive bins differently.
Color scheme from http://colorbrewer2.org/
"""
hist_bins = [75, 42, 35, 28, 21, 14, 7, 0, 7, 14, 21, 28, 35, 42, 75]
__, __, patches = plt.hist(x, density=True, bins=hist_bins, color="#f1a340")
# Purple for positive spread:
i = next(i for i, val in enumerate(hist_bins) if val == 0)
for p in patches[i:]:
p.set_facecolor("#998ec3")
def best_norm(x, **kwargs):
"""Plot the best normal fit for the data."""
mu, std = stats.norm.fit(x)
# Plot the PDF.
xmin, xmax = min(x), max(x)
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, std)
plt.plot(x, p, 'k', alpha=0.6)
# Plot all of the teams' spreads since 1980
g = sns.FacetGrid(df[df.season > 1980].sort_values('team1'), col="team1", col_wrap=5, height=2)
g = (g.map(color_coded_hist, "spread")
.map(best_norm, "spread")
.set_titles("{col_name}")
.set_axis_labels("spread", "density"))
msg = "Histograms of spread (at home) since 1980. Normal approximation is dashed line."
plt.suptitle(msg, y=1.025, fontsize=14);
Home field advantage¶
Nate Silver's model bumps the team's Elo rating by a constant to account for home field advantage. Great idea. I'm copying it.
To avoid using unknown future information, the model home field advantage value will be the average from the past three seasons, rather than the average over the entire dataset, saved in the data frame rolling_avg
. The plots below show spread on the left and win percent on the right, year over year. The approximation for home advantage is a dotted black line. The central tendency for the data is a solid blue line, and the 95% confidence interval for each season is the light blue region. Other observations about home field advantage:
 Starting around the 1970's the overall point spread tightens: home field advantage starts at 10 points on average in 1920 and settles down to 2 to 3 points per game, give or take a couple of points.
 This translates to about a 16% greater chance of winning at home (58% at home versus 42% away).
 Approximating home field advantage via the average of the past three years (black dotted line) looks fine: it closely tracks the actual median and is always within the confidence interval. Good enough for me.
# Ignore a future warning that comes from scipy.
import warnings
warnings.filterwarnings("ignore")
# Show some of the values
print("~ Overall ~")
print(df[df.neutral == 0][['spread', 'result1']].mean())
print("\n~ Past few seasons ~")
print(df[(df.neutral == 0) & (df.season > 2011)].groupby('season')[['spread', 'result1']].mean())
# Create a data frame with lagged rolling means and combine with the existing data
rolling_avg = (
df.groupby('season')[['spread', 'result1']].mean()
.rolling(3, min_periods=1)
.mean()
.shift(1)
)
rolling_avg['win_pct_advantage'] = rolling_avg.result1  .5
rolling_avg['spread_advantage'] = rolling_avg.spread
rolling_avg = rolling_avg.drop(columns=['result1', 'spread'])
# Plot everything
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14.0, 4.0)
f, ax = plt.subplots(1, 2)
sns.lineplot(x="season", y="spread", data=df[df.neutral == 0], ax=ax[0])
ax[0].plot(rolling_avg.index, rolling_avg.spread_advantage, 'k:')
ax[0].set_ylabel("home spread")
sns.lineplot(x="season", y="result1", data=df[df.neutral == 0], ax=ax[1])
ax[1].plot(rolling_avg.index, rolling_avg.win_pct_advantage + .5, 'k:')
ax[1].set_ylabel("home win percent")
plt.suptitle(
'Central tendency and confidence interval of spread (left), '
'win percent (right) by season — lagged moving average is dotted line',
fontsize=14);
Convert from "bygame" to "byteam" format¶
The original FiveThirtyEight dataset has one row per game ("bygame" format), with the home team labelled as "team1." If I want to see all of the games for a single team progress I have to bounce between the "team1" and "team2" columns. Reshape the dataset to "byteam" format, meaning there are two rows per game, adding an extra column "home" to identify the home team. This way I can group by "season" and "team1" and have all of the games for each team. The new dataset will be called by_team
.
season  date  team1  team2  result1 

1985  19850915  CHI  NE  1 
1985  19850919  MIN  CHI  0 
season  date  home  team1  team2  result1 

1985  19850915  1  CHI  NE  1 
1985  19850919  0  CHI  MIN  1 
1985  19850915  0  NE  CHI  0 
1985  19850919  1  MIN  CHI  0 
# use toggle
home_games = df[[
'date', 'season', 'neutral', 'playoff', 'team1', 'team2', 'result1', 'spread'
]][df.season > 1975]
home_games['home'] = 1
# Now swap the teams for "away"
away_games = home_games.rename(columns={'team1':'team2', 'team2':'team1'})
away_games['home'] = 1  home_games.home
# Remember to switch the meaning of the winning and spread columns too
away_games['result1'] = 1  home_games.result1
away_games['spread'] =  home_games.spread
by_team = (
pd.concat([home_games, away_games], ignore_index=True)
.sort_values(by=['season', 'team1', 'date'])
)
Set up for iterative updating¶
Both the models will follow the same procedure:
 Initialize the model with a baseline:
 Bernoulli win probability = 50%
 Tdistribution spread = 0
 After every game, update the hyperparameters with the new information as described in their respective formulas
 Use the hyperparameters to calculate the posterior predictive distribution
Make a base class that will keep track of all of the hyperparameters for each team and each game. The parameters can be added as columns to the main dataset for backtesting. In practice, the Updater.lookup
attribute would be accessing a database, not a dictionary, and the update step would be run weekly to increment the hyperparameters based on the game results. But for this blog post, it will remain a dictionary.
# use toggle
from abc import ABC, abstractmethod
from collections import namedtuple
class Updater(ABC):
def __init__(self, *hyperparameters):
# Form of lookup:
# {season: {team: [{week1 data}, {week2 data}, ... {weekN data}]}}
self.lookup = {}
self.Params = namedtuple('Params', ['date'] + list(hyperparameters))
def iterrows(self):
for season, teams in self.lookup.items():
for team, results in teams.items():
for row in results:
yield dict(season=season, team1=team, **row._asdict())
def get_rows(self):
return [r for r in self.iterrows()]
@abstractmethod
def revert_to_mean(self, season, team, keep=.3, n_obs=8):
pass
@abstractmethod
def update(self, row):
pass
Bernoulli distribution¶
The probability of winning is equal to:
$$ \text{Pr}(\text{win}) = \frac{\alpha}{\alpha + \beta} \pm \text{home advantage} $$
The hyperparameters for the Bernoulli distribution are:
$$ \begin{aligned} \alpha &= 1 + \text{number of wins} \\ \beta &= 1 + \text{number of losses} \end{aligned} $$
Updates are done by this rule:
 A win increments $\alpha$ by 1
 A loss increments $\beta$ by 1
 A tie increments both $\alpha$ and $\beta$ by 0.5
The default initial value sets $\alpha = \beta$, where you can choose how many "observations" (n_obs
) to pretend we have already seen. If n_obs
is a large number, then the incremental effect of new games won't be as pronounced. If n_obs
is a small number, then the new season's data will quickly overwhelm the past performance. I tuned the parameters for good performance:
 Weight $\alpha$ and $\beta$ at the beginning of each year as if they count for 4 games.
 Revert $\alpha$ and $\beta$ back to the mean ($\alpha = \beta$) keeping 80% of last year's final value.
class BernoulliUpdater(Updater):
def __init__(self):
super().__init__('alpha1', 'beta1')
def revert_to_mean(self, season, team, keep=.8, n_obs=4):
# default
alpha = beta = 1 + n_obs * .5
# or use existing data
if season in self.lookup and team in self.lookup[season]:
last_entry = self.lookup[season][team].pop()
date, alpha0, beta0 = last_entry
p = alpha0 / (alpha0 + beta0)
alpha = 1 + n_obs * (keep * p + (1  keep) * .5)
beta = 1 + n_obs * (keep * (1  p) + (1  keep) * .5)
# push back the reverted value to the list
self.lookup[season][team].append(self.Params(date, alpha, beta))
return alpha, beta
def update(self, row):
if row.season not in self.lookup:
self.lookup[row.season] = {}
if row.team1 not in self.lookup[row.season]:
self.lookup[row.season][row.team1] = []
alpha, beta = self.revert_to_mean(row.season  1, row.team1)
else:
__, alpha, beta = self.lookup[row.season][row.team1][1]
# THE UPDATE STEP:
# a' = a + 1 if win else 0
# b' = b + 1 if lose
if row.result1 == 1: # Won
alpha_beta_next = self.Params(row.date, alpha + 1, beta)
elif row.result1 == 0.5: # Tie
alpha_beta_next = self.Params(row.date, alpha + .5, beta + .5)
else: # Lost
alpha_beta_next = self.Params(row.date, alpha, beta + 1)
self.lookup[row.season][row.team1].append(alpha_beta_next)
return alpha, beta
bernoulli_updater = BernoulliUpdater()
for i, row in by_team.iterrows():
bernoulli_updater.update(row)
ab = pd.DataFrame(bernoulli_updater.get_rows()).sort_values(['team1','season'])
g = ab.groupby('team1')
ab = ab.assign(alpha1 = g.alpha1.shift(), beta1=g.beta1.shift())
bernoulli_dataset = (
by_team[[c for c in by_team.columns if c != 'spread']]
.merge(ab, on=['season', 'date', 'team1'])
.reindex(columns=[
'season', 'date', 'home', 'neutral', 'playoff',
'team1', 'team2', 'result1', 'alpha1', 'beta1'])
)
print("\n\n☆ Da Bears! ☆")
bernoulli_dataset[(bernoulli_dataset.season == 1985) & (bernoulli_dataset.team1 == 'CHI')]
Convert back from "byteam" to "bygame" format¶
We now have to convert back to the original layout of one row per game. The goal is to combine each team's performance to predict the outcome of the game: team1
's hyperparameters $\alpha_1$ and $\beta_1$ are combined with team2
's hyperparameters $\beta_1$ and $\beta_2$, so that overall the formula for win probability will be:
$$ \text{Pr}(\text{team}_1 \text{ wins}) = \frac{\alpha}{\alpha + \beta} + \text{team}_1\text{ home advantage} $$
in which $\alpha = \left(\alpha_1 + \beta_2  1\right)$ and $ \beta = \left(\beta_1 + \alpha_2  1\right)$.
# use toggle
b = (
bernoulli_dataset[['season', 'date', 'team1', 'alpha1', 'beta1']]
.rename(columns=dict(team1='team2', alpha1='alpha2', beta1='beta2'))
.merge(bernoulli_dataset, on=['season', 'date', 'team2'])
.join(
rolling_avg[['win_pct_advantage']]
.rename(columns={'win_pct_advantage':'home_advantage'})
, on='season')
)
b = (
b.assign(
pwin =
(b.alpha1 + b.beta2  1) / (b.alpha1 + b.beta1 + b.alpha2 + b.beta2  2)
# if at home and not neutral add home advantage
+ b.home * (1  b.neutral) * b.home_advantage
# if away and not neutral subtract home advantage
 (1  b.home) * (1  b.neutral) * b.home_advantage
,
success = lambda row: row.pwin.round() == row.result1
)
.reindex(columns=(
list(bernoulli_dataset.columns)
+ ['alpha2', 'beta2', 'home_advantage', 'pwin', 'success']
))
)
print(b.success.mean())
b.tail()
Check model performance¶
One way to visualize the performance of a binary classification model is with a Receiver operating characteristic (ROC) plot (Wikipedia page). The total area under the curve (AUC) can be used as a measure of model performance, with larger AUC identifying better models.
Our best model with the this Bernoulli posterior has about 63% accuracy, and an area under the ROC curve of about 0.68. This is as good as my logistic regression models from prior years that incldued moving averages of penalties, turnovers, and passing and rushing yards. Cool.
# use toggle
def plot_roc(predicted, actual, resolution=100, ax=None):
"""'predicted' and 'actual' are pandas Series."""
ax = ax or plt.gca()
cutoff = np.linspace(0, 1, resolution)
total_pos = (actual == 1).sum()
total_neg = (actual != 1).sum()
true_positive_rate = np.fromiter(
map(lambda c: (actual[predicted > c] == 1).sum() / total_pos, cutoff),
float)
false_positive_rate = np.fromiter(
map(lambda c: (actual[predicted > c] != 1).sum() / total_neg, cutoff),
float)
ax.plot(
false_positive_rate, true_positive_rate,
linestyle='', color=sns.color_palette()[0], linewidth=3)
ax.set_xlim([0,1])
ax.set_ylim([0,1])
ax.plot([0,1], [0,1], 'k:')
# Area under the curve
auc = sum((true_positive_rate[:1] + true_positive_rate[1:]) / 2
* (false_positive_rate[:1]  false_positive_rate[1:]))
ax.set_title('ROC curve. AUC = {:0.3f}'.format(auc), fontsize=14);
## Start the actual plot
plt.rcParams['figure.figsize'] = (15.0, 3.0)
f, ax = plt.subplots(1, 3)
summary = b.groupby(['team1', 'season'], as_index=False).success.mean()
# Histogram
sns.distplot(summary.success, ax=ax[0], bins=np.linspace(0, 1, 11))
ax[0].axvline(0.5, color='k', linestyle=':')
ax[0].set_ylabel("frequency count")
ax[0].set_title('Model accuracy (grouped by team, season)', fontsize=14)
# Time series
sns.lineplot(x="season", y="success", data=summary, ax=ax[1])
ax[1].set_ylabel("Model success rate")
ax[1].set_title('Accuracy year over year (mean {:0.0%})'.format(b.success.mean()), fontsize=14)
# ROC
plot_roc(b.pwin, b.result1, resolution=100, ax=ax[2])
T distribution¶
This section is based on the content of this table on the Wikipedia page for conjugate priors. The spread follows a T distribution with parameters:
$$ \text{Pr}(\text{spread}) = t_{2\alpha}\left(\tilde{\text{spread}}  \mu, \sigma^2 = \frac{\beta (\nu + 1)}{\alpha\nu}\right) $$
The hyperparameters for the T distribution are:
$$ \begin{aligned} \nu &= \text{number of games} \\ \mu &= \text{running mean of spread} \\ \alpha &= \nu / 2 \\ \beta &= (\text{running sum of the squared deviations}) / 2 \end{aligned} $$
Updates are done by these rules:
 Increment $\nu$ each game, $\nu' = \nu + 1$
 Intermediate variable $\delta = x  \mu$
 The running mean is $\mu' = \mu + \delta / \nu'$
 The running sum of squared deviations is $2\beta = 2\beta + \delta * \mu'$
The default initial value sets $\mu = 0$, and the number of observations n_obs
to 3. The parameters for mean reversion were tuned for good performance and are:
 Set $\nu$ to 3 games.
 Revert $\mu$ back to zero keeping 50% of last year's final value.
 Revert $\beta$ back to the mean $\beta$ of all teams in the prior year, scaled down to 3 games. Keep 50% of last year's final value.
class TUpdater(Updater):
def __init__(self):
super().__init__('nu1', 'mu1', 'alpha1', 'beta1')
def get_mean_beta(self, season):
mean_beta = 16**2 / 2 # Default
if season in self.lookup:
team_sets = self.lookup[season].values()
mean_beta = (
sum(ts[1].beta1 for ts in team_sets)
/ sum(ts[1].nu1 for ts in team_sets))
return mean_beta
def revert_to_mean(self, season, team, keep=.5, n_obs=3):
mean_beta = self.get_mean_beta(season  1) # Default
nu, mu, alpha, beta = n_obs, 0, n_obs / 2, mean_beta * n_obs
# or use existing data
if season in self.lookup and team in self.lookup[season]:
last_entry = self.lookup[season][team].pop()
date, nu0, mu0, alpha0, beta0 = last_entry
mu = keep * mu0
beta = nu * (keep * beta0 / nu0 + (1  keep) * mean_beta)
# push back the reverted value to the list
self.lookup[season][team].append(self.Params(date, nu, mu, alpha, beta))
return nu, mu, alpha, beta
def update(self, row):
if row.season not in self.lookup:
self.lookup[row.season] = {}
if row.team1 not in self.lookup[row.season]:
self.lookup[row.season][row.team1] = []
nu, mu, alpha, beta = self.revert_to_mean(row.season  1, row.team1)
else:
__, nu, mu, alpha, beta = self.lookup[row.season][row.team1][1]
# THE UPDATE STEP:
delta = row.spread  mu
nu_mu_alpha_beta_next = self.Params(
row.date,
nu + 1, # nu' = nu + 1
mu + delta / (nu + 1), # mu' = mu + delta / (nu + 1)
alpha + .5, # alpha' = alpha + 1/2
beta + delta * (mu + delta / (nu + 1)) / 2
# beta' = beta + delta * mu' / 2
)
self.lookup[row.season][row.team1].append(nu_mu_alpha_beta_next)
return nu, mu, alpha, beta
t_updater = TUpdater()
for i, row in by_team.iterrows():
t_updater.update(row)
nmab = pd.DataFrame(t_updater.get_rows()).sort_values(['team1','season'])
g = nmab.groupby('team1')
nmab = nmab.assign(
nu1 = g.nu1.shift(),
mu1 = g.mu1.shift(),
alpha1 = g.alpha1.shift(),
beta1=g.beta1.shift())
t_dataset = (
by_team[[c for c in by_team.columns if c != 'result1']]
.merge(nmab, on=['season', 'date', 'team1'])
.reindex(columns=[
'season', 'date', 'home', 'neutral', 'playoff',
'team1', 'team2', 'spread', 'nu1', 'mu1', 'alpha1', 'beta1'])
)
print("\n\n☆ Da Bears! ☆")
t_dataset[(t_dataset.season == 1985) & (t_dataset.team1 == 'CHI')]
Convert back from "byteam" to "bygame" format¶
We will again convert back to the original layout of one row per game. Combining the two teams' hyperparameters is a little more complicated than for the Bernoulli distribution. The distribution of the spread will be:
$$ \text{Pr}(\text{team}_1 \text{ spread}) = t_{2\alpha}\left(\tilde{\text{spread}}  \mu, \sigma^2 = \frac{\beta (\nu + 1)}{\alpha\nu}\right) $$
in which
$$ \begin{aligned} \nu &= \nu_1 + \nu_2 \\ \alpha &= \alpha_1 + \alpha_2 \\ \mu &= \frac{\nu_1 \mu_1  \nu_2 \mu_2}{\nu_1 + \nu_2} \\ \beta &= \beta_1 + \beta_2 + \frac{\nu_1 \nu_2}{\nu_1 + \nu_2} \frac{(\mu_1 + \mu_2)^2}{2} \end{aligned} $$
t = (
t_dataset[['season', 'date', 'team1', 'nu1', 'mu1', 'alpha1', 'beta1']]
.rename(columns=dict(
team1='team2', nu1='nu2', mu1='mu2', alpha1='alpha2', beta1='beta2'))
.merge(t_dataset, on=['season', 'date', 'team2'])
.join(
rolling_avg[['spread_advantage']]
.rename(columns={'spread_advantage':'home_advantage'})
, on='season')
)
t = (
t.assign(
pspread =
(t.nu1 * t.mu1  t.nu2 * t.mu2) / (t.nu1 + t.nu2)
# if at home and not neutral add home advantage
+ t.home * (1  t.neutral) * t.home_advantage
# if away and not neutral subtract home advantage
 (1  t.home) * (1  t.neutral) * t.home_advantage
,
betaprime =
t.beta1 + t.beta2
+ (t.nu1 * t.nu2) / (t.nu1 + t.nu2)
* (t.mu1 + t.mu2)**2 / 2
,
pwin = (
lambda row: 1  stats.t.cdf(
0,
row.nu1 + row.nu2,
loc=row.pspread,
scale=(
row.betaprime
* (row.nu1 + row.nu2 + 1)
/ (row.nu1 + row.nu2) / (row.alpha1 + row.alpha2)
)))
,
success = lambda row: row.pwin.round() == (row.spread > 0)
)
.reindex(columns=(
list(t_dataset.columns)
+ ['nu2', 'mu2', 'alpha2', 'beta2', 'home_advantage', 'pspread', 'pwin', 'success']
))
)
print(t.success.mean())
print(t.shape)
t.tail()
Check model performance¶
So, the model's performance at predicting spread is pretty crappy. But, using a probit link to connect the T distribution to a win/loss prediction, we get performance similar to the Bernoulli distribution, with an AUC of 0.67 under the ROC plot. Not much improvement for all that extra complication.
# use toggle
ss_res = ((t.spread  t.pspread)**2).sum()
ss_tot = ((t.spread  t.spread.mean())**2).sum()
r_squared = 1  ss_res/ ss_tot
sns.jointplot("spread", "pspread", data=t, kind="hex", space=0, color="b", ratio=4)
title = "Actual spread vs. mean of distribution. R squared= {:0.0%}".format(r_squared)
plt.suptitle(title, x=.45, y=1.01, fontsize=14);
# use toggle
plt.rcParams['figure.figsize'] = (15.0, 3.0)
f, ax = plt.subplots(1, 3)
summary = t.groupby(['team1', 'season'], as_index=False).success.mean()
# Histogram
sns.distplot(summary.success, ax=ax[0], bins=np.linspace(0, 1, 11))
ax[0].axvline(0.5, color='k', linestyle=':')
ax[0].set_ylabel("frequency count")
ax[0].set_title('Model accuracy (grouped by team, season)', fontsize=14)
# Time series
sns.lineplot(x="season", y="success", data=summary, ax=ax[1])
ax[1].set_ylabel("Model success rate")
ax[1].set_title('Accuracy year over year (mean {:0.0%})'.format(t.success.mean()), fontsize=14)
# ROC
plot_roc(t.pwin, t.spread > 0, resolution=100, ax=ax[2])
Conclusion¶
It is possible to use nothing but home advantage and perteam historical win/loss data to achieve better than 60% predictive accuracy (on average) year over year for NFL wins. Further, it can be done using an analytical Bayesian approach, at a computational cost of a handful of arithmetic operations per team per week.
 It is worth your time to look at simple Bayesian solutions to predictive problems.
 Not everything needs numerical methods, people have been solving hard statistical problems for centuries without computers.
 Even if you end up using other methods, the hyperparameters from the Bayesian method can be part of your feature engineering pipeline.
 Logit / probit link for a regression that adds injuries and rest between games?
 Computational cost is nearly nothing.

The '85 Bears are the best football team ever to have existed.★ 🐻★
You can track the Bernoulli model's performance under my name in this year's FiveThirtyEight forecasting game leaderboard…or join the fun, and enter the game yourself!
Python libraries used in this notebook¶
The dataset is from the FiveThirtyEight GitHub repo:
 Repo page: https://github.com/fivethirtyeight/nflelogame
 Raw data direct link: https://github.com/fivethirtyeight/nflelogame/raw/master/data/nfl_games.csv
The rest of the dependencies are regular Python packages, listed below:
import sys, platform, IPython, pandas as pd, numpy as np, seaborn as sns
print("This notebook was created on an %s computer running OSX %s and using:\nPython %s\nIPython %s\nNumPy %s\nPandas %s\nSeaborn %s\n" % (platform.machine(), platform.mac_ver()[0], sys.version[:5], IPython.__version__, np.__version__, pd.__version__, sns.__version__))
Vocabulary¶

hyperparameter
The parameters for the conjugate prior are called "prior hyperparameters," and the parameters for the posterior distribution are called "posterior hyperparameters." The only thing called "model parameters" are the parameters of the likelihood. The word "hyperparameter" is used specifically to distinguish between the "model parameters" and everything else. Here is a worked example showing hyperparameters for the Bernoulli distribution. 
posterior
In the context of Bayesian probability theory, the posterior is the distribution that is found a posteriori, meaning after you combine your prior knowledge and the likelihood of the observed data using Bayes's theorem: $p(\theta  x) = \frac{p(x\theta) p(\theta)}{p(x)}$. Wikipedia definition of posterior. 
recursive Bayesian estimation
I didn't mention this in the blog, but upon reading the entire Wikipedia page on conjugate priors, I found out there's a name for the online updating method I implemented: recursive Bayesian estimation. It is common in engineering—the Kalman filter is a Bayesian approach used in feedback control that estimates unknown parameters in a linear system. Here is the Wikipedia recursive Bayesian estimation page and the Kalman filter page. 
support
In the context of a probability distribution, the support is the set of values for which there is a nonzero probability. For example, the support for a Bernoulli random variable is the set {0, 1}, which contains the only two possible outcomes.
References¶
 Bernoulli distribution conjugate prior—Wikipedia's table of discrete conjugate priors. The table lists the parameters and how to update them.
 Normal distribution conjugate prior—Also Wikipedia. For continuous distributions.
 FiveThirtyEight's NFL Elo rankings, explained—Also sign up for their NFL forecasting game, and check out their GitHub repo.
 NFL Pickwatch—The best forecasters' picks, aggregated in one site.
 ProFootball Reference—One of the easiest sites to use for football statistics, including a frequentlyupdated injury list.
 Singlepass variance blog post (no https)—The first blog to actually cite Knuth while giving the algorithm gets my linkback.
All links in this post were accessed on or before September 9, 2018.
Appendix: major changes to NFL rules 🏈¶
I didn't want to mess up the models with games that are totally different thanks to rules changes, so I made a summary of major NFL rules changes from the official NFL operations site, with additional citations as noted.
 1933:
 Quarterbacks can now throw from anywhere behind the line of scrimmage (previously at least 5 yards back), drastically increasing the prevalence of the passing attack
 By 1940, to increase scoring for a "maximum of entertainment":
 No penalties for multiple incomplete passes in the same series of downs
 Hash marks are at 10 yards, then 15, then 20 yards from the sidelines (Wikipedia)
 15yard penalty for roughing the passer
 Ball shape is changed to be more passfriendly
 1972:
 Hash marks move from the sidelines to 70 feet, 9 inches (~23.5 yards) from the sidelines
 1974
 Goal posts move from goal line to the end line
 All field goals missed beyond the 20yard line are a turnover
 Kickoffs are at the 35yard line (back from 40)
 Offense can't move on a punt until after the kick
 Penalty for holding, illegal use of hands, tripping is 10 yards (down from 15 which was almost unrecoverable)
 A receiver can only be chucked (legal contact? I can't find the definition) once by a given defender after the receiver has gone 3 yards downfield
 1978:
 Illegal contact is now anything beyond 5 yards downfield
 Offensive linemen can extend arms and open hands on pass plays
 1994:
 Multiple unlisted changes intended to increase scoring (which is at 2 TD/team/game at this point) and reduce injuries
 2016:
 Field goal moves to 15 yd from 2 yd, plus multiple changes to 2 point conversion
Actually all of these rules are either for safety or to favor offense and increase scoring. Except the recent change to the distance for an extra point, which just messes up historical rankings of kickers. Since both teams have an offense, and all the data I'm using is win/loss or points, any advantage is a wash. In the end I didn't use this, but didn't want to delete all the typing.