Markov-Chain Monte Carlo (MCMC) methods are a category of numerical technique used in Bayesian statistics. They numerically estimate the distribution of a variable (the posterior) given two other distributions: the prior and the likelihood function, and are useful when direct integration of the likelihood function is not tractable.

I am new to Bayesian statistics, but became interested in the approach partly from exposure to the PyMC3 library, and partly from FiveThirtyEight's promoting it in a commentary soon after the time of the p-hacking scandals a few years back (Simmons et. al. coin 'p-hacking' in 2011, and Head et. al. quantify the scale of the issue in 2014).

Until the 1980's, it was not realistic to use Bayesian techniques except when analytic solutions were possible. (Here's Wikipedia's list of analytic options. They're still useful.) MCMC opens up more options.

The Python library pymc3 provides a suite of modern Bayesian tools: both MCMC algorithms and variational inference. One of its core contributors, Thomas Wiecki, wrote a blog post entitled MCMC sampling for dummies, which was the inspiration for this post. It was enthusiastically received, and cited by people I follow as the best available explanation of MCMC. To my dismay, I didn't understand it; probably because he comes from a stats background and I come from engineering. This post is for people like me.

MCMC for dumber dummies (i.e. engineers like me)¶

My first exposure to MCMC was via the Ising model. This post is for people like me who may remember the term "MCMC" from school and want to start from there to gain a working understanding of Bayesian numerical methods. If engineering concepts are familiar, this should be an accessible read.

If you have simulated the Ising model, you've already implemented the original MCMC algorithm: the Metropolis-Hastings Method. If not, the next section briefly describes both MCMC and the Ising model.

Background¶

The Metropolis-Hastings method was first developed right after World War II, when Metropolis and his team were exploring the physics of fission and fusion for use in a thermonuclear weapon. They published the method, to be used in general statistical mechanics applications, in 1953. Teachers use the Ising model to teach the Metropolis-Hastings method because it's less complicated than modeling anything with moving atoms.

Thermodynamic ensembles as an analogy for MCMC¶

Thermodynamic ensembles are a concept from engineering: Imagine any starting state we want in a system, and step it forward in time, allowing for randomness. If the system goes long enough, the randomness will even out and we can get an equilibrium value by averaging over the ensemble.

This example of mixing is how I visualize what's happening in Markov Chain Monte Carlo. Except instead of advancing in time, like here, the model advances a system through probability space toward more probable configurations.

Ising model¶

The Ising model is named for Ernst Ising, a student of Wilhelm Lenz. He solved the 1-D problem for his doctoral thesis in 1920. In the model, a material is represented by a regular lattice of atoms that can have positive or negative dipole moment (magnetic spin $s$ is up or down: $s = \pm 1$).

For two or more dimensions, there is a Curie temperature $T_c$ at which a phase transition occurs. Below $T_c$, the substance behaves as if it were magnetic. Above it, thermal energy disrupts the tendency toward alignment, and the substance acts as a paramagnetic one.

Here is an example drawing of a typical state below the critical temperature (left—all one spin—100% magnetization) and above the critical temperature (right—random spins—nearly 0% magnetization):

Boltzmann distribution: where the MCMC comes in¶

In the Ising model, the distribution of spins in the lattice depends on temperature, and follows the Boltzmann distribution:

$$ \text{frequency distribution} \propto e^{-E / k T} $$

In which $k$ is Boltzmann's constant, $T$ is the lattice temperature, and $E$ is the total energy from magnetization. When there is no external magnetic field, and given a coupling parameter $J > 0$ (a property of the particular material), the energy from magnetization is:

$$ E = - J \sum_{\text{(atoms)}}\sum_{\text{(neighbor atoms)}} \left( s_{\text{atom}} \cdot s_{\text{neighbor}}\right) $$

There's a shorthand notation:

$$ E = - J \sum_{<i, j>} s_i s_j $$

The subscript $<i, j>$ denotes a sum over each atom's $j$ nearest neighbors.

The goal of the model is to identify $T_c$. We do that by running the MCMC analysis a bunch of times at different temperatures to determine the average magnetization as a function of temperature. Then it's pretty clear where the discontinuity occurs.

The solution in this post is from a lecture by Richard Fitzpatrick at the University of Texas. His lecture is understandable and thorough.

Options¶

Here are three approaches to calculating the equilibrium magnetization:

Sum over all states¶

We can brute-force calculate the equilibrium magnetization of the system by iterating through every single possible state, calculating its energy given the magnetization, and then performing a weighted average based on its frequency, $\exp(-E / k T)$:

$$ \begin{aligned} \text{average magnetization} &= \sum_{\text{all configurations}} \text{magnetization} \cdot \text{frequency of configuration} \\ &= \sum_{\text{all configurations}} \text{magnetization} \cdot e^{-E / k T} \end{aligned} $$

But that's a lot of states. Even if we have a quite small $10 \times 10$ grid, it makes 100 positions and thus—with spin up and spin down—$2^{100}$ total different configurations. At modern processor speeds, this is still a time prohibitive approach.

Weighted sum over randomly-chosen states¶

So, rather than stepping through all of the possible configurations, what if we randomly choose a bunch of configurations and take the weighted average of their frequency?

The problem is it's inefficient. The Boltzmann distribution is not uniform: it is weighted toward lower energies. Ideally, we'd find some way to pick spin configurations that correspond more often to the more frequently occurring states.

Metropolis-Hastings method¶

And that's precisely the contribution from nuclear physicist Nicholas Metropolis and his team of mathematicians (Metropolis paper, paywalled). The method was originally published in 1953. They pioneered both the use of random-number generation for simulation (Monte Carlo methods) and the chaining of random steps (Markov Chain Monte Carlo) in their work for the Manhattan project.

They use their newly created technique to take a random walk through the configuration space, visiting the more frequently-occurring spin configurations more often, and drastically speeding up the calculation. In their words:

Instead of choosing configurations randomly, then weighting them with $\exp(-E/kT)$, we choose configurations with a probability $\exp(-E/kT)$ and weight them evenly.

They never really explained the statistics behind why it worked, though. Just the physics of what was happening. That's why it's named Metropolis-Hastings: the '-Hastings' acknowledges the statistician W. K. Hastings's contribution in 1970, when the theory behind the technique was finally explained and generalized (Hastings paper).

I'm not a physicist or a statistician, so it's a testament to the writing talent of both Hastings and the Metropolis team that their papers are actually understandable to people with an average engineering background and a bit of stats. Hastings writes:

The main features of these methods for sampling from a distribution with density $p(x)$ are:

The computations depend on $p(x)$ only through ratios of the form $p(x')/p(x)$, where $x'$ and $x$ are sample points. Thus, the normalizing constant need not be known, no factorization of $p(x)$ is necessary, and the methods are very easily implemented on a computer. [...]

A sequence of samples is obtained by simulating a Markov chain. The resulting samples are therefore correlated and estimation of the standard deviation of an estimate and assessment of the error of an estimate may require more care than with independent samples.

What Hastings was saying¶

It's easier to understand ideas with a concrete example like the Ising model. The important physics, described earlier, is that the spin configurations follow the Boltzmann distribution:

$$ \text{frequency distribution of configurations} \propto \exp(-E / k T) $$

This notation is too verbose, so let:

$x$ = the configuration of the lattice
$p$ = the frequency distribution of the configurations (the probability mass function)
$\tilde{T}$ = dimensionless temperature $\tilde{T} = k T / J$
$H(x)$ = the sum over neighboring spins $\sum_{<ij>} s_i s_j$ for configuration $x$

With the new symbols:

$$ p(x) \propto \exp(-H(x)/\tilde{T}) = \exp\biggl(-\frac{1}{\tilde{T}} \sum_{<ij>} s_i s_j \biggr) $$

The reason we just say it's proportional to is because in order to know the probability that a specific spin configuration occurs, you need to divide it by all of the possible configurations so that the probabilities add up to one:

$$ p(x) = \frac{ \exp(- H(x) / \tilde{T}) } { \sum_{\text{all }x} \exp(- H(x) / \tilde{T}) } $$

But we don't know the denominator. Which was the original hangup for Bayesian statisticians. Fortunately, it's not a problem when you use MCMC. Like Hastings said,

The computations depend on $p(x)$ only through ratios of the form $p(x')/p(x)$, where $x'$ and $x$ are sample points. Thus, the normalizing constant need not be known, no factorization of $p(x)$ is necessary, and the methods are very easily implemented on a computer.

Implementation¶

For the Ising model, the Metropolis-Hasting algorithm follows these steps:

Pick a starting spin configuration $x$.
Pick a new spin configuration $x'$.
Make a random choice to take the new configuration or keep the old one, weighted by the relative likelihoods of the new configuration vs. the old one $p(x')/p(x)$.
Repeat #2 to #3 until the system converges.
Except actually this one just goes $N$ steps instead of checking for convergence because all I want to do is make an animation of the transient part.

The probability mass function for $x$ is

$$ p(x) = \frac{ \exp(- H(x) / \tilde{T}) } { \sum_{\text{all }x} \exp(- H(x) / \tilde{T}) } $$

which means that for step #3, the ratio of likelihoods is

$$ \frac{p(x')}{p(x)} = \frac{ \exp(- H(x') / \tilde{T}) } { \sum_{\text{all }x} \exp(- H(x) / \tilde{T}) } \bigg/ \frac{ \exp(- H(x) / \tilde{T}) } { \sum_{\text{all }x} \exp(- H(x) / \tilde{T}) } $$

It's a ratio, so we can cancel out the sum over all configurations

$$ \frac{p(x')}{p(x)} = \frac{ \exp(- H(x') / \tilde{T}) } { \exp(- H(x) / \tilde{T}) } $$

Rearranging, the ratio of likelihoods is:

$$ \frac{p(x')}{p(x)} = \exp\left(- \frac{H(x') - H(x)}{\tilde{T}}\right) $$

And now so long as the algorithm selects the new configuration $x'$ relative to the old configuration $x$ with the above proportion, we will walk the Markov Chain with appropriate weighted probabilities. Hastings describes a couple of options in the literature, but says the choice by Metropolis and his team was the most efficient. So, the algorithm is to switch to $x'$ from $x$ with the probability:

$$ p(\text{switch to }x') = \left\lbrace \begin{matrix} 1 & \text{ if } H(x') - H(x) < 0 \kern1em\text{(lower energy)}\\ \exp\left(- \frac{H(x') - H(x)}{\tilde{T}}\right) & \text{ if } H(x') - H(x) \ge 0 \kern1em\text{(higher energy)} \end{matrix} \right. $$

Setup¶

The code uses these Python libraries:

numpy mkl-service theano pymc3 array2gif

This first chunk of code has nothing to do with MCMC. It's for visualization: it converts an array of spin lattices (spin values $\pm 1$) to blue and white and writes it to an animated gif.

import numpy as np
from array2gif import write_gif


def to_two_color(lattice):
    blue = np.ones(lattice.shape, dtype=np.int) * 255 
    red = np.zeros(lattice.shape, dtype=np.int)
    red[lattice < 0] = 255 
    green = red 
    return np.array([red, green, blue])


def output_to_gif(dataset, filename, fps=8):
    print("Frames: {}".format(len(dataset)))
    colors = []
    write_gif(
        [to_two_color(lattice) for lattice in dataset],
        filename,
        fps=fps
    )

Standard approach¶

The code in this section implements the Metropolis-Hastings method for the Ising model as Richard Fitzpatrick describes it. The code about to calculate the energy (get_dH) appears first. After that, in standard_approach, comes the MCMC implementation.

The function get_dH calculates $H(x') - H(x)$. Because none of the other positions in the lattice change, it is not necessary to calculate the energy except at the location $i,j$ of the flip.

def get_dH(lattice, trial_location):
    """ H = - Sum_<ij>(s_i s_j) """
    i, j = trial_location
    height, width = lattice.shape
    H, Hflip = 0, 0
    for di, dj in ((-1, 0), (1, 0), (0, -1), (0, 1)):
        ii = (i + di) % height
        jj = (j + dj) % width
        H -= lattice[ii, jj] * lattice[i, j]
        Hflip += lattice[ii, jj] * lattice[i, j]
    return Hflip - H

The function standard_approach creates a lattice with values $\pm 1$, and loops through all positions $N * 5$ times, taking a snapshot every 5^th step.

It's not very random to step through every position in the lattice sequentially...but it's faster than randomly selecting positions because you are certain to visit every position in the fewest possible steps.

def standard_approach(T, width, height, N=60):
    # Randomly initialize the spins to either +1 or -1
    lattice = 2 * np.random.randint(2, size=(height, width)) - 1
    snapshots = []
    for snapshot in range(N):
        snapshots.append(to_two_color(lattice))
        print('{:2.0%} complete. Net magnetization: {:3.0%}'
              .format(snapshot / N,
                      abs(lattice.sum()) / lattice.size),
              end='\r')
        for step in range(5):
            # Walk through the array flipping atoms.
            for i in range(height):
                for j in range(width):
                    dH = get_dH(lattice, (i, j))
                    if dH < 0:  # lower energy: flip for sure
                        lattice[i, j] = -lattice[i, j]
                    else:  # Higher energy: flip sometimes
                        probability = np.exp(-dH / T)
                        if np.random.rand() < probability:
                            lattice[i, j] = -lattice[i, j]
    return snapshots

Using PyMC3¶

The code in this section implements the Metropolis-Hastings method for the Ising model using PyMC3.

The function get_H calculates the full $H(x)$ for the lattice, because in the PyMC3 implementation, each step is actually completely independent of the next (it's supposed to be), meaning I can't access the prior state's lattice configuration to calculate the difference get_dH, $H(x') - H(x)$, like in the custom code.

The lattice in this case is a Bernoulli random variable, so its values are $\lbrace 0, 1\rbrace$ not $\lbrace -1, +1\rbrace$. That's the reason for the to_spins function, which converts to $\lbrace -1, +1\rbrace$. Then get_H uses Theano for GPU matrix math (when available); in this case to shift the matrix indices by one to get the nearest neighbors.

import theano
import theano.tensor as tt
import pymc3 as pm
from theano.compile.ops import as_op

# Fix compile failure on OSX
# https://stackoverflow.com/a/51312739
theano.config.gcc.cxxflags = "-Wno-c++11-narrowing"


def get_H(spins):
    H = - (
        tt.roll(spins, 1, axis=1) * spins +
        tt.roll(spins, 1, axis=0) * spins +
        tt.roll(spins, -1, axis=1) * spins +
        tt.roll(spins, -1, axis=0) * spins
    )
    return H


def to_spins(lattice):
    return 2 * lattice - 1

The way PyMC3 is used here is nonstandard: typically you'd use observed values to update the prior estimate of the variables you're looking for, but this example has no observed values. Instead, it adds the pm.Potential, which will change the likelihood directly to guide the MCMC random walk toward the lower-energy configurations.

The function mc3_approach runs the MCMC simulation using PyMC3. It performs th same function as standard_approach, for the most part. The main difference is it uses a full-blown MCMC framework, so we can't cheat and step through the positions in sequence like we did with the custom code.

Instead, at each step it will draw an entirely new lattice to try. The problem with this is multiple spins may cancel each other out, meaning the MCMC algorithm will take longer to converge. To fix this, we set scaling = .0006 so that on average only one position in the lattice will change at each step. (From the source code, a random uniform value must be below 1 - .5**scaling for the algorithm to attempt a flip.)

Here's the function. Each line will be broken down in the next section.

def mc3_approach(T, width, height, N=100):
    shape = (height, width)
    x0 = np.random.randint(2, size=shape)
    with pm.Model() as model:
        x = pm.Bernoulli('x', 0.5, shape=shape, testval=x0)
        magnetization = pm.Potential(
            'm',
            -get_H(to_spins(x)) / T
        )
        scaling = .0006
        mul = int(height * width * 1.75)
        step = pm.BinaryMetropolis([x], scaling=scaling)
        trace = pm.sample(N * mul * 5, step=step, chains=1, tune=False)
    dataset = [to_two_color(2 * t['x'] - 1) for t in trace[::mul * 5]]
    # Print out the final percent magnetization
    lattice = 2 * trace[-1]['x'] - 1
    print('Finished. Net magnetization: {:3.0%}'
              .format(abs(lattice.sum()) / lattice.size))
    return dataset

The middle part is where the PyMC3 library is used. The code below is the same as above starting from the PyMC3 model context (with pm.Model() as model), with added comments.

# PyMC3 uses a model context to collect all of the
# random variables together. Every random variable that
# is declared inside this context will be attached to
# the model. The variables can depend on each other,
# and will advance through the Markov Chain together.

with pm.Model() as model:
    # PyMC3 has pre-defined common discrete and continuous
    # distributions. The Bernoulli distribution has values
    # that are in {0, 1}.
    x = pm.Bernoulli('x', 0.5, shape=shape, testval=x0)

    # The Potential function is summed and added to the
    # overall log-likelihood. PyMC3 uses log-likelihood
    # instead of likelihood so when they take p(x')/p(x)
    # they can do it with subtraction:
    #   log( p(x') / p(x)) = log(p(x')) - log(p(x))
    #
    # So, long story--this line plays the same role as
    # the `get_dH` function earlier.
    magnetization = pm.Potential('m', -get_H(to_spins(x))/ T)

    # `scaling` limits the fraction of positions in `x`
    # that flip at every step in the Markov Chain.
    # From the PyMC3 source:
    #     p_jump = 1. - .5 ** self.scaling
    #     ...
    #     switch_locs = (rand_array < p_jump)
    scaling = .0006

    # `mul` was originally height * width to make the same
    # number of iterations as in the standard approach.
    # I made it bigger because the trace didn't converge.
    mul = int(height * width * 1.75)

    # PyMC3 has a lot of different step options, including
    # No U-Turn sampling (NUTS), slice, and Hamiltonian
    # Monte-Carlo. The Metropolis method is the oldest;
    # the others may converge faster depending on the
    # application.
    step = pm.BinaryMetropolis([x], scaling=scaling)

    # `trace` is the object that contains all of the steps.
    # When there are multiple random variables in the model,
    # it will contain all of those variables.
    trace = pm.sample(N * mul * 5,
                      step=step,
                      chains=1,  # default is 2
                      tune=False)

Now run both approaches¶

Finally, this is the part where the two different methods are called and compared. I had to increase the number of samples in the MC3 approach because the random choices weren't as efficient as stepping through the lattice. Also, I chose numbers far above and below the critical ratio so that the test runs would converge faster with this larger matrix size.

You would normally run this on a 10x10 or maybe 20x20 lattice, and run for many more steps to get better convergence, but blog posts need visuals, and I wanted an animation of the transient state. I couldn't make the lattice bigger because I was running into memory issues for the PyMC3 method (which stores every single value in the trace).

def run(T_over_Tc=.9, width=50, height=50, mc3=False):
    Tc = 2.269  # Normalized T := kT/J
    T = T_over_Tc * Tc
    dataset = None
    if mc3:
       dataset = mc3_approach(T, width, height, N=80)
       filename = ('mc3_ising_{}_{}x{}.gif'
                   .format(T_over_Tc, width, height))
    else:
       dataset = standard_approach(T, width, height, N=60)
       filename = ('ising_{}_{}x{}.gif'
                   .format(T_over_Tc, width, height))
    write_gif(dataset, filename, fps=8)

run(T_over_Tc=.75)
run(T_over_Tc=.75, mc3=True)
run(T_over_Tc=1.25)
run(T_over_Tc=1.25, mc3=True)

98% complete. Net magnetization:  97%

Sequential sampling (1 chains in 1 job)
BinaryMetropolis: [x]
100%|██████████| 1750000/1750000 [24:57<00:00, 1168.24it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks

Finished. Net magnetization: 100%
98% complete. Net magnetization:   2%

Sequential sampling (1 chains in 1 job)
BinaryMetropolis: [x]
100%|██████████| 1750000/1750000 [21:42<00:00, 1343.42it/s]
Only one chain was sampled, this makes it impossible to run some convergence checks

Finished. Net magnetization:   0%

Result¶

Convergence is faster with the custom code because the 'random walk' was actually stepping through each lattice position in sequence. Still, it's possible to see the system settling in to magnetization for $T = 0.75 T_c$ in both implementations. And even though in the PyMC3 implementation, the higher temperature system looks a lot less active than in the custom implementation, the net magnetization in both is near zero, with equal amounts of spin up and down.

Conclusion¶

The PyMC3 library provides a ready-made framework for applying Markov Chain Monte Carlo methods in Bayesian inference.

If you're an engineer, it's not bad to think of MCMC as a generalized method for creating "thermodynamic ensembles" of any real-world variable.
The system steps through probability space, with a weighted random walk that's biased toward the more likely states.
At the end, you will have a sample trace that contains the samples from the random walk. The weighting of the walk is such that if you randomly draw new predictions from the tail end of the trace (after it has converged), your draws will follow the posterior distribution of your target variable.
It uses Theano under the hood. Theano is a matrix math library that can leverage the NVIDIA GPU. The caveat is that custom code requires using Theano's functions for tensor math. Note that PyMC4 is about to come out and it depends on TensorFlow if you prefer that to Theano.

PyMC3 also implements No U-Turn Sampling (NUTS) and Hamiltonian Monte Carlo methods. They are modern MCMC techniques that speed up convergence in some cases by using different weights on the random walk.

import sys, platform, IPython, numpy as np, mkl, theano, pymc3 as pm
print("This notebook was created on an %s computer running OSX %s and using:\nPython %s\nIPython %s\nmkl %s\nNumPy %s\nPyMC3 %s\nTheano %s\n" % (platform.machine(), platform.mac_ver()[0], sys.version[:5], IPython.__version__, mkl.__version__, np.__version__, pm.__version__, theano.__version__))

This notebook was created on an x86_64 computer running OSX 10.11.6 and using:
Python 3.6.5
IPython 6.4.0
mkl 1.1.2
NumPy 1.14.5
PyMC3 3.5.rc1
Theano 1.0.2

Vocabulary¶

Likelihood
This is what statisticians call the probability mass (or density) function $p(x|\theta)$ when, instead of evaluating it for fixed parameters $\theta$ they evaluate it for fixed observation $x$ and allow the parameters themselves to vary across their entire domain. To make this more explicit when talking about it, they give the function a different name: the likelihood $\mathscr{L}(\theta|x) \equiv p(\theta|x)$. The total area under the likelihood need not equal one, so it must be normalized.
Monte Carlo Method
The practice of using pseudo-random numbers generated in a computer to model random events. Named for the Monte Carlo casino in Monaco, where Stanislaw Ulam's uncle would gamble. (Ulam is a coauthor on the 1953 Metropolis paper.)
Markov Chain
A sequence of events that each satisfy the Markov property: that whatever happens next only depends on the current state, and is independent of prior events. A Brownian motion is one example. It is named for the Russian mathematician Andrey Andreevich Markov.
Markov Chain Monte Carlo (MCMC)
A sequence of successive Monte Carlo draws in which the starting point of the current draw is the outcome of the last draw. The chain steps through points in probability space. The MCMC algorithms have a weighted preference for more likely outcomes, so the chain will spend more of its time in the more likely regions. This means the tail of the resulting Markov Chain will approximate the posterior distribution, and you can draw from it like you would draw from any computer-generated random distribution.
Posterior
In the context of Bayesian statistics, this is a new distribution that combines what you knew a priori with the observed data. When using MCMC, the posterior is approximated by the tail end of the sample trace. Mathematically, you obtain the posterior from Bayes' theorem, in which $p(x)$ is a prior of your choice, often constant, $x_\text{obs}$ is the observed data, and $\mathscr{L}$ is a likelihood of your choice. $$P(x|x_\text{obs}) = \frac{P(x_\text{obs} | x)}{P(x_\text{obs})} p(x) = \frac{\mathscr{L}(x_\text{obs}| x)}{\int_{x}\mathscr{L}(x_\text{obs}|x) p(x) dx} p(x)$$
Prior
In the context of Bayesian statistics, a distribution that reflects what you know about a random variable prior to incorporating your observations. Often people will just say the prior is uniform, $p(x) \propto 1$, and normalize it out. Robert Cousins gives the example in particle physics where the prior can be zero for negative mass and uniform otherwise, or other choices that make sense for the context.

Further resources¶

The following resources (in chronological order) are for people interested in the history of MCMC methods:

Equation of State Calculations by Fast Computing Machines (paywall)
Nicholas Metropolis and collaborators' paper starting it all. J. Chem. Phys., Vol. 21, No. 6 (June 1953) pp. 1087-1092.
Monte Carlo Sampling Methods Using Markov Chains and Their Applications
Wilhelm Hastings's paper explaining and generalizing Metropolis's algorithm. Biometrika, Vol. 57, No. 1 (April 1970), pp. 97-109.
The Beginning of the Monte Carlo Method (pdf)
A delightful historical account, by Nicholas Metropolis, written as part of a special newsletter honoring the scientific contributions of Stan Ulam. Los Alamos Science, Special Issue, 1987, pp.125-130.
From EM to Data Augmentation: The Emergence of MCMC Bayesian Computation in the 1980s (pdf)
Martin Tanner and Wing Wong's fantastic survey of the history of Markov Chain Monte Carlo methods. Stat. Sci., Vol. 25, No. 4 (2010) pp. 506–516

I'm new to the Bayesian approach. The following resources (in descending order of usefulness) have helped me:

"Why isn't every physicist a Bayesian?" (pdf)
☆ Choose this if you only have time for one paper. ☆
Robert D. Cousins's paper is technical but understandable. It mercifully avoids the religious fervor of other papers on the topic. Am. J. Phys., Vol. 63, No. 5 (May 1995).
Journal Article Reporting Standards for Quantitative Research in Psychology: The APA Publications and Communications Board Task Force Report
Table 8 is about reporting Bayesian results. It's longer and more detailed than the BaSiS group checklist listed next.
Standards for Reporting of Bayesian Analyses in the Scientific Literature (no https)
The Bayesian Standards in Science group (BaSiS) compiled a checklist in 2001 for what to include when reporting Bayesian results. Short and sweet. The actual checklist (no https).
Frequentism and Bayesianism: A Python-driven Primer (pdf)
Jake VanderPlas's 2014 paper is a summary of both the Frequentist and Bayesian philosophies. It uses accessible language and examples, and demonstrates both techniques using Python but sort of feels like it's trolling Frequentists. The Cousins paper was easier for me to understand. Proc. of the 13th Python in science conf. (SCIPY 2014).
Frequentism and Bayesianism blog post
VanderPlas's 2014 blog post describes PyMC (the precursor to PyMC3) and two other Python tools, with a lot of the examples eventually going into the above paper; for people who prefer a blog format.
MCMC sampling for dummies
Thomas Wiecki's 2015 blog post that inspired this one. It's good for people who are comfortable with the statistics part of the Bayesian approach but not comfortable with the numerical methods part of MCMC. (That's actually the opposite of what I was.)

All links in this post were accessed on or before July 29, 2018.