Monte Carlo sampling algorithms (either MCMC or not) have a goal to attain samples from a distribution. They can be organized by what inputs or prior knowledge about the distribution they require. This ranges from a low amount of knowledge, as in slice sampling (just give it an unnormalized density function), to a high amount, as in Gibbs sampling (you have to decompose your distribution into individual conditionals).

Typical inputs include \(f(x)\), an unnormalized density or probability function for the target distribution, which returns a real number for a variable value. \(g()\) and \(g(x)\) represent sample generation procedures (that output a variable value); some generators require an input, some do not.

Here are the required inputs for a few algorithms. (For an overview, see e.g. Ch 29 of MacKay.) There are many more out there of course. I’m leaving off tuning parameters.

Black-box samplers: Slice sampling, Affine-invariant ensemble

- unnorm density \(f(x)\)

Metropolis (symmetric proposal)

- unnorm density/pmf \(f(x)\)

- proposal generator \(g(x)\)

Hastings (asymmetric proposal)

- unnorm density/pmf \(f(x)\)

- proposal generator \(g(x)\)

- proposal unnorm density/pmf \(q(x’; x)\) .

… [For the proposal generator at state \(x\), probability it generates \(x'\)]

Importance sampling, rejection sampling

- unnorm density/pmf \(f(x)\)

- proposal generator \(g()\)

- proposal unnorm density/pmf \(q(x)\)

Independent Metropolis-Hastings: the proposal is always the same, but still have to worry about asymmetric corrections

- unnorm density/pmf \(f(x)\)

- proposal generator \(g()\)

- proposal unnorm density/pmf \(q(x’; x)\)

Hamiltonian Monte Carlo

- unnorm density \(f(x)\)

- unnorm density gradient \(gf(x)\)

Gibbs Sampling

- local conditional generators \(g_i(x_{-i})\)

… [which have to give samples from \( p(x_i | x_{-i}) \)]

Note importance/rejection sampling are stateless, but the MCMC algorithms are stateful.

I’m distinguishing a sampling procedure \(g\) from a density evaluation function \(f\) because having the latter doesn’t necessarily give you the former. (For the one-dimension case, having an inverse CDF indeed gives you a a sampler, but multidimensional gets harder — part of why all these techniques were invented in the first place!) Shay points out their relationship is analogous to 3-SAT: it’s easy to evaluate a full variable setting, but hard to generate them. (Or specifically, think about a 3-SAT PMF \(p(x) = 1\{\text{\(x\) is boolean satisfiable}\}\) where only one \(x\) has non-zero probability; PMF evaluation is easy but the best known sampler is exponential time.)

And of course there’s a related organization of optimization algorithms. Here’s a rough look at a few unconstrained optimizers:

Black-box optimizers: Grid search, Nelder-Mead, evolutionary, …

- objective \(f(x)\)

BFGS, CG

- objective \(f(x)\)

- gradient \(gf(x)\)

Newton-Raphson

- objective \(f(x)\)

- gradient \(gf(x)\)

- hessian \(hf(x)\)

Simulated annealing

- objective \(f(x)\)

- proposal generator \(g(x)\)

SGD

- one-example gradient \(gf_i(x)\)

I think it’s neat that Gibbs Sampling and SGD don’t always require you to implement a likelihood/objective function. It’s nice to do to ensure you’re actually optimizing or exploring the posterior, but strictly speaking the algorithms don’t require it.

Very nice summary.

One question: does SGD actually work in cases in which the objective function doesn’t decompose in an obvious way into the sum of one-example objective functions? My vague recollection of the proofs of convergence is that they make heavy use of the decomposition of the objective functions.

Yeah SGD requires decomposition into minibatch gradients you can sum together for the big gradient (and thus the gradient from a minibatch is in expectation same as the full gradient). I’m not aware of SGD for non-sum situations…