(dp_mix)=
:::{post} Sept 16, 2021 :tags: mixture model, :category: advanced :author: Austin Rochford, Abhipsha Das :::
The Dirichlet process is a flexible probability distribution over the space of distributions. Most generally, a probability distribution, $P$, on a set $\Omega$ is a [measure](https://en.wikipedia.org/wiki/Measure_(mathematics%29) that assigns measure one to the entire space ($P(\Omega) = 1$). A Dirichlet process $P \sim \textrm{DP}(\alpha, P_0)$ is a measure that has the property that, for every finite disjoint partition $S_1, \ldots, S_n$ of $\Omega$,_
Here $P_0$ is the base probability measure on the space $\Omega$. The precision parameter $\alpha > 0$ controls how close samples from the Dirichlet process are to the base measure, $P_0$. As $\alpha \to \infty$, samples from the Dirichlet process approach the base measure $P_0$.
Dirichlet processes have several properties that make them quite suitable to {term}MCMC simulation.
The posterior given i.i.d. observations $\omega_1, \ldots, \omega_n$ from a Dirichlet process $P \sim \textrm{DP}(\alpha, P_0)$ is also a Dirichlet process with
where $\delta$ is the Dirac delta measure
\delta_{\omega}(S) & = \begin{cases} 1 & \textrm{if } \omega \in S \\ 0 & \textrm{if } \omega \not \in S \end{cases} \end{align*}.$$The posterior predictive distribution of a new observation is a compromise between the base measure and the observations,
We see that the prior precision $\alpha$ can naturally be interpreted as a prior sample size. The form of this posterior predictive distribution also lends itself to Gibbs sampling.
Samples, $P \sim \textrm{DP}(\alpha, P_0)$, from a Dirichlet process are discrete with probability one. That is, there are elements $\omega_1, \omega_2, \ldots$ in $\Omega$ and weights $\mu_1, \mu_2, \ldots$ with $\sum_{i = 1}^{\infty} \mu_i = 1$ such that
The stick-breaking process gives an explicit construction of the weights $\mu_i$ and samples $\omega_i$ above that is straightforward to sample from. If $\beta_1, \beta_2, \ldots \sim \textrm{Beta}(1, \alpha)$, then $\mu_i = \beta_i \prod_{j = 1}^{i - 1} (1 - \beta_j)$. The relationship between this representation and stick breaking may be illustrated as follows:
[Suggested Further Reading]: (http://mlg.eng.cam.ac.uk/tutorials/07/ywt.pdf) and {cite:t}teh2010dirichletprocess for a brief introduction to other flavours of Dirichlet Processes, and their applications.
We can use the stick-breaking process above to easily sample from a Dirichlet process in Python. For this example, $\alpha = 2$ and the base distribution is $N(0, 1)$.
We draw and plot samples from the stick-breaking process.
As stated above, as $\alpha \to \infty$, samples from the Dirichlet process converge to the base distribution.
For the task of density estimation, the (almost sure) discreteness of samples from the Dirichlet process is a significant drawback. This problem can be solved with another level of indirection by using Dirichlet process mixtures for density estimation. A Dirichlet process mixture uses component densities from a parametric family $\mathcal{F} = {f_{\theta}\ |\ \theta \in \Theta}$ and represents the mixture weights as a Dirichlet process. If $P_0$ is a probability measure on the parameter space $\Theta$, a Dirichlet process mixture is the hierarchical model
To illustrate this model, we simulate draws from a Dirichlet process mixture with $\alpha = 2$, $\theta \sim N(0, 1)$, $x\ |\ \theta \sim N(\theta, (0.3)^2)$.
We now focus on a single mixture and decompose it into its individual (weighted) mixture components.
Sampling from these stochastic processes is fun, but these ideas become truly useful when we fit them to data. The discreteness of samples and the stick-breaking representation of the Dirichlet process lend themselves nicely to Markov chain Monte Carlo simulation of posterior distributions. We will perform this sampling using PyMC.
Our first example uses a Dirichlet process mixture to estimate the density of waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park.
For convenience in specifying the prior, we standardize the waiting time between eruptions.
Observant readers will have noted that we have not been continuing the stick-breaking process indefinitely as indicated by its definition, but rather have been truncating this process after a finite number of breaks. Obviously, when computing with Dirichlet processes, it is necessary to only store a finite number of its point masses and weights in memory. This restriction is not terribly onerous, since with a finite number of observations, it seems quite likely that the number of mixture components that contribute non-negligible mass to the mixture will grow slower than the number of samples. This intuition can be formalized to show that the (expected) number of components that contribute non-negligible mass to the mixture approaches $\alpha \log N$, where $N$ is the sample size.
There are various clever Gibbs sampling techniques for Dirichlet processes that allow the number of components stored to grow as needed. Stochastic memoization {cite:p}roy2008npbayes is another powerful technique for simulating Dirichlet processes while only storing finitely many components in memory. In this introductory example, we take the much less sophisticated approach of simply truncating the Dirichlet process components that are stored after a fixed number, $K$, of components. {cite:t}ishwaran2002approxdirichlet and {cite:t}ohlssen2007flexible provide justification for truncation, showing that $K > 5 \alpha + 2$ is most likely sufficient to capture almost all of the mixture weight ($\sum_{i = 1}^{K} w_i > 0.99$). In practice, we can verify the suitability of our truncated approximation to the Dirichlet process by checking the number of components that contribute non-negligible mass to the mixture. If, in our simulations, all components contribute non-negligible mass to the mixture, we have truncated the Dirichlet process too early.
Our (truncated) Dirichlet process mixture model for the standardized waiting times is
Note that instead of fixing a value of $\alpha$, as in our previous simulations, we specify a prior on $\alpha$, so that we may learn its posterior distribution from the observations.
We now construct this model using PyMC.
We sample from the model 1,000 times using NUTS initialized with ADVI.
The posterior distribution of $\alpha$ is highly concentrated between 0.25 and 1.
To verify that truncation is not biasing our results, we plot the posterior expected mixture weight of each component.
We see that only three mixture components have appreciable posterior expected weights, so we conclude that truncating the Dirichlet process to thirty components has not appreciably affected our estimates.
We now compute and plot our posterior density estimate.
As above, we can decompose this density estimate into its (weighted) mixture components.
The Dirichlet process mixture model is incredibly flexible in terms of the family of parametric component distributions ${f_{\theta}\ |\ f_{\theta} \in \Theta}$. We illustrate this flexibility below by using Poisson component distributions to estimate the density of sunspots per year. This dataset was curated by {cite:t}sidc2021sunspot and can be downloaded.
For this example, the model is
For the sunspot model, the posterior distribution of $\alpha$ is concentrated between 0.6 and 1.2, indicating that we should expect more components to contribute non-negligible amounts to the mixture than for the Old Faithful waiting time model.
Indeed, we see that between ten and fifteen mixture components have appreciable posterior expected weight.
We now calculate and plot the fitted density estimate.
Again, we can decompose the posterior expected density into weighted mixture densities.
:::{bibliography} :filter: docname in docnames :::
az.extract by Benjamin T. Vincent in February 2023 (pymc-examples#522):::{include} page_footer.md :::