binning.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/case_studies/binning.ipynb

(awkward_binning)= (binning)=

Estimating parameters of a distribution from awkwardly binned data

:::{post} Oct 23, 2021 :tags: binned data, case study, parameter estimation :category: intermediate :author: Eric Ma, Benjamin T. Vincent :::

The problem

Let us say that we are interested in inferring the properties of a population. This could be anything from the distribution of age, or income, or body mass index, or a whole range of different possible measures. In completing this task, we might often come across the situation where we have multiple datasets, each of which can inform our beliefs about the overall population.

Very often this data can be in a form that we, as data scientists, would not consider ideal. For example, this data may have been binned into categories. One reason why this is not ideal is that this binning process actually discards information - we lose any knowledge about where in a certain bin an individual datum lies. A second reason why this is not ideal is that different studies may use alternative binning methods - for example one study might record age in terms of decades (e.g. is someone in their 20's, 30's, 40's and so on) but another study may record age (indirectly) by assigning generational labels (Gen Z, Millennial, Gen X, Boomer II, Boomer I, Post War) for example.

So we are faced with a problem: we have datasets with counts of our measure of interest (whether that be age, income, BMI, or whatever), but they are binned, and they have been binned differently. This notebook presents a solution to this problem that PyMC Labs worked on, supported by the Gates Foundation. We can make inferences about the parameters of a population level distribution.

[Image blocked: No description]

The solution

More formally, we describe the problem as: if we have the bin edges (aka cut points) used for data binning, and bin counts, how can we estimate the parameters of the underlying distribution? We will present a solution and various illustrative examples of this solution, which makes the following assumptions:

that the bins are order-able (e.g. underweight, normal, overweight, obese),
the underlying distribution is specified in a parametric form, and
the cut points that delineate the bins are known and can be pinpointed on the support of the distribution (also known as the valid values that the probability distribution can return).

The approach used is heavily based upon the logic behind ordinal regression. This approach proposes that observed bin counts $Y = {1, 2, \ldots, K}$ are generated from a set of bin edges (aka cutpoints) $\theta_1, \ldots, \theta _{K-1}$ operating upon a latent probability distribution which we could call $y^$. We can describe the probability of observing data in bin 1 as:

$P(Y=1) = \Phi(\theta_1) - \Phi(-\infty) = \Phi(\theta_1) - 0$

bin 2 as:

$P(Y=2) = \Phi(\theta_2) - \Phi(\theta_1)$

bin 3 as:

$P(Y=3) = \Phi(\theta_3) - \Phi(\theta_2)$

and bin 4 as:

$P(Y=4) = \Phi(\infty) - \Phi(\theta_3) = 1 - \Phi(\theta_3)$

where $\Phi$ is the standard cumulative normal.

[Image blocked: No description]

In ordinal regression, the cutpoints are treated as latent variables and the parameters of the normal distribution may be treated as observed (or derived from other predictor variables). This problem differs in that:

the parameters of the Gaussian are unknown,
we do not want to be confined to the Gaussian distribution,
we have observed an array of cutpoints,
we have observed bin counts,

We are now in a position to sketch out a generative PyMC model:

python

The exact way we implement the models below differs only very slightly from this, but let's decompose how this works. Firstly we define priors over the mu and sigma parameters of the latent distribution. Then we have 3 lines which calculate the probability that any observed datum falls in a given bin. The first line of this

python

calculates the cumulative density at each of the cutpoints. The second line

python

simply concatenates the cumulative density at $-\infty$ (which is zero) and at $\infty$ (which is 1). The third line

python

calculates the difference between consecutive cumulative densities to give the actual probability of a datum falling in any given bin.

Finally, we end with the Multinomial likelihood which tells us the likelihood of observing the counts given the set of bin probs and the total number of observations sum(counts).

Hypothetically we could have used base python, or numpy, to describe the generative process. The problem with this however is that gradient information is lost, and so completing these operations using numerical libraries which retain gradient information allows this to be used by the MCMC sampling algorithms.

The approach was illustrated with a Gaussian distribution, and below we show a number of worked examples using Gaussian distributions. However, the approach is general, and at the end of the notebook we provide a demonstration that the approach does indeed extend to non-Gaussian distributions.

Simulated data with a Gaussian distribution

The first few examples will be based on 2 hypothetical studies which measure a Gaussian distributed variable. Each study will have it's own sample size, and our task is to learn the parameters of the population level Gaussian distribution. Frustration 1 is that the data have been binned. Frustration 2 is that each study has used different categories, that is, different cutpoints in this data binning process.

In this simulation approach, we will define the true population level parameters as:

true_mu: -2.0
true_sigma: 2.0

Our goal will be to recover the mu and sigma values given only the bin counts and cutpoints.

The studies used the following, different, cutpoints for their data binning process.

Let's visualise this in one convenient figure. The left hand column shows the underlying data and the cutpoints for both studies. The right hand column shows the resulting bin counts.

Each bin is paired with counts. As you'll see above, c1 and c2 are binned differently. One has 6 bins, the other has 7. c1 omits basically half of the Gaussian distribution.

To recap, in a real situation we might have access to the cutpoints and to the bin counts, but not the underlying data x1 or x2.

Example 1: Gaussian parameter estimation with one set of bins

We will start by investigating what happens when we use only one set of bins to estimate the mu and sigma parameter.

Model specification

Checks on model

We first start with posterior predictive checks. Given the posterior values, we should be able to generate observations that look close to what we observed.

We can do this graphically.

It looks like the numbers are in the right ballpark. With the numbers ordered correctly, we also have the correct proportions identified.

We can also get programmatic access to our posterior predictions in a number of ways:

Let's take the mean and compare it against the observed counts:

Recovering parameters

The more important question is whether we have recovered the parameters of the distribution or not. Recall that we used mu = -2 and sigma = 2 to generate the data.

Pretty good! And we can access the posterior mean estimates (stored as xarray types) as below. The MCMC samples arrive back in a 2D matrix with one dimension for the MCMC chain (chain), and one for the sample number (draw). We can calculate the overall posterior average with .mean(dim=["draw", "chain"]).

Example 2: Parameter estimation with the other set of bins

Above, we used one set of binned data. Let's see what happens when we swap out for the other set of data.

Model specification

As with the above, here's the model specification.

Posterior predictive checks

Let's run a PPC check to ensure we are generating data that are similar to what we observed.

We calculate the mean bin posterior predictive bin counts, averaged over samples.

Looks like a good match. But as always it is wise to visualise things.

Not bad!

Recovering parameters

And did we recover the parameters?

Example 3: Parameter estimation with two bins together

Now we need to see what happens if we add in both ways of binning.

Model Specification

Posterior predictive checks

Recovering parameters

Example 4: Parameter estimation with continuous and binned measures

For the sake of completeness, let's see how we can estimate population parameters based one one set of continuous measures, and one set of binned measures. We will use the simulated data we have already generated.

Model Specification

Posterior predictive checks

We can calculate the mean and standard deviation of the posterior predictive distribution for study 2 and see that they are close to our true parameters.

Recovering parameters

Finally, we can check the posterior estimates of the parameters and see that the estimates here are spot on.

Example 5: Hierarchical estimation

The previous examples all assumed that study 1 and study 2 data were sampled from the same population. While this was in fact true for our simulated data, when we are working with real data, we are not in a position to know this. So it could be useful to be able to ask the question, "does it look like data from study 1 and study 2 are drawn from the same population?"

We can do this using the same basic approach - we can estimate population level parameters like before, but now we can add in study level parameter estimates. This will be a new hierarchical layer in our model between the population level parameters and the observations.

Model specification

This time, because we are getting into a more complicated model, we will use coords to tell PyMC about the dimensionality of the variables. This feeds in to the posterior samples which are outputted in xarray format, which makes life easier when processing posterior samples for statistical or visualization purposes later.

The model above is fine but running this model as it is results in hundreds of divergences in the sampling process (you can find out more about divergences from the {ref}diagnosing_with_divergences notebook). While we won't go deep into the reasons here, the long story cut short is that Gaussian centering introduces pathologies into our log likelihood space that make it difficult for MCMC samplers to work. Firstly, we removed the population level estimates on sigma and just stick with study level priors. We used the Gamma distribution to avoid any zero values. Secondly use a non-centered reparameterization to specify mu. This does not completely solve the problem, but it does drastically reduce the number of divergences.

We can see that despite our efforts, we still get some divergences. Plotting the samples and highlighting the divergences suggests (from the top left subplot) that our model is suffering from the funnel problem

Posterior predictive checks

Inspect posterior

Any evidence for differences in study-level means or standard deviations?

No compelling evidence for differences between the population level means and standard deviations.

Population level estimates in the mean parameter. There is no population level estimate of sigma in this reparameterised model.

Another possible solution would be to make independent inferences about the study level parameters from group 1 and group 2, and then look for any evidendence that these differ. Taking this approach works just fine, no divergences in sight, although this approach drifts away from our core goal of making population level inferences. Rather than fully work through this example, we included the code in case it is useful to anyone's use case.

python

Example 6: A non-normal distribution

In theory, the method we're using is quite general. Its dependencies are usually well-specified:

A parametric distribution
Known cut points on the support of that distribution to bin our data
Counts (and hence proportions) of each bin

We will now empirically verify that the parameters of other distributions are recoverable using the same methods. We will approximate the distribution of Body Mass Index (BMI) from 2 hypothetical (simulated) studies.

In the first study, the fictional researchers used a set of thresholds which give them many categories of:

Underweight (Severe thinness): $< 16$
Underweight (Moderate thinness): $16 - 17$
Underweight (Mild thinness): $17 - 18.5$
Normal range: $18.5 - 25$
Overweight (Pre-obese): $25 - 30$
Obese (Class I): $30 - 35$
Obese (Class II): $35 - 40$
Obese (Class III): $\ge 40$

The second set of researchers used a categorisation scheme recommended by the Hospital Authority of Hong Kong:

Underweight (Unhealthy): $< 18.5$
Normal range (Healthy): $18.5 - 23$
Overweight I (At risk): $23 - 25$
Overweight II (Moderately obese): $25 - 30$
Overweight III (Severely obese): $\ge 30$

We assume the true underlying BMI distribution is Gumbel distributed with mu and beta parameters of 20 and 4, respectively.

Model specification

This is a variation of Example 3 above. The only changes are:

update the probability distribution to match our target (the Gumbel distribution)
ensure we specify priors for our target distribution, appropriate given our domain knowledge.

Posterior predictive checks

Recovering parameters

We can see that we were able to do a good job of recovering the known parameters of the underlying BMI population.

If we were interested in testing whether there were any differences in the BMI distributions in Study 1 and Study 2, then we could simply take the model in Example 6 and adapt it to operate with our desired target distribution, just like we did in this example.

Conclusions

As you can see, this method for estimating known parameters of Gaussian and non-Gaussian distributions works pretty well. While these examples have been applied to synthetic data, doing these kinds of parameter recovery studies is crucial. If we tried to recover population level parameters from counts and could not do it when we know the ground truth, then this would indicate the approach is not trustworthy. But the various parameter recovery examples demonstrate that we can in fact accurately recover population level parameters from binned, and differently binned data.

A key technical point to note here is that when we pass in the observed counts, they ought to be in the exact CDF order. Not shown here are experiments where we scrambled the counts' order; there, the estimation of underlying distribution parameters were incorrect.

We have presented a range of different examples here which makes clear that the general approach can be adapted easily to the particular situation or research questions being faced. These approaches should easily be adaptable to novel but related data science situations.

Authors

Authored by Eric Ma and Benjamin T. Vincent in September, 2021 (pymc-examples#229)
Updated to run in PyMC v4 by Fernando Irarrazaval in June 2022 (pymc-examples#366)

Watermark

:::{include} ../page_footer.md :::