(factor_analysis)=

Factor analysis

:::{post} 19 Mar, 2022 :tags: factor analysis, matrix factorization, PCA :category: advanced, how-to :author: Chris Hartl, Christopher Krapu, Oriol Abril-Pla, Erik Werner :::

Factor analysis is a widely used probabilistic model for identifying low-rank structure in multivariate data as encoded in latent variables. It is very closely related to principal components analysis, and differs only in the prior distributions assumed for these latent variables. It is also a good example of a linear Gaussian model as it can be described entirely as a linear transformation of underlying Gaussian variates. For a high-level view of how factor analysis relates to other models, you can check out this diagram originally published by Ghahramani and Roweis.

:::{include} ../extra_installs.md :::

Simulated data generation

To work through a few examples, we'll first generate some data. The data will not follow the exact generative process assumed by the factor analysis model, as the latent variates will not be Gaussian. We'll assume that we have an observed data set with $N$ rows and $d$ columns which are actually a noisy linear function of $k_{true}$ latent variables.

The next code cell generates the data via creating latent variable arrays M and linear transformation Q. Then, the matrix product $QM$ is perturbed with additive Gaussian noise controlled by the variance parameter err_sd.

Because of the way we have generated the data, the covariance matrix expressing correlations between columns of $Y$ will be equal to $QQ^T$. The fundamental assumption of PCA and factor analysis is that $QQ^T$ is not full rank. We can see hints of this if we plot the covariance matrix:

If you squint long enough, you may be able to glimpse a few places where distinct columns are likely linear functions of each other.

Model

Probabilistic PCA (PPCA) and factor analysis (FA) are a common source of topics on PyMC Discourse. The posts linked below handle different aspects of the problem including:

Direct implementation

The model for factor analysis is the probabilistic matrix factorization

$X_{(d,n)}|W_{(d,k)}, F_{(k,n)} \sim \mathcal{N}(WF, \Psi)$

with $\Psi$ a diagonal matrix. Subscripts denote the dimensionality of the matrices. Probabilistic PCA is a variant that sets $\Psi = \sigma^2 I$. A basic implementation (taken from this gist) is shown in the next cell. Unfortunately, it has undesirable properties for model fitting.

At this point, there are already several warnings regarding failed convergence checks. We can see further problems in the trace plot below. This plot shows the path taken by each sampler chain for a single entry in the matrix $W$ as well as the average evaluated over samples for each chain.

Each chain appears to have a different sample mean and we can also see that there is a great deal of autocorrelation across chains, manifest as long-range trends over sampling iterations.

One of the primary drawbacks for this model formulation is its lack of identifiability. With this model representation, only the product $WF$ matters for the likelihood of $X$, so $P(X|W, F) = P(X|W\Omega, \Omega^{-1}F)$ for any invertible matrix $\Omega$. While the priors on $W$ and $F$ constrain $|\Omega|$ to be neither too large or too small, factors and loadings can still be rotated, reflected, and/or permuted without changing the model likelihood. Expect it to happen between runs of the sampler, or even for the parametrization to "drift" within run, and to produce the highly autocorrelated $W$ traceplot above.

Alternative parametrization

This can be fixed by constraining the form of W to be:

  • Lower triangular
    • Positive and increasing values along the diagonal

We can adapt expand_block_triangular to fill out a non-square matrix. This function mimics pm.expand_packed_triangular, but while the latter only works on packed versions of square matrices (i.e. $d=k$ in our model, the former can also be used with nonsquare matrices.

We'll also define another function which helps create a diagonal matrix with increasing entries along the main diagonal.

With these modifications, we remake the model and run the MCMC sampler again.

$W$ (and $F$!) now have entries with identical posterior distributions as compared between sampler chains, although it's apparent that some autocorrelation remains.

Because the $k \times n$ parameters in F all need to be sampled, sampling can become quite expensive for very large n. In addition, the link between an observed data point $X_i$ and an associated latent value $F_i$ means that streaming inference with mini-batching cannot be performed.

This scalability problem can be addressed analytically by integrating $F$ out of the model. By doing so, we postpone any calculation for individual values of $F_i$ until later. Hence, this approach is often described as amortized inference. However, this fixes the prior on $F$, allowing for less modeling flexibility. In keeping with $F_{ij} \sim \mathcal{N}(0, I)$ we have:

$X|WF \sim \mathcal{N}(WF, \sigma^2 I).$

We can therefore write $X$ as

$X = WF + \sigma I \epsilon,$

where $\epsilon \sim \mathcal{N}(0, I)$. Fixing $W$ but treating $F$ and $\epsilon$ as random variables, both $\sim\mathcal{N}(0, I)$, we see that $X$ is the sum of two multivariate normal variables, with means $0$ and covariances $WW^T$ and $\sigma^2 I$, respectively. It follows that

$X|W \sim \mathcal{N}(0, WW^T + \sigma^2 I )$.

Unfortunately sampling of this model is very slow, presumably because calculating the logprob of the MvNormal requires inverting the covariance matrix. However, the explicit integration of $F$ also enables batching the observations for faster computation of ADVI and FullRankADVI approximations.

Results

When we compare the posteriors calculated using MCMC and VI, we find that (for at least this specific parameter we are looking at) the two distributions are close, but they do differ in their mean. The two MCMC posteriors agree with each other quite well with each other and the ADVI estimate is not far off.

Post-hoc identification of F

The matrix $F$ is typically of interest for factor analysis, and is often used as a feature matrix for dimensionality reduction. However, $F$ has been marginalized away in order to make fitting the model easier; and now we need it back. Transforming

$X|WF \sim \mathcal{N}(WF, \sigma^2)$

into

$(W^TW)^{-1}W^T X|W,F \sim \mathcal{N}(F, \sigma^2(W^TW)^{-1})$

we have represented $F$ as the mean of a multivariate normal distribution with a known covariance matrix. Using the prior $ F \sim \mathcal{N}(0, I) $ and updating according to the expression for the conjugate prior, we find

$ F | X, W \sim \mathcal{N}(\mu_F, \Sigma_F) $,

where

$\mu_F = \left(I + \sigma^{-2} W^T W\right)^{-1} \sigma^{-2} W^T X$

$\Sigma_F = \left(I + \sigma^{-2} W^T W\right)^{-1}$

For each value of $W$ and $X$ found in the trace, we now draw a sample from this distribution.

Comparison to original data

To check how well our model has captured the original data, we will test how well we can reconstruct it from the sampled $W$ and $F$ matrices. For each element of $Y$ we plot the mean and standard deviation of $X = W F$, which is the reconstructed value based on our model.

We find that our model does a decent job of capturing the variation in the original data, despite only using two latent factors instead of the actual four.

Authors

  • Authored by chartl on May 6, 2019
  • Updated by Christopher Krapu on April 4, 2021
  • Updated by Oriol Abril-Pla to use PyMC v4 and xarray-einstats on March, 2022
  • Updated by Erik Werner on Dec, 2023 (pymc-examples#612)

Watermark

:::{include} ../page_footer.md :::