Beyond the Dirichlet-Multinomial Conjugacy

Machine Learning and Data Mining Algorithms are often designed using probabilistic models. In particular, some of the most popular and best performing methods are Probabilistic Graphical Models (PGM). These are Bayesian Statistical treatments of variables where conditional dependencies are explicitly modeled.

Example Graphical Model
Example of a Graphical Model taken from CoBaFi

Many of these models make use of latent (hidden) variables that are not directly modeled. To make inference simpler, it is common to collapse out these variables using the inherent properties of conjugacy. Within Graphical Model Literature, far and away, the most common conjugacy is the Dirichlet-Multinomial. It is easy to find many introductions and tutorials on this particular distribution pairing.

Yet, when we talk about Statistics in most contexts outside of Machine Learning, people assume either Uniform Distributions (Equal Probabilities) or Gaussian (Normal aka Bell Curve) Distributions. However, within a Bayesian Statistical treatment of Machine Learning, these distributions are rarely used. You would assume that this was the case due to a lack of conjugacy properties, however that is not the case for these distributions. As far as I can tell from discussions with experts in the field; it is more a function of popularity. There are lots of datasets containing discrete values and a Dirichlet-Multinomial conjugacy makes sense to use. In cases where data exhibits a more Gaussian distribution, you assume the model will approximate the distribution in the limit, and that it should be close enough to just use a multinomial. Yet, since there is a conjugacy for the Gaussian Distribution, it makes sense to use this instead - especially if you have continuous valued data. It seems that there is just a lack of familiarity on this conjugacy property due to its absence in much of the literature.

PDF Graphs
You can approximate a Gaussian using a multinomial and discretizing continuous data into bins, but often it makes more sense to use the actual distribution.

The treatment of Uniform Distributions in a Graphical Model context is quite straightforward, so I won't talk about it here. However, using a conjugate prior for a Gaussian distribution is not any more complex than using a Dirichlet for a Multinomial/Categorical. The conjugate prior of Multivariate Gaussian Distribution is the Gaussian-Wishart Distribution. In an MCMC sampler, you can simply collapse out latent Gaussian variables using a Gaussian-Wishart. As you would with a Dirichlet-Multinomial model, you explicitly model the hyperparameters of your Multivariate Gaussian variables and sample accordingly.

If you are interested in how to do this and a more formal derivation, Tom Haines wrote a nice tutorial on how to collapse out Gaussian variables in a Bayesian setting. For an example of an actual Graphical Model that uses this conjugate prior, you can see a recent paper Alex Beutel, Christos Faloustos, Alex Smola, and I wrote on Collaborative Filtering: CoBaFi. The Haines Tutorial was an invaluable reference when implementing our sampler. Our model for a Recommendation System clusters users and items while assuming Gaussian distributions within clusters. It should serve as a good example of using Multivariate Gaussians in Probabilistic Graphical Models and show that you can easily go beyond the Dirichlet-Multinomial Conjugacy.