AMMAI 2015 Paper reading: Latent Dirichlet Allocation

D. Blei, A. Ng, and M. Jordan.

Introduction

Latent Dirichlet Allocation (LDA) is a technique used to improve the early method - Probabilistic Latent Semantic Indexing (PLSI) - for finding latent topic. The main deficit of PLSI is lacking generative ability, means that PLSI is good at modeling the input data, but useless in unseen data. LDA can solve this problem well, and the model size is quite smaller than PLSI.

Method

This is the model representation of LDA. w is the word, z is the topic, M is the documents, and θ is sampled from a Dirichlet distribution. In this model, we train 2 parameters ⍺ and β. ⍺ is used to determine the Dirichlet distribution that used to get θ. We add β - the parameter controlling the k multinomial distribution over words - to get the word w. The marginal distribution of a document can be formulated as following :