D. Blei, A. Ng, and M. Jordan.
Introduction
Latent Dirichlet Allocation (LDA) is a technique used to improve the early method - Probabilistic Latent Semantic Indexing (PLSI) - for finding latent topic. The main deficit of PLSI is lacking generative ability, means that PLSI is good at modeling the input data, but useless in unseen data. LDA can solve this problem well, and the model size is quite smaller than PLSI.
Method
This is the model representation of LDA. w is the word, z is the topic, M is the documents, and θ is sampled from a Dirichlet distribution. In this model, we train 2 parameters ⍺ and β. ⍺ is used to determine the Dirichlet distribution that used to get θ. We add β - the parameter controlling the k multinomial distribution over words - to get the word w. The marginal distribution of a document can be formulated as following :
Here, the result shows that there are no variable that its size is relative to input size. The information is compressed in the parameter θ.
In the training stage, we use EM algorithm to maximize the likelihood of distribution - just like PLSA.
First, expanding the first equation,
The author referred to another work and get the likelihood :
Then applying EM algorithm. E step :
M step :
Result
The lines of LDA in the above diagrams lie in the bottom, means that LDA has the lowest confusion between two words.
These diagram show that LDA has the highest accuracy when doing text classification.









沒有留言:
張貼留言