2015年6月4日 星期四

Latent Dirichlet Allocation

D. Blei, A. Ng, and M. Jordan.


Introduction

Latent Dirichlet Allocation (LDA) is a technique used to improve the early method - Probabilistic Latent Semantic Indexing (PLSI) - for finding latent topic. The main deficit of PLSI is lacking generative ability, means that PLSI is good at modeling the input data, but useless in unseen data. LDA can solve this problem well, and the model size is quite smaller than PLSI. 

Method

This is the model representation of LDA. w is the word, z is the topic, M is the documents, and θ is sampled from a Dirichlet distribution. In this model, we train 2 parameters  and β.  is used to determine the Dirichlet distribution that used to get θ. We add β - the parameter controlling the k multinomial distribution over words - to get the word w. The marginal distribution of a document can be formulated as following : 
,

 and the probability of a corpus is
Here, the result shows that there are no variable that its size is relative to input size. The information is compressed in the parameter θ.

In the training stage, we use EM algorithm to maximize the likelihood of distribution - just like PLSA. 
First, expanding the first equation, 
The author referred to another work and get the likelihood : 
Then applying EM algorithm. E step : 
M step : 




Result

The lines of LDA in the above diagrams lie in the bottom, means that LDA has the lowest confusion between two words.


These diagram show that LDA has the highest accuracy when doing text classification. 










沒有留言:

張貼留言