2015年5月31日 星期日

Probabilistic Latent Semantic Indexing

Thomas Hofmann


Introduction

Latent Semantic Analysis (LSA) is a technique that is used to find the latent topic for some documents. Because the thing we can observe is just ''There are some documents, and they include some words', what topic is this document is unknown. So we need LSA to help finding these latent topic. However, LSA lacks some significant statistical features, this author proposed a new method called Probabilistic Latent Semantic Analysis (PLSA), that improves these shortages. 

Method

The author introduced a statistical model called 'Aspect model', which is the core of PLSA. For some class (or topic, unobservable)z 􏰉 Z 􏰞 fz􏰈 􏰝 􏰑 􏰑 􏰑 􏰝 zK g


, word 
, and document

 , we can form the following formulation

If we can find this decomposition for P(d,w), we can find the relationship between latent topic and word and document. The method is using Expectation Maximization (EM) algorithm. 
First, we need to define the similarity (or likelihood) between the objective function and the function what we found. 
Second, re-parameterizing the objective function
Finally, iteratively applying EM algorithm

E-step

M-step
The author then proposed a improved EM algorithm called tempered EM (TEM), which can help to avoid overfitting. 

Other

This is the geometry of the aspect model

PLSA v.s. LSA
LSA -> SVD(), L2 distance

PLSA -> U=P(d|z), Sigma=diag(P(Z)), V=P(w|z), likelihood function

Finally, the author proposed 2 improved PLSA.

(i) PLSA-U : more smoothen
P'(w|d)=n(d,w)/n(d), 
and Final P''(w|d)=lambda*P'(w|d)+(1-lambda)*P(w|d)

(ii) PLSA-Q : low-dimensional representation

Result


These is the 10 most probable words for the query 'flight' and 'love'.

performance


We can see the performance is quite good. The PR curves are on the top.












word

沒有留言:

張貼留言