Thomas Hofmann
Introduction
Latent Semantic Analysis (LSA) is a technique that is used to find the latent topic for some documents. Because the thing we can observe is just ''There are some documents, and they include some words', what topic is this document is unknown. So we need LSA to help finding these latent topic. However, LSA lacks some significant statistical features, this author proposed a new method called Probabilistic Latent Semantic Analysis (PLSA), that improves these shortages.
Method
The author introduced a statistical model called 'Aspect model', which is the core of PLSA. For some class (or topic, unobservable)z Z fz zK g
, word
, and document
, we can form the following formulation
If we can find this decomposition for P(d,w), we can find the relationship between latent topic and word and document. The method is using Expectation Maximization (EM) algorithm.
First, we need to define the similarity (or likelihood) between the objective function and the function what we found.
Second, re-parameterizing the objective function
Finally, iteratively applying EM algorithm
E-step
The author then proposed a improved EM algorithm called tempered EM (TEM), which can help to avoid overfitting.
Other
This is the geometry of the aspect model
PLSA v.s. LSA
PLSA -> U=P(d|z), Sigma=diag(P(Z)), V=P(w|z), likelihood function
Finally, the author proposed 2 improved PLSA.
(i) PLSA-U : more smoothen
P'(w|d)=n(d,w)/n(d),
and Final P''(w|d)=lambda*P'(w|d)+(1-lambda)*P(w|d)
(ii) PLSA-Q : low-dimensional representation
Result
performance
We can see the performance is quite good. The PR curves are on the top.
word






































