AMMAI 2015 Paper reading: Probabilistic Latent Semantic Indexing

Thomas Hofmann

Introduction

Latent Semantic Analysis (LSA) is a technique that is used to find the latent topic for some documents. Because the thing we can observe is just ''There are some documents, and they include some words', what topic is this document is unknown. So we need LSA to help finding these latent topic. However, LSA lacks some significant statistical features, this author proposed a new method called Probabilistic Latent Semantic Analysis (PLSA), that improves these shortages.

Method

The author introduced a statistical model called 'Aspect model', which is the core of PLSA. For some class (or topic, unobservable)z 􏰉 Z 􏰞 fz􏰈 􏰝 􏰑 􏰑 􏰑 􏰝 zK g

, word

, and document

, we can form the following formulation

If we can find this decomposition for P(d,w), we can find the relationship between latent topic and word and document. The method is using Expectation Maximization (EM) algorithm.

First, we need to define the similarity (or likelihood) between the objective function and the function what we found.