2015年6月16日 星期二

Text Understanding from Scratch

Zhang, Xiang, and Yann LeCun


Introduction

Text classification is a problem that different from image classification. Text has syntactic and semantic property, and we may encounter the synonym and multiple meaning problem. This paper proposed a method that using CNN to classify text. Many experiments show that this work is outperform to other work, like Bag of Words or word2vec method.

Method

The convolution process is done by the following formula :
,

and the max-pooling function is
.

We need to do some encoding to every words before feeding the input to the network. There are total 69 characters


and the following is the result after encoding


This is the overall architecture of this work


We have 6 convolutional layers and 3 fully-connected layers. The following tables are the parameters of these layers :


As other learning works, we need to do data augmentation. This was done by using Thesaurus, which is the dictionary for synonym. The new data can be obtained by replacing some words in a sentence with their synonym.

Result

The authors evaluated the performance on several test, including DBpedia Onto logy Classification, Amazon Review Analysis, Yahoo! Answers Topic Classification, News Categorization in English and New Categorization in English. They also implemented two different works - Bag of Words and word2vec - and compared with their work.

 DBpedia



Amazon Review





Yahoo! Answers




News Categorization in English




News Categorization in Chinese


We can see that their methods always have the better performance.


沒有留言:

張貼留言