2015年6月16日 星期二

What does classifying more than 10,000 image categories tell us?

J. Deng, A. C. Berg, K. Li, and L. Fei-Fei


Introduction

Image classification has been an important problem in MIR for a long time, and there are several works proposed to solve this problem. However, as the number of images and classes increasing rapidly, these old methods gradually become unfeasible. This paper investigated some famous works, and do some comparison on different phases.

Comparison


Datasets : 
(1) ImageNet10K
(2) ImageNet7K
(3) ImageNet1K
(4) Rand200
(5) Ungulate183
(6) Fungus134
(7) Vehicle262
(8) CalNet200

Evaluation : 
(1) Mean accuracy
 (2) Mean misclassification cost

Method : 
(1) GIST+NN 
(2) BOW+NN 
(3) BOW+SVM
(4)SPM+SVM

Consideration :
(1) Computation
(2) Size
(3) Density
(4) Hierarchy

(1) Computation
They regarded that it takes several CPU years to train these classifier. They used 66 CPU cluster and parallel algorithm to do these training, but it still took several weeks to finish their experiments. 

(2) Size


Experiments showed that as the number of classes increases, the accuracy becomes lower. A technique that significantly outperforms others on small datasets may actually underperform them on large number of categories.

(3) Density
If the data is more sparse, the performance of classification is better. More sparse means the longer distance between each data.


(4) Hierarchy
The author claimed that the cost of different error type should be different. For instance, the error of classifying redshank to bird should be lower than classifying redshank to microwave. They call this "Hierarchical Cost".






Text Understanding from Scratch

Zhang, Xiang, and Yann LeCun


Introduction

Text classification is a problem that different from image classification. Text has syntactic and semantic property, and we may encounter the synonym and multiple meaning problem. This paper proposed a method that using CNN to classify text. Many experiments show that this work is outperform to other work, like Bag of Words or word2vec method.

Method

The convolution process is done by the following formula :
,

and the max-pooling function is
.

We need to do some encoding to every words before feeding the input to the network. There are total 69 characters


and the following is the result after encoding


This is the overall architecture of this work


We have 6 convolutional layers and 3 fully-connected layers. The following tables are the parameters of these layers :


As other learning works, we need to do data augmentation. This was done by using Thesaurus, which is the dictionary for synonym. The new data can be obtained by replacing some words in a sentence with their synonym.

Result

The authors evaluated the performance on several test, including DBpedia Onto logy Classification, Amazon Review Analysis, Yahoo! Answers Topic Classification, News Categorization in English and New Categorization in English. They also implemented two different works - Bag of Words and word2vec - and compared with their work.

 DBpedia



Amazon Review





Yahoo! Answers




News Categorization in English




News Categorization in Chinese


We can see that their methods always have the better performance.


2015年6月14日 星期日

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Hinton, Geoffrey, et al.


Summarization

This article introduces many techniques about speech recognition using deep neural networks. Some of these seem special and seldom appear in other domain.

Restricted Boltzmann Machine (RBM) is a model used in generative pre-training. The energy function is given by 

Contrastive Divergence (CD) is an efficient learning procedure for RBMs.

To model real-valued data, a new model called Gaussian-Bernoulli RBM (GRBM) is proposed. The energy function is given by

We can stack RBMs to make a Deep Belief Network (DBN).

A DNN that is pre-trained generatively as a DBN is called DBN-DNN.

There are some comparison between Microsoft, Google and IBM research.



This article is an important introduction to learn speech recognition using CNN. There are several techniques seem to be used in this subject for a long time, but I haven't caught it well. Still, it is worth studying fine. 











Rich feature hierarchies for accurate object detection and semantic segmantation


Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik


Introduction

Object detection, just like image classification, plays an important role in MIR. The main difference is that object detection tells you not only what are in the image, but also where they are. This paper proposed a method that combine region proposals with CNNS to do object detection, called R-CNN, beaten the state of the art method by miles. 

Method

The following figure is the overview of this system.



The first step is finding the candidate regions, called region proposals, in the image that probably contain the object. There are various papers offer methods for  generating region proposals, and this work used selective search.
Then we need to do feature extraction. This paper used CNN, which architecture proposed by Krizhevsky et al., to extract a 4096-dimension feature vector from each region proposal. Note that because the input of CNN must be the same size, it's necessary to do resizing on each region proposal. The author chose the simplest one, simple warp, to do this.
After the features of region proposals are extracted, the author used SVM to do classification. SVM will tell you what class of this proposal is.
The CNN on the second step was supervised pre-trained on a large auxiliary dataset (ILSVRC 2012).
Because CNN has multiple layers, the author then discussed which layer of CNN should be selected as the input of SVM. After several experiments, they found that with find-tuning, the fully connected layer 7 has the highest performance.

Performance



We can find that the accuracy of this work is far higher than other works.

Result

















2015年6月10日 星期三

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton


Introduction

Image classification plays an important role in MIR, and there have been several works to do that. This paper proposed a method that using deep convolutional neural networks to do image classification, and got a great improvement.

Method

The author applied Rectified Linear Units (ReLU) to every neurons. The formulation of this function is simple :

f(x) = max(0,x),

then applied local response normalization after ReLU :


The overall architecture looks like this :


To reduce overfitting, the author applied data augmentation and dropout. For data augmentation, they simply do some editing on the original image, such as translation and horizontal reflection. Dropout means the neurons have a certain probability no to propagate to the next layer. 
In the learning step, the author used the following formulation :


The variable v in the first equation is the  momentum, and we add it to w to update our weight. 


Performance


We can see the error rates on the dataset is lower than other methods.

Result







2015年6月8日 星期一

Story-Driven Summarization for Egocentric Video

Z. Lu and K. Grauman


Introduction

Egocentric video is the video taken from the camera that placed at the human position, not necessary in your hand, but on your head or arm. Such devices are more and more popular in recent years, and those videos are usually taken from amateur. One of the problem of those videos is too long. There is little possibility to spend several hours on watching some nameless videos, not to mention those videos are too many in the number. Thus, video summarization became valuable. This paper preposed a novel method that took not only low level feature, but also "Story" into account, resulted in great outperformance.

Method

This method can be separated into several parts :

(1) Segment the original video into a series of n sub-shot
(2) Define the components of objective function
(3) Optimizing the objective function
(4) Final summarization


(1) The author used optical flow and blur as features, and trained the classifier using SVM. 

(2) For every order-preserving chain of K selected nodes, selecting the optimal K-node chain S* :


S(S) is the story term, which can be computed by :
and

To account for coherency as well as influence, the objective function was modified to :

I(S) is the importance of individual sub-shots, which is computed using another work :

See 
  1. Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important people and objects for egocentric video summarization. In CVPR, 2012. 
for the detail of this term.

D(S) is diversity among transitions :

(3) For optimization, the author referred to 
  1. D. Shahaf and C. Guestrin. Connecting the dots between news arti- cles. In KDD, 2010.
  2. and do some modification.
(4) Sometime, the measurement of influence across the boundaries of major distinct event may be incorrect, the author posed the final summarization task in two layers. 

Using the formulation above, plus some extra steps, we can decompose the video into some events. Then we vary the argument K in the previous step to find the best chains, and concatenate it together.

Performance

We can see that most people regard this work performs better.

This work has the higher average true positive rate.

Result