Josef Sivic & Andrew Zisserman
Introduction
In this work, the author wanted to purpose a fast, efficient way to retrieve some particular object from the video frames. Many works have succeeded in identifying an object in an image. The author involved some text retrieval technology in his work, and hoped that we can query an object from the video, just like using google.
It is necessary that the descriptor is unaffected by changing in viewpoint, scale, illumination. Many works have provided the solutions for this. The author chose two different way to find the feature region:
(1) Shape Adapted (SA)
(2) Maximally Stable (MS),
then used SIFT as the descriptor.
The author used K-Means clustering to do quantization, and used Mahalanobis distance as the distance function.
Feature Description
It is necessary that the descriptor is unaffected by changing in viewpoint, scale, illumination. Many works have provided the solutions for this. The author chose two different way to find the feature region:
(1) Shape Adapted (SA)
(2) Maximally Stable (MS),
then used SIFT as the descriptor.
Visual Vocabulary
The author used K-Means clustering to do quantization, and used Mahalanobis distance as the distance function.
Frequency Computation
Term frequency-inverse document frequency (tf-idf) was used as the way to compute the weight of a vocabulary.
Experiments
The way to use this system is: first, user can pick some region from some frame, generally this region contains some object. Then the system would return the frames that also contain the object of the one picked by user.
We can see that the accuracy is not bad.


沒有留言:
張貼留言