You are on page 1of 10

Image Categorization

Origin and Motivation


Origin 1: Texture recognition
• Texture is characterized by the repetition of basic elements or textons

• For stochastic textures, it is the identity of the textons, not their spatial arrangement, that
matters.
Origin 2: Bag-of-words models
• Orderless document representation: frequencies of words from a dictionary Salton &
McGill (1983)
“Bag-of-Features” Approach:
The task of image categorization is to label a query image to a certain scene type, e.g.,
“building,” “street,” “mountains,” or “forest.” The main difference compared to recognition
tasks for distinct objects is a much wider range of intra-class variation. Two instances of type
“building,” for example, can look very different in spite of having certain common features.
Therefore a more or less rigid model of the object geometry is not applicable any longer.

Main Idea :
A similar problem is faced in document analysis when attempting to automatically assign a
piece of text to a certain topic, e.g., “mathematics,” “news,” or “sports”.This problem is
solved by the definition of a so-called codebook there .A codebook consists of lists of words
or phrases which are typical for a certain topic.

It is built in a training phase. As a result, each topic is characterized by a “bag of words” (set
of codebook entries), regardless of the position at which they actually appear in the text.
During classification of an unknown text the codebook entries can be used for gathering
evidence that the text belongs to a specific topic.

This solution can be applied to the image categorization task as well: here, the “visual
codebook” consists of characteristic region descriptors (which correspond to the “words”)
and the “bag of words” is often described as a “bag of features”in literature.

The visual codebook is built in a training phase where descriptors are extracted from sample
images of different scene types and clustered in feature space. The cluster centers can be
interpreted as the visual words.

In the recognition phase, the feature distribution of a query image based on the codebook
data is derived (e.g., through assignment of each descriptor to the most similar codebook
entry) and classification is done by comparing it to the distributions of the scene types learnt
in the training phase, e.g., by calculating some kind of similarity between the histograms of
the query image and known scene types in the model database.
Bag of features: outline

1. Extract features

2. Learn “visual vocabulary”.

3. Quantize features using visual vocabulary.

4. Represent images by frequencies of “visual words”.

1. Extract features :

The identification of image patches can be achieved by one of the following :

 Random Sampling: An alternative strategy is to sample the image by


random. Empirical studies conducted give evidence that such a simple
random sampling strategy yields equal or even better recognition
results,because it is possible to sample image patches densely, whereas the
number ofpatches is limited for keypoint detectors as they focus on
characteristic points. Dense sampling has the advantage of containing more
information.

 Regular Grid

 Interest point detector

 Segmentation-based patches
2. Learning the visual vocabulary

After feature detection, each image is abstracted by several local patches. Feature
representation methods deal with how to represent the patches as numerical vectors. These
methods are called feature descriptors. A good descriptor should have the ability to handle
intensity, rotation, scale and affine variations to some extent. One of the most famous
descriptors is Scale Invariant feature Transform (SIFT).SIFT converts each patch to 128-
dimensional vector. After this step, each image is a collection of vectors of the same
dimension (128 for SIFT), where the order of different vectors is of no importance.

A codeword can be considered as a representative of several similar patches. One simple


method is performing K-means clustering over all the vectors.Codewords are then defined as
the centers of the learned clusters.

Thus, each patch in an image is mapped to a certain codeword through the clustering process
and the image can be represented by the histogram of the codewords.
An alternative clustering scheme is the k-means algorithm which intends to identify densely
populated regions in feature space (i.e., where many descriptors are located close to each
other).

Each cluster center produced by k-means becomes a code vector.

The advantage of k-means clustering is that the codebook fits better to

the actual distribution of the data, but on the other hand – at least in its original

form – k-means only performs local optimization and the number of clusters k has

to be known in advance,There are other approaches use Agglomerative Cluster.

Agglomerative Clustering:

which automatically determine the number of clusters by successively merging features until
a cut-off threshold t on the cluster compactness is reached. However, both the runtime and
the memory requirements are often significantly higher for agglomerative methods.

3. Quantize features using visual vocabulary:The codebook is used for quantizing features

A vector quantizer takes a feature vector and maps it to the index of the nearest codevector
in a codebook.

Codebook = visual vocabulary

Code vector = visual word


Visual Vocabulary example

Image patch examples of visual words


4-image representation

Image Classification
Given the bag-of-features representations of images from different classes

Disadvantages:
One of notorious disadvantages of BoW is that it ignores the spatial relationships among the
patches, which is very important in image representation. Researchers have proposed
several methods to incorporate the spatial information.

You might also like