Professional Documents
Culture Documents
recognition employing MKL. In this paper we propose food use all of the following three kinds of strategies to sample:
image recognition employing the MKL-based feature fusion Difference of Gaussian (DoG), random sampling and regular
method. grid sampling with every 10 pixels. In the experiment, about
3. PROPOSED METHOD 500-1000 points depending on images are sampled by the
DoG keypoint detector, and we sample 3000 points by ran-
In this paper, we realize image recognition of many kinds of dom sampling. We set the number of codewords as 1000 and
foods with high accuracy by introducing a MKL-based fea- 2000.
ture fusion method into food image recognition. In our recog- Color Histogram: A color histogram is a very common
nition, we prepare 50 kinds of food categories as shown in image representation. We divide an image into 2 × 2 blocks,
Figure 1, and classify an unknown food image into one of the and extract a 64-bin RGB color histogram from each block
categories. There has been no systems which can handle such with dividing the space into 4 × 4 × 4 bins. Totally, we extract
many kinds of food categories so far. a 256-dim color feature vector from each image.
In the training step, we extract various kinds of image fea- Gabor Texture Features: A Gabor texture feature represent
tures such as bag-of-features (BoF), color histogram and Ga- texture patterns of local regions with several scales and orien-
bor texture features from the training images, and we train a tations. In this paper, we use 24 Gabor filters with four kinds
MKL-SVM with extracted features. of scales and six kinds of orientations. Before applying the
In the classification step, we extract image features from a Gabor filters to an image, we divide an image into 3 × 3 or
given image in the same way as the training step, and classify 4×4 blocks. We apply the 24 Gabor filters to each block, then
it into one of the given food categories with the trained MKL- average filter responses within the block, and obtain a 24-dim
SVM. Gabor feature vector for each block. Finally we simply con-
catenate all the extracted 24-dim vectors into one 216-dim or
3.1. Image Features 384-dim vector for each image.
In this paper, we use the following image features: bag-of- 3.2. Classification with Multiple Kernel Learning
features, color and texture. In this paper, we carry out multi-class classification for 50
Bag-of-Features: The bag-of-features representation [1] at- categories of food images. As a classifier we use a support
tracts attention recently in the research community of object vector machine (SVM), and we adopt the one-vs-rest strategy
recognition, since it has been proved that it has excellent for multi-class classification. In the experiment, we build 50
ability to represent image concepts in the context of visual kinds of food detectors by regarding one category as a positive
object categorization / recognition in spite of its simplicity. set and the other 49 categories as negative sets.
The basic idea of the bag-of-features representation is that A normal SVM can treat with only a single kind of image
a set of local image points is sampled by an interest point feature. Then, in this paper, we use the multiple kernel learn-
detector, randomly, or grid-based, and visual descriptors are ing (MKL) to integrate various kinds of image features. With
extracted by the Scale Invariant Feature Transform (SIFT) MKL, we can train a SVM with a adaptively-weighted com-
descriptor [9] on each point. The resulting distribution of bined kernel which fuses different kinds of image features.
description vectors is then quantified by vector quantization The combined kernel is as follows:
against pre-specified codewords, and the quantified distribu-
tion vector is used as a characterization of the image. The K
codewords are generated by the k-means clustering method Kcomb (x, y) = βj Kj (x, y)
based on the distribution of SIFT vectors extracted from all j=1
the training images in advance. That is, an image are rep- K
resented by a set of “visual words”, which is the same way with βj ≥ 0, βj = 1. (1)
that a text document consists of words. In this paper, we j=1
286
where βj is weights to combine sub-kernels Kj (x, y). MKL
can estimate optimal weights from training data. Table 1. Results from single features and fusion by MKL
By preparing one sub-kernel for each image features and image features classification rate
estimating weights by the MKL method, we can obtain an color 38.18%
optimal combined kernel. We can train a SVM with the esti- BoF (dog1000) 26.52%
mated optimal combined kernel from different kinds of image BoF (dog2000) 27.48%
features efficiently. BoF (grid1000) 26.10%
Sonnenburg et al.[10] proposed an efficient algorithm of BoF (grid2000) 27.68%
MKL to estimate optimal weights and SVM parameters si- BoF (random1000) 28.42%
multaneously by iterating training steps of a normal SVM. BoF (random2000) 29.70%
This implementation is available as the SHOGUN machine Gabor3x3 31.28%
learning toolbox at the Web site of the first author of [10]. Gabor4x4 34.64%
In the experiment, we use the MKL library included in the MKL (after fusion) 61.34%
SHOGUN toolbox as the implementation of MKL.
Table 2. The best five and worst five categories in the recall
4. EXPERIMENTAL RESULTS rate of the results by MKL.
top 5 category recall worst 5 category recall
1 miso soup 97% 1 simmered pork 18%
In the experiments, we carried out image classification for 2 soba noodle 94% 2 ginger pork saute 28%
50 kinds of food images shown in Figure 1 to evaluate the 2 eels on rice 94% 3 toast 31%
proposed method. 4 potage 91% 4 pilaf 39%
First of all, we build a 50-category food image set by gath- 5 omelet with fried rice 87% 4 egg roll 39%
ering food images from the Web and selecting 100 relevant
images by hand for each category, since images on the Web compare between categories, we used the recall rate which is
are taken by many people in various real situations, which calculated as (the number of correctly classified images)/(the
is completely different from an artificial setting for experi- number of all the image in the category).
ments. Basically we selected images containing foods which Table 1 shows the classification results evaluated by the
are ready to eat as shown in Figure 1. For some images, we classification rate. While the best rate with a single feature
clipped out the regions where the target food was located. were 38.18% by the color histogram, as the classification rate
Because originally our targets are common foods in Japan, with MKL-based feature fusion we obtained 61.34% for 50-
some Japanese unique foods are included in the dataset, which class food image categorization. If we accept three candidate
might be unfamiliar with other people than Japanese. categories at most in the descending order of the output values
The image features used in the experiments were color, of the 1-vs-rest classifiers, the classification rate increases to
bag-of-features (BoF) and Gabor. As color features, we used 80.05%.
a 256-dim color histogram. To extract BoFs, we tried three Table 3 shows the best five and the worst five food cat-
kinds of point-sampling methods (DoG, random, and grid) egories in terms of the recall rate of the results obtained by
and two kinds of codebook size (1000 and 2000). Totally, MKL, and Figure 2 shows food images of the best five cat-
we prepared six kinds of the BoF vectors. As Gabor texture egories in the recall rate. Variation of the food images be-
features, we prepared 216-dim and 384-dim of Gabor feature longing to the best five categories was small. Four kinds of
vectors which are extracted from 3 × 3 and 4 × 4 blocks, food images out of the best five exceeded 90%, while “sim-
respectively. Totally, we extracted nine types of image feature mered pork” is less than 20%. This indicates that recognition
vectors from one image. With MKL, we integrated all of the accuray varies depending on food categories greatly. One of
nine features. the reasons is that some of categories are taxonomically very
We employ a SVM for training and classification. As a close and their food images are very similar to each other. For
kernel function of the SVM, we used the χ2 kernel which example, images of “beef curry” and ones of “cutlet curry”
were commonly used in object recognition tasks: are very similar, since both of them are variations of curry.
Although selecting categories to be classified is not an easy
K
task in fact, we need to examine if all the categories used in
Kf (x, y) = βf exp −γf χ2f (xf , yf ) the experiments are appropriate or not carefully.
f =1 Figure 3 shows the weights estimated by MKL for
(xi − yi )2 the 1-vs-rest classifiers of ten categories, and the average
where χ2 (x, y) = weights. BoF (DoG2000) was assigned the largest weight,
xi + yi
and BoF (random2000) became the second in terms of the
where γf is a kernel parameter. Zhang et al. [11] reported that average weight. The weight of color and Gabor were about
the best results were obtained in case that they set the average only 10% and 7%, respectively. As a result, BoF occupied
of χ2 distance between all the training data to the parameter 93% weights out of 100%. This means that BoF is the most
γ of the χ2 kernel. We followed this method to set γ. importance feature for food image classification, and DoG
For evaluation, we adopted 5-fold cross validation and and random sampling are more effective than grid sampling
used the classification rate which corresponds to the aver- to build BoF vectors. In terms of codebook size, 2000 is more
age value of diagonal elements of the confusion matrix. To useful than 1000, which shows larger codebooks is better
287
uploaded food images were taken in the bad condition such
that foods were shown in the photo as a very small region
or images taken in the dark room were too dark to recog-
nize. Therefore, the accuracy for the prototype system might
be improved by instructing users how to take a easy-to-be-
recognized food photo.
5. CONCLUSIONS
288