You are on page 1of 8

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO.

1, JANUARY 2007

Utilizing Google Images for Semantic


Segmentation via CRF-MAP
Rizki Perdana Rangkuti, Vektor Dewanto, Wisnu Jatmiko
AbstractThis research aims to improve the capability of semantic segmentation through data perspective. This research
utilizes the Google Images as training datasets. Google Images passes images related to a given keyword. The keywords
lead to finding the images which represent the desired objects, for example a car keyword would catch the images of
car. This condition would benefit the semantic segmentation to collect many images as training datasets without any cost.
The more the datasets, the higher the accuracy. This paper challenges the argument and it turns out that by using the
combination between VOC PASCAL dataset and Google Images dataset gives a competitive prediction result in accuracy
and visual representation. However, the increasing number of Google Images do not significantly improve the prediction
accuracy compared to using VOC dataset solely and it even does worse.
KeywordsComputer Society, IEEEtran, journal, LATEX, paper, template.

I NTRODUCTION

The goal of semantic segmentation is to label


every pixel with an object-class label from a
pre-defined set of labels, e.g., car, person, and
bus. Figure 1 provides some semantic segmentation examples. The three pictures in the
first row are original images that are typically
taken by cameras. The next row consists of the
semantic segmentation of the original images
respectively. In practice, a semantic segmentation marks the pixels with some unique colors
according to their labels. These colored images
are called as labeled images. For example, cow
label has blue color, grass label has green color,
body or person has brown color etc. A pixel
may own a label different from its neighbors.
The snapshot or configuration of labels is called
a labeling. Two identical original images should
have the same labeling and identical labeled
images.
A semantic segmentation needs a number of
datasets to establish a prediction model. Recent
works depend heavily on limited number of
datasets. The rule of thumb in machine learning is that the more the data, the better the
prediction accuracy. The current progress for
producing pixel-wise labeled training images
is costly, since it is generated by hands. Lack
of training datasets leaves the classifier a poor

Fig. 1: Pixel-wise semantic segmentation of [1].


The first row displays the original images. The
second row displays the labeled images of the
original images respectively.

accuracy. VOC PASCAL 2010 [2] and MSRC [3]


are popular datasets in the topics semantic segmentation. VOC PASCAL 2012 and MSRC are
categorized as strong labeled datasets, because
they are labeled pixel-wise. VOC PASCAL 2010
and MSRC contain 1928 and 591 labeled images
respectively. In multiclass setting, those numbers are not enough for training the classifiers
as the size of the classes grow.
On the other hand, the Internet provides
thousands of images. Some image search services provide images with metadata, so that it
enriches the image with an amount of useful information. However, unlike VOC PASCAL 2010
and MSRC, many of them cannot directly be
used as training datasets, because the straightforward information of the label is unavailable.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

The Google Images, for example, is categorized


as a weak labeled dataset, because it is not
labeled pixel-wise. The Google Images passes
some images according to a given keyword.
The desired object often appears salient in the
search results when the keyword refers to a
real world object. The keyword gives the hint
of what kinds of labels should be assigned,
but it is not adequately informative to place
the labels. A salient region of an image covers
the important part which belongs to a particular class of labels. If computers can recognize
the salient region, then it will be informative
to put labels on the weak-labeled images. In
other words, the computer can generate strong
labeled datasets from the Google Images automatically. The author argues that by enabling
the computers to recognize saliency, the Google
Images can be utilized as training datasets for
semantic segmentation.

2 C ONDITIONAL R ANDOM F IELDS M AXIMUM A P OSTERIORI (MRF-MAP)


A Conditional Random Fields (CRF) describes
an image as a graph where the pixels are
represented as nodes. A CRF determines an
unobserved label yi based on an observed value
xi . The observed value xi is simply a feature,
e.g., color, texture, or location. The node may
correlate with its neighbours.
Conditional Random Fields is a variant of
Markov Random Fields which directly estimates the posterior probability. According to
Markov-Gibbs equivalence, the posterior probability is exponentially proportional to the energy of labeling y given the data x.
P py |xq 

1 pE py,xqq
e
Z

(1)

The energy of labeling y is the sum of potential


functions.
E py, xq 

u pyi , xi q

p pyi , yj , xi q

i j Ni

Ni denotes the neighboring pixel indices of i.


Z term normalizes the probability to a range
between 0 and 1. Computing Z for large CRF

can be intractable, because it is the sum of


exponentially many terms.
Z

epE py,xqq

y Y

To obtain optimal prediction, one should


minimize the misclassification risk. According
to [4], the minimal risk estimate is equivalent
to Maximum A Posteriori.
y  argmax P py |xq

y Y

The Hammersley-Clifford theorem proves


that the probability of a pixel i being labeled as
yi depends on the potentials of the neighours of
the pixel i [5]. The theorem provides a practical
simplicity for estimating the joint probability of
labeling y by specifying the potential functions
of the energy.
u and p denote unary potential and pairwise potential respectively. The potential function can be regarded as a penalty of a label.
A unary potential penalizes a label assignment
based on its likelihood to features. For example, a furry texture would certainly penalizes
animal-related labels, i.e., cat and dog, lower
than man-made related labels, such as aeroplane and car. A pair-wise potential penalizes
a pair of labels that each of which unlikely
to coexist. For example, an aeroplane label is
penalized lower when its neighbor is an aeroplane label rather than the others. In order to
realize the potential function terms, the CRF
can utilize a classifier for the unary potentials,
such as TextonBoost classifier proposed by [1],
and Potts model as the pair-wise potentials (see
Equation 2).
p pyi , yj , xi q  vyi

 yj w

(2)

The role of energy function is to assist the


validation of the CRF. The parameters and the
potential functions are learnt through training
datasets such that the ground truth labeling has
the lowest energy. In opposite direction, once
the parameters and the potential functions are
established, a prediction of semantic segmentation can be made for unseen data.
A good labeling y maximizes the posterior probability globally. Maximizing posterior

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

probability is similar to minimizing the energy


of labeling y. This fact holds an important
consequence because it simplifies the vision
problem into optimization-based problem.

abstraction is employed to remove undesired


details. An element uniqueness of each pixel is
computed as follows.

y  argmin E py, xq

Ui

y Y

Some efficient optimization methods for discrete labels are proved to exist in some domains
of problem, such as binary segmentation and
multiclass segmentation. Not only for inference, the optimization method is also used for
learning the CRF parameters.

S ALIENCY F ILTERS

Saliency or salience is a state of being standing out and very ease to see. According to
Longman dictionary, an object is salient when
it appears as the most important or noticeable
object among other objects. Saliency of an object
is interpreted as a visual existence and a state
of being the central object. Salient objects tend
to have a more compact appearance compared
to the background objects. For example, Figure
2a shows a cat as the main object in the image
and dominates the visual perception over all
objects, whereas Figure 2b shows a cat and a
baby where the cat is not a salient object.

Fig. 2: The picture on the left shows a cat as


salient object according to human sense. The
picture on the right shows a scene of a cat and
a baby. The cat is not a salient object, because
it shares its appearance with the baby.
Most methods for saliency detection use contrast information. The work of [6] redefines the
contrast information in two measurements, element uniqueness and element distribution. The
SLIC method of [7] is utilized to abstract the
image components through superpixels. The

}ci  cj }2 .wijp

j 1
p
wij

1
1
expp 2 vpi  pj wq
Zi
2c

(3)

pi and ci denote the position of pixel i and


the color of superpixel i respectively. Equation
4 describes the calculation of element distribution of each pixel. A value Di is computed
from a sum of distance between exclusive pixel
location pi and weighted mean position i multiplied by color similarity metrics wij .
Di

}pj  i}2 wijc

j 1
c
wij

1
1
expp 2 vci  cj wq
Zi
2c

(4)

The location information encodes the locality


aspects. In another view, the locality aspects
are defined as Gaussian filtering kernel. This
allows an approximation of the locality aspects
to reduce the complexity from OpN 2 q to OpN q
through permutohedral lattice [6]. The saliency
level of pixel i (Si ) is formulated as:
Si

(a) Cat as a salient ob-(b) Cat as a non-salient


ject
object

 Ui.exppk.Diq

(5)

DATASETS

VOC PASCAL 2010 was introduced in VOC


PASCAL 2010 competition [2]. It provides raw
JPEG images, PNG annotation files, and evaluation program made in Matlab. A set of
1928 JPEG and PNG files are provided as
the groundtruth. Every pixels is classified into
classes of aeroplane, bicycle, bird, boat, ottle,
bus, car, cat, chair, cow, iningtable, dog, horse,
motorbike, person, pottedplant, sheep, sofa,
rain, tvmonitor, and background.
VOC PASCAL 2010 is split into training, validation, and testing portions. The ratios between
set follows the size which [8] has determined in
his work. 600 images are used for training, 364
for validation, and 964 for testing separately.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

Fig. 5: Saliency filters assign saliency level


in each pixel. A binary segmentation utilizes
them as potential functions to separate between
salient and non-salient region (the rightmost
image).
(a)

(b)

Fig. 3: The samples of VOC images. a The


original image (JPEG files) b The ground truth
(PNG files)

G OOGLE I MAGES T RANSFORMATION

A keyword can be utilized to represent a class.


The Google Images passes some images based
on the keyword. The author finds out that
every salient regions of the images comes from
the results of the same keyword. Meanwhile,
the saliency filters can guide the computers to
recognize salient regions. This creates a possibility to segment the salient regions and regards
it as the part of the object class that the keyword
refers to. Figure 4 illustrates the Google Images
transformation.

regions from the background. The binary segmentation employs a CRF and the saliency
filters. From Equation 5, Si can be regarded as a
saliency map that rates the saliency level from
every pixels. Figure 5 shows the results of the
saliency filters. The saliency map is shown by
the image in the middle. It can be utilized to
segment the salient part of an image through a
binary segmentation where the unary potential
is represented by Si [6]. The energy formulation
of the CRF is written as follows.
y  argmin E py, lf , xq
yY

E py, lf , xq 

saliency pyi , lf , xq

saliency pyi , lf , xq 
query: aeroplane

Google
Images

Saliency
Detector

Labeled
Images

Fig. 4: The Google Images can be transformed


to strong labeled dataset. A keyword determines the labels of images. The transformation
considers the foreground object as the queried
object class (i.e. aeroplane). The aeroplane label
class has red color, while the background label
class has black color.

The Google Images transformation performs


binary segmentation to differentiate the salient

vyi  yj w

i j Nj

p1  saliencypi, xqq
saliency pi, xq

if yi  lf
otherwise

saliency denotes a unary potential function that


takes Si as the value. The rightmost image is
the result of the binary segmentation. lf informs
the inference algorithm about which label class
the CRF should assign. Since lf has the value
of doll, the foreground region is regarded as
the figure of a doll and the rest is labeled as
background. Figure 6 shows some examples of
transformation results.

E XPERIMENT R ESULTS

Experiment scenarios aim to investigate the


behaviour of the semantic segmentation over
under different settings. The procedure of the

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(a) Original Google Image

(b) Labeled Image

Fig. 6: The results of Google Images Transformation

experiment mainly consists of two steps, training and testing. In training phase, a CRF model
is learned from the given datasets. In testing
phase, the CRF model is utilized to predict the
unknown samples. There are three experiment
scenarios. Each scenarios follows the steps as
described before, yet differs in datasets composition when training.
The first scenario aims to compare the performance between two cases, i.e. an experiment
using the combination of VOC PASCAL 2010
and Google Images, and an experiment using
VOC PASCAL 2010 datasets solely.
The second scenario aims to compare the performance between two cases, i.e. an experiment
using the VOC PASCAL 2010 dataset and an
experiment using the Google Images only.
The third scenario aims to compare the performance among several cases, where each
cases uses a certain amount of Google Images
only. In this paper, the scenario uses 600, 700,
800, and 900 Google Images respectively. The
performance is measured with averaged class
accuracy (abbreviated as CA) and global accuracy (abbreviated as GA).
Table 1 summarizes the result of the first sce-

nario. The first experiment (CRF+VOC) utilizes


VOC PASCAL 2010 as training dataset. While
the second experiment (CRF+VOC+Google Images) utilizes VOC PASCAL 2010 and Google
Images as the training datasets. The first experiment achieves 11.0450% CA and 79.213%
GA. In the second experiment, the CA and GA
increase by 0.6592% and 0.0860% respectively.
Compared to the related work, [9] reported 13%
averaged CA by using unary potentials without
pair-wise potentials. The difference in accuracy
between this result and the result of the original work is due to the different scale. The
experiment utilized the rescaled version of the
images a half of the original size, whereas the
original work employs normal sized images.
This result shows that the Google Images improves the prediction accuracy. Figures 7 shows
that the the second experiment tends to predict
correctly the part of which the baseline method
is incapable at despite of imperfect segmentation boundaries. One possible reason is that
Google Images introduces novel characteristics
that VOC PASCAL 2010 does not provides
to the CRF. Based on the result, the combination of VOC and Google Images improves
the accuracy of the semantic segmentation by
broadening the characteristics of the classes.
Experiment Name
(CRF+VOC)
(CRF+VOC+Google Images)

Averaged CA (%)
11.0450
11.7042

GA
79.213
79.299

TABLE 1: Summarized results from the first


scenario
Table 2 elaborates the detail of the performance from each classes. VOC PASCAL 2010
and Google Images combination fails to improve the CA from several classes such as
aeroplane, cat, chair, dog, horse, person, potted
plant, sofa, and background.
Table 3 summarizes the result of the second
scenario. The first experiment (CRF+VOC) utilizes VOC PASCAL 2010 as training dataset.
While the second experiment (CRF+Google Images) utilizes Google Images as the training
datasets. The first experiment achieves 11.045%
CA and 79.213% GA. In the second experiment,
the CA and GA decrease by 1.789% and 5.854%
respectively.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007


No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA

(CRF+VOC)
15.1694
0.0000
3.1685
5.3229
0.7876
10.5571
14.3507
12.5929
4.0795
1.6487
3.9595
5.1569
4.7258
14.9855
24.1621
1.1087
11.7376
1.5923
15.5669
3.4973
77.7755
11.0450

(CRF+VOC+Google Images)
12.3211
0.9129
3.5391
10.4128
3.5560
14.9576
16.6133
10.6088
2.0624
6.1896
4.0120
2.2727
4.7002
16.2211
22.7428
0.7029
12.8854
1.2566
15.4514
6.9930
77.3758
11.7042

TABLE 2: Comparison of prediction accuracy of


the first experiment between (CRF+VOC) case
and (CRF+VOC+Google Images) case. VOC
PASCAL 2010 and Google Images combination fails to improve the CA of aeroplane, cat,
chair, dog, horse, person, potted plant, sofa,
and background class.
Experiment Name
(CRF+VOC)
(CRF+Google Images)

Averaged CA (%)
11.045
9.256

No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

(CRF+VOC)
15.169
0.000
3.169
5.323
0.788
10.557
14.351
12.593
4.080
1.649
3.960
5.157
4.726
14.986
24.162
1.109
11.738
1.592
15.567
3.497
77.776
11.045

(CRF+Google Images)
8.012
0.058
2.138
8.056
0.000
17.212
4.112
9.461
0.554
6.106
0.654
1.753
3.891
7.364
9.664
2.207
13.959
0.721
17.521
5.352
78.314
9.256

TABLE 4: Comparison of prediction accuracy of


the first experiment between (CRF+VOC) case
and (CRF+Google Images) case.
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

GA
79.213
73.359

TABLE 3: Summarized results from the second


scenario

This result confirms that the Google Images


solely cannot surpasses the accuracy of the
baseline accuracy. The reason is that the Google
Images lacks of classes variability in a single
image. One can find that every images often
contains a particular class of object and background class. Meanwhile, this closes the chance
of the CRF to learn the correlation between object classes. Therefore, it will hardly to perform
multiclass segmentation in testing phase.
Table 4 elaborates the detail of the performance from each classes. The training with sole
Google Images dataset fails in most classes.
Table 5 summarizes the result of the third
scenario. The prediction accuracy decreases as
the number of the Google Images increases. The
number of images to achieve an optimum result
might depend on the complexity of the objects.

Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA

Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA
Global Accuracy

600
8.012
0.058
2.138
8.056
0.000
7617.212
334.112
889.461
0.554
6.106
0.654
1.753
3.891
7.364
9.664
2.207
13.959
0.721
17.521
5.352
75.578
9.256
73.359

700
7.610
0.465
0.743
10.260
0.041
19.210
4.845
7.204
0.260
7.961
0.912
3.890
4.651
4.887
9.677
2.039
11.578
0.922
17.446
3.548
75.459
9.219
73.503

800
11.095
0.880
0.566
7.982
0.000
19.067
6.014
6.695
0.085
5.635
1.153
2.775
4.198
6.762
7.442
3.603
10.771
1.242
13.478
5.188
75.690
9.063
73.770

900
9.477
1.431
0.959
7.810
1.266
18.783
6.013
8.359
0.379
8.121
0.623
2.566
5.464
5.064
7.479
1.691
9.975
0.371
12.826
4.485
75.170
8.967
73.046

TABLE 5: Comparison of prediction accuracy in


different Google Images sizes.

C ONCLUSIONS AND F UTURE W ORKS

This research proposes Google Images as training dataset. The Google Images is converted
into strong labeled dataset by saliency filtering.
The perfomance improvement varies in some
scenarios. Combining the datasets from both
of VOC PASCAL 2010 and the Google Images
increases the prediction accuracy. The Google

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(a)

(b)

(c)

(d)

Fig. 7: The examples of results from scenario 2. a The original images b The ground truth labeled
images c The result from the first experiment (CRF+VOC) d The result from the second experiment
(CRF+VOC+Google Images)

Images helps the semantic segmentation to


enlarge the class characteristics. On the other
hand, solely using the Google Images does not
help to improve the performance. Furthermore,
adding more the Google Images does not lead
to a better performance.
The author realizes that this research leaves
many things to explore. It requires an investigation of the effective number of the Google
Images, because the experiment shows that
adding more datasets cannot increase the per-

formance. The experiment has not performed


the exact rate of improvement which explains
how much the Google Images is needed to
achieve a certain amount of accuracy. In the
other side, the keywords can also affect the
search results in some ways. There might be
a better keyword to find a suitable word to
describe an object, so that it can give a more decent result. Choosing the right keyword would
be an interesting problem.

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

R EFERENCES
[1]

[2]

[3]

[4]

[5]
[6]

[7]

[8]

[9]

J. Shotton, J. Winn, C. Rother, and A. Criminisi,


Textonboost for image understanding: Multi-class object
recognition and segmentation by jointly modeling
texture, layout, and context, Int. J. Comput. Vision,
vol. 81, no. 1, pp. 223, Jan. 2009. [Online]. Available:
http://dx.doi.org/10.1007/s11263-007-0109-1
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, The PASCAL Visual Object Classes
Challenge 2010 (VOC2010) Results, http://www.pascalnetwork.org/challenges/VOC/voc2010/workshop/index.html,
2010.
Research.microsoft.com,
Object
class
recognition
microsoft
research,
2015.
[Online].
Available:
http://research.microsoft.com/enus/projects/objectclassrecognition/
S. Geman and D. Geman, Stochastic relaxation,
gibbs distributions, and the bayesian restoration of
images, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6,
no. 6, pp. 721741, Nov. 1984. [Online]. Available:
http://dx.doi.org/10.1109/TPAMI.1984.4767596
J. M. Hammersley and P. Clifford, Markov fields on finite
graphs and lattices, 1971.

F. Perazzi, P. Krahenbuhl,
Y. Pritch, and A. Hornung,
Saliency filters: Contrast based filtering for salient region
detection, in CVPR, 2012, pp. 733740.
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
S. Ssstrunk, Slic superpixels, EPFL, Tech. Rep. 149300,
June 2010.
P. Kraehenbuehl, Efficient inference in fully connected
crfs with gaussian edge potentials, 2014. [Online]. Available: http://graphics.stanford.edu/projects/densecrf/
and V. Koltun, Efficient inference in fully
P. Krahenbuhl
connected crfs with gaussian edge potentials, in Advances
in Neural Information Processing Systems 24, J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
Eds.
Curran Associates, Inc., 2011, pp. 109117.
[Online]. Available: http://papers.nips.cc/paper/4296efficient-inference-in-fully-connected-crfs-with-gaussianedge-potentials.pdf

You might also like