Bare JRNL Compsoc PDF

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO.
1, JANUARY 2007
Utilizing Google Images for Semantic

Segmentation via CRF-MAP
Rizki Perdana Rangkuti, Vektor Dewanto, Wisnu Jatmiko
AbstractThis research aims to improve the capability of semantic segmentation through data perspective. This research
utilizes the Google Images as training datasets. Google Images passes images related to a given keyword. The keywords
lead to finding the images which represent the desired objects, for example a car keyword would catch the images of
car. This condition would benefit the semantic segmentation to collect many images as training datasets without any cost.
The more the datasets, the higher the accuracy. This paper challenges the argument and it turns out that by using the
combination between VOC PASCAL dataset and Google Images dataset gives a competitive prediction result in accuracy
and visual representation. However, the increasing number of Google Images do not significantly improve the prediction
accuracy compared to using VOC dataset solely and it even does worse.
KeywordsComputer Society, IEEEtran, journal, LATEX, paper, template.
I NTRODUCTION
The goal of semantic segmentation is to label

every pixel with an object-class label from a
pre-defined set of labels, e.g., car, person, and
bus. Figure 1 provides some semantic segmentation examples. The three pictures in the
first row are original images that are typically
taken by cameras. The next row consists of the
semantic segmentation of the original images
respectively. In practice, a semantic segmentation marks the pixels with some unique colors
according to their labels. These colored images
are called as labeled images. For example, cow
label has blue color, grass label has green color,
body or person has brown color etc. A pixel
may own a label different from its neighbors.
The snapshot or configuration of labels is called
a labeling. Two identical original images should
have the same labeling and identical labeled
images.
A semantic segmentation needs a number of
datasets to establish a prediction model. Recent
works depend heavily on limited number of
datasets. The rule of thumb in machine learning is that the more the data, the better the
prediction accuracy. The current progress for
producing pixel-wise labeled training images
is costly, since it is generated by hands. Lack
of training datasets leaves the classifier a poor
Fig. 1: Pixel-wise semantic segmentation of [1].

The first row displays the original images. The
second row displays the labeled images of the
original images respectively.
accuracy. VOC PASCAL 2010 [2] and MSRC [3]

are popular datasets in the topics semantic segmentation. VOC PASCAL 2012 and MSRC are
categorized as strong labeled datasets, because
they are labeled pixel-wise. VOC PASCAL 2010
and MSRC contain 1928 and 591 labeled images
respectively. In multiclass setting, those numbers are not enough for training the classifiers
as the size of the classes grow.
On the other hand, the Internet provides
thousands of images. Some image search services provide images with metadata, so that it
enriches the image with an amount of useful information. However, unlike VOC PASCAL 2010
and MSRC, many of them cannot directly be
used as training datasets, because the straightforward information of the label is unavailable.
JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007
The Google Images, for example, is categorized

as a weak labeled dataset, because it is not
labeled pixel-wise. The Google Images passes
some images according to a given keyword.
The desired object often appears salient in the
search results when the keyword refers to a
real world object. The keyword gives the hint
of what kinds of labels should be assigned,
but it is not adequately informative to place
the labels. A salient region of an image covers
the important part which belongs to a particular class of labels. If computers can recognize
the salient region, then it will be informative
to put labels on the weak-labeled images. In
other words, the computer can generate strong
labeled datasets from the Google Images automatically. The author argues that by enabling
the computers to recognize saliency, the Google
Images can be utilized as training datasets for
semantic segmentation.
2 C ONDITIONAL R ANDOM F IELDS M AXIMUM A P OSTERIORI (MRF-MAP)

A Conditional Random Fields (CRF) describes
an image as a graph where the pixels are
represented as nodes. A CRF determines an
unobserved label yi based on an observed value
xi . The observed value xi is simply a feature,
e.g., color, texture, or location. The node may
correlate with its neighbours.
Conditional Random Fields is a variant of
Markov Random Fields which directly estimates the posterior probability. According to
Markov-Gibbs equivalence, the posterior probability is exponentially proportional to the energy of labeling y given the data x.
P py |xq
1 pE py,xqq
e
Z
(1)
The energy of labeling y is the sum of potential

functions.
E py, xq
u pyi , xi q
p pyi , yj , xi q
i j Ni
Ni denotes the neighboring pixel indices of i.

Z term normalizes the probability to a range
between 0 and 1. Computing Z for large CRF
can be intractable, because it is the sum of

exponentially many terms.
Z
epE py,xqq
y Y
To obtain optimal prediction, one should

minimize the misclassification risk. According
to [4], the minimal risk estimate is equivalent
to Maximum A Posteriori.
y argmax P py |xq
y Y
The Hammersley-Clifford theorem proves

that the probability of a pixel i being labeled as
yi depends on the potentials of the neighours of
the pixel i [5]. The theorem provides a practical
simplicity for estimating the joint probability of
labeling y by specifying the potential functions
of the energy.
u and p denote unary potential and pairwise potential respectively. The potential function can be regarded as a penalty of a label.
A unary potential penalizes a label assignment
based on its likelihood to features. For example, a furry texture would certainly penalizes
animal-related labels, i.e., cat and dog, lower
than man-made related labels, such as aeroplane and car. A pair-wise potential penalizes
a pair of labels that each of which unlikely
to coexist. For example, an aeroplane label is
penalized lower when its neighbor is an aeroplane label rather than the others. In order to
realize the potential function terms, the CRF
can utilize a classifier for the unary potentials,
such as TextonBoost classifier proposed by [1],
and Potts model as the pair-wise potentials (see
Equation 2).
p pyi , yj , xi q vyi
yj w
(2)
The role of energy function is to assist the

validation of the CRF. The parameters and the
potential functions are learnt through training
datasets such that the ground truth labeling has
the lowest energy. In opposite direction, once
the parameters and the potential functions are
established, a prediction of semantic segmentation can be made for unseen data.
A good labeling y maximizes the posterior probability globally. Maximizing posterior
probability is similar to minimizing the energy

of labeling y. This fact holds an important
consequence because it simplifies the vision
problem into optimization-based problem.
abstraction is employed to remove undesired

details. An element uniqueness of each pixel is
computed as follows.
y argmin E py, xq
Ui
y Y
Some efficient optimization methods for discrete labels are proved to exist in some domains
of problem, such as binary segmentation and
multiclass segmentation. Not only for inference, the optimization method is also used for
learning the CRF parameters.
S ALIENCY F ILTERS
Saliency or salience is a state of being standing out and very ease to see. According to
Longman dictionary, an object is salient when
it appears as the most important or noticeable
object among other objects. Saliency of an object
is interpreted as a visual existence and a state
of being the central object. Salient objects tend
to have a more compact appearance compared
to the background objects. For example, Figure
2a shows a cat as the main object in the image
and dominates the visual perception over all
objects, whereas Figure 2b shows a cat and a
baby where the cat is not a salient object.
Fig. 2: The picture on the left shows a cat as

salient object according to human sense. The
picture on the right shows a scene of a cat and
a baby. The cat is not a salient object, because
it shares its appearance with the baby.
Most methods for saliency detection use contrast information. The work of [6] redefines the
contrast information in two measurements, element uniqueness and element distribution. The
SLIC method of [7] is utilized to abstract the
image components through superpixels. The
}ci cj }2 .wijp
j 1
p
wij
1
1
expp 2 vpi pj wq
Zi
2c
(3)
pi and ci denote the position of pixel i and

the color of superpixel i respectively. Equation
4 describes the calculation of element distribution of each pixel. A value Di is computed
from a sum of distance between exclusive pixel
location pi and weighted mean position i multiplied by color similarity metrics wij .
Di
}pj i}2 wijc
j 1
c
wij
1
1
expp 2 vci cj wq
Zi
2c
(4)
The location information encodes the locality

aspects. In another view, the locality aspects
are defined as Gaussian filtering kernel. This
allows an approximation of the locality aspects
to reduce the complexity from OpN 2 q to OpN q
through permutohedral lattice [6]. The saliency
level of pixel i (Si ) is formulated as:
Si
(a) Cat as a salient ob-(b) Cat as a non-salient

ject
object
Ui.exppk.Diq
(5)
DATASETS
VOC PASCAL 2010 was introduced in VOC

PASCAL 2010 competition [2]. It provides raw
JPEG images, PNG annotation files, and evaluation program made in Matlab. A set of
1928 JPEG and PNG files are provided as
the groundtruth. Every pixels is classified into
classes of aeroplane, bicycle, bird, boat, ottle,
bus, car, cat, chair, cow, iningtable, dog, horse,
motorbike, person, pottedplant, sheep, sofa,
rain, tvmonitor, and background.
VOC PASCAL 2010 is split into training, validation, and testing portions. The ratios between
set follows the size which [8] has determined in
his work. 600 images are used for training, 364
for validation, and 964 for testing separately.
Fig. 5: Saliency filters assign saliency level

in each pixel. A binary segmentation utilizes
them as potential functions to separate between
salient and non-salient region (the rightmost
image).
(a)
(b)
Fig. 3: The samples of VOC images. a The

original image (JPEG files) b The ground truth
(PNG files)
G OOGLE I MAGES T RANSFORMATION
A keyword can be utilized to represent a class.

The Google Images passes some images based
on the keyword. The author finds out that
every salient regions of the images comes from
the results of the same keyword. Meanwhile,
the saliency filters can guide the computers to
recognize salient regions. This creates a possibility to segment the salient regions and regards
it as the part of the object class that the keyword
refers to. Figure 4 illustrates the Google Images
transformation.
regions from the background. The binary segmentation employs a CRF and the saliency
filters. From Equation 5, Si can be regarded as a
saliency map that rates the saliency level from
every pixels. Figure 5 shows the results of the
saliency filters. The saliency map is shown by
the image in the middle. It can be utilized to
segment the salient part of an image through a
binary segmentation where the unary potential
is represented by Si [6]. The energy formulation
of the CRF is written as follows.
y argmin E py, lf , xq
yY
E py, lf , xq
saliency pyi , lf , xq
saliency pyi , lf , xq
query: aeroplane
Google
Images
Saliency
Detector
Labeled
Images
Fig. 4: The Google Images can be transformed

to strong labeled dataset. A keyword determines the labels of images. The transformation
considers the foreground object as the queried
object class (i.e. aeroplane). The aeroplane label
class has red color, while the background label
class has black color.
The Google Images transformation performs

binary segmentation to differentiate the salient
vyi yj w
i j Nj
p1 saliencypi, xqq
saliency pi, xq
if yi lf
otherwise
saliency denotes a unary potential function that

takes Si as the value. The rightmost image is
the result of the binary segmentation. lf informs
the inference algorithm about which label class
the CRF should assign. Since lf has the value
of doll, the foreground region is regarded as
the figure of a doll and the rest is labeled as
background. Figure 6 shows some examples of
transformation results.
E XPERIMENT R ESULTS
Experiment scenarios aim to investigate the

behaviour of the semantic segmentation over
under different settings. The procedure of the
(a) Original Google Image
(b) Labeled Image
Fig. 6: The results of Google Images Transformation
experiment mainly consists of two steps, training and testing. In training phase, a CRF model
is learned from the given datasets. In testing
phase, the CRF model is utilized to predict the
unknown samples. There are three experiment
scenarios. Each scenarios follows the steps as
described before, yet differs in datasets composition when training.
The first scenario aims to compare the performance between two cases, i.e. an experiment
using the combination of VOC PASCAL 2010
and Google Images, and an experiment using
VOC PASCAL 2010 datasets solely.
The second scenario aims to compare the performance between two cases, i.e. an experiment
using the VOC PASCAL 2010 dataset and an
experiment using the Google Images only.
The third scenario aims to compare the performance among several cases, where each
cases uses a certain amount of Google Images
only. In this paper, the scenario uses 600, 700,
800, and 900 Google Images respectively. The
performance is measured with averaged class
accuracy (abbreviated as CA) and global accuracy (abbreviated as GA).
Table 1 summarizes the result of the first sce-
nario. The first experiment (CRF+VOC) utilizes

VOC PASCAL 2010 as training dataset. While
the second experiment (CRF+VOC+Google Images) utilizes VOC PASCAL 2010 and Google
Images as the training datasets. The first experiment achieves 11.0450% CA and 79.213%
GA. In the second experiment, the CA and GA
increase by 0.6592% and 0.0860% respectively.
Compared to the related work, [9] reported 13%
averaged CA by using unary potentials without
pair-wise potentials. The difference in accuracy
between this result and the result of the original work is due to the different scale. The
experiment utilized the rescaled version of the
images a half of the original size, whereas the
original work employs normal sized images.
This result shows that the Google Images improves the prediction accuracy. Figures 7 shows
that the the second experiment tends to predict
correctly the part of which the baseline method
is incapable at despite of imperfect segmentation boundaries. One possible reason is that
Google Images introduces novel characteristics
that VOC PASCAL 2010 does not provides
to the CRF. Based on the result, the combination of VOC and Google Images improves
the accuracy of the semantic segmentation by
broadening the characteristics of the classes.
Experiment Name
(CRF+VOC)
(CRF+VOC+Google Images)
Averaged CA (%)
11.0450
11.7042
GA
79.213
79.299
TABLE 1: Summarized results from the first

scenario
Table 2 elaborates the detail of the performance from each classes. VOC PASCAL 2010
and Google Images combination fails to improve the CA from several classes such as
aeroplane, cat, chair, dog, horse, person, potted
plant, sofa, and background.
Table 3 summarizes the result of the second
scenario. The first experiment (CRF+VOC) utilizes VOC PASCAL 2010 as training dataset.
While the second experiment (CRF+Google Images) utilizes Google Images as the training
datasets. The first experiment achieves 11.045%
CA and 79.213% GA. In the second experiment,
the CA and GA decrease by 1.789% and 5.854%
respectively.

No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA
(CRF+VOC)
15.1694
0.0000
3.1685
5.3229
0.7876
10.5571
14.3507
12.5929
4.0795
1.6487
3.9595
5.1569
4.7258
14.9855
24.1621
1.1087
11.7376
1.5923
15.5669
3.4973
77.7755
11.0450
12.3211
0.9129
3.5391
10.4128
3.5560
14.9576
16.6133
10.6088
2.0624
6.1896
4.0120
2.2727
4.7002
16.2211
22.7428
0.7029
12.8854
1.2566
15.4514
6.9930
77.3758
11.7042
TABLE 2: Comparison of prediction accuracy of

the first experiment between (CRF+VOC) case
and (CRF+VOC+Google Images) case. VOC
PASCAL 2010 and Google Images combination fails to improve the CA of aeroplane, cat,
chair, dog, horse, person, potted plant, sofa,
and background class.
Experiment Name
(CRF+VOC)
(CRF+Google Images)
Averaged CA (%)
11.045
9.256
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(CRF+VOC)
15.169
0.000
3.169
5.323
0.788
10.557
14.351
12.593
4.080
1.649
3.960
5.157
4.726
14.986
24.162
1.109
11.738
1.592
15.567
3.497
77.776
11.045
(CRF+Google Images)
8.012
0.058
2.138
8.056
0.000
17.212
4.112
9.461
0.554
6.106
0.654
1.753
3.891
7.364
9.664
2.207
13.959
0.721
17.521
5.352
78.314
9.256
TABLE 4: Comparison of prediction accuracy of

the first experiment between (CRF+VOC) case
and (CRF+Google Images) case.
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
GA
79.213
73.359
TABLE 3: Summarized results from the second

scenario
This result confirms that the Google Images

solely cannot surpasses the accuracy of the
baseline accuracy. The reason is that the Google
Images lacks of classes variability in a single
image. One can find that every images often
contains a particular class of object and background class. Meanwhile, this closes the chance
of the CRF to learn the correlation between object classes. Therefore, it will hardly to perform
multiclass segmentation in testing phase.
Table 4 elaborates the detail of the performance from each classes. The training with sole
Google Images dataset fails in most classes.
Table 5 summarizes the result of the third
scenario. The prediction accuracy decreases as
the number of the Google Images increases. The
number of images to achieve an optimum result
might depend on the complexity of the objects.
Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA
Classes
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
background
Averaged CA
Global Accuracy
600
8.012
0.058
2.138
8.056
0.000
7617.212
334.112
889.461
0.554
6.106
0.654
1.753
3.891
7.364
9.664
2.207
13.959
0.721
17.521
5.352
75.578
9.256
73.359
700
7.610
0.465
0.743
10.260
0.041
19.210
4.845
7.204
0.260
7.961
0.912
3.890
4.651
4.887
9.677
2.039
11.578
0.922
17.446
3.548
75.459
9.219
73.503
800
11.095
0.880
0.566
7.982
0.000
19.067
6.014
6.695
0.085
5.635
1.153
2.775
4.198
6.762
7.442
3.603
10.771
1.242
13.478
5.188
75.690
9.063
73.770
900
9.477
1.431
0.959
7.810
1.266
18.783
6.013
8.359
0.379
8.121
0.623
2.566
5.464
5.064
7.479
1.691
9.975
0.371
12.826
4.485
75.170
8.967
73.046
TABLE 5: Comparison of prediction accuracy in

different Google Images sizes.
C ONCLUSIONS AND F UTURE W ORKS
This research proposes Google Images as training dataset. The Google Images is converted
into strong labeled dataset by saliency filtering.
The perfomance improvement varies in some
scenarios. Combining the datasets from both
of VOC PASCAL 2010 and the Google Images
increases the prediction accuracy. The Google
(a)
(b)
(c)
(d)
Fig. 7: The examples of results from scenario 2. a The original images b The ground truth labeled
images c The result from the first experiment (CRF+VOC) d The result from the second experiment
Images helps the semantic segmentation to

enlarge the class characteristics. On the other
hand, solely using the Google Images does not
help to improve the performance. Furthermore,
adding more the Google Images does not lead
to a better performance.
The author realizes that this research leaves
many things to explore. It requires an investigation of the effective number of the Google
Images, because the experiment shows that
adding more datasets cannot increase the per-
formance. The experiment has not performed

the exact rate of improvement which explains
how much the Google Images is needed to
achieve a certain amount of accuracy. In the
other side, the keywords can also affect the
search results in some ways. There might be
a better keyword to find a suitable word to
describe an object, so that it can give a more decent result. Choosing the right keyword would
be an interesting problem.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
J. Shotton, J. Winn, C. Rother, and A. Criminisi,

Textonboost for image understanding: Multi-class object
recognition and segmentation by jointly modeling
texture, layout, and context, Int. J. Comput. Vision,
vol. 81, no. 1, pp. 223, Jan. 2009. [Online]. Available:
http://dx.doi.org/10.1007/s11263-007-0109-1
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,
and A. Zisserman, The PASCAL Visual Object Classes
Challenge 2010 (VOC2010) Results, http://www.pascalnetwork.org/challenges/VOC/voc2010/workshop/index.html,
2010.
Research.microsoft.com,
Object
class
recognition
microsoft
research,
2015.
[Online].
Available:
http://research.microsoft.com/enus/projects/objectclassrecognition/
S. Geman and D. Geman, Stochastic relaxation,
gibbs distributions, and the bayesian restoration of
images, IEEE Trans. Pattern Anal. Mach. Intell., vol. 6,
no. 6, pp. 721741, Nov. 1984. [Online]. Available:
http://dx.doi.org/10.1109/TPAMI.1984.4767596
J. M. Hammersley and P. Clifford, Markov fields on finite
graphs and lattices, 1971.
F. Perazzi, P. Krahenbuhl,
Y. Pritch, and A. Hornung,
Saliency filters: Contrast based filtering for salient region
detection, in CVPR, 2012, pp. 733740.
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
S. Ssstrunk, Slic superpixels, EPFL, Tech. Rep. 149300,
June 2010.
P. Kraehenbuehl, Efficient inference in fully connected
crfs with gaussian edge potentials, 2014. [Online]. Available: http://graphics.stanford.edu/projects/densecrf/
and V. Koltun, Efficient inference in fully
P. Krahenbuhl
connected crfs with gaussian edge potentials, in Advances
in Neural Information Processing Systems 24, J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger,
Eds.
Curran Associates, Inc., 2011, pp. 109117.
[Online]. Available: http://papers.nips.cc/paper/4296efficient-inference-in-fully-connected-crfs-with-gaussianedge-potentials.pdf

Bare JRNL Compsoc PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bare JRNL Compsoc PDF

Uploaded by

Copyright:

Available Formats

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO.

Utilizing Google Images for Semantic

The goal of semantic segmentation is to label

Fig. 1: Pixel-wise semantic segmentation of [1].

accuracy. VOC PASCAL 2010 [2] and MSRC [3]

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

The Google Images, for example, is categorized

2 C ONDITIONAL R ANDOM F IELDS M AXIMUM A P OSTERIORI (MRF-MAP)

The energy of labeling y is the sum of potential

Ni denotes the neighboring pixel indices of i.

can be intractable, because it is the sum of

To obtain optimal prediction, one should

The Hammersley-Clifford theorem proves

The role of energy function is to assist the

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

probability is similar to minimizing the energy

abstraction is employed to remove undesired

Fig. 2: The picture on the left shows a cat as

pi and ci denote the position of pixel i and

}pj i}2 wijc

The location information encodes the locality

(a) Cat as a salient ob-(b) Cat as a non-salient

VOC PASCAL 2010 was introduced in VOC

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

Fig. 5: Saliency filters assign saliency level

Fig. 3: The samples of VOC images. a The

G OOGLE I MAGES T RANSFORMATION

A keyword can be utilized to represent a class.

Fig. 4: The Google Images can be transformed

The Google Images transformation performs

saliency denotes a unary potential function that

Experiment scenarios aim to investigate the

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

(a) Original Google Image

(b) Labeled Image

Fig. 6: The results of Google Images Transformation

nario. The first experiment (CRF+VOC) utilizes

TABLE 1: Summarized results from the first

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

TABLE 2: Comparison of prediction accuracy of

TABLE 4: Comparison of prediction accuracy of

TABLE 3: Summarized results from the second

This result confirms that the Google Images

TABLE 5: Comparison of prediction accuracy in

C ONCLUSIONS AND F UTURE W ORKS

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

Images helps the semantic segmentation to

formance. The experiment has not performed

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007

J. Shotton, J. Winn, C. Rother, and A. Criminisi,

You might also like