Professional Documents
Culture Documents
Abstract
1. Introduction
Boundaries and regions are closely intertwined. A closed
boundary generates a region while every image region has a
boundary. Psychophysics experiments suggest that humans
use both boundary and region cues to perform segmentation
[43]. In order to build a vision system capable of parsing
natural images into coherent units corresponding to surfaces
Human SS Human IC
Patch+IC
IC-Only
Patch-Only
Image
The ecological statistics of the dataset show that regions are mostly convex, validating the assumptions
made by the intervening contour approach.
2. Methodology
We formulate the problem of learning the pixel affinity
function as a classification problem of discriminating samesegment pixel pairs from different-segment pairs. Let
be the true same-segment indicator so that
when pixels and are in the same segment, and
when pixels
and are in different segments.
2
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
F=0.60 MI=0.16 Patch
F=0.61 MI=0.14 IC
F=0.65 MI=0.19 Patch+IC
F=0.77 MI=0.25 Humans
Iso F=0.77
0.2
0.1
0
0.2
0.4
0.6
0.8
Recall
!"$# %
& ' (*)+-,/"2. "0 # %1%30 0
,. ,/.
3. Features
We will model the affinity between two pixels as a function
of both patch-based and gradient-based features. In each
case, we can use brightness, color, or texture, producing a
total of six features. We also consider the distance between
the pixels as a seventh feature.
465
475
465
485
Same Image
1
0.02
0.8
465
0.8
0.02
0.6
0.015
0.4
Precision
Precision
0.015
465
Different Images
0.025
0.6
0.01
0.4
0.01
0.2
0.2
0.005
0.005
0
0
0
0.2
0.4
0.6
Recall
0.8
0.2
0.4
0.6
0.8
Recall
45
the family of measures
Pb
for
, as well as the mean of Pb
. The
next section will cover the choice of the intervening contour
function, as well as the best way to combine the contour information from the brightness, color, and texture channels.
,
' , ,
'
4. Findings
4.1. Validating the Groundtruth Dataset
Before applying the human segmentation data to the problem of learning and evaluating affinity functions, we must
determine that the dataset is self-consistent. To this end,
we validate the dataset by comparing each segmentation to
the remaining segmentations of the same image. Treating
the left-out segmentation as the signal and the remaining
segmentations as ground-truth, we apply our two evaluation
methods.
The left panel of Figure 3 shows the distribution of precision and recall for the entire dataset of 1020 images.
Since the signal in this case is binary-valued, we have
a single point for each segmentation. The distribution is
characterized by high recall, with a median value of 99%.
This indicates that 99% of the same-segment pairs in the
4
0.76
Same Image
Different Images
0.5
0.16
0.14
BR = 0.027
0.4
p(1hop SL)
0.25
Same Image
Different Images
BR = 0.022
Small Scale
0.18
Density
Density
0.12
0.1
0.08
0.3
0.2
0.06
0.04
0.1
0.02
0.2
0.4
0.6
0.8
0
0
0.4
0.6
0.8
0.2
0.8
0.14
0.6
0.4
0.2
Figure 5: Each panel shows two pixels, marked with squares, that
all humans declared to be in the same segment. The intensity at
each point represents the empirical probability that a one-hop path
through that point does not intersect a boundary contour. The left
column is conditioned on there being an unobstructed straight-line
(SL) path between the pixels, while the right column shows the
probabilities when the SL path is obstructed. The top row shows
data gathered from pixel pairs with small separation; the bottom
row for pairs with large separation. See Section 4.2 for further
discussion.
tour is a good approximation to the same-segment indicator
for small distances: 49% of same-segment pairs are in the
75% correct range.
When straight-line intervening contour fails for a samesegment pair, there exist more complex paths connecting the
pixels. Consider the set of paths that consist of two straightline segments, which we call one-hop paths. The situations
where one-hop paths succeed but straight-line paths fail can
give us intuition about how much can be gained by examining paths more complex than straight line paths.
Figure 5 shows the empirical spatial distribution of onehop paths between same-segment pixel pairs, using the
union of human segmentations. The probability of a onehop path existing is conditioned on (left) there being a
straight-line path with no intervening contour and (right)
on there being no straight-line path. If the human subjects
regions were completely convex, then the right column images would be zero. Instead, we see that when straightline intervening contour fails, there is a small but significant
probability that a more complex one-hop path will succeed,
and the probability of such a path is larger for smaller scales.
There is clearly some benefit from the more complex paths
due to concavities in the regions. However, the degree to
which an algorithm could take advantage of the more powerful one-hop version of intervening contour depends on the
frequency with which the one-hop paths find holes in the estimated boundary map. In any case, the figure makes clear
that the simple straight-line path is a good first-order approximation to the connectivity of same-segment pairs.
Since the straight-line version of intervening contour
will underestimate connectivity in concave regions, it may
have a tendency toward over-segmentation. Figure 6 shows
Prob. no IC
Frequency
0.8
0.6
0.4
0.2
200
ground-truth are contained in a left-out human segmentation. The median precision is 66%, indicating that 66% of
the same-segment pairs in a left-out human segmentation
are contained in the ground-truth. These values support
the interpretation that the different subjects provide consistent segmentations varying simply in their degree of detail. For comparison, the right panel of the figure shows the
distribution when the left-out human segmentation and the
groundtruth segmentations are of different images.
The left panel of Figure 4 shows the distribution of the
F-measure for the same-image and different-image cases.
Similarly, the right panel shows the distributions for mutual information. The clear separation between same-image
and different-image comparisons attests to the consistency
of segmentations of the same image. The median F-measure
of 0.76 and mutual information of 0.25 represent the maximum achievable performance for pairwise affinity functions.
100
150
Distance (pixels)
0.4
50
0.6
0
0.2
Figure 4: The left panel shows the same two distributions as Fig-
0
0
1
0.8
0.32
Mutual Information
FMeasure
Large Scale
0
0
p(1hop SL)
250
x 10
4.5
0.18
3.5
0.8
0.16
3
0.6
2.5
2
0.4
1.5
0.2
1
Original
With IC
0.14
0.12
Density
Precision
0.1
0.08
0.06
0.04
0.02
0.5
0
0
0.2
0.4
0.6
Recall
0.8
0
0
1
0
0.2
0.4
0.6
0.8
Precision
0.75
Precision
Figure 6: The potential over-segmentation caused by the intervening contour approach agrees with the refinement of objects by
human observers. The distribution of precision and recall at left
is generated in an identical manner as the left panel of Figure 3.
However, we add the constraint that the same-segment pairs from
the left-out human must not have an intervening boundary contour.
The recall naturally decreases from adding a constraint to the signal. However from the marginal distributions shown at right, we
see that precision increases with the added constraint. Because the
union segmentation is, on average, a refinement of the left-out segmentation, intervening contour tends to break non-convex regions
in a manner similar to the human subjects.
0.5
0.25
0.25
0.5
Recall
0.75
the effect on precision and recall for the human data when
we add the constraint that same-segment pairs have no intervening boundary contour. As in Figure 3, we are comparing a left-out human to the union of the remaining humans.
On average, the union segmentation will be more detailed
than the left-out human. The figure shows a increase in median precision from 66% to 75%, indicating that intervening
contour tends to break up non-convex segments in a manner
similar to the human subjects. This lends confidence to an
approach to perceptual organization of first finding convex
object pieces through low-level processes, and then grouping the object pieces with into whole objects using higherlevel cues.
Precision
0.75
0.5
0.25
0.25
0.5
Recall
0.75
Precision
0.75
0.5
0.25
0.25
0.5
Recall
0.75
0.75
0.75
Precision
0.5
0.25
0.5
0.25
Distance F=0.404 @(0.504,0.338) MI=0.0399
Patch F=0.602 @(0.649,0.562) MI=0.157
IC F=0.609 @(0.580,0.641) MI=0.142
Patch+IC F=0.652 @(0.652,0.652) MI=0.19
Patch+Dist F=0.609 @(0.654,0.569) MI=0.16
IC+Dist F=0.612 @(0.579,0.648) MI=0.143
Patch+IC+Dist F=0.652 @(0.666,0.638) MI=0.191
0
0
0.25
0.5
Recall
0.75
0.25
0.5
Recall
0.75
Precision
Brightness
Color
Texture
0.75
0.75
0.75
0.5
0.25
Precision
Number of Textons
MI
Num F
32 0.48 .081
64 0.50 .093
128 0.52 0.10
256 0.53 0.11
512 0.55 0.12
1024 0.52 0.10
2048 0.52 0.10
Precision
Texture
F MI
0.50 .097
0.53 0.11
0.55 0.12
0.53 0.11
0.50 .089
Precision
Patch Radius
Brightness
Color
F MI
Radius F MI
0.010 0.46 .069 0.49 .093
0.014 0.46 .071 0.50 .096
0.020 0.46 .074 0.50 .097
0.028 0.46 .073 0.50 .097
0.040 0.46 .071 0.49 .095
0.056 0.45 .067 0.48 .091
0.080
0.112
0.5
0.25
0.25
0
0
0.25
0.5
Recall
0.5
0.75
0
0
0.25
0.5
Recall
0.75
0.25
0.5
0.75
Recall
Figure 9: The three plots show classifiers that use either brightness (left), color (middle), or texture (right). Each plot shows
the performance of a classifier using the patch cue, the gradient
cue, and both together. The brightness patch appears an especially
weak cue, which can be expected from the frequency of shading
gradients in images. Both texture patches and texture gradients
are powerful cues, and their combination is everywhere superior
to using one alone.
Top-Down
1
0.75
0.75
Precision
Precision
Bottom-Up
1
0.5
0.25
0.5
0.25
P{bct}IC{bct} F=0.652 @(0.666,0.638) MI=0.190
P{bct}IC{bt} F=0.651 @(0.668,0.635) MI=0.189
P{ct}IC{bt} F=0.648 @(0.667,0.630) MI=0.187
P{ct}IC{t} F=0.634 @(0.650,0.619) MI=0.178
P{c}IC{t} F=0.605 @(0.605,0.605) MI=0.160
P{}IC{t} F=0.564 @(0.535,0.598) MI=0.116
0
0
0.25
0.5
Recall
0.75
0.25
0.5
Recall
0.75
References
[1] I. Abdou and W. Pratt. Quantitative design and evaluation of
enhancement/thresholding edge detectors.
Proc. of the IEEE,
67(5):753763, May 1979.
[2] K. Bowyer, C. Kranenburg, and S. Dougherty. Edge detector evaluation using empirical ROC curves. 1999.
[3] E. Brunswik and J. Kamiya. Ecological validity of proximity and
other Gestalt factors. Am. J. Psych., pages 2032, 1953.
[4] J. Canny. A computational approach to edge detection. IEEE PAMI,
8:679698, 1986.
[5] Berkeley Segmentation Dataset, 2002.
http://www.cs.berkeley.edu/projects/vision/bsds.
[6] J. Elder and S. Zucker. Computing contour closures. In ECCV, 1996.
[7] P. Felzenszwalb and D. Huttenlocher. Image segmentation using local variation. 1998.
[8] I. Fogel and D. Sagi. Gabor filters as texture discriminator. Bio.
Cybernetics, 61:103113, 1989.
[9] C. Fowlkes, D. Martin, and J. Malik. Understanding Gestalt cues and
ecological statistics using a database of human segmented images.
POCV Workshop, ICCV, 2001.
[10] Y. Gdalyahu, D. Weinshall, and M. Werman. Stochastic image segmentation by typical cuts. In CVPR, 1999.
[11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution,
and the Bayesian retoration of images. IEEE PAMI, 6:72141, Nov.
1984.