Professional Documents
Culture Documents
25
Multi-edge Graph Tag Analysis with Multi-edge Graph Unified Tag Analysis
Tag Refinement
top101 dog
(v,f)
vertex 1 vertex 2 tour house
y1 (v,f) y2 tiger tree
sweet sky
big ground
(v,f)
cloud
(v,f)
cloud
(v,f)
(v,f)
(v
,f)
Tag to Region
water
vertex 3 (v,f) vertex m (v,f) water
vertex k grass cow
y3 ym (v,f) yk cow
grass
(v ,f)
(v v,f)
Automatic Tagging
,f)
(v
(
,f)
people
vertex n tiger
v: feature vector road
f: tag confidence vector yn
y: tag assignment vector
(a) (b) (c)
Figure 1: (a) An illustration of multi-edge graph, a graph with multiple edges between each vertex pair. (b)
For image tag analysis, each edge connects two segmented image regions from two unique images/vertices,
and thus each image/vertex pair is generally connected with multiple edges, where the number of edges for
each image/vertex pair may be different. (c) Based on multi-edge graph, various image tag analysis tasks
can be formulated within a unified framework. For better viewing, please see original color pdf file.
tags of an image and only reserves the top ones as the re- the image contents and semantic tags based on a collection
fined results. However, this method focuses on selecting a of manually labeled training images, and then use the learnt
coherent subset of tags from existing tags associated with models to predict the tags of those unlabeled images. Many
each image, and thus is incapable of “inventing” new tags algorithms have been proposed for automatic image tagging,
for a particular image if the tags are not initially provided varying from building classifiers for individual semantic la-
by users. To address this issue, Liu et al. [8] propose to refine bels [13, 14] to learning relevance models between images
the tags based on the visual and semantic consistency resid- and keywords [15, 16]. However, these algorithms gener-
ing in the social images, and assign similar tags to visually ally rely on sufficient training samples with high quality la-
similar images. The refined and possibly enriched tags can bels, which are however difficult, if not impossible, to be
then better describe the visual contents of the images and in obtained in many real-world applications. Fortunately, the
turn improve the performances in tag-related applications. semi-supervised learning technique can well relieve the above
Tag-to-Region Assignment. Even after we obtain re- difficulty by leveraging both labeled and unlabeled images
liable tags annotated at the image-level, we still need a in the automatic tagging process. The most frequently ap-
mechanism to assign tags to regions within an image, which plied semi-supervised learning method for automatic image
is a promising direction for developing reliable and visible tagging is graph-based label propagation [17]. A core compo-
content-based image retrieval systems. There exist some nent of the algorithms for label propagation on graph is the
related works [9, 10, 11] in computer vision community, graph construction. Most existing algorithms simply con-
known as simultaneous object recognition and image seg- struct a graph to model the relationship among individual
mentation, which aims to learn explicit detection model for images, and a single edge is connected between two vertices
each class/tag, and thus inapplicable for real applications for capturing the image-to-image visual similarity. Never-
due to the difficulties in collecting precisely labeled image theless, using a single edge to model the relationship of two
regions for each tag. In multimedia community, a recent images is inadequate in practice, especially for the real-world
work is proposed by Liu et al. [12], which accomplishes the images typically associated with multiple tags.
task of tag-to-region assignment by using a bi-layer sparse Although many tag related tasks have been exploited,
coding formulation for uncovering how an image or seman- there exist several limitations for these existing algorithms.
tic region could be reconstructed from the over-segmented (1) Most existing algorithms for tag refinement and auto-
image patches of the entire image repository. In [12], the matic tagging tasks typically infer the correspondence be-
atomic representation is based on over-segmented patches, tween the images and their associated tags only at the im-
which are not descriptive enough and may be ambiguous age level, and utilize the image-to-image visual similarity
across tags, and thus the algorithmic performance is lim- to refine or predict image tags. However, two images with
ited. partial common tags may be considerably different in terms
Multi-label Automatic Tagging. Accurate manual of holistic image features, and thus image level similarity
tagging is generally laborious when the image repository vol- may be inadequate to describe the similarity of the under-
ume becomes large, while a large portion of images uploaded lying concepts among the images. (2) The only work [12]
onto those photo sharing websites are unlabeled. Therefore, for tag-to-region assignment task, despite working at the re-
an automatic process to predict the tags associated with an gion level, suffers from the possible ambiguities among the
image is highly desirable. Generally, popular machine learn- smaller-size patches, which are expected to sparsely recon-
ing techniques are applied to learn the relationship between struct the desired semantic regions.
26
It is worth noting that the various tasks on tag analy- images, posted comments, tags, etc. Typically, the friend-
sis are essentially closely related, and the common goal is ship relations do not reside only within a single information
to discover the tags associated with the underlying seman- channel, but rather are spread across multiple information
tic regions in the images. Therefore, we naturally believe channels. Therefore, we represent each shared information
that there exists a unified framework which can accomplish channel between any two users with an edge, which naturally
these tasks in a coherent way, which directly motivates our forms a multi-edge graph structure. Note that the available
work in this paper. Towards this goal, we propose a new information channels for the individual users might be dif-
concept of multi-edge graph, upon which we derive a uni- ferent1 , thus the edge number between different user pairs
fied formulation and solution to various tag analysis tasks. may be different. With this new graph structure, we can see
We characterize each vertex in the graph with a unique im- a more comprehensive picture of the relations between the
age, which is encoded as a “bag-of-regions” with multiple users, and thus better recognize the friendship relations.
segmentations, and the thresholding of the pairwise similar- Now we give the definition of multi-edge graph. For-
ities between the individual regions naturally constructs the mally, we have a multi-edge graph G = (V, E), where V =
multiple edges between each two vertices. The foundation of {1, 2, ..., N } denotes the vertex set, E is a family of multi-
the unified tag analysis framework is cross-level tag propaga- edge set E ij of V , where E ij = {eij ij
1 , ..., e|E ij | } denotes the
tion which propagates and adapts the tags between a vertex multiple edges between vertex i and vertex j. Since each
and its connected edges, as well as between all edges in the edge in the graph represents one shared information channel
graph structure. A core vertex-vs-edge tag equation unique between two vertices, we may place a feature representation
for multi-edge graph is derived to bridge the image/vertex for the edge to indicate the shared information within the
tags and the region-pair/edge tags. That is, the maximum specific information channel conveyed by this edge. We as-
confidence scores over all the edges between two vertices in- sume that the multi-edge graph G = (V, E) is represented
dicate the shared tags of these two vertices (see Section 2). by a symmetric edge weighting matrix W, where each Wij
Based on this core equation, we formulate the unified tag measures the weighting similarity between two edge feature
analysis framework as a constrained optimization problem, representations.
where the objective characterizing the cross-region tag con- Obviously both graph vertices and edges may convey the
sistency is constrained by the core equations for all vertex class/tag information for the multi-edge graph, and then the
pairs, and the cutting plane method is utilized for efficient general target of learning over the multi-edge graph becomes
optimization. Figure 1 illustrates the unified tag analysis to infer the tags of the edges supported by the individual in-
framework based on the proposed multi-edge graph. formation channels and/or the tags of the unlabeled vertices
The main contributions of this work can be summarized based on the tags of those labeled vertices. To realize the
as follows: cross-level tag inference, a core equation unique for multi-
edge graph is available for bridging the tags at the vertex
• We propose a unified formulation and solution to var-
level and the tags at the edge level. Given a tag c, assume
ious tag analysis tasks, upon which, tag refinement,
a scalar yci ∈ {0, 1}2 is used to illustrate the labeling infor-
tag-to-region assignment as well as automatic tagging
mation of the vertex i, where yci = 1 means that vertex i
can be implemented in a coherent way.
is labeled as positive with respect to c and yci = 0 other-
• We propose to use the multi-edge graph to model the wise. We also associate a real-value label fct ∈ [0, 1] with the
parallel semantic relationships between the images. We edge et to capture the probability that the edge et is labeled
also discover a core equation to connect the tags at the as positive with respect to c under the specific information
image level with the tags at the region level, which nat- channel residing on et . Then for any two vertices i and j, a
urally realizes the cross-level tag propagation. core equation can be formulated as follows:
• We propose an iterative and convergence provable pro- yci ycj = max fct . (1)
et ∈E ij
cedure with the cutting plane method for efficient op-
timization. Discussion. Compared with the traditional single edge
graph, the proposed multi-edge graph has the following char-
2. GENERAL MULTI-EDGE GRAPH acteristics: (1) each edge is linked at a specific information
An object set endowed with pairwise relationships can be channel, which naturally describes the vertex relation from
naturally illustrated as a graph, in which the vertices repre- a specific aspect; (2) the graph adjacent structure better
sent the objects, and any two vertices that have some kind captures the complex relations between the vertices owing
of relationship are joined together by an edge. However, to the multi-edge linkage representation; and (3) the graph
in many real-world problems, representing the relationship weighting is performed over the individual edge representa-
between two complex relational objects with a single edge tions, which may describe the correlation of the heteroge-
may cause considerable information loss, especially when the neous information channels.
individual objects convey multiple information channels si- There are also some works [18, 19] related to the term of
multaneously. A natural way to remedy the information “multi-graph” in literature, but the focus of these works is to
loss issue in the traditional graph is to link the objects with combine multiple graphs obtained from different modalities
multiple edges, each of which characterizes the information 1
For example, some users only upload photos onto Flickr but never
in certain aspect of the objects, and then use a multi-edge
provide other information, while some users only post comments to
graph to model the complex relationships amongst the ob- others’ photos without sharing his/her own photos.
jects. Take the friendship mining task on Flickr website as 2 c
yi denotes the entry corresponding to tag c in the tag vector yi ,
example, the individual users often provide a collection of and thus is a scalar. Similar statement holds for vector fc and its
information channels including personal profiles, uploaded scalar entry fct .
27
of the same observations to obtain a more comprehensive
Segmentation 1
single edge graph, which, however, is essentially different
from the multi-edge graph introduced in this work, where
multiple edges are constructed to characterize the parallel
relationships from different information channels between
the vertices.
The recently proposed hypergraph [20] also attempts to
model the complex relationships among the objects. How-
ever, the modeling is performed on a single information Input Image Bag-of-Regions
Segmentation 2
channel of the objects, which also causes considerable infor-
mation loss. Instead of modeling the relationship of objects
with a single channel, multi-edge graph differs from hyper- Figure 2: The “bag-of-regions” representation of an
graph by explicitly constructing the linkages with multiple input image. We use two segmentation algorithms
information channels of the vertices. to segment the input image simultaneously and then
collect all derived segments to obtain the bag-of-
3. CONSTRUCT MULTI-EDGE GRAPH FOR regions representation.
UNIFIED TAG ANALYSIS
In this section, we first describe the bag-of-regions repre- (visual word index) between 1 and 500. Finally, each region
sentation for the images, and then introduce the multi-edge is represented as a normalized histogram over the visual vo-
graph construction procedure for image-based tag analysis. cabulary.
28
parameter of the Gaussian function, and is set as the median edge-specific tag confidence vectors over the graph during
value of all pairwise Euclidean distances between the edges. the propagation. Here, we refer to this approach as Cross-
level Tag Propagation due to the extra constraints from the
core vertex-vs-edge tag equations for all vertex pairs.
4. UNIFIED TAG ANALYSIS OVER IMAGE- We formulate the cross-level tag propagation within a reg-
BASED MULTI-EDGE GRAPH ularized framework which conducts the propagation implic-
In this section, we present the unified tag analysis frame- itly by optimizing the regularized objective function,
work over the image-based multi-edge graph. We first de-
scribe the problem formulation and then introduce an effi- m
X X
cient iterative procedure for the optimization. min Ω(F, W) + λ `(yci ycj , max I>
c fk(e) )
F e∈E ij
c=1 i,j∈Lc
4.1 Basic Framework m
X
We restrict our discussion in a transductive setting, where s.t. ft ≥ 0, fct = 1, t = 1, 2, . . . , T, (3)
l labeled images {(x1 , y1 ), . . . , (xl , yl )} and u unlabeled im- c=1
ages {xl+1 , . . . , xl+u } are given3 . We represent each image
xi (i = 1, 2, . . . , l + u) with the “bag-of-regions” representa-
tion as described in Section 3.1. All unique tags appeared where F = [f1 , f2 , . . . , fT ]> consists of the edge-specific tag
in the labeled image collection can be collected into a set confidence vectors to be learned for T edges in the graph,
Y = {yi , y2 , . . . , ym }, where m denotes the total number of Lc denotes the collection of labeled images annotated with a
tags. For any labeled image (xi , yi ) (i = 1, 2, . . . , l), the tag specific tag yc , k(e) denotes the index of edge e in E. `(·, ·)
assignment vector yi is a binary vector yi ∈ {0, 1}m , where is a convex loss function such as hinge loss, and Ic is an m-
the k-th entry yki = 1 indicates that xi is assigned with dimensional binary vector with the c-th entry being 1 and
tag yk ∈ Y and yki = 0 otherwise. Besides, a multi-edge other entries all being 0’s. Ω is a regularization term respon-
graph G = (V, E) with vertices V corresponding to the l + u sible for the implicit tag propagation, and λ is a regulariza-
images, where vertices L = {1, . . . , l} correspond to the la- tion parameter. Since the numerical values residing in F are
beled images, and vertices U = {l + 1, . . . , l + u} correspond edge-specific tag confidence scores over the tags, we enforce
to the unlabeled images, is also provided. Furthermore, we ft ≥ 0 to ensure
P that the scores are non-negative. Further-
collect all edges in the graph and re-index the edges in E as more, the m c
c=1 ft = 1 constraint is imposed to ensure that
E = {e1 , e2 , . . . , eT }, where T is the total number of edges the sum of the probabilities to be 1, which essentially cou-
in the graph. ples different tags together and makes them interact with
The general goal of tag analysis is to infer globally con- each other in a collaborative way.
sistent edge-specific tag confidence vector ft for each edge et We enforce the tag smoothness over the multi-edge graph
(t = 1, . . . , T ), wherein the c-th entry fct denotes the proba- by minimizing the regularization term Ω:
bility of assigning tag yc to the edge et . This is essentially
to infer the fine tag information at the region level from the T
X
coarse tag information at the image level. Intuitively, the Ω(F, W) = Sth kft − fh k2 = 2tr(F> LF). (4)
side information conveyed by a set of images annotated with t,h=1
the same tag could be sufficient for disclosing which regions
should be simultaneously labeled with the tag under consid- 1 1
eration. This is due to the fact that any two images with the Here S = Q− 2 WQ− 2 is a normalized weight matrix of
same tag will be linked at least by one edge that connects W. Q is a diagonal matrix whose (i, i)-entry is the i-th row
two regions corresponding to the concerned tag in the multi- sum of W. L = (I − S) is the graph Laplacian. Obviously,
edge graph. The repetition of such pairwise connections in the minimization of Eq. (4) yields a smooth propagation of
a fraction of the labeled images will be an important sig- the edge-specific tag confidence scores over the graph, and
nal that can be used to infer a common “visual prototype” nearby edges shall have similar tag confidence vectors.
throughout the image collection. Finding multiple visual Based on the pairwise side information provided by images
prototypes and their correspondences with respect to the annotated with the same tag, we instantiate the loss function
individual tags is the foundation of the unified tag analysis `(·, ·) in Eq. (3) with `1 -loss:
framework. More importantly, the core equation in Eq. (1)
plays an important role in linking the tags of the edges and
`(yci ycj , max I> c c >
c fk(e) ) = |yi yj − max Ic fk(e) |. (5)
the tags of the images, and offers the capability for propa- e∈E ij e∈E ij
gating tags between them.
Furthermore, by assuming that similar edges share sim-
ilar regional features, the distribution of edge-specific tag By introducing slack variables and plugging Eq. (4) and
confidence vector should be smooth over the graph. Given Eq. (5) into Eq. (3), we obtain the following convex opti-
a weighting matrix W which describes the pairwise edge mization problem:
relationship based on Eq. (2), we can propagate the edge-
specific tag confidence vectors from the labeled image pairs m
X X
to the unlabeled ones by enforcing the smoothness of these min 2tr(F> LF) + λ ξcij
F,ξcij
c=1 i,j∈Lc
3
Note that the proposed framework is general and may not need the
support of unlabeled data. For example, in tag refinement and tag-to- s.t. −ξcij ≤ yci ycj − max I>
c fk(e) ≤ ξcij , ξcij ≥ 0,
e∈E ij
region assignment tasks, we can only use the labeled images to obtain
the desired results. ft ≥ 0, 1> ft = 1, t = 1, 2, . . . , T. (6)
29
This is also equivalent to Although the objective function in Eq. (10) is quadratic
m
X X and the constraints are all convex, the second constraint is
min 2tr(F> LF) + λ ξcij not smooth. Hence, we cannot directly use the off-the-shelf
F,ξcij
c=1 i,j∈Lc convex optimization toolbox to solve the problem. Fortu-
nately, the cutting plane method [26] is able to well solve
s.t. I> c c
c fk(e) − ξcij ≤ yi yj , e ∈ E ij the optimization problem with non-smooth constraints. In
yci ycj − max I>
c fk(e) ≤ ξcij , Eq. (10), we have m slack variables ξc (c = 1, 2, ..., m).
e∈E ij
To solve it efficiently, we first drive the 1-slack form of
ξcij ≥ 0, c = 1, 2, . . . , m, i, j ∈ Lc , Eq. (10). More
P specifically, we introduce one single slack
ft ≥ 0, 1> ft = 1, t = 1, 2, . . . , T. (7) variable ξ = m c=1 ξc . Finally, we can obtain the objective
function as
4.2 Optimization with Cutting Plane min Ω(r, W) + λξ
The optimization problem in Eq. (7) is convex, resulting r,ξ
fidence vectors Fij = [fk(eij ) , fk(eij ) , . . . , fk(eij ) ] that cor- yci ycj , yci ycj − maxet ∈E ij I>
c At r). The basic idea of cutting
1 2 |E ij | plane method is to substitute the nonlinear constraints with
respond to the edges between two given vertices i and j a collection of linear constraints. The algorithm starts with
yields a standard Quadratic Programming (QP) problem: the following optimization problem:
m
X
min Ω(Fij , W) + λ ξc min Ω(r, W)
r
Fij ,ξc
c=1
s.t. r ≥ 0, 1> At r = 1, et ∈ E ij . (12)
s.t. I>
c fk(e) − ξc ≤ yci ycj , e∈E , ij
s
After obtaining the solution r (s = 0) of the above prob-
yci ycj − max I>
c fk(e) ≤ ξc , lem, we approximate the nonlinear function ξ(r) around rs
e∈E ij
by a linear inequality:
ξc ≥ 0, c = 1, 2, . . . , m,
fk(e) ≥ 0, 1> fk(e) = 1, e ∈ E ij . (8) ∂ξ(r)
ξ(r) ≥ ξ(rs ) + |r=rs (r − rs ), (13)
∂r
Since we only solve the fd ’s residing on edges between
vertex i and vertex j and treat the others as constant vectors Denote by g1 (r) = 0, g2 (r) = I> c c
c At r − yi yj , g3 (r) =
during the optimization, the smooth term that relates to the yci ycj − maxet ∈E ij I>
c At r, the subgradient of ξ(r) in Eq. (13)
concerned optimization variable can be written as can be calculated as follows:
X X m ³
Ω(Fij , W) = 2 St? h? (f> >
t? ft? − 2ft? fh? )
∂ξ(r) X ∂gp (r) ´
|r=rs = zp (r) × |r=rs . (14)
et ∈E ij eh ∈E
/ ij ∂r c=1
∂r
X X
+2 St? h? (f> > >
t? ft? − 2ft? fh? + fh? fh? ), (9) Here,
et ∈E ij eh ∈E ij \{et }
1, if p = arg maxp=1,2,3 gp (r),
? ?
where t = k(et ) and h = k(eh ) denote the indices of edge zp (r) = (15)
et and edge eh in the edge set E, respectively. 0, otherwise,
We vectorize the matrix Fij by stacking their columns into
>
a vector. Denote by r = vec(Fij ), then fk(et ) can be calcu- and
P ∂g1 (r)/∂r = 0, ∂g2 (r)/∂r = Ic At and ∂g3 (r)/∂r =
lated as fk(et ) = At r, where At = [I(t−1)∗m+1 , . . . , It∗m ]> − et ∈E ij γet I>
c At , where γet is defined as follows:
with Ig an (m × |E ij |)-dimensional column vector in which
1, if et = arg maxet ∈E ij I> c At r,
the g-th entry is 1 while the other entries are all 0’s. Thus γet = (16)
the objective function can be rewritten as
0, otherwise.
m
X
min Ω(r, W) + λ ξc The linear constraint in Eq. (13) is called a cutting plane
r,ξc
c=1 in the cutting plane terminology. We add the obtained con-
s.t. I> c c
et ∈ E ij , straint into the constraint set and then use efficient QP pack-
c At r − ξc ≤ yi yj ,
age to solve the reduced cutting plane problem. The opti-
yci ycj − max I>
c At r ≤ ξc , mization procedure is repeated until no salient gain when
et ∈E ij
adding more cutting plane constraints. In practice, the cut-
ξc ≥ 0, c = 1, 2, . . . , m, ting plane procedure halts when satisfying the ε-optimality
r ≥ 0, 1> At r = 1, et ∈ E ij . (10) condition, i.e., the difference between the objective value of
30
Algorithm 1 Unified Tag Analysis Over Multi-edge Graph erator. The obtained vector then indicates the refined tags
1: Input: A multi-edge graph G = (V, E) of an image for the given image.
collection X , with the labeled subset Xl and the un- Tag-to-Region Assignment. The tag-to-region assign-
labeled subset Xu , edge affinity matrix W = {wij } ment is implemented in a pixel-wise manner. Since our ap-
(i, j = 1, 2, .., T ) of the edge set E, and regularization proach relies on two segmentation algorithms, one pixel will
constant λ. appear twice in two regions. For each region, we predict its
2: Output: The edge-specific tag confidence vectors F = tag and count the corresponding predicted number with the
[f1 , f2 , ..., fT ]> . tag prediction method discussed above. If the tag predic-
3: Initialize F = F0 , t = 0, ∆J = 10−3 , J −1 = 10−3 . tions for the two regions are inconsistent, we select the tag
4: while ∆J/J t−1 > 0.01 do with the most prediction number as the predicted tag of the
5: for any pairwise labeled vertices i and j do concerned pixel.
6: Derive problem in Eq. (11), set the constraints set Automatic Tagging. We predict the binary tag assign-
Ω = φ and s = −1; ment vector for each region of an unlabeled image, and then
7: Cutting plane iterations: fuse the obtained binary vectors via OR operator.
8: repeat
9: s=s+1; 4.4 Algorithmic Analysis
10: Get (rs , ξ s ) by solving Eq. (11) under Ω; We first analyze the time complexity of Algorithm 1. In
11: Compute the linear constraint by Eq. (13) and the optimization, each sub-problem is solved with the cut-
update the constraint set Ω by adding the ob- ting plane method. As for the time complexity of the cutting
tained linear constraint into it. plane iteration, we have the following two theorems:
12: until Convergence
13: end for Theorem 1. Each iteration in steps 9-11 of Algorithm 1
14: for any pairwise vertices i and j in which at least one takes time O(dn) for a constant constraint set Ω, where d is
is unlabeled do the number of edges between vertex i and vertex j.
15: Solve the QP problem whose form is similar to Theorem 2. The cutting plane iteration in steps 8-12 of
Eq. (12). 2
Algorithm 1 terminates after at most max{ 2ε , 8λR
ε2
} steps,
16: end for
where
17: t = t + 1; Ft = Ft−1 ; ∆J = J t−1 − J t ;
Xm X
18: end while
R2 = max (0, kI>
c At k, k γet I>
c At k),
e ∈E ij
c=1 t et ∈E ij
the master problem and the objective value of the reduced and ε is the threshold for terminating the cutting plane iter-
cutting plane problem is smaller than a threshold ε. In this ation (see Section 4.2). Here k · k denotes `2 -norm.
work, we set ε to be 0.1. We use J t to denote the value of the
objective function in Eq. (7), then the whole optimization The proofs of the above two theorems are direct by fol-
procedure can be summarized as in Algorithm 1. lowing the proofs in [27], and are omitted here due to space
limitations. In this work, we implement Algorithm 1 on
4.3 Unified Tag Analysis with the Derived F the MATLAB platform of an Intel XeonX5450 workstation
Once the tag confidence vectors F = [f1 , f2 , . . . , fT ]> are with 3.0 GHz CPU and 16 GB memory, and observe that the
obtained, the predicted tag vector y∗ for a given seman- cutting plane iteration converges fast. For example, in the
tic region x can be obtained in a majority voting strategy. tag-to-region assignment experiment on the MSRC dataset
Assume there are totally p edges connected with the re- (see Section 5.2), one optimization sub-problem between two
gion x in the multi-edge graph, which form an edge subset vertices can be processed within only 0.1 seconds, which
E(x) = {ex1 , ex2 , . . . , exp }. We collect the corresponding confi- demonstrates that the cutting plane iteration of Algorithm 1
dence vectors into a subset F(x) = [fk(ex1 ) , fk(ex2 ) , . . . , fk(exp ) ], is efficient. Furthermore, as each optimization sub-problem
in Eq. (11) is convex, its solution is globally optimal. There-
where k(exi ) denotes the index of edge exi in the whole edge
fore, each optimization between two vertices will monoton-
set of the multi-edge graph. In essence, each confidence vec-
ically decrease the objective function, and hence the algo-
tor in F(x) represents the tag prediction result of region x,
rithm will converge. Figure 3 shows the convergence process
where the c-th entry corresponds to the confidence score of
of the iterative optimization, which is captured during the
assigning tag yc to region x. Since each region usually only
tag-to-region assignment experiment on the MSRC dataset.
contains one semantic concept, we select the tag with the
Here one iteration refers to one round of alternating opti-
maximum confidence score in the confidence vector as the
mizations for all vertex pairs, which corresponds to steps
predicted tag of the given region. Furthermore, since there
5-16 of Algorithm 1. As can be seen, the objective function
are p edges simultaneously connected with region x, we pre-
converges to the minimum after about 5 iterations.
dict the tag with each confidence vector and select the tag
with the most prediction number as the final predicted tag
of region x. By doing so, we can obtain a binary tag assign- 5. EXPERIMENTS
ment vector for each region, and then the three tag analysis In this section, we systematically evaluate the effective-
tasks can be implemented in the following manner: ness of the proposed unified formulation and solution to
Tag Refinement. For each image with user-provided various tag analysis tasks over three widely used datasets
tags, we predict the binary tag assignment vector for each and report comparative results on two tag analysis tasks:
of its composite regions in the bag-of-regions representation (1) tag-to-region assignment, and (2) multi-label automatic
and then fuse all binary vectors through the logic OR op- tagging. We have also conducted extensive experiments on
31
Objective Function Value
9000 ing formulation to uncover how an image or semantic region
8000
can be robustly reconstructed from the over-segmented im-
age patches of an image set, and is the state-of-the-art for
7000 tag-to-region assignment task.
6000 The evaluation of the tag-to-region assignment is based
5000
on the region-level ground truth of each image, which is not
available for the COREL-5k and NUS-WIDE-SUB datasets.
4000 To allow for direct comparison, here we report the per-
3000 formance comparisons on the MSRC-350 and COREL-100
1 2 3 4 5 6 7 8 9 10
Iterative Number datasets, both of which have been utilized to evaluate the
tag-to-region performance in [12]. More specifically, the
Figure 3: The convergence curve of Algorithm 1 on MSRC-350 dataset focuses on relatively well labeled classes
the MSRC dataset in the tag-to-region assignment in MSRC dataset, which contains 350 images annotated with
experiment (See Section 5.2 for the details). 18 different tags. The second dataset, COREL-100, is the
subset of COREL image set, which contains 100 images with
the tag refinement task over real-world social images and 7 tags, and each image is annotated with the region-level
obtained better results than the state-of-the-art tag refine- ground truth. Besides, we fix the parameter λ to be 1 in all
ment methods in [7] and [8]. This result is not shown here the experiments. The tag-to-region assignment performance
due to space limitations. is measured by pixel-level accuracy which is the percentage
5.1 Image Datasets of pixels with agreement between the assigned tags and the
ground truth.
We perform the experiments on three publicly available
benchmark image datasets: MSRC [11], COREL-5k [28] Table 1: Tag-to-region assignment accuracy compar-
and NUS-WIDE [4], which are progressively more difficult. ison on MSRC-350 and COREL-100 dataset. The
The MSRC dataset contains 591 images from 23 object and kNN-based algorithm is implemented with different
scene categories with region-level ground truths. There are values for the parameter k, namely, kNN-1: k=49,
about 3 tags on average for each image. The second dataset, kNN-2: k=99.
COREL-5k, is a wildly adopted benchmark for keywords Dataset kNN-1 kNN-2 bi-layer [12] M-E Graph
based image retrieval and image annotation. It contains
5, 000 images manually annotated with 4 to 5 tags, and the MSRC-350 0.45 0.37 0.63 0.73
whole vocabulary contains 374 tags. A fixed set of 4, 500 COREL-100 0.52 0.44 0.61 0.67
images are used as the training set, and the remaining 500
images are used for testing. The NUS-WIDE dataset, re-
cently collected by the National University of Singapore, is a
5.2.2 Results and Analysis
more challenging collection of real-world social images from Table 1 shows the pixel-level accuracy comparison be-
Flickr. It contains 269, 648 images in 81 categories and has tween the baseline algorithms and our proposed unified so-
about 2 tags per image. We select a subset of this dataset, lution on the MSRC-350 and COREL-100 datasets. From
focusing on images containing at least 5 tags, and obtain the results, we have the following observations. (1) Our
a collection of 18, 325 images with 41, 989 unique tags. We proposed unified solution achieves much better performance
refer to this social image dataset as NUS-WIDE-SUB and compared to the other three baseline algorithms. This clearly
the average number of tags for each image is 7.8. demonstrates the effectiveness of our proposed unified so-
lution in the tag-to-region assignment task. (2) The con-
5.2 Exp-I: Tag-to-Region Assignment textual tag-to-region assignment algorithms including the
Our proposed unified solution is able to automatically as- bi-layer and our proposed multi-edge graph based unified
sign tags to the corresponding regions within the individ- solution outperform the kNN based counterparts. This is
ual images, which naturally accomplishes the task of tag- because the former two harness the contextual information
to-region assignment. In this subsection, we compare the among the semantic regions in the image collection. (3) The
quantitative results from our proposed unified solution with multi-edge graph based tag-to-region assignment algorithm
those from existing tag-to-region assignment algorithms. clearly beats the bi-layer sparse coding algorithm, which
owes to the fact that the structure of the multi-edge graph
5.2.1 Experiment setup avoids the ambiguities among the smaller size patches in the
To evaluate the effectiveness of our proposed unified so- bi-layer sparse coding algorithm. Figure 4 illustrates the
lution in the tag-to-region assignment task, two algorithms tag-to-region assignment results for some exemplary images
are implemented as baselines for comparison. (1) The kNN from MSRC-350 and COREL-100 datasets produced by our
algorithm. For each segmented region of an image, we first proposed unified solution. By comparing the results in [12],
select its k nearest neighboring regions from the whole se- the results in Figure 4 are more precise, especially for the
mantic region collection, and collect the images containing boundaries of the objects.
these regions into an image subset. Then we count the occur-
rence number of each tag in the obtained image subset and 5.3 Exp-II: Multi-label Automatic Tagging
choose the most frequent tag as the tag of the given seman- In this subsection, we evaluate the performance of our
tic region. We apply this baseline with different parameter proposed unified formulation and solution in the task of au-
setting, i.e., k=49 and 99, and thus can obtain two results tomatic image tagging. Since our proposed unified solution
from this baseline. (2) The bi-layer sparse coding (bi-layer) is founded on the multi-edge graph, the learning procedure
algorithm in [12]. This algorithm uses a bi-layer sparse cod- is essentially in a transductive setting.
32
Table 2: Performance comparison of different auto-
matic tagging algorithms on different datasets.
Dataset method Precision Recall
ML-LGC 0.62 0.55
CNMF 0.65 0.61
MSRC
SMSE 0.67 0.62
(22 tags)
MISSL 0.63 0.60
M-E Graph 0.72 0.67
ML-LGC 0.22 0.24
CNMF 0.24 0.27
COREL-5k
SMSE 0.23 0.28
(260 tags)
MISSL 0.22 0.29
M-E Graph 0.25 0.31
ML-LGC 0.28 0.29
CNMF 0.29 0.31
NUS-WIDE-SUB
SMSE 0.32 0.32
(81 tags)
MISSL 0.27 0.33
M-E Graph 0.35 0.37
33
have the following observations: (1) The proposed unified [11] J. Shotton, J. Winn, C. Rother, and A. Criminisi.
solution based on multi-edge graph outperforms the other Textonboost: Joint appearance, shape and context
four multi-label image tagging algorithms over all datasets. modeling for multi-class object recognition and
It indicates the effectiveness of the proposed unified solution segmentation. In ECCV, 2006.
in the automatic image tagging task. (2) Our proposed uni- [12] X. Liu, B. Cheng, S. Yan, J. Tang, T. Chua, and
fied solution outperforms the MISSL which is also based on H. Jin. Label to region by bi-layer sparsity priors. In
the semantic regions, as our algorithm explicitly explores the MM, 2009.
region-to-region relations in the multi-edge graph structure. [13] J. Li and J. Wang. Automatic linguistic indexing of
pictures by a statistical modeling approach. TPAMI,
6. CONCLUSIONS AND FUTURE WORK 2003.
We have introduced a new concept of multi-edge graph [14] E. Chang, K. Goh, G. Sychay, and G. Wu. Cbsa:
into image tag analysis, upon which a unified framework is content-based soft annotation for multimodal image
derived for different yet related analysis tasks. The unified retrieval using bayes point machines. TCSVT, 2003.
tag analysis framework is characterized by performing cross- [15] J. Jeon, V. Lavrenko, and R. Manmatha. Automatic
level tag propagation between a vertex and its edges as well image annotation and retrieval using cross-media
as between all edges in the graph. A core equation is devel- relevance models. In SIGIR, 2003.
oped to unify the vertex tags and the edge tags, based on [16] S. Feng, R. Manmatha, and V. Lavrenko. Multiple
which, the unified tag analysis is formulated as a constrained bernoulli relevance models for image and video
optimization problem and the cutting plane method is used annotation. In CVPR, 2004.
for efficient optimization. Extensive experiments on two ex- [17] D. Zhou, O. Bousquet, T. Lal, J. Weston, and
ample tag analysis tasks over three widely used benchmark B. Schölkopf. Learning with local and global
datasets have demonstrated the effectiveness of the proposed consistency. In NIPS, 2003.
unified solution. For future work, we will pursue other ap- [18] D. Zhou and C. Burges. Spectral clustering and
plications of the multi-edge graph, such as analyzing user transductive learning with multiple views. In ICML,
behavior in social network or mining knowledge from the 2007.
rich information channels of multimedia documents. [19] D. Zhou, S. Zhu, K. Yu, X. Song, B. Tseng, H. Zha,
and C. Giles. Learning multiple graphs for document
7. ACKNOWLEDGMENT recommendations. In WWW, 2008.
This work is supported by the CSIDM Project No.CSIDM- [20] D. Zhou, J. Huang, and B. Schölkopf. Learning with
200803 partially funded by a grant from the National Re- hypergraphs: Clustering, classification, and
search Foundation (NRF) administered by the Media De- embedding. In NIPS, 2006.
velopment Authority (MDA) of Singapore. [21] P. Felzenszwalb and D. Huttenlocher. Efficient
graph-based image segmentation. IJCV, 2004.
[22] J. Shi and J. Malik. Normalized cuts and image
8. REFERENCES segmentation. TPAMI, 2000.
[1] Flickr. http://www.flickr.com/.
[23] B. Russell, W. Freeman, A. Efros, J. Sivic, and
[2] Picasa. http://picasa.google.com/.
A. Zisserman. Using multiple segmentations to
[3] L. Kennedy, S.-F. Chang, and I. Kozintsev. To search discover objects and their extent in image collections.
or to label?: Predicting the performance of In CVPR, 2006.
search-based automatic image classifiers. In MIR,
[24] K. Mikolajczyk and C. Schmid. Scale & affine
2006.
invariant interest point detectors. IJCV, 2004.
[4] T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and
[25] D. Lowe. Distinctive image features from
Y. Zheng. Nus-wide: A real-world web image database
scale-invariant keypoints. IJCV, 2004.
from national university of singapore. In CIVR, 2009.
[26] J. Kelley. The cutting-plane method for solving convex
[5] D. Liu, X.-S. Hua, L. Yang, M. Wang, and
programs. JSIAM, 1960.
H.-J. Zhang. Tag ranking. In WWW, 2009.
[27] T. Joachims. Training linear svms in linear time. In
[6] M. Ames and M. Naaman. Why we tag: Motivations
KDD, 2006.
for annotation in mobile and online media. In CHI,
[28] H. Müller, S. Marchand-Maillet, and T. Pun. The
2007.
truth about corel - evaluation in image retrieval. In
[7] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang.
CIVR, 2002.
Content-based image annotation refinement. In
[29] Y. Liu, R. Jin, and L. Yang. Semi-supervised
CVPR, 2007.
multi-label learning by constrained non-negative
[8] D. Liu, X.-S. Hua, M. Wang, and H.-J. Zhang. Image
matrix factorization. In AAAI, 2006.
retagging. In MM, 2010.
[30] G. Chen, Y. Song, F. Wang, and C. Zhang.
[9] L. Cao and L. Fei-Fei. Spatially coherent latent topic
Semi-supervised multi-label learning by solving a
model for concurrent segmentation and classification
sylvester equation. In SDM, 2008.
of objects and scenes. In ICCV, 2007.
[31] Z.-J. Zha, T. Mei, J. Wang, Z. Wang, and X.-S. Hua.
[10] Y. Chen, L. Zhu, A. Yuille, and H.-J. Zhang.
Graph-based semi-supervised learning with multiple
Unsupervised learning of probabilistic object models
labels. JVCI, 2009.
(poms) for object classification, segmentation, and
[32] R. Rahmani and S. Goldman. MISSL: Multiple
recognition using knowledge propagation. TPAMI,
instance semi-supervised learning. In ICML, 2006.
2009.
34