Professional Documents
Culture Documents
Abstract
A statistical correlation model for image retrieval is proposed. This model captures
the semantic relationships among images in a database from simple statistics of user-
provided relevance feedback information. It is applied in the post-processing of
image retrieval results such that more semantically related images are returned to the
user. The algorithm is easy to implement and can be efficiently integrated into an
image retrieval system to help improve the retrieval performance. Preliminary
experimental results on a database of 100,000 images show that the proposed model
could improve image retrieval performance for both content-based and text-based
queries.
Keywords
1. Introduction
As the amount of digital image data available on the Internet and in digital libraries is
rapidly growing, there is a great need for efficient image indexing and access tools in
order to fully utilize this massive digital resource. Image retrieval is a research area
dedicated to address this issue and substantial research efforts have been made.
However, by and large, the earlier image retrieval systems have all taken keyword or
text-based approaches for indexing and retrieval of image data. Because image
annotation is a tedious process, it is practically impossible to annotate all images on
the Internet. Furthermore, due to the multiplicity of contents in a single image and the
subjectivity of human perception and understanding, it is also difficult to make
exactly the same annotations to the same image by different users. To address those
limitations, content-based image retrieval (CBIR) approaches have been studied in the
last decade [1, 2, 3, 4, 5]. These approaches work with descriptions based on
properties that are inherent in the images themselves such as color, texture, and shape
and utilize them for retrieval purposes. Since visual features are automatically
extracted from images, automated indexing of image databases becomes possible.
However, despite the many research efforts, the retrieval accuracy of today’s CBIR
algorithms is still limited and often worse than keyword based approaches. The
problem stems from the fact that visual similarity measures, such as color histograms,
in general do not necessarily match perceptional semantics and subjectivity of images.
In addition, each type of image features tends to capture only one of many aspects of
*
Author to whom all correspondence should be addressed.
image similarity and it is difficult to require a user to specify clearly which aspect
exactly or what combination of these aspects he/she wants to apply in defining a
query. To address these problems, interactive relevance feedback techniques have
been proposed [6, 7, 8, 9, 10, 11, 12]. The idea is that we should incorporate human
perception subjectivity into the retrieval process and provide users opportunities to
evaluate retrieval results, and automatically refine queries on the basis of those
evaluations. Lately, this research topic has become the most challenging one in CBIR
research.
The early relevant feedback schemes for CBIR have been mainly adopted from text
document retrieval research and can be classified into two approaches: query point
movement (query refinement) and re-weighting (similarity measure refinement).
Both have been built based upon the vector model in information retrieval theory [13,
14]. Recently, more computationally robust methods that perform global optimization
have been proposed. The MindReader retrieval system formulates a minimization
problem on the parameter estimating process [9]. It allows for correlations between
attributes in addition to different weights on each component.
However, as presented above, while all the approaches adapted from text document
retrieval do improve the performance of CBIR, there are severe limitations: even with
feedback, it is still difficult to capture high level semantics of images when only low-
level image features are used in queries. The inherent problem with these approaches
is that the low-level features are often not as powerful in representing complete
semantic content of images as keywords in representing text documents. In other
words, applying the relevance feedback approaches used in text document retrieval
technologies to low-level feature based image retrieval will not be as successful as in
text document retrieval. Using low-level features alone is not effective in
representing users’ feedbacks and in describing their intentions. Furthermore, in these
algorithms, the potentially captured semantic in the relevance feedback processes in
one query session is not memorized to continuously improve the retrieval
performance of a system. To overcome these limitations, another school of ideas is to
use learning approaches in incorporating semantics in relevance feedback [15, 16, 17,
18].
The PicHunter framework further extended the relevance feedback and learning idea
with a Bayesian approach [17]. With an explicit model of what users would do, given
what target image they want, PicHunter uses Bayesian rule to predict what is the
target they want, given their actions. This is done via a probability distribution over
possible image targets, rather than refining a query. To achieve this, an entropy-
minimizing display algorithm is developed that attempts to maximize the information
obtained from a user at each iteration of the search. Also, this proposed framework
makes use of hidden annotation rather than a possibly inaccurate and inconsistent
annotation structure that the user must learn and make queries in. However, this
could be a disadvantage as well since it excludes the possibility of benefiting from
good annotations, which may lead to a very slow convergence.
Motivated by the work of statistical language modeling [19] and the link structure
analysis in web page search [20, 21], we propose a statistical correlation model that is
able to accumulate and memorize the semantic knowledge learnt from the relevance
feedback information of previous queries. We have also developed an effective
algorithm to apply this model in image retrieval so as to help yield better results for
future queries. This model simply estimates the probability of how likely two images
are semantically similar to each other based on the co-occurrence frequency that both
images are labeled as positive examples during a query / feedback session. It can be
trained from the users’ relevance feedback log, and dynamically updated during the
image retrieval process. The algorithm is so simple that it can be easily incorporated
into an image retrieval system.
Preliminary versions of this paper appeared in the proceedings of the 3rd Intl
Workshop on Multimedia Information Retrieval (MIR 2001) [22] and presented as a
keynote at the 1st Intl Workshop on Pattern Recognition in Information Systems
(PRIS 2001).
The remainder of this paper is organized as follows. In Section 2, the definition of the
correlation model is introduced. In Section 3, the training algorithms of the model are
described. In Section 4, the image ranking schemes based on the correlation model
are explained. Preliminary experimental results on a database of 100,000 images are
presented in Section 5. Finally, concluding remarks are given in Section 6.
The main idea behind the proposed model is the assumption that two images represent
similar semantics if they are jointly labeled as relevant to the same query in a
relevance feedback phase. Accordingly, the model estimates the semantic correlation
between two images based on the number of search sessions in which both images are
relevant examples. A search session starts with a query phase, and is possibly
followed by one or more feedback phases. For simplicity, the number of times that
two images are co-relevant is referred to as bigram frequency, while that an image is
relevant is referred to as unigram frequency. The maximum value of all unigram and
bigram frequencies is referred to as maximum frequency. Intuitively, the larger the
bigram frequency is, the more likely that these two images are semantically similar to
each other, so the higher the semantic correlation between them. Ideally, the
correlation strength might be defined as the ratio between the bigram frequency and
the total number of search sessions. In practice, however, there are many images in
the database, and users are usually reluctant to provide feedback information.
Therefore, the bigram frequency is very small with respect to the number of queries.
Here, we define the semantic correlation between two images as the ratio between the
bigram frequency and the maximum frequency. Since the definition of bigram
frequency is symmetric, the semantic correlation is also symmetric. The self-
correlation, i.e., the correlation between an image and itself, is defined in a similar
way, except that the bigram frequency is changed with the unigram frequency of this
image. By definition, the correlation strength is within the interval between 0 and 1.
To be specific,
0 ≤ R( I , J ) ≤ 1 ,
R( I , J ) = R( J , I ) ,
R( I , J ) = U ( I ) M , if I = J ,
R( I , J ) = B( I , J ) M , if I ≠ J ,
where I , J are two images, B ( I , J ) is their bigram frequency, U ( I ) is the unigram
frequency of image I , M is the maximum frequency, R ( I , J ) is the semantic
correlation strength between image I and J .
3. Training Algorithms
The proposed correlation model is solely determined by the unigram and bigram
frequencies of images in the database. An intuitive training method is to obtain these
frequencies from the statistics of user-provided feedback information collected in the
user log. Let A denote the query-image adjacency matrix. Its (i, j )th entry is equal
to 1 if the j th image is relevant to the i th query, and is equal to 0 otherwise. Then the
co-relevant matrix AT A contains all necessary information. Its diagonal entries are
the unigram frequencies, while other entries are the bigram ones. This method is
pretty simple, and can be used in the dynamic update of the correlation model during
image retrieval process.
In our solution, the definition of unigram and bigram frequency is extended to take
account of irrelevant images. For a specific search session, we assume a positive
correlation between two positive (relevant) examples, and the corresponding bigram
frequency is increased. We assume a negative correlation between a positive example
and a negative (irrelevant) example, and their bigram frequency is decreased.
However, we do not assume any correlation between two negative examples, because
they may be irrelevant to the user’s query in many different ways. Accordingly, the
unigram frequency of a positive example is increased, while that of a negative
example is decreased. The non-feedback images are not automatically treated as
negative examples in our proposed model. Therefore, these images are excluded from
the calculation of unigram and bigram frequencies.
In order to overcome the problem of data sparseness, the feedbacks of search sessions
with the same query, either a text query or an image example, are grouped together
such that feedback images in different sessions may obtain correlation information.
Within each group of search sessions with the same query, the local unigram
frequency of each image, which is referred to as unigram count, is calculated at first.
Based on these counts, the global unigram and bigram frequencies are updated.
The basic idea is to reorder the retrieved images based on the correlation model. It
comes from the following observations. Given a query, the similarity between the
query and an image in the database is measured based on their feature vectors. If the
user provides any relevance feedback, the similarity measure is refined accordingly.
Images with the highest similarities are returned as the retrieval results. Among these
images, there are more or less images that are relevant to the query. In general, the
retrieval precision declines as the number of images in consideration increases. This
implies that the feature-based similarity is also a measure of relevance although it is
often not good enough. On the other hand, the retrieved images exhibit different
relationships. Since relevant images convey semantically similar content with respect
to the query, it is likely that previous users have already judged them as co-relevant
through relevance feedback. Therefore, the correlation strength between two relevant
images is expected to be high. In contrast, as irrelevant images may be semantically
different from the query in many different aspects, it is unlikely that they have been
jointly labeled as relevant. Thus, the correlation strength between two irrelevant
images is expected to be low. Similarly, the correlation between a relevant image and
an irrelevant one is also expected to be low. Therefore, it is reasonable to assume that
images having strong correlations with the top-ranked images are likely to be relevant
to the query, even if their similarity scores defined in the feature space are low.
With each retrieved image, we associate a non-negative relevance score, which can be
treated as semantic similarity, and is initialized to its feature-based similarity. Then,
we make use of the relationship between images to iteratively update the scores in the
following way. Relevance scores are propagated into other images via the correlation
model, and each image receives a refined score, which is the sum of all relevance
scores, with the sum weighted by the correlation strength between this image and
others in the retrieved list. The refined relevance scores are further propagated into
others. This process repeats until the properly normalized scores converge to some
equilibrium values.
Suppose there are n images returned by the system, denoted I1 , I 2 , L , I n , which are
ranked in the descending order of their similarities. Let P denote the vector of
relevance scores, and W be a n × n matrix, whose (i, j )th entry is equal to the
correlation strength between image I i and I j . The iterative refinement of relevance
scores is equivalent to P' = λk W k P with k increasing without bound, where λk is a
normalization factor, and P' is the vector of refined scores. As W is a symmetric
matrix and has only non-negative entries, it can be proved that the unit vector in the
direction of P' converges to the principal eigenvector of W , which corresponds to the
largest eigen value of W and has only non-negative entries [20]. This leads to a
possible image ranking scheme. That is, images are re-ranked based on the
corresponding coordinates of the principle eigenvector of W .
This ranking method does improve the image retrieval precision in our experiments.
However, it is not reliable enough. Unlike the link structure analysis in web page
search [20, 21], the correlation model is trained using the limited feedback
information available. Thus, it may be not well trained and be inaccurate in some
sense. Thus, we cannot solely depend on it to re-rank images. Moreover, when the
number of retrieved images is large, the extraction of principal eigenvector becomes
computationally inefficient, and images relevant to other semantics might dominate
the coordinates of the principle eigenvector. Therefore, a more reasonable method is
to calculate the relevance scores in an efficient way and combine them with the
feature-based similarities in producing the final ranking.
Our ranking scheme is as follows. For image I j , its relevance score p j is initialized
to its similarity s j , and is iteratively updated for a fixed number of k times according
to the following equation:
m m
p j = ∑ pi × rij ∑ p , m ≤ n,
i j = 1,2, L n ,
i =1 i =1
where rij is the correlation strength between image I i and I j , i.e., rij = R( I i , I j ) . In
this equation, only the relevance scores of top m images are propagated to others.
Then the final ranking score of image I j is the weighted sum of relevance score p j
and similarity s j :
S j = w × s j + (1 − w) × p j , 0 ≤ w ≤ 1 ,
where w is the semantic weight.
5. Experiments
We have implemented the correlation model and integrated it with an image search
system [23], which provides the functionalities of keyword based image search, query
by image example, and relevance feedback. In this system, the image database has
been greatly expanded, which contains about 100,000 images collected from more
than 2,000 representative websites. These images cover a variety of categories, such
as “animals”, “arts”, “nature”, etc. Their high-level textual features and low-level
visual features are extracted from the web pages containing the images and the image
themselves respectively [24]. The following six low-level features are used in this
system: color histogram in HSV space with quantization 256, first and second color
moments in Lab space, color coherence vector in LUV space with quantization 64,
MRSAR texture feature, Tamura coarseness feature, and Tamura directionality.
The correlation model is trained using users’ search and feedback data collected in the
user log. After internal use for months, about 3,000 queries with relevance feedbacks
were collected.
Two experiments were conducted to evaluate the proposed method: one is text-based
image retrieval; another is pure content-based retrieval. For the former, we chose 20
text queries. These queries are the following keywords: car, flower, tree, cat,
submarine, mars, spring, galaxy, movie star, potato, ship, space, tomb raider, woman,
mountain, Clinton, Jordan, angel, dog, and summer. After that, we asked two subjects
to perform the image search experiments. Each one of them was required to search
for images with every query twice and label all relevant and irrelevant images within
the top 200 results returned by the system, according to his/her own subjective
judgment. First, there was no feedback; second, three images were selected as either
positive or negative examples. All the information was stored in the log, from which
the ground-truth is extracted automatically. For CBIR, 10 image examples were
selected. So far, only one subject performed the query by example experiments. The
ground-truth is obtained by repeatedly conducting relevance feedback until no more
images relevant to the query can be retrieved.
Based on the queries and the ground-truth, the performance evaluation is conducted
automatically by setting different semantic weights. In the experiments, the number
of iterations to refine the relevance scores is set to 5 ( k = 5 ), while the number of
images to propagate relevance scores is set to 30 ( m = 30 ). Because of the
subjectivity of relevance judgment, the image retrieval precision is calculated for each
subject separately, and is averaged finally. The precision is defined as the percentage
of relevant images in the retrieved list.
The experimental results for CBIR are presented in Figure 1, while that for text-based
retrieval in Figure 2, where the horizontal axis is the number of top images in
consideration; the vertical axis is the corresponding retrieval precision; w is the
semantic weight. It is not surprising that the performance of text-based retrieval is
much higher than that of CBIR. In both cases, the proposed correlation model
significantly improves the retrieval precision. For CBIR, the precision is improved
from 10% to 41% for top 10 images, while from 4.6% to 18.5% for top 100 images.
6. Conclusion
Acknowledgments
We thank Fang Qian, Xiaoxin Yin and Lei Zhang for helping perform the image
search experiments.
References
Q
R
L Z
V
L
F Z
H
U
3 Z
1XPEHURILPDJHV
About the Author – Zheng Chen received his B.S., and Ph.D. degrees in computer
science from Tsinghua University, China, in 1994 and 1999, respectively. He joined
Microsoft Research China in March 1999. His research interests include speech
recognition, natural language processing, information retrieval, multimedia
information retrieval, personal information management, and artificial intelligence.
About the Author – Hong-Jiang Zhang received his B.S. from Zhengzhou University,
China in 1982, and Ph.D. from the Technical University of Denmark in 1991, both in
electrical engineering. His research interests include video and image analysis and
processing, content-based image / video / audio retrieval, media compression and
streaming, computer vision and their applications in consumer and enterprise markets.
He has published over 120 articles in the above area. He is a senior member of IEEE,
also serves on the editorial boards of 5 professional journals and a dozen committees
of various international conferences.