Naive Bayes Nearest Neighbor Classifiers: Christos Varytimidis

Naive Bayes Nearest Neighbor Classiers
Christos Varytimidis
Image, Video and Multimedia Systems Laboratory National Technical University of Athens
January 2011
Outline
Irani - In Defence of Nearest-Neighbor Based Image Classication
Wang - Image-to-Class Distance Metric Learning for Image Classication
Behmo - Towards Optimal Naive Bayes Nearest Neighbor
Outline
rds), for obtaining compact ization gives rise to a signi, but also to signicant degraower of descriptors. Such dintial for many learning-based l tractability, and for avoid(a) (b) (c) (d) Figure 1. Effects of descriptor quantization Informative des unnecessary and especially Boiman, Shechtman, Irani CVPR 2008 scriptors have low database frequency, leading to high quanametric classication, that has tization error. (a) An image from the Face class in CalIn Nearest-Neighbor Based Image Classication te for this loss of Defence information. of tech101. (b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Calteche is essential to Kernel meth101 (generated using [14]). Red = high error; Blue = low error. n NN-Image classiers, it proThe most informative descriptors (eye, nose, etc.) have the highest n only when the query image quantization error. (c) Green marks the 8% of the descriptors in the image that are most frequent in the database (simple edges). ase images, but does not gen(d) Magenta marks the 8% of the descriptors in the image that are led images. This limitation is least frequent in the database (mostly facial features). with large diversity. e a remarkably simple nonthe data (typically hundreds of thousands of descriptors exer, which requires no descriptracted from the training images), is quantized to a rather a direct Image-to-Class dissmall codebook (typically into 200 1000 representative 1 he Naive-Bayes assumption , descriptors). Lazebnik et al. [16] further proposed to add
the data (typically Naive Bayes Nearest-Neighbor Classier
especially severe for classes with large diversity. In this paper we propose a remarkably simple nonparametric NN-based classier, which requires no descriptor quantization, and employs a direct Image-to-Class distance. We show that under the Naive-Bayes assumption1 , the theoretically optimal image classier can be accurately approximated by this simple algorithm. For brevity, we refer to this classier as NBNN, which stands for NaiveBayes Nearest-Neighbor. NBNN is embarrassingly simple: Given a query image, compute all its local image descriptors d 1 , ..., dn . Search for the class C which minimizes the sum n di NNC (di ) 2 (where NNC (di ) is the NNi=1 descriptor of d i in class C ). Although NBNN is extremely simple and requires no learning/training, its performance ranks among the top leading learning-based image classiers. Empirical comparisons are shown on several challenging databases (Caltech-101,Caltech-256 and Graz-01). The paper is organized as follows: Sec. 2 discusses the causes for the inferior performance of standard NN-based image classiers. Sec. 3 provides the probabilistic formulation and the derivation of the optimal Naive-Bayes image classier. In Sec. 4 we show how the optimal Naive-Bayes classier can be accurately approximated with a very simple NN-based classier (NBNN). Finally, Sec. 5 provides empirical evaluation and comparison to other methods.
tracted from the t small codebook ( descriptors). Laze rough quantized lo resentation. Such are necessary for image classicatio were also used i compared to in [27 However, the si tized codebook re will be shown nex tion is considerab Learning-based al information loss b classication resu ple non-parametri phase to undo th It is well known quantization error tization error. Ho a large database o simple edges and classes within the
ng, its performance based image classiwn on several chal-256 and Graz-01). Sec. 2 discusses the standard NN-based robabilistic formuNaive-Bayes image ptimal Naive-Bayes ed with a very simly, Sec. 5 provides other methods.
assication?
mage classication erformance of non-
rametric classiers
d to generate codeng compact image ms of quantized decriptors taken from
are i.i.d. given image class.
will be shown next, the amount of discriminative information is considerably reduced due to the rough quantization. Learning-based algorithms can compensate for some of this information loss by their learning phase, leading to good classication results. This, however, is not the case for simple non-parametric algorithms, since they have no training phase to undo the quantization damage. It is well known that highly frequent descriptors have low quantization error, while rare descriptors have high quantization error. However, the most frequent descriptors in a large database of images (e.g., Caltech-101) comprise of simple edges and corners that appear abundantly in all the classes within the database, and therefore are least informative for classication (provide very low class discriminativity). In contrast, the most informative descriptors for classication are the ones found in one (or few) class, but are rare in other classes. These discriminative descriptors tend to be rare in the database, hence get high quantization error. This problem is exemplied in Fig. 1 on a face image from Caltech-101, even when using a relatively large codebook of quantized descriptors. As noted before [14, 26], when densely sampled image descriptors are divided into ne bins, the bin-density follows a power-law (also known as long-tail or heavy-tail distributions). This implies that most descriptors are infrequent (i.e., found in low-density regions in the descriptor
Quantization Error
Quantization Error
btaining compact s rise to a signisignicant degracriptors. Such diny learning-based y, and for avoidary and especially sication, that has oss of information. al to Kernel methe classiers, it pron the query image but does not genThis limitation is versity. ably simple nonquires no descripage-to-Class disayes assumption1 , can be accurately
Figure 1. Effects of descriptor quantization Informative descriptors have low database frequency, leading to high quantization error. (a) An image from the Face class in Caltech101. (b) Quantization error of densely computed image descriptors (SIFT) using a large codebook (size 6, 000) of Caltech101 (generated using [14]). Red = high error; Blue = low error. The most informative descriptors (eye, nose, etc.) have the highest quantization error. (c) Green marks the 8% of the descriptors in the image that are most frequent in the database (simple edges). (d) Magenta marks the 8% of the descriptors in the image that are least frequent in the database (mostly facial features).
(a)
(b)
(c)
(d)
the data (typically hundreds of thousands of descriptors extracted from the training images), is quantized to a rather small codebook (typically into 200 1000 representative descriptors). Lazebnik et al. [16] further proposed to add rough quantized location information to the histogram rep-
Eects of descriptor Quantization almost no clusters in the descriptor space. Consequently,

any clustering to a small number of clusters (even thousands) will inevitably incur a very high quantization error in most database descriptors. Thus, such long-tail descriptor distribution is inherently inappropriate for quantization. High quantization error leads to a drop in the discriminative power of descriptors. Moreover, the more informative (discriminative) a descriptor is, the more severe the degradation in its discriminativity. This is shown quantitatively in Fig. 2. The graph provides an evidence to the severe drop in the discriminativity (informativeness) of the (SIFT) descriptors in Caltech-101 as result of quantization. The descriptor discriminativity measure of [2, 26] was used: p(d|C )/p(d|C ), which measures how well a descriptor d discriminates between its class C and all other classes C . We compare the average discriminativity of all descriptors in all Caltech-101 classes after quantization: p(dquant |C )/p(dquant |C ), to their discriminativity before quantization. Alternative methods have been proposed for generating compact codebooks via informative feature selection [26, 2]. These approaches, however, discard all but a small set of highly discriminative descriptors/features. In particular, they discard all descriptors with low-discriminativity. Although individually such descriptors offer little discrimina-
space), therefore rather isolated. In other words, there are
Figure 2. Effects of d descriptor discrimin of descriptor discrimi (for a very large samp each for its respective along the y-axis. Thi after quantization (th scale in both axes. NO a descriptor d is, the l
2.2. Image-to-Im
In this section tance, which is fund RVM), signicantly non-parametric ima belled (training) im NN-image classi
Eects of descriptor Quantization
her words, there are pace. Consequently, clusters (even thouh quantization error h long-tail descripate for quantization.
rop in the discrimi, the more informahe more severe the his is shown quandes an evidence to informativeness) of as result of quantimeasure of [2, 26] ures how well a deass C and all other discriminativity of s after quantization: criminativity before
Figure 2. Effects of descriptor quantization Severe drop in descriptor discriminative power. We generated a scatter plot of descriptor discriminative power before and after quantization (for a very large sample set of SIFT descriptors d in Caltech-101, each for its respective class C ). We then averaged this scatter plot along the y-axis. This yields the Average discriminative power after quantization (the RED graph). The display is in logarithmic scale in both axes. NOTE: The more informative (discriminative) a descriptor d is, the larger the drop in its discriminative power.
2.2. Image-to-Image vs. Image-to-Class Distance

In this section we argue that Image-to-Image dis-
osed for generating
Image-to-Class Distance
p(Q|C ) =
Taking the log proba

C
= arg max log(p C
The simple classie sication algorithm Sec 4 we show how approximated using (without descriptor q
Figure 3. Image-to-Image vs. Image-to-Class distance. A Ballet class with large variability and small number (three) of labelled images (bottom row). Even though the Query-to-Image distance is large to each individual labelled image, the Queryto-Class distance is small. Top right image: For each descriptor at each point in Q we show (in color) the labelled image which gave it the highest descriptor likelihood. It is evident that the new query conguration is more likely given the three images, than each individual image seperately. (Images taken from [4].)
Naive-Bayes classi KL-Distance: In S benets of using an show that the above to minimizing Que Eq. (1) can be rew
=a C
the entire class C (using all images I C ), we would get better generalization capabilities than by employing in-
where we sum over tract a constant term
Probabilistic formulation - Maximum Likelihood
image congurations by composing pieces from a set of other images was previously shown useful in [17, 4]. We prove (Sec. 3) that under the Naive-Bayes assumption, the optimal distance to use in image classication is the KL Image-to-Class distance, and not the commonly used Image-to-Image distribution distances (KL, 2 , etc.)
3. Probabilistic Formulation
In this section we derive likelihood the optimal imNaive-Bayes prior = age classiposterior er, which is approximated by NBNN (Sec. 4). evidence Given a new query (test) image Q, we want to nd its class C . It is well known [7] that maximum-a-posteriori (MAP) classier minimizes the average classication er = arg maxC p(C |Q). When the class prior p(C ) ror: C is uniform, the MAP classier reduces to the MaximumLikelihood (ML) classier: = arg max p(C |Q) = arg max p(Q|C ). C
C C
Bayes Rule: p(C |Q) =
p(Q|C )p(C ) p(Q)
where KL( ) is two probability di Naive-Bayes assum mizes a Query-totor distributions of A similar relat and KL-distance w tion, yet between distances and not between descriptor cation have also be again between pa
4. The Approx
In this section w accurately approxim age classier of Se
Let d1 , ..., dn denote all the descriptors of the query image Q. We assume the simplest (generative) probabilistic model, which is the Naive-Bayes assumption (that the descriptors d1 , ..., dn of Q are i.i.d. given its class C ), namely:
Non-Parametric D The optimal MAP requires computing scriptor d in a class tors in an image da ber of pixels in the
Probabilistic formulation - Naive Bayes Assumption
Naive Bayes Assumption = descriptors d1 , . . . , dn are i.i.d.

n
p(Q|C ) = p(d1 , .., dn |C ) =
i=1
p(di |C )
Taking the log probability of the ML decision rule we get: n = arg max log(p(C |Q)) = arg max 1 log p(di |C ) C C C n i=1 (1) The simple classier implied by Eq. (1) is the optimal classication algorithm under the Naive-Bayes assumption.
(without descriptor quantization). Naive Bayes Classier Minimum Image-to-Class Naive-Bayes classi er Minimum Image-to-Class KL-Distance KL-Distance: In Sec. 2.2 we discussed the generalization Class distance. A benets of using an Image-to-Class distance. We next the above MAP classier of Eq. (1) is equivalent to minimizing Query-to-Class KL-distances. Eq. (1) can be rewritten as: = arg max C
C d
approximated using a non-parametric NN-based algorithm
umber (three) of lae Query-to-Image image, the Querye: For each descriphe labelled image d. It is evident that en the three images, es taken from [4].)
p(d|Q) log p(d|C )
C ), we would by employing ints. Such a direct ned by computing distributions of Q ough the Querylabelled images KL-distance may on. Inferring new ces from a set of l in [17, 4]. ve-Bayes assumpe classication is not the commonly nces (KL, 2 , etc.)
where we sum over all possible descriptors d. We can subtract a constant term independent of C from the right hand . By subside of the above equation, without affecting C tracting d p(d|Q) log p(d|Q), we get: p(d|C ) = arg max( ) C p(d|Q) log C p(d|Q) = arg min( KL(p(d|Q) p(d|C )) )
C d D
(2)
where KL( ) is the KL-distance (divergence) between two probability distributions. In other words, under the Naive-Bayes assumption, the optimal MAP classier minimizes a Query-to-Class KL-distance between the descriptor distributions of the query Q and the class C . A similar relation between Naive-Bayes classication and KL-distance was used in [28] for texture classica-
Parzen Window - Nearest Neighbors
Note thatAlgorithm: the approximation of Eq. (4) always bounds The NBNN from below complete Parzen window estimate of Eq. (3). Due to the the long-tail characteristic of descriptor distribuFig. 4 shows the accuracy of such NN approximation of tions, almost all of the descriptors are rather isolated in the the distribution p(d |C ). Even when very small numdescriptor space, therefore very farusing fromamost descriptors ber of database. nearest neighbors (as small as the r = 1, a single nearin the Consequently, all of terms in the sumest neighbor descriptor forfor each d in will eachbe class C ), a very mation of Eq. (3), except a few, negligible (K accurate approximation p ( d|C ) of the complete Parzen exponentially decreases with NN distance) . Thus we can accuwindow estimate isthe obtained (see Fig. 4.a). Moreover, NN approximate summation in Eq. (3) using the (few) descriptor approximation hardly reduces the discriminative r largest elements in the sum. These r largest elements corpower ofto descriptors (see Fig. 4.b).ofThis is in contrast to respond the r nearest neighbors a descriptor d Q C descriptors due to dethe severe discriminativity of within the drop class in descriptors dC , .., d C : 1 L r scriptor quantization. 1 ( d | C ) = K ( d differences dC ) in the ac(4) p We have indeed found very small NN NNj L d=1 tual classication results when changing r from 1 to 1000 Note that the approximation (4) always bounds The case of rof =Eq. 1 is especially convefrom the complete Parzen window of Eq. (3). nientbelow to use, since log p( d|C ) obtains a estimate very simple form: n 2 Fig. 4 shows the accuracy ofNN such NN approximation of logP (Q |C ) di and there is no C (di ) i=1 the distribution p(d|Con ). Even when using a very smallkernel numlonger a dependence the variance of the Gaussian ber K . of nearest neighbors (as small as r = 1, a single nearest neighbor descriptor for each d in 5. each class C ), a very experimental results reported in Sec. accurate approximation p (d|C ) of the complete Parzen NN window estimate is obtained (see Fig. 4.a). Moreover, NN descriptor approximation hardly reduces the discriminative power of descriptors (see Fig. 4.b). This is in contrast to the severe drop in discriminativity of descriptors due to descriptor quantization.
d=1
The resulting Na cation performan can therefore be su descriptor types a The NBNN Algo each image using 1. Compute desc assumption on all 2. di C compu very simple exten 3. C = arg min The decision rulC simp of Despite each of its the t imates the single-d theore the above = arg requires nomin learnin C C j where di is the i-t Combining Sever determined by the approaches to ima Kj corresponding t demonstrated that who learn weights in single area xed andclassi share cation performan descriptor types C a Computational each image using cient approximate assumption on all tree implementatio very simple exten search is logarithm The decision rule the KD-tree [1]. of each of the t d the above single-d = arg minC C where dj i is the i-th determined by the Kj corresponding t
NN density distribution - discriminativity
approximation of the nsity p(d|C ) [7]. Let obtained from all the the Parzen likelihood dC j ) (3) (a) (b)
function (which is typically a Gaussian: 2 )). As L approaches educes accordingly, p [7]. y, all the database dey estimation of Eq. (3). nally time-consuming ance (d d C j ) for all lass). We next show hbor approximation of
Figure 4. NN descriptor estimation preserves descriptor density distribution and discriminativity. (a) A scatter plot of the 1-NN probability density distribution p (d|C ) vs. the true NN distribution p(d|C ). Brightness corresponds to the concentration of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes.
of descriptor distribue rather isolated in the
The resulting Naive-Bayes NN image classier (NBNN) can therefore be summarized as follows: The NBNN Algorithm: 1. Compute descriptors d , ..., d of the query image Q.
ces accordingly, p
l the database deimation of Eq. (3). y time-consuming e (d d C j ) for all ). We next show approximation of
sity distribution and discriminativity. (a) A scatter plot of the 1-NN probability density distribution p (d|C ) vs. the true NN distribution p(d|C ). Brightness corresponds to the concentration of points in the scatter plot. The plot shows that 1-NN distribution provides a very accurate approximation of the true distribution. (b) 20-NN descriptor approximation (Green graph) and 1NN descriptor approximation (Blue graph) preserve quite well the discriminative power of descriptors. In contrast, descriptor quantization (Red graph) severely reduces discriminative power of descriptors. Displays are in logarithmic scale in all axes.
The NBNN Algorithm!
escriptor distribuher isolated in the most descriptors terms in the sumbe negligible (K Thus we can accu(3) using the (few) gest elements cordescriptor d Q C:
NNj
The resulting Naive-Bayes NN image classier (NBNN) can therefore be summarized as follows: The NBNN Algorithm: 1. Compute descriptors d 1 , ..., dn of the query image Q. 2. di C compute the NN of d i in C : NNC (di ). = arg minC n 3. C di NNC (di ) 2 . i=1 Despite its simplicity, this algorithm accurately approximates the theoretically optimal Naive-Bayes classier, requires no learning/training, and is efcient.
dC
(4)
4) always bounds stimate of Eq. (3). approximation of
Combining Several Types of Descriptors: Recent approaches to image classication [5, 6, 20, 27] have demonstrated that combining several types of descriptors in a single classier can signicantly boost the classication performance. In our case, when multiple (t) descriptor types are used, we represent each point in each image using t descriptors. Using a Naive Bayes
gest elements cordescriptor d Q C:

C
requires no learning/training, and is efcient.
Combining of Descriptors Combining Several Several TypesTypes of Descriptors: Recent

(4) approaches to image classication [5, 6, 20, 27] have demonstrated that combining several types of descriptors in a single classier can signicantly boost the classication performance. In our case, when multiple (t) descriptor types are used, we represent each point in using t descriptors. Using a Naive Bayes assumption on all the descriptors of all types yields a very simple extension of the NBNN algorithm above. The decision rule linearly combines the contribution of each of the t descriptor types. Namely, Step (3) in the above single-descriptor-type NBNN is replaced by: j 2 = arg minC t wj n C dj , i NNC (di ) j =1 i=1 j where di is the i-th query descriptor of type j , and w j are determined by the variance of the Parzen Gaussian kernel Kj corresponding to descriptor type j . Unlike [5, 6, 20, 27], who learn weights wj per descriptor-type per class, our w j are xed and shared by all classes. Computational Complexity & Runtime: We use the efcient approximate-r-nearest-neighbors algorithm and KDtree implementation of [23]. The expected time for a NNsearch is logarithmic in the number of elements stored in the KD-tree [1]. Note that the KD-tree data structure is
NNj
4) always bounds estimate of Eq. (3). approximation of a very small num= 1, a single nearh class C ), a very e complete Parzen a). Moreover, NN the discriminative s is in contrast to criptors due to de-
erences in the acr from 1 to 1000 especially convevery simple form: 2 and there is no e Gaussian kernel was used in all the
on Ca Experiments - Descriptor extraction comparisons of NBNN to other
requires no learncessing step has a ber of elements N ) w seconds for cones in Caltech-101). (training) images n D the number of ains n label nD deors searches within for one query im(nC nD log(nD )) no training time in g of the KD-tree. on Caltech-101 for d SIFT descriptors
the luminance part (L* from a CIELAB color space) as a luminance descriptor, and the chromatic part (a*b*) as a color descriptor. Both are normalized to unit length. 4. Shape descriptor: We extended the Shape-Context descriptor [22] to contain edge-orientation histograms in NN-based Performance its log-polar bins. method This descriptor is applied to textureImage 42.1 0.81% invariant SPM edge NN maps [21], [27] and is normalized to unit length. GBDist NN Image [27] of45.2 0.96% 5. The Self-Similarity descriptor [25]. GB Vote NN [3] 52% SVM-KNN [30] 59.1 0.56% The descriptors are densely computed for each image, at NBNN (1 Desc) 65.0 scale 1.14% ve different spatial scales, enabling some invariance. NBNN Desc) 72.8 0.39% To further utilize(5 rough spatial position (similar to [30, 16]), we augment each descriptor d with its location l in the imTable 1. performance of non-parametric NN-based Comparing age: d = (d, l).the The resulting L 2 distance between de= 15). All the approaches on the Caltech-101 dataset (n 2 scriptors, d1 d2 = d1 d2 2 + 2 label l1 l2 2 , combines listed methods do not require a learning phase. descriptor distance and location distance. ( was manually set in our experiments. The same xed was used for Caltech-101 and Caltech-256, and = 0 for Graz-01.)
pearance and shap
paring NBNN with descriptor-type im NN-based) (Fig. 5 tiple descriptor-typ classiers (learning Table 1 shows 101 for several N we used 15 labelle bers reported in t tor NBNN algorith gap all NN-image SVM-KNN [30] based, which was c
5.2. Experiments
Following common benchmarking procedures, we split each class to randomly chosen disjoint sets of training images and test images. In our NBNN algorithm, since there is no training, we use the term labelled images instead of training images. In learning-based methods, the training images are fed to a learning process generating a classier
NBNN, and comrs (learning-based s are provided in
Combining Several Types of Descriptors selected nlabel = 1

5.1. Implementation
We tested our NBNN algorithm with a single descriptortype (SIFT), and with a combination of 5 descriptor-types: 1. The SIFT descriptor ([19]). 2 + 3. Simple Luminance & Color Descriptors: We use log-polar sampling of raw image patches, and take the luminance part (L* from a CIELAB color space) as a luminance descriptor, and the chromatic part (a*b*) as a color descriptor. Both are normalized to unit length. 4. Shape descriptor: We extended the Shape-Context descriptor [22] to contain edge-orientation histograms in its log-polar bins. This descriptor is applied to textureinvariant edge maps [21], and is normalized to unit length. 5. The Self-Similarity descriptor of [25]. The descriptors are densely computed for each image, at ve different spatial scales, enabling some scale invariance. To further utilize rough spatial position (similar to [30, 16]), we augment each descriptor d with its location l in the im = (d, l). The resulting L 2 distance between deage: d 2 2 2 2
quires no learning/training, its performance ranks among the top leading learning-based image classiers. Sec 5.3 further demonstrates experimentally the damaging effects of using descriptor-quantization or image-to-image distances in a non-parametric classier.
(training) and test
20 images per clas nlabel = 1, 5, 10, images per class. T times (randomly s time performance rate per class. Th somewhat differen
Caltech-101: Th furniture, vehicles pearance and shap comparisons on C of NBNN to other paring NBNN with descriptor-type im NN-based) (Fig. 5 tiple descriptor-ty classiers (learnin Table 1 shows 101 for several N we used 15 labell bers reported in tor NBNN algorit gap all NN-imag
Results on Caltech-101
(a) (a)
(b) (b)
Figure 5. Performance comparison on Caltech-101. (a) Single descriptor methods: NBNN (1 Desc), Grif n SPM [13], SVM Figure 5. Performance comparison on Caltech-101. (a) Single descriptor typetype methods: NBNN (1 Desc), Grif n SPM [13], SVM KNN [30], SPM [16], PMK [12], DHDP [29], SVM (SVM Geometric Blur) [27], KNN [30], SPM [16], PMK [12], DHDP [29], GB GB SVM (SVM withwith Geometric Blur) [27], GB GB Vote Vote NN NN [3], [3], GB GB NN NN (NN-(NNImage Geometric Blur) [27], SPM (NN-Image Spatial Pyramids Match) [27]. (b) Multiple descriptor methods: Image withwith Geometric Blur) [27], SPM NN NN (NN-Image withwith Spatial Pyramids Match) [27]. (b) Multiple descriptor typetype methods: NBNN (5 Desc), Bosch Trees (with Optimization) Bosch SVM LearnDist [11], SKM [15], Varma [27], KTA [18]. NBNN (5 Desc), Bosch Trees (with ROI ROI Optimization) [5], [5], Bosch SVM [6], [6], LearnDist [11], SKM [15], Varma [27], KTA [18].
multi-descriptor NBNN algorithm performs even better OurOur multi-descriptor NBNN algorithm performs even better 72.8% on labelled 15 labelled images). Vote NN [3] uses (72.( 8% on 15 images). GBGB Vote NN [3] uses an an image-to-class NN-based voting scheme (without descripimage-to-class NN-based voting scheme (without descriptor quantization), each descriptor votes only a single tor quantization), but but each descriptor votes only to a to single (nearest) class, hence the inferior performance.
Results - Contribution Evaluation

5.3. Impact of Quantization & Image-to-Image Dist.
In Sec. 2 we have argued that descriptor quantization and Image-to-Image distance degrade the performance of non-parametric image classiers. Table 3 displays the results of introducing either of them into NBNN (tested on Caltech-101 with n label = 30). The baseline performance of NBNN (1-Desc) with a SIFT descriptor is 70.4%. If we replace the Image-to-Class KL-distance in NBNN with Opelt Zhang Lazebnik the NBNN NBNN an Class Image-to-Image KL-distance, performance drops [24] [16] (1 Desc) (5 Desc) to 58.4% (i.e., 17%[31] drop in performance). To check the efBikes 86.5 92.0 86.3descriptors 2.5 89.2 4.quantized 7 90.0 4.to 3a fect of quantization, the SIFT are People of 80.8 88.0 82.3 3.1 86.0 5 .0 87.0 4.6of codebook 1000 words. This reduces the performance Table 2.28.4% Resultsdrop on Graz-01 NBNN to 50.4% (i.e., in performance). The spatial pyramid match kernel of [16] measures No Quant. With Quant. distances between histograms of quantized SIFT descripImage-to-Class 70.4% 50.4% (-28.4%) tors, but within an SVM classier. Their SVM learning Image-to-Image 58.4% (-17%) phase compensates for some of the information loss due to quantization, raising classi cation performance up to Table 3. Impact of introducing descriptor quantization or Image64.6%. However, comparison to the baseline performance to-Image distance into NBNN (using SIFT descriptor on Caltechof NBNN (70 4%) implies that the information loss in101, nlabel = 30.). curred by the descriptor quantization was larger than the formance is better than the learning-based classiers of [16] gain obtained and by using SVM. (SVM-based) [24] (Boosting based). NBNN performs
only slightly worse than the SVM-based classier of [31].
(SVM-based) and [24] (Boosting based). NBNN performs only slightly worse than the SVM-based classier of [31].
[12] K. Grauman a Discriminative ICCV, 2005. [13] G. Grifn, A. H egory dataset. [14] F. Jurie and B sual recognitio [15] A. Kumar and for object reco [16] S. Lazebnik, features: Spat scene categori and Pattern Re [17] B. Leibe, A. Le egorization an InBosch, ECCV A. Work [5] A. Zi [18] using Y. Lin, T. Liu, random f for objectA. cate [6] A. Bosch, Zi [19] with D. Lowe. Di a spatial p keypoints. IJC [7] R. Duda, P. Har York, 200 [20] New M. Marsza ek, jer. Learning [8] R. Fei-Fei, L.an visual modelsIn f recognition. [21] bayesian C. M. J. approa Marti Workshop on G image bounda [9] P. Felzenszwalb cues. PAMI, 2 recogniti [22] object G. Mori, S. Be [10] R. Fergus, P. co Pe using shape by unsup [23] nition D. Mount and [11] A. Frome, Y. Si est neighbor s consistent local Comp. Geome and M. class [24] trieval A. Opelt, F [12] K. Grauman potheses and an b Discriminative nition. In ECC ICCV, 2005.
Outline

710 Z. Wang, Y. Hu, and L.-T. Chia
soft-margin, we introduce a slack variable in the error term, and the whole convex optimization problem is therefore formed as:
M1 ,M2 ,...,MC
min
O(M1 , M2 , . . . , MC ) =
T T r(Xip Mp Xip )+
(6) ipn i,pi,ni
(1 )
T T s.t. i, p, n : T r(Xin Mn Xin ) T r(Xip Mp Xip ) 1 ipn
i,pi
i, p, n : ipn 0 c : Mc 0
This optimization problem is Hu, an instance SDP, which Wang, Chia ofECCV 2010can be solved using standard SDP solver. However, as the standard SDP solvers is computation Image-to-Class Distance Metric Learning for Image expensive, we use an ecient gradient descent based method derived Classication from [20,19] to solve our problem. Details are explained in the next subsection. 2.3 An Ecient Gradient Descent Solver
Due to the expensive computation cost of standard SDP solvers, we propose an ecient gradient descent solver derived from Weinberger et al. [20,19] to solve this optimization problem. Since the method proposed by Weinberger et
Contributions - Dierences
Dierences on approach 1. Mahalanobis distance instead of Euclidean (learning for Mc ) 2. Spatial Pyramid Match 3. Weighted local features Results 1. Fewer features-descriptors needed per image 2. Lower testing time 3. and. . . higher performance!
In this section, we formulate a large margin convex optimization problem for Notation learning the Per-Class metrics and introduce an ecient gradient descent method to solve this problem. We also adopt two strategies to further enhance the discrimination of our learned I2C distance. 2.1 Notation
Our work deals with the image represented by a collection of its local feature descriptors extracted from patches around each keypoint. So let Fi = {fi1 , fi2 , . . . , fimi } denote features belonging to image Xi , where mi represents the number of features in Xi and each feature is denoted as fij Rd , j {1, . . . , mi }. To calculate the I2C distance from an image Xi to a candidate class c, we need to nd the NN of each feature fij from class c, which is denoted c . The original I2C distance from image Xi to class c is dened as the sum as fij
Introducing Mahalanobis distance

Image-to-Class Distance Metric Learning for Image Classication 709
of Euclidean distances between each feature in image Xi and its NN in class c and can be formulated as:
mi
Dist(Xi , c) =
j =1
c fij fij
(1)
After learning the Per-Class metric Mc Rdd for each class c, we replace the Euclidean distance between each feature in image Xi and its NN in class c by the Mahalanobis distance and the learned I2C distance becomes:
mi
Dist(Xi , c) =
j =1
c T c (fij fij ) Mc (fij fij )
(2)
This learned I2C distance can also be represented in a matrix form by introducing a new term Xic , which is a mi d matrix representing the dierence between all features in the image Xi and their nearest neighbors in the class c formed as: (fi1 fic1 )T
Euclidean distance between each feature in image Xi and its NN in class c by the Mahalanobis distance and the learned I2C distance becomes:
. . . more Notation
mi j =1
Dist(Xi , c) =
c T c (fij fij ) Mc (fij fij )
(2)
This learned I2C distance can also be represented in a matrix form by introducing a new term Xic , which is a mi d matrix representing the dierence between all features in the image Xi and their nearest neighbors in the class c formed as: (fi1 fic1 )T (fi2 f c )T i2 Xic = (3) ... c (fimi fim )T i So the learned I2C distance from image Xi to class c can be reformulated as:
T Dist(Xi , c) = T r(Xic Mc Xic )
(4)
This is equivalent to the equation (2). If Mc is an identity matrix, then its also equivalent to the original Euclidean distance form of equation (1). In the following subsection, we will use this formulation in the optimization function. 2.2 Problem Formulation
Optimization problem
The objective function in our optimization problem is composed of two terms: the regularization term and error term. This is analogous to the optimization problem in SVM. In the error term, we incorporate the idea of large margin and formulate the constraint that the I2C distance from image Xi to its belonging class p (named as positive distance) should be smaller than the distance to any other class n (named as negative distance) with a margin. The formula is given as follows: T T ) T r(Xip Mp Xip )1 (5) T r(Xin Mn Xin In the regularization term, we simply minimize all the positive distances similar to [20]. So for the whole objective function, on one side we try to minimize all the positive distances, on the other side for every image we keep those negative distances away from the positive distance by a large margin. In order to allow
Optimization problemn
710 Z. Wang, Y. Hu, and L.-T. Chia
soft-margin, we introduce a slack variable in the error term, and the whole convex optimization problem is therefore formed as:
M1 ,M2 ,...,MC
min
O(M1 , M2 , . . . , MC ) =
T T r(Xip Mp Xip )+
(6) ipn i,pi,ni
(1 )
T T s.t. i, p, n : T r(Xin Mn Xin ) T r(Xip Mp Xip ) 1 ipn
i,pi
i, p, n : ipn 0 c : Mc 0
This optimization problem is an instance of SDP, which can be solved using standard SDP solver. However, as the standard SDP solvers is computation expensive, we use an ecient gradient descent based method derived from [20,19] to solve our problem. Details are explained in the next subsection. 2.3 An Ecient Gradient Descent Solver
solve this optimization problem. Since the method proposed by Weinberger et al. targets on solving only one global metric, we modify it to learn our Per-Class metrics. This solver updates all matrices iteratively by taking a small step along the gradient direction to reduce the objective function (6) and projecting onto feasible set to ensure that each matrix is positive semi-denite in each iteration. t+1 = M t O (M , M , . . . , M ) Gradient To evaluate the Descent: gradient ofM objective function for each matrix, 1 2 we denote c the c c th t , and the corresponding gradient matrix Mc for each class c at t iteration as Mc t ). We dene a set of triplet error indices N t such that (i, p, n) N t as G(Mc t if ipn > 0 at the tth iteration. Then the gradient G(Mc ) can be calculated by t taking the derivative of objective function (6) to Mc :
An Ecient Gradient Descent Solver
t G(Mc ) = (1 )
T Xic Xic + i,c=p
(i,p,n)N t ,c=p
T Xic Xic
T Xic Xic (7)
(i,p,n)N t ,c=n
Directly calculating the gradient in each iteration using this formula would be computational expensive. As the changes in the gradient from one iteration to the next are only determined by the dierences between the sets N t and N t+1 , t t+1 ) to calculate the gradient G(Mc ) in the next iteration, which we use G(Mc would be more ecient:
t+1 t G(Mc ) = G(Mc ) + ( (i,p,n)(N t+1 N t ),c=p T Xic Xic T Xic Xic ) (8) (i,p,n)(N t+1 N t ),c=n
T T ( Xic Xic Xic Xic ) (i,p,n)(N t N t+1 ),c=p (i,p,n)(N t N t+1 ),c=n T Since (Xic Xic ) is unchanged during the iterations, we can accelerate the updating procedure by pre-calculating this value before the rst iteration. The
timization problem (6) is convex, this solver is able to converge to the global optimum. We summarize the whole work ow Algorithm in Algorithm 1. Gradient Descent Algorithm 1. A Gradient Descent Method for Solving Our Optimization Problem
T Input: step size , parameter and pre-calculated data (Xic Xic ), i {1, . . . , N }, c {1, . . . , C } for c := 1 to C do 0 T ) := (1 ) i,pi Xip Xip G(Mc 0 Mc := I end for{Initialize M and gradient for each class} Set t := 0 repeat Compute N t by checking each error term ipn for c = 1 to C do t+1 ) using equation (8) Update G(Mc t+1 t t+1 ) Mc := Mc + G(Mc t+1 for keeping positive semi-denite Project Mc end for Calculate new objective function t := t + 1 until Objective function converged Output: each matrix M1 , . . . , MC
To generate a more discriminative I2C distance for better recognition perforimages in a class. We adopt the idea of spatial pyramid by restricting each feature mance, we improve ourSpatial learned distance by adopting the idea of a spatial pyramid descriptor in the image to only nd its NN in the same subregion from class at each Pyramid Match match [9] and learning I2C distance function [16]. level.
Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes use of spatial correspondence, and the idea of pyramid match is adapted from Grauman et al. [8]. This method recursively divides the image into subregions at increasingly ne resolutions. We adopt this idea in our NN search by limiting each feature point in the image to nd its NN only in the same subregion from a candidate class at each level. So the feature searching set in the candidate class is reduced from the whole image (top level, or level 0) to only the corresponding subregion (ner level), see Figure 2 for details. This spatial restriction enhances the robustness of NN search by reducing the eect of noise due to wrong matches from other subregions. Then the learned distances from all levels are merged together as pyramid combination. In addition, we nd in our experiments that a single level spatial restriction Fig. parallelogram an image, and the right parallelograms denote at 2. a The ner left resolution makes denotes better recognition accuracy compared to the top images in a class. We adopt the idea of spatial pyramid by restricting each feature level especially for those images with geometric scene structure, although the descriptor inisthe imagelower to only nd its NN in the same subregion a Since class at each accuracy slightly than the pyramid combination of all from levels. the level. candidate searching set is smaller in a ner level, which requires less computation cost for the NN search, we can use just a single level spatial restriction of the
Spatial pyramid match (SPM) is proposed by Lazebnik et al. [9] which makes use of spatial correspondence, and the idea of pyramid match is adapted from
Weghts on local features - for Optimization learned I2C distance to speed up the classication test images. Compared to the top level, a ner level spatial restriction not only reduces the computation cost, but also improves the recognition accuracy in most datasets. For some images without geometric scene structure, this single level can still preserve the recognition performance due to sucient features in the candidate class. We also use the method of learning I2C distance function proposed in [16] to combine with the learned Mahalanobis I2C distance. The idea of learning local distance function is originally proposed by Frome et al. and used for image classication and retrieval in [6,5]. Their method learns a weighted distance function for measuring I2I distance, which is achieved by also using a large margin framework to learn the weight associated with each local feature. Wang et al. [16] have used this idea to learn a weighted I2C distance function from each image to a candidate class, and we nd our distance metric learning method can be combined with this distance function learning approach. For each class, its weighted I2C distance is multiplied with our learned Per-Class matrix to generate a more discriminative weighted Mahalanobis I2C distance. Details of this local distance function for learning weight can be found in [6,16].
3
3.1
Experiment
Datasets and Setup
Experiments - descriptors
tion, we use dense sampling strategy and SIFT features [12] as our descriptor, which are computed on a 16 16 patches over a grid with spacing of 8 pixels for all datasets. This is a simplied method compared to some papers that use densely sampled and multi-scale patches to extract large number of features,
Results - Spatial Match gain I2CDML+Weight 78.5 Pyramid 0.74 81.3 1.46 90.1 0.94
I2CDML+ 83.7 0.49 84.3 1.52 91.4 0.88 SPM+Weight
Sports 0.86 0.84
I2CDML I2CDML+Weight I2CDML I2CDML+Weight
I2CDML+SPM
81.2 0.52 79.7 1.83 89.8 1.16
Scene 15
Corel 0.93 0.92 0.91

I2CDML I2CDML+Weight
0.84 0.82
0.82
0.8
0.9 0.8 0.89 0.78 0.76
0.78
0.88 NS SSL SPM 0.87 NS SSL SPM
0.76
NS
SSL
SPM
Fig. 4. Comparing the performance of no spatial restriction (NS), spatial single level restriction (SSL) and spatial pyramid match (SPM) for both I2CDML and I2CDML+Weight in all the three datasets. With only spatial single level, it achieves better performance than without spatial restriction, although slightly lower than spatial pyramid combination of multiple levels. But it requires much less computation cost for feature NN search.
Then we show in Table 2 the improved I2C distance through spatial pyramid restriction from the idea of spatial pyramid match in [9] and learning weight associated with each local feature in [16]. Both strategies are able to augment
Results - Caltech 101

Image-to-Class Distance Metric Learning for Image Classication
Caltech 101
717
0.7 0.6 0.5 0.4 0.3 0.2 1*1 2*2 3*3
I2CDML I2CDML+Weight NBNN
4*4
5*5
6*6
7*7
SPM
Fig. 5. Comparing the performances of I2CDML, I2CDML+Weight and NBNN from spatial division of 1 to 77 and spatial pyramid combination (SPM) on Caltech 101.
less than 1000 features per image on average using our feature extraction strategy, which are about 1/20 compared to the size of feature set in [1]. We also use single level spatial restriction to constrain the NN search for acceleration. For each image, we divide it from 22 to 77 subregions and test the performance
Outline
172
R. Behmo et al.
Fig. 1. Subwindow detection for the original NBNN (red) and for our version of NBNN (green). Behmo, Marcombes, Dalalyan, Prinet original ECCV Since the background class is more densely sampled than the object class, the NBNN 2010 tends to select an object window that is too small relatively to the object instance. As show these examples, our approach addresses this issue. Towards Optimal Naive Bayes Nearest Neighbor
by the efciency of the classier. Naive Bayes Nearest Neighbor (NBNN) is a classier introduced in [1] that was designed to address this issue: NBNN is non-parametric, does not require any feature quantization step and thus uses to advantage the full discriminative power of visual features. However, in practice, we observe that NBNN performs relatively well on certain datasets, but not on others. To remedy this, we start by analyzing the theoretical foundations of the NBNN. We show that this performance variability could stem from the assumption that the normalization factor involved in the kernel estimator of the conditional density of features is class-independent. We relax this assumption and provide a new formulation of the NBNN which is richer than the original one. In particular, our approach is well suited for optimal, multi-channel image classication and object detection. The main argument of NBNN is that the log-likelihood of a visual
Contributions - Dierences
Dierences on approach 1. Optimize normalization factors of Parzen Window (learning) 2. Learn optimal combinations of deerent descriptors (channels) 3. Spatial Pyramid Matching 4. Classication by Detection using ESS Results 1. Copes with dierently populated classes 2. Has higher performance! 3. but. . . is slow at both learning and testing!
This classier is shown to outperform the usual nearest neighbor classier. Moreover, it does not require any feature quantization step, and the descriptive power of image features is thus preserved. The reasoning above proceeds in three distinct steps: the naive Bayes assumption considers that image features are independent identically distributed given the image class cI (equation 1). Then, the estimation of a feature probability density is obtained by a non-parametric density estimation process like the Parzen-Rosenblatt estimator (equation 2). NBNN is based on the assumption that the logarithm of this value, which is a sum of distances, can be approximated by its largest term (equation 3). In the following section, we will show that the implicit simplication that consists in removing the normalization parameter from the density estimator is invalid in most practical cases. Along with the notation introduced in this section, we will also need the notion of point-to-set distance, which is simply the squared Euclidean distance of a point to its nearest neighbor in the set: RD , x RD , (x, ) = inf y x y 2 . In what follows, (x, c ) will be abbreviated as c (x).
Notation
2.2 Afne Correction of Nearest Neighbor Distance for NBNN The most important theoretical limitation of NBNN is that in order to obtain a simple approximation of the log-likelihood, the normalization factor 1/Z of the kernel estimator is assumed to be the same for all classes. Yet, there is no a priori reason to believe that this assumption is satised in practice. If this factor signicantly varies from one class to another, then the approximation of the maximum a posteriori class label c I by equation 4 becomes unreliable. It should be noted that the objection that we raise does not concern the core hypothesis of NBNN, namely the naive Bayes hypothesis and the approximation of the sum of exponentials of equation 2 by its largest term. In fact, in the following we will essentially follow and extend the arguments presented in [1] using the same starting hypothesis.
NBNN reasoning steps
1. Naive Bayes assumption (eq.1) 2. Parzen window estimator of pdf (eq.2) 3. Nearest neighbor approximation (eq.3) (invalid removal of normalization parameters)
2.1 Initial Formulation of NBNN
Original NBNN
In this section, we briey recall the main arguments of NBNN described by Boiman et al. [1] and introduce some necessary notation. D In an image I with hidden class label cI , we extract KI features (dI k )k R . Under the naive Bayes assumption, and assuming all image labels are equally probable (P (c) cte) the optimal prediction c I of the class label of image I maximizes the product of the feature probabilities relatively to the class label:
KI
c I = arg max
c k=1
P (dI k |c).
(1)
The feature probability conditioned on the image class P (dI k |c) can be estimated by a non-parametric kernel estimator, also called Parzen-Rosenblatt estimator. If we note c = dJ k |cJ = c, 1 k KJ the set of all features from all training images that belong to class c, we can write: P (dI k |c) = 1 Z exp
dc
dI k d 2 2
(2)
where is the bandwidth of the density estimator. In [1], this estimator is further approximated by the largest term from the sum on the RHS. This leads to a quite simple expression: 174 R. Behmo et al. (3) d, c, log (P (d|c)) minc d d 2 .
The decision rule for image I is thus:

c c
c I = arg max P (I |c) = arg min
dc
min
dI k d
(4)
This classier is shown to outperform the usual nearest neighbor classier. Moreover,
Optimal Naive Nearest Neighbor 175 Z Towards 2(by ) 1/2Bayes 2(to sample ) space, in order to reach an approximation bounded we need 233 points. In practice, the PR estimator does not converge and there is little sense in keeping more D 4/(4+D ) c c that convergence speed ofsum. the Parzen-Rosenblatt (PR) estimator is K 2 ( than just the rst of the =the | |(2 ) term )D . Recall that c (d) is the squared Euclidean[13]. distance of where Z c This means that in the case of a 128-dimensional feature space, such as the SIFT c Thus, the log-likelihood visual relatively to an image label c is: feature .a In the feature aboved equations, we have replaced the classd to its nearest neighbor in of space, in order to reach an approximation bounded by 1/2 we need to sample 233 points. c c c c is no reason to believe that independent notation , Z by , Z since, in general, there 1 ( d ) ( d ) In practice, the estimator does notexp converge there = is little sense in keeping log PPR (d|c ) = log and + log( Z c ), more (6) c )2 Zc 2( 2( c )2 parameters should across classes. For instance, both parameters are functions than just the be f rstequal term of the sum. Thus, the log-likelihood of a visual feature d relatively to an image label c is: of the number of training features of class c in the training set. D where Z c = |c |(2 ) 2 ( c )D . Recall that c (d) is the squared Euclidean distance of c Returning to the naive Bayes formulation, we obtain: c 1 the above c (d ) (d) replaced cthe classwe have d to its neighbor inlog . In nearest log P ( d|c) = = exp equations, + log(Z ), (6) c) 2 c )2 there is 2( independent notation , Z by c , Z c since, in 2( general, no reason to believe that KI KI c classes. I parameters should be equal across For instance, both parameters are functions (dk ) c c c (I |c)) c = D c D + clog( Ztraining ) = set. c c (dI , c ((7) c, lognumber (P = | (2 ) 2 ( ) . of d) where Z of the of |training features in the k ) + KI R c class 2 2( c ) d to In the above equations, its nearest neighbor in .formulation, Returning to the naive Bayes we obtain: we have k=1 k=1replaced the classindependent notation , Z by c , Z c since, in general, there is no reason to believe that KI KI cparameters should 2 be equal I c For instance, both parameters functions cclasses. (d where = 1 / (2( c( )P ) and c across = log( Z ofare the c log-likelihood k ) ) is a re-parametrization +clog( Z c) = cset. c (dI (7) number c, log of (I |c)) =features of c k ) + KI , of the training class in the training 2 2( ) 6 that has the advantage of being linear in the model parameters. The image label is k =1 k =1 Returning to the naive Bayes formulation, we obtain:
c 2
c 2
Correction of Nearest Neighbor Distance
then decided according to a criterion that c is slightly different from equation 4: c c 2 c c I = arg min
c
where = 1/(2( ) ) and = log( Z ) is a re-parametrization of the log-likelihood KI K I (dI c c c c The k ) in the model 6 that has of being linear parameters. imageclabel(7) is c, the log advantage (P (I |c)) = + log( Z ) = ( dI K I k ) + KI , 2( c )2 then decided according to akcriterion that different from c is slightly c I c equation 4: =1 k=1
KI k) =1 is a c re-parametrization of the log-likelihood where c = 1/(2( c )2 ) and c = log(Zcc min ( dI KI c . c I = in the (8) k ) + parameters. 6 that has the advantage of arg being linear model The image label is c =1 We note then that decided this modied criterion can be interpreted in two accordingdecision to a criterion that k is slightly different from equation 4: different ways: it can either be interpreted as the consequence of a density estimator to which a mulWe note that this modied decision criterion KI can be interpreted in two different ways: c c c it can either be interpreted as the consequence of a density estimator to an which a multiplicative factor was added, or as an unmodied NBNN in .which afne correction c I = arg min (8) (dI k ) + KI factor added,Euclidean or ascan unmodied NBNN in which an afne correction formuk=1 has beentiplicative added to the was squared distance. In the former, the resulting has been added to the squared Euclidean distance. In the former, the resulting formu-
(dk ) + KI
(8)
has been added to the squared Euclidean distance. In the former, the resulting formulation can be considered different from the initial NBNN. In the latter, equation 8 can be obtained from equation 4 simply by replacing c (d) by c c (d) + c (since c is positive, the nearest neighbor distance itself does not change). This formulation differs from [1] only in the evaluation of the distance function, leaving us with two parameters per class to be evaluated. At this point, it is important to recall that the introduction of parameters c and c does not violate the naive Bayes assumption, nor the assumption of equiprobability of classes. In fact, the density estimation correction can be seen precisely as an enIf a class is more densely sampled than others (i.e: its feature space contains more training samples), then the NBNN estimator will have a bias towards that class, even though it made the assumption that all classes are equally probable. The purpose of setting appropriate values for c and c is to correct this bias. It might be noted that deciding on a suitable value for c and c reduces to def ning 176 R. Behmo et al. an appropriate bandwidth c . Indeed, the dimensionality D of the feature space and the number |c | of training feature points are known parameters. However, in practice, choosing a good value for the bandwidth parameter is time-consuming and inefcient. To cope with this issue, we designed an optimization scheme to nd the optimal values of parameters c , c with respect to cross-validation.
Correction of Nearest Neighbor Distance
2.3 Multi-channel Image Classication In the most general case, an image is described by different features coming from different sources or sampling methods. For example, we can sample SIFT features and local color histograms from an image. We observe that the classication criterion of equation 1 copes well with the introduction of multiple feature sources. The only difference should be the parameters for density estimation, since feature types correspond, in general, to different feature spaces. In order to handle different feature types, we need to introduce a few denitions and
Rd . Channels can be dened arbitrarily: a channel can be associated to a particular detector/descriptor pair, but can also represent global image characteristics. For instance, an image channel can consist in a single element, such as the global color histogram. Let us assume we have dened a certain number of channels (n )1nN , that are expected to be particularly relevant to the problem at hand. Adapting the framework of our modied NBNN to multiple channels is just a matter of changing notation. Similarly to the single-channel case, we aim here at estimating the class label of an image I :
Combining descriptors
c I = arg max P (I |c),

c
with
P (I |c) =
n dn (I )
P (d|c).
(9)
Since different channels have different features spaces, the density correction paramec ters should depend on the channel index: c , c will thus be noted c n , n . The notation from the previous section are adapted in a similar way: we call c = n J |cJ =c n (J ) the set of all features from class c and channel n and dene the distance function of a c c feature d to c n by: d, n (d) = (d, n ). This leads to the classication criterion: c I = arg min
c n
c n
dn (I )
c c n (d) + n |n (I )| .
(10)
Naturally, when adding feature channels to our decision criterion, we wish to balance the importance of each channel relatively to its relevance to the problem at hand. Equation 10 shows us that the function of relevance weighting can be assigned to the distance correction parameters. The problems of adequate channel balancing and nearest neighbor distance correction should thus be addressed in one single step. In the following section, we present a method to nd the optimal values of these parameters. 2.4 Parameter Estimation
c We now turn to the problem of estimating values of c n and n that are optimal for c
Xn (I ) =
Parameter Estimation - Optimization
c c c, the vector Xcan (I ) can considered as as a global descriptor of image I .of Weimage I . W For every c, For theevery vector X (I ) be be considered a global descriptor c c c also denote by c the (2N )-vector ( , c . . , N ) and the matrix that 1 , .c c c1 , . . . c, Naive c by W N Towards Optimal Bayes Nearest Neighbor 177 also denote by the (2 N ) -vector ( , . . . , , , . . . , ) and by W the matrix tha c 1 w for different 1 N N values of c. Using these notation, results from concatenation of vectors c as: the classier we propose can be rewritten esults from concatenation of vectors w cfor different values of c. Using these notation c c Xn (I ) = n (d), XN +n (I ) = |n (I )|, n = 1, . . . , N. (11) c c he classier we propose can be rewritten c = arg minas: (12) I c ( ) X (I ), d (I )
n
n (d),c c Xn (I ) = n (d), dn (I )dn (I )
XNc+n (I ) = |n (I )|,
XN +n (I ) = |n (I )|,
n = 1, . . . , N.
n = 1, . . . , N.
(11)
(11
c , the vector Xc (I ) can be considered For every as a global descriptor of image I . We ) stands for the transpose of c . Thisc is close in spirit to the winner-takes-all where (c c c = multiclass arg(min ( ) (12 I the c X ( c I ), also denote by c used the (2 N )-vector c .c . , c classier widely for 1 , . classication. N , 1 , . . . , N ) and by W the matrix that c for different values of c . Using these notation, results from concatenation of vectors w Given a labeled sample (Ii , ci )i=1,...,K independent of the sample used for computstands for the transpose of c . This isenergy close in spirit to the winner-takes-a where ( c ) the classier we can be rewritten as: we can dene a constrained linear optimization problem that ing the sets c n ,propose minimizes the hinge loss of a multi-channel NBNN classier: classier widely used for the multiclass classication. c c c I = arg minc ( ) X (I ), (12) K Given a labeled csample (Ii , ci )i=1,...,Kcc independent of the sample used for compu ci c c the winner-takes-all i. This ) stands for the transpose of is close in spirit to where ( E (W ) = max 1 + ( ) X (Ii ) ( ) X (Ii ) + , (13) c: c=constrained ci can dene a linear energy optimization problem tha ng the sets classier c n , we widely used for the multiclass classication. i=1 Given a labeled sample (Ii , ci )i=1,...,K independent of the sample used for computminimizes the hinge loss of a multi-channel NBNN classier: where (x) stands for the positive part of a real x. The minimization of E (W ) can , we can dene a constrained linear energy optimization problem that ing the sets+c be recast as n a linear program since it is equivalent to minimizing i i subject to minimizes the hinge K loss of a multi-channel NBNN classier: constraints: i K E (W ) = 1 + ( c max ) c Xci (Ii ) ( c ) Xc (Ii ) + , (13 c ci c c , K, c = ci , i E (1 + ) cX (Ii 1 ) i =c ) 1, . .. (14) i X ( i i ), c :ci= i W )( = +( ( c) max ) XcI (Ii ) ( X (Ii ) + , (13) i=1 c : c =c i 1, . . . , K, (15) i 0 i = i=1 c ( ) e 0 , n = 1part , . part . . , N, n for where (x)+where stands the positive of real x. The minimization of E (W ) ca (x)+for stands the positive of aareal x. The minimization of E (W ) (16) can as aprogram linear program since it is is equivalent to minimizing to 2N i i subject be recast asbe arecast linear since equivalent to equal minimizing where en stands for the vector of Rit having all coordinates to zero, except for i i subject t constraints: the nth coordinate, which is equal to 1. This linear program can be solved quickly for
constraints:
c images . In practice, number a relatively large ofci channels and i 1 + number ( ci ) X (Ii ) ( ) Xc (Ii ), i = 1, . .the . , K, c = of ci ,channels (14) should be kept to the c number c of training samples to avoid overtting. c small relatively c
the object class and position only depends on the point belonging or not to the object:
Classication by detection (d) if d

n, d, c, log (P (d|c, )) =
c n (d) if d / . c n
(17)
c c and n In the above equation, we have written the feature-to-set distance functions n without apparent density correction in order to alleviate the notation. We leave to the c c c by c reader the task of replacing n n n + n in the equations of this section. The image log-likelihood function is now decomposed over all features inside and outc c side the object: E (I, c, ) log P (I |c, ) = n d n (d) + d / n (d) . The term on the RHS can be rewritten:
E (I, c, ) =
n d
c c (n (d) n (d)) +
c n (d) . d
(18)
Observing that the second sum on the RHS does not depend on , we get E (I, c, ) = c c (d) n (d) and E2 (I, c) = E1 (I, c, )+E2 (I, c), where E1 (I, c, ) = n d n c c relatively to class c as the n d n (d). Let us dene the optimal object position position that minimizes the rst energy term: c = arg min E1 (I, c, ) for all c. Then, we can obtain the most likely image class and object position by: c I = arg min (E1 (I, c, c ) + E2 (I, c)) ,
c I I = c .
(19)
For any class c, nding the rectangular window c that is the most likely candidate can be done naively by exhaustive search, but it proves prohibitive. Instead, we make use of fast branch and bound subwindow search [2]. The method used to search for the image window that maximizes the prediction of a linear SVM can be generalized to any classier that is linear in the image features, such as our optimal multi-channel NBNN. In short, the most likely class label and object position for a test image I are found by the following algorithm:
Detection Algorithm
Towards Optimal Naive Bayes Nearest Neighbor 1: declare variables c , = + 2: E 3: for each class label c do 4: find c by efficient branch and bound subwindow search 5: c = arg min E1 (I, c, ) then c ) + E2 (I, c) < E 6: if E1 (I, c, = E1 (I, c, c ) + E2 (I, c) 7: E 8: c = c 9: = c 10: end if 11: end for 12: return c ,
179
4 Experiments
Our optimal NBNN classier was tested on three datasets: Caltech-101 [15], SceneClass 13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts for parameter selection. Classication results are expressed in percent and reect the rate of good classication, per class or averaged over all classes.
4 Experiments
Fast NN search - LSH
Our optimal NBNN classier was tested on three datasets: Caltech-101 [15], SceneClass 13 [16] and Graz-02 [14]. In each case, the training set was divided into two equal parts for parameter selection. Classication results are expressed in percent and reect the rate of good classication, per class or averaged over all classes. A major practical limitation of NBNN and of our approach is the computational time necessary to nearest neighbor search, since the sets of potential nearest neighbors to explore can contain of the order of 105 to 106 points. We thus need to implement an appropriate search method. However, the dimensionality of the descriptor space can also be quite large and traditional exact search methods, such as kd-trees or vantage point trees [17] are inefcient. We chose Locality Sensitive Hashing (LSH) and addressed the thorny issue of parameter tuning by multi-probe LSH2 [18] with a recall rate of 0.8. We observed that resulting classication performance are not overly sensitive to small variations in the required recall rate. However, computations speed is: compared to exhaustive naive search, the observed speed increase was more than ten-fold. Further improvement in the execution times can be achieved using recent approximate NNsearch methods [19,20]. Let us describe the databases used in our experiments. Caltech-101 (5 classes). This dataset includes the ve most populated classes of the Caltech-101 dataset: faces, airplanes, cars-side, motorbikes and background. These images present relatively little clutter and variation in object pose. Images were resized to a maximum of 300 300 pixels prior to processing. The training and testing sets both contain 30 randomly chosen image per class. Each experiment was repeated 20 times and we report the average results over all experiments. SceneClass 13. Each image of this dataset belongs to one of 13 indoor and outdoor
descriptors, with 91.10% good classication rate, while rgSIFT performs worst, with Results onevaluation 1 SIFT and 5space SIFT descriptors 4.1Thus, Single-Channel Classication 85.17%. a wrong of the feature properties undermines the descriptor performance. The impact of optimal parameter selection on NBNN is measured by performing image classication with just one feature channel. We chose SIFT features [22] for their relative popularity. Results are summarized in Tables 1 and 2. 4.3 Multi-channel Classication
2 The notion channel iscomparison sufciently versatile toof be adapted to aby variety of different Table of 1. Performance between the bag words classied linear and -kernel conSVM, the NBNN classier and our optimal texts. In this experiment, we borrow theNBNN idea developed in [4] to subdivide the image in different spatial regions. We consider that an image channel associated to a certain 2
Datasets
BoW/SVM
BoW/ -SVM
NBNN [1]
Optimal NBNN
67.85 0.78 76.7 0.60 48.52 1.53 75.35 0.79 Table 3.SceneClass13 Caltech101 [16] (5classes): Inuence of various radiometry invariant features. Best and worst 68.18 4.21 77.91 2.43 61.13 5.61 78.98 2.37 Graz02 [14] SIFT invariants are highlighted in blue and red, respectively. 59.2 11.89 89.13 2.53 73.07 4.02 89.77 2.31 Caltech101 [15]
In Table 1, the rst two columns refer to the classication of bags of words by linear SIFT and by 2 -kernel SVM. 88.90 In 2 .59 73.07 4.we 02 selected 89.77 2.31 all three experiments the most efcient SVM codebook size (between 500 and 3000) histograms by their 89.90 2.18 and feature 72.73 6.01 were normalized 91.10 2.45 OpponentSIFT 1 L norm. Furthermore, only the results 2 -kernel SVM with the best possible 86.03 2.63 for the 80.17 3.73 85.17 4.86 rgSIFT value (in a nite grid) of the smoothing are reported. In Table we omitted 86.13 2.76 parameter 75.43 3.86 86.872, 3.23 cSIFT the results of BoW/SVM because of their clear inferiority w.r.t. BoW/ 2 -SVM. 89.40 2 .48 73.03 5.52 90.01 3.03 Transf. color SIFT
Table 2. Performance comparison between the bag of words classied by 2 -kernel SVM, the NBNN classier and our optimal NBNN. Per class results for Caltech-101 (5 classes) dataset. Class Airplanes Car-side BoW/2 -SVM 91.99 4.87 96.16 3.84 NBNN [1] 34.17 11.35 97.67 2.38 Optimal NBNN 95.00 3.25 94.00 4.29
Feature
BoW/2 -SVM
NBNN [1]
Optimal NBNN
Results on Spatial Pyramid Matching
SIFT OpponentSIFT rgSIFT cSIFT Transf. color SIFT
88.90 2.59 89.90 2.18 86.03 2.63 86.13 2.76 89.40 2.48
73.07 4.02 72.73 6.01 80.17 3.73 75.43 3.86 73.03 5.52
89.77 2.31 91.10 2.45 85.17 4.86 86.87 3.23 90.01 3.03
182
R. Behmo et al.
Fig. 2. Feature channels as image subregions: 1 1, 1 2, 1 3, 1 4
Table 4. Multi-channel classication, SceneClass13 dataset Channels 11 11+12 11+13 11+14 #channels 1 3 4 5 NBNN 48.52 53.59 55.24 55.37 Optimal NBNN 75.35 76.10 76.54 78.26
image region is composed of all features that are located inside this region. In practice, image regions are regular grids of xed size. We conducted experiments on the
Results on Classication by Detection

Towards Optimal Naive Bayes Nearest Neighbor 183
Fig. 3. Subwindow detection for NBNN (red) and optimal NBNN (green). For this experiment, all ve SIFT radiometry invariants were combined. (see Section 4.4)
It can be observed that the non-parametric NBNN usually converges towards an optimal object window that is too small relatively to the object instance. This is due to the fact that the background class is more densely sampled. Consequently, the nearest neighbor distance gives an estimate of the probability density that is too large. It was precisely to address this issue that optimal NBNN was designed.
5 Conclusion

Naive Bayes Nearest Neighbor Classifiers: Christos Varytimidis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naive Bayes Nearest Neighbor Classifiers: Christos Varytimidis

Uploaded by

Copyright:

Available Formats

Naive Bayes Nearest Neighbor Classiers

Irani - In Defence of Nearest-Neighbor Based Image Classication

Wang - Image-to-Class Distance Metric Learning for Image Classication

Behmo - Towards Optimal Naive Bayes Nearest Neighbor

Irani - In Defence of Nearest-Neighbor Based Image Classication

Wang - Image-to-Class Distance Metric Learning for Image Classication

Behmo - Towards Optimal Naive Bayes Nearest Neighbor

Irani - In Defence of Nearest-Neighbor Based Image Classication

the data (typically Naive Bayes Nearest-Neighbor Classier

mage classication erformance of non-

d to generate codeng compact image ms of quantized decriptors taken from

are i.i.d. given image class.

Eects of descriptor Quantization almost no clusters in the descriptor space. Consequently,

space), therefore rather isolated. In other words, there are

Eects of descriptor Quantization

2.2. Image-to-Image vs. Image-to-Class Distance

osed for generating

Taking the log proba

= arg max log(p C

where we sum over tract a constant term

Probabilistic formulation - Maximum Likelihood

Bayes Rule: p(C |Q) =

p(Q|C )p(C ) p(Q)

In this section w accurately approxim age classier of Se

Probabilistic formulation - Naive Bayes Assumption

Naive Bayes Assumption = descriptors d1 , . . . , dn are i.i.d.

p(Q|C ) = p(d1 , .., dn |C ) =

approximated using a non-parametric NN-based algorithm

p(d|Q) log p(d|C )

Parzen Window - Nearest Neighbors

NN density distribution - discriminativity

of descriptor distribue rather isolated in the

The NBNN Algorithm!

4) always bounds stimate of Eq. (3). approximation of

gest elements cordescriptor d Q C:

requires no learning/training, and is efcient.

Combining of Descriptors Combining Several Several TypesTypes of Descriptors: Recent

on Ca Experiments - Descriptor extraction comparisons of NBNN to other

pearance and shap

NBNN, and comrs (learning-based s are provided in

Combining Several Types of Descriptors selected nlabel = 1

(training) and test

Results - Contribution Evaluation

Irani - In Defence of Nearest-Neighbor Based Image Classication

Wang - Image-to-Class Distance Metric Learning for Image Classication

Behmo - Towards Optimal Naive Bayes Nearest Neighbor

Wang - Image-to-Class Distance Metric Learning for Image Classication

(6) ipn i,pi,ni

T T s.t. i, p, n : T r(Xin Mn Xin ) T r(Xip Mp Xip ) 1 ipn

Introducing Mahalanobis distance

c T c (fij fij ) Mc (fij fij )

c T c (fij fij ) Mc (fij fij )

(6) ipn i,pi,ni

T T s.t. i, p, n : T r(Xin Mn Xin ) T r(Xip Mp Xip ) 1 ipn

An Ecient Gradient Descent Solver

T Xic Xic + i,c=p

T Xic Xic (7)

81.2 0.52 79.7 1.83 89.8 1.16

Corel 0.93 0.92 0.91

0.9 0.8 0.89 0.78 0.76

0.88 NS SSL SPM 0.87 NS SSL SPM

Results - Caltech 101

0.7 0.6 0.5 0.4 0.3 0.2 11 22 3*3