Professional Documents
Culture Documents
of languages. present paper continues where (Shi et al., 2009) falls short:
we prove exponential tail bounds. These bounds hold for
This paper makes four main contributions: 1. In sec-
general hash kernels, which we later apply to show how
tion 2 we introduce specialized hash functions with unbi-
hashing enables us to do large-scale multitask learning ef-
ased inner-products that are directly applicable to a large
ficiently. We start with a simple lemma about the bias and
variety of kernel-methods. 2. In section 3 we provide ex-
variance of the hash kernel. The proof of this lemma ap-
ponential tail bounds that help explain why hashed feature
pears in appendix A.
vectors have repeatedly lead to, at times surprisingly, strong
empirical results. 3. In the same chapter we show that
Lemma 2 The hash kernel is unbiased, that is
the interference between independently hashed subspaces
E [hx, x0 i ] = hx, x0 i. Moreover, the variance is
is negligible with high probability, which allows large-scale
2 02
P
2 1
multi-task learning in a very compressed space. 4. In sec- x,x 0 = m i6=j xi xj + xi xi xj xj , and thus, for
0 0
2 1
tion 5 we introduce collaborative email-spam filtering as a kxk2 = kx0 k2 = 1, x,x 0 = O
m .
novel application for hash representations and provide ex-
perimental results on large-scale real-world spam data sets. This suggests that typical values of the hash kernel should
be concentrated within O( 1m ) of the target value. We use
2. Hash Functions Chebyshevs inequality
to show that half of all observations
are within a range of 2. This, together with Talagrands
We introduce a variant on the hash kernel proposed by (Shi convex distance inequality, enables us to construct expo-
et al., 2009). This scheme is modified through the introduc- nential tail bounds.
tion of a signed sum of hashed features whereas the original
hash kernels use an unsigned sum. This modification leads
3.1. Concentration of Measure Bounds
to an unbiased estimate, which we demonstrate and further
utilize in the following section. In this subsection we show that under a hashed feature-map
the length of each vector is preserved with high probability.
Definition 1 Denote by h a hash function h : N Talagrands inequality (Ledoux, 2001) is a key tool for the
{1, . . . , m}. Moreover, denote by a hash function : proof of the following theorem (detailed in the appendix B).
N {1}. Then for vectors x, x0 `2 we define the
hashed feature map and the corresponding inner product Theorem 3 Let < 1 be a fixed constant and x be a given
as instance. Let = kxk
kxk2 . Under the assumptions above, the
(h,)
X hash kernel satisfies the following inequality
i (x) = (j)xj (2)
j:h(j)=i 2 2
( )
| kxk kxk2 |
D E Pr 2 x,x + exp 4 .
and hx, x0 i := (h,) (x), (h,) (x0 ) . (3) kxk2
2
Note that an analogous result would also hold for the orig-
Although the hash functions in definition 1 are defined over inal hash kernel of (Shi et al., 2009), the only modifica-
the natural numbers N, in practice we often consider hash tion being the associated bias terms. The above result can
functions over arbitrary strings. These are equivalent, since also be utilized to show a concentration bound on the inner
each finite-length string can be represented by a unique nat- product between two general vectors x and x0 .
ural number. Usually, we abbreviate the notation (h,) ()
by just (). Two hash functions and 0 are different Corollary 4 For two vectors x and x0 , let us define
0 0
when = (h,) and 0 = (h , ) such that either h0 6= h
or 6= 0 . The purpose of the binary hash is to remove := max(x,x , x0 ,x0 , xx0 ,xx0 )
the bias inherent in the hash kernel of (Shi et al., 2009). kxk kx0 k kx x0 k
:= min , , .
In a multi-task setting, we obtain instances in combina- kxk2 kx0 k2 kx x0 k2
tion with tasks, (x, u) X U . We can naturally ex- 2 2 2
tend our definition 1 to hash pairs, and will write u (x) = Also let = kxk + kx0 k + kx x0 k . Under the as-
((x, u)). sumptions above, we have that
h i
Pr | hx, x0 i hx, x0 i | > ( 2+)/2 < 3e 4 .
3. Analysis
The following section is dedicated to theoretical analysis The proof for this corollary can be found in appendix C. We
of hash kernels and their applications. In this sense, the can also extend the bound in Theorem 3 for the maximal
Feature Hashing for Large Scale Multitask Learning
canonical distortion over large sets of distances between hash-feature space we want to ensure that there is little in-
vectors as follows: teraction between the different parameter vectors. Let U be
a set of different tasks, u U being a specific one. Let w be
Corollary 5 Denote by X = {x1 , . . . , xn } a set of vectors a combination of the parameter vectors of tasks in U \ {u}.
which satisfy kxi xj k kxi xj k2 for all pairs i, j. We show that for any observation x for task u, the inter-
In this case with probability 1 we have for all i, j action of w with x in the hashed feature space is minimal.
For each x, let the image of x under the hash feature-map
2 2
| kxi xj k kxi xj k2 | 2 for task u be denoted as u (x) = (,h) ((x, u)).
r
2 + 64 2 log2 n
2 .
kxi xj k 2 m
Theorem 7 Let w Rm be a parameter vector for tasks
in U \ {u}. In this case the value of the inner product
This means that the number of observations n (or corre-
hw, u (x)i is bounded by
spondingly the size of the un-hashed kernel matrix) only
enters logarithmically in the analysis.
2 /2
Storing all predictors wi requires O(d (|U | + 1)) mem- To show that d is small with high probability, we apply
ory. In a task like collaborative spam-filtering, |U |, the Corollary 4 once for each possible values of v.
number of users can be in the hundreds of thousands and
the size of the vocabulary is usually in the order of mil- In section 5 we show experimental results for this set-
lions. The nave way of dealing with this is to elimi- ting. The empirical results are stronger than the theoretical
nate all infrequent tokens. However, spammers target this bounds derived in this subsectionour technique outper-
memory-vulnerability by maliciously misspelling words forms a single global classifier on hundreds thousands of
and thereby creating highly infrequent but spam-typical users. In the same section we provide an intuitive explana-
tokens that fall under the radar of conventional classi- tion for these strong results.
fiers. Instead, if all words are hashed into a finite-sized
feature vector, infrequent but class-indicative tokens get a Massively Multiclass Estimation We can also regard
chance to contribute to the classification outcome. Further, massively multi-class classification as a multitask problem,
large scale spam-filters (e.g. Yahoo MailTM or GMailTM ) and apply feature hashing in a way similar to the person-
typically have severe memory and time constraints, since alization setting. Instead of using a different hash func-
they have to handle billions of emails per day. To guaran- tion for each user, we use a different hash function for each
tee a finite-size memory footprint we hash all weight vec- class.
tors w0 , . . . , w|U | into a joint, significantly smaller, feature (Shi et al., 2009) apply feature hashing to problems with
space Rm with different hash functions 0 , . . . , |U | . The a high number of categories. They show empirically that
resulting hashed-weight vector wh Rm can then be writ- joint hashing of the feature vector (x, y) can be efficiently
ten as: X achieved for problems with millions of features and thou-
wh = 0 (w0 ) + u (wu ). (5) sands of classes.
uU
Note that in practice the weight vector wh can be learned Collaborative Filtering Assume that we are given a very
directly in the hashed space. All un-hashed weight vectors large sparse matrix M where the entry Mij indicates what
never need to be computed. Given a new document/email action user i took on instance j. A common example for
x of user u U , the prediction task now consists of calcu- actions and instances is user-ratings of movies (Bennett &
lating h0 (x) + u (x), wh i. Due to hashing we have two Lanning, 2007). A successful method for finding common
sources of error distortion d of the hashed inner prod- factors amongst users and instances for predicting unob-
ucts and the interference with other hashed weight vectors served actions is to factorize M into M = U > W . If we
Feature Hashing for Large Scale Multitask Learning
1 !"'#%
x !"!'%
!"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%
NEU NEU 0 !"#$% !"#&%
Votre USER123_NEU -1 !"##% !"##% !%
0 !"##%
Apotheke Votre
... USER123_Votre 0
-1 #"$#%
Apotheke #"$'%
USER123_Apotheke 0
1 #"(#% +,-./,01/2134%
... #")#%
0 #")$% #")(%
... 5362-7/,8934%
#"*#%
./23,873%
text document (email) bag of words bag of words hashed,
(personalized) sparse vector #"'#%
0 (x)+u (x)
#"##%
!$% '#% ''% '*% ')%
(#$"
)!*"
("
A number of researchers have tackled related, albeit differ-
)(*"
)$+,*"
ent problems.
!#'"
)%+-*"
!#&" (Rahimi & Recht, 2008) use Bochners theorem and sam-
)'+(.*"
!#%"
)(&+,(*"
pling to obtain approximate inner products for Radial Ba-
!#$" ),$+&%*" sis Function kernels. (Rahimi & Recht, 2009) extend this
!" )&%+/0" to sparse approximation of weighted combinations of ba-
('" $!" $$" $%" $&" 12345674" sis functions. This is computationally efficient for many
0%0&)!%&1%3#!3')#0,*% function spaces. Note that the representation is dense.
(Li et al., 2007) take a complementary approach: for sparse
Figure 3. Results for users clustered by training emails. For ex- feature vectors, (x), they devise a scheme of reducing the
ample, the bucket [8, 15] consists of all users with eight to fifteen
number of nonzero terms even further. While this is in prin-
training emails. Although users in buckets with large amounts of
training data do benefit more from the personalized classifier (up-
ciple desirable, it does not resolve the problem of (x) be-
to 65% reduction in spam), even users that did not contribute to ing high dimensional. More succinctly, it is necessary to
the training corpus at all obtain almost 20% spam-reduction. express the function in the dual representation rather than
expressing f as a linear function, where w is unlikely to be
User clustering One hypothesis for the strong results in compactly represented: f (x) = h(x), wi.
Figure 2 might originate from the non-uniform distribution (Achlioptas, 2003) provides computationally efficient ran-
of user votes it is possible that by using personalization domization schemes for dimensionality reduction. Instead
and feature hashing we benefit a small number of users who of performing a dense dm dimensional matrix vector mul-
have labeled many emails, degrading the performance of tiplication to reduce the dimensionality for a vector of di-
most users (who have labeled few or no emails) in the pro- mensionality d to one of dimensionality m, as is required
cess. In fact, in real life, a large fraction of email users do by the algorithm of (Gionis et al., 1999), he only requires 13
not contribute at all to the training corpus and only interact of that computation by designing a matrix consisting only
with the classifier during test time. The personalized ver- of entries {1, 0, 1}.
sion of the test email u (xu ) is then hashed into buckets
of other tokens and only adds interference noise i to the (Shi et al., 2009) propose a hash kernel to deal with the is-
classification. sue of computational efficiency by a very simple algorithm:
high-dimensional vectors are compressed by adding up all
To show that we improve the performance of most users, coordinates which have the same hash value one only
it is therefore important that we not only report averaged needs to perform as many calculations as there are nonzero
results over all emails, but explicitly examine the effects terms in the vector. This is a significant computational sav-
of the personalized classifier for users depending on their ing over locality sensitive hashing (Achlioptas, 2003; Gio-
contribution to the training set. To this end, we place users nis et al., 1999).
into exponentially growing buckets based on their num-
ber of training emails and compute the relative reduction Several additional works provide motivation for the investi-
of uncaught spam for each bucket individually. Figure 3 gation of hashing representations. For example, (Ganchev
shows the results on a per-bucket basis. We do not compare & Dredze, 2008) provide empirical evidence that the
against a purely local approach, with no global component, hashing-trick can be used to effectively reduce the memory
since for a large fraction of usersthose without training footprint on many sparse learning problems by an order of
datathis approach cannot outperform random guessing. magnitude via removal of the dictionary. Our experimen-
It might appear surprising that users in the bucket with none tal results validate this, and show that much more radical
or very little training emails (the line of bucket [0] is iden- compression levels are achievable. In addition, (Langford
tical to bucket [1]) also benefit from personalization. After et al., 2007) released the Vowpal Wabbit fast online learn-
all, their personalized classifier was never trained and can ing software which uses a hash representation similar to
only add noise at test-time. The classifier improvement of that discussed here.
this bucket can be explained by the subjective definition of
spam and not-spam. In the personalized setting the indi- 7. Conclusion
vidual component of user labeling is absorbed by the local
classifiers and the global classifier represents the common In this paper we analyze the hashing-trick for dimensional-
definition of spam and not-spam. In other words, the global ity reduction theoretically and empirically. As part of our
part of the personalized classifier obtains better generaliza- theoretical analysis we introduce unbiased hash functions
tion properties, benefiting all users. and provide exponential tail bounds for hash kernels. These
Feature Hashing for Large Scale Multitask Learning
give further insight into hash-spaces and explain previously Rahimi, A., & Recht, B. (2009). Randomized kitchen
made empirical observations. We also derive that random sinks. In L. Bottou, Y. Bengio, D. Schuurmans and
subspaces of the hashed space are likely to not interact, D. Koller (Eds.), Advances in neural information pro-
which makes multitask learning with many tasks possible. cessing systems 21. Cambridge, MA: MIT Press.
Our empirical results validate this on a real-world applica- Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.,
tion within the context of spam filtering. Here we demon- Strehl, A., & Vishwanathan, V. (2009). Hash kernels.
strate that even with a very large number of tasks and Proc. Intl. Workshop on Artificial Intelligence and Statis-
features, all mapped into a joint lower dimensional hash- tics 12.
space, one can obtain impressive classification results with
finite memory guarantee.
A. Mean and Variance
References Proof [Lemma 2] To compute the expectation we expand
Daume, H. (2007). Frustratingly easy domain adaptation. This expression can be simplified by noting that:
Annual Meeting of the Association for Computational
Linguistics (p. 256). E [(i)(j)(k)(l)] = ij kl +[1ijkl ](ik jl + il jk ).
Ganchev, K., & Dredze, M. (2008). Small statistical mod- Passing the expectation over through the sum, this allows
els by random feature mixing. Workshop on Mobile Lan- us to break down the expansion of the variance into two
guage Processing, Annual Meeting of the Association for terms.
Computational Linguistics. 2 2
x2i x0j Eh h(i),h(j)
X X
E [hx, x0 i ] = xi x0i xk x0k +
Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity i,k i6=j
search in high dimensions via hashing. Proceedings of
X
+ xi x0i xj x0j Eh h(i),h(j)
the 25th VLDB Conference (pp. 518529). Edinburgh, i6=j
Scotland: Morgan Kaufmann.
2 1 2
x2i x0j +
X X
= hx, x0 i + xi x0i xj x0j
Langford, J., Li, L., & Strehl, A. (2007). Vow- m
i6=j i6=j
pal wabbit online learning project (Technical Report).
http://hunch.net/?p=309. 1
by noting that Eh h(i),h(j) = m for i 6= j. Using the fact
2
Ledoux, M. (2001). The concentration of measure phe- that = E [hx, x i ]E [hx, x i ]2 proves the claim.
2 0 0
Next denote by d(, A) the distance between a hash func- (The last inequality holds because in the worst case all mass
tion and a set A of hash functions, that is d(, A) = is concentrated in a single entry of vi .) As a next step we
inf 0 A d(, 0 ). In this case Talagrands convex distance will express k0 (x)k2 in terms of x,x . Because 0 A,
inequality (Ledoux, 2001) holds. If Pr(A) denotes the total we obtain that
probability mass of the set A, then q
k0 (x)k2 = hx, xi0 (1+ 2x,x )1/2 1+x,x / 2.
1 s2 /4
Pr {d(, A) s} [Pr(A)] e . (10)
To simplify our notation, let us define = 1 + x,x / 2.
Proof [Theorem 3] Without loss of generality assume that Plugging our upper bounds for kvk2 and k0 (x)k2 into (12)
kxk2 = 1. We can then easily generalize to the general x leads to
2
2
case. From Lemma 2 it follows that the variance of kxk is kxk 1 8 kxk d2 (, 0 )( +2 kxk d2 (, 0 ))+ 2.
4 2
2
given by x,x = N2 [1 kxk4 ] and E(kxk ) = 1.
As we have not specified our particular choice of 0 , we
Chebyshevs
inequality states that P (|X E(X)| can now choose it to be the closest vector to within A,
2) 12 . We can therefore denote ie such that d(, 0 ) = d(, A). By Talagrands inequality,
we know that with probability at least 12es /4 we obtain
2
2
n o
A := where kxk 1 2x,x . d(, A) s and therefore with high probability:
2
2
and obtain Pr(A) 21 . From Talagrands inequality (10) kxk 1 8 kxk s2 + 16 kxk s4 + 2.
we know that Pr({ : d(, A) s}) 2es /4 . Now as-
2
2
+
sume that we have a pair of hash functions and 0 , with A change of variables s2 = 4kxk gives us that
0 A. Let us define the difference of their hashed inner-
2
kxk 1 2 + w.p. 1 2es /4 . Noting that
2
2
products as := kxk hx, xi0 . By the triangle inequal-
ity and because 0 A, we can state that s2 = ( 2 + )/4 kxk /4 kxk , lets us ob-
p
2
2
tain our final result
kxk 1 kxk hx, xi0 + hx, xi0 1
2
kxk 1 2 + w.p. 1 2e /4kxk .
|| + 2. (11)
Finally, for a general x, we can derive the above result for
x
Let us now denote the coordinate-wise difference between y = kxk 2
. Replacing kyk = kxk
kxk2 we get the following
the hashed features as vi := 0i (x) i (x). With version for a general x,
P this def-
inition, we can express in terms of v: = i i (x)2 ( 2 2
)
2 | kxk kxk2 |
kxk2
0i (x) = 2 h0 (x), vi + kvk22 . By applying the Cauchy- Pr 2 2x,x + exp 4kxk
Schwartz inequality to the inner product h0 (x), vi, we ob- kxk2