You are on page 1of 8

Feature Hashing for Large Scale Multitask Learning

Kilian Weinberger KILIAN @ YAHOO - INC . COM


Anirban Dasgupta ANIRBAN @ YAHOO - INC . COM
John Langford JL @ HUNCH . NET
Alex Smola ALEX @ SMOLA . ORG
Josh Attenberg JOSH @ CIS . POLY. EDU
Yahoo! Research, 2821 Mission College Blvd., Santa Clara, CA 95051 USA

Abstract existence of handcrafted non-linear features), yet, the train-


ing set may be prohibitively large in size and very high di-
Empirical evidence suggests that hashing is an
mensional. In such a case, there is no need to map the input
effective strategy for dimensionality reduction
vectors into a higher dimensional feature space. Instead,
and practical nonparametric estimation. In this
limited memory makes storing a kernel matrix infeasible.
paper we provide exponential tail bounds for fea-
ture hashing and show that the interaction be- For this common scenario several authors have recently
tween random subspaces is negligible with high proposed an alternative, but highly complimentary vari-
probability. We demonstrate the feasibility of ation of the kernel-trick, which we refer to as the
this approach with experimental results for a new hashing-trick: one hashes the high dimensional input vec-
use case multitask learning with hundreds of tors x into a lower dimensional feature space Rm with
thousands of tasks. : X Rm (Langford et al., 2007; Shi et al., 2009). The
parameter vector of a classifier can therefore live in Rm in-
stead of in the original input spaceRd (or in Rn in the case
1. Introduction of kernel matrices), where m  n and m  d. Different
from random projections, the hashing-trick preserves spar-
Kernel methods use inner products as the basic tool for sity and introduces no additional overhead to store projec-
comparisons between objects. That is, given objects tion matrices.
x1 , . . . , xn X for some domain X, they rely on
To our knowledge, we are the first to provide exponential
k(xi , xj ) := h(xi ), (xj )i (1) tail bounds on the canonical distortion of these hashed inner
products. We also show that the hashing-trick can be partic-
to compare the features (xi ) of xi and (xj ) of xj respec- ularly powerful in multi-task learning scenarios where the
tively. original feature spaces are the cross-product of the data, X,
Eq. (1) is often famously referred to as the kernel-trick. It and the set of tasks, U . We show that one can use differ-
allows the use of inner products between very high dimen- ent hash functions for each task 1 , . . . , |U | , to map the
sional feature vectors (xi ) and (xj ) implicitly through data into one joint space with little interference. Sharing
the definition of a positive semi-definite kernel matrix k amongst the different tasks is achieved with an additional
without ever having to compute a vector (xi ) directly. hash function 0 that also maps into the same joint space.
This can be particularly powerful in classification settings The hash function 0 is shared amongst all |U | tasks and
where the original input representation has a non-linear de- allows to learn their common components.
cision boundary. Often, linear separability can be achieved While many potential applications exist for the hashing-
in a high dimensional feature space (xi ). trick, as a particular case study we focus on collaborative
In practice, for example in text classification, researchers email spam filtering. In this scenario, hundreds of thou-
frequently encounter the opposite problem: the original in- sands of users collectively label emails as spam or not-
put space is almost linearly separable (often because of the spam, and each user expects a personalized classifier that
reflects their particular preferences. Here, the set of tasks,
Appearing in Proceedings of the 26 th International Conference U , is the number of email users (this can be very large for
on Machine Learning, Montreal, Canada, 2009. Copyright 2009 open systems such as Yahoo MailTM or GmailTM ), and the
by the author(s)/owner(s). feature space spans the union of vocabularies in multitudes
Feature Hashing for Large Scale Multitask Learning

of languages. present paper continues where (Shi et al., 2009) falls short:
we prove exponential tail bounds. These bounds hold for
This paper makes four main contributions: 1. In sec-
general hash kernels, which we later apply to show how
tion 2 we introduce specialized hash functions with unbi-
hashing enables us to do large-scale multitask learning ef-
ased inner-products that are directly applicable to a large
ficiently. We start with a simple lemma about the bias and
variety of kernel-methods. 2. In section 3 we provide ex-
variance of the hash kernel. The proof of this lemma ap-
ponential tail bounds that help explain why hashed feature
pears in appendix A.
vectors have repeatedly lead to, at times surprisingly, strong
empirical results. 3. In the same chapter we show that
Lemma 2 The hash kernel is unbiased, that is
the interference between independently hashed subspaces
E [hx, x0 i ] = hx, x0 i. Moreover, the variance is
is negligible with high probability, which allows large-scale
2 02
P 
2 1
multi-task learning in a very compressed space. 4. In sec- x,x 0 = m i6=j xi xj + xi xi xj xj , and thus, for
0 0

2 1
tion 5 we introduce collaborative email-spam filtering as a kxk2 = kx0 k2 = 1, x,x 0 = O

m .
novel application for hash representations and provide ex-
perimental results on large-scale real-world spam data sets. This suggests that typical values of the hash kernel should
be concentrated within O( 1m ) of the target value. We use
2. Hash Functions Chebyshevs inequality
to show that half of all observations
are within a range of 2. This, together with Talagrands
We introduce a variant on the hash kernel proposed by (Shi convex distance inequality, enables us to construct expo-
et al., 2009). This scheme is modified through the introduc- nential tail bounds.
tion of a signed sum of hashed features whereas the original
hash kernels use an unsigned sum. This modification leads
3.1. Concentration of Measure Bounds
to an unbiased estimate, which we demonstrate and further
utilize in the following section. In this subsection we show that under a hashed feature-map
the length of each vector is preserved with high probability.
Definition 1 Denote by h a hash function h : N Talagrands inequality (Ledoux, 2001) is a key tool for the
{1, . . . , m}. Moreover, denote by a hash function : proof of the following theorem (detailed in the appendix B).
N {1}. Then for vectors x, x0 `2 we define the
hashed feature map and the corresponding inner product Theorem 3 Let  < 1 be a fixed constant and x be a given
as instance. Let = kxk
kxk2 . Under the assumptions above, the
(h,)
X hash kernel satisfies the following inequality
i (x) = (j)xj (2)
j:h(j)=i 2 2
( )
| kxk kxk2 |  
D E Pr 2 x,x +  exp 4 .
and hx, x0 i := (h,) (x), (h,) (x0 ) . (3) kxk2
2

Note that an analogous result would also hold for the orig-
Although the hash functions in definition 1 are defined over inal hash kernel of (Shi et al., 2009), the only modifica-
the natural numbers N, in practice we often consider hash tion being the associated bias terms. The above result can
functions over arbitrary strings. These are equivalent, since also be utilized to show a concentration bound on the inner
each finite-length string can be represented by a unique nat- product between two general vectors x and x0 .
ural number. Usually, we abbreviate the notation (h,) ()
by just (). Two hash functions and 0 are different Corollary 4 For two vectors x and x0 , let us define
0 0
when = (h,) and 0 = (h , ) such that either h0 6= h
or 6= 0 . The purpose of the binary hash is to remove := max(x,x , x0 ,x0 , xx0 ,xx0 )
the bias inherent in the hash kernel of (Shi et al., 2009). kxk kx0 k kx x0 k
 
:= min , , .
In a multi-task setting, we obtain instances in combina- kxk2 kx0 k2 kx x0 k2
tion with tasks, (x, u) X U . We can naturally ex- 2 2 2
tend our definition 1 to hash pairs, and will write u (x) = Also let = kxk + kx0 k + kx x0 k . Under the as-
((x, u)). sumptions above, we have that
h i

Pr | hx, x0 i hx, x0 i | > ( 2+)/2 < 3e 4 .
3. Analysis
The following section is dedicated to theoretical analysis The proof for this corollary can be found in appendix C. We
of hash kernels and their applications. In this sense, the can also extend the bound in Theorem 3 for the maximal
Feature Hashing for Large Scale Multitask Learning

canonical distortion over large sets of distances between hash-feature space we want to ensure that there is little in-
vectors as follows: teraction between the different parameter vectors. Let U be
a set of different tasks, u U being a specific one. Let w be
Corollary 5 Denote by X = {x1 , . . . , xn } a set of vectors a combination of the parameter vectors of tasks in U \ {u}.
which satisfy kxi xj k kxi xj k2 for all pairs i, j. We show that for any observation x for task u, the inter-
In this case with probability 1 we have for all i, j action of w with x in the hashed feature space is minimal.
For each x, let the image of x under the hash feature-map
2 2
| kxi xj k kxi xj k2 | 2 for task u be denoted as u (x) = (,h) ((x, u)).
r
2 + 64 2 log2 n
2 .
kxi xj k 2 m
Theorem 7 Let w Rm be a parameter vector for tasks
in U \ {u}. In this case the value of the inner product
This means that the number of observations n (or corre-
hw, u (x)i is bounded by
spondingly the size of the un-hashed kernel matrix) only
enters logarithmically in the analysis.
2 /2

Pr {|hw, u (x)i| > } 2e m1 kwk2 2


2 kxk2 +kwk kxk /3
Proof We apply the bound of Theorem 3 to each dis-
1
tance individually. Note the bound 2 m for all nor-
n(n1) Proof We use Bernsteins inequality (Bernstein, 1946),
malized vectors. Also, since we have 2 pairs of dis-
tances the union bound yields a corresponding factor. Solv- which states that for independent random variables Xj ,

n(n1) 4 with E [Xj ] = 0, if C > 0 is such that |Xj | C, then
ing 2 e for  and easy inequalities proves the
claim.
n
!
X t2 /2
Pr Xj > t exp Pn  2 . (4)
j=1 j=1 E Xj + Ct/3
3.2. Multiple Hashing
Note that the tightness of the union bound in Corollary 5 We have to compute
depends crucially on the magnitude of . In other words, P the concentration property of
hw, u (x)i = j xj (j)wh(j) . Let Xj = xj (j)wh(j) .
for large values of , that is, whenever some terms in x By the definition of h and , Xj are independent. Also,
are very large, even a single collision can already lead to for each j, since w depends only on the hash-functions for
significant distortions of the embedding. This issue can U \ {u},
be amended by trading off sparsity with variance. A vec-  wh (j) is independent of (j). Thus, E[Xj ] =
E(,h) xj (j)wh(j) = 0. For j, we also have |Xj | <

each
tor of unit lengthmay be written as (1, 0, 0, 0, . . .), or kxk kwk =: C. Finally, j E[Xj2 ] is given by
P
as 1 , 1 , 0, . . . , or more generally as a vector with c
2 2
1
nonzero terms of magnitude c 2 . This is relevant, for in- 2 2
E (xj (j)wh(j) )2 = 1
x2j w`2 = 1
X X
stance whenever the magnitudes of x follow a known pat- m m kxk2 kwk2
j j,`
tern, e.g. when representing documents as bags of words
since we may simply hash frequent words several times.
The claim follows by plugging both terms and C into the
The following corollary gives an intuition as to how the
Bernstein inequality (4).
confidence bounds scale in terms of the replications:

Lemma 6 If we let x0 = 1 (x, . . . , x) then:


c Theorem 7 bounds the influence of unrelated tasks with any
particular instance. In section 5 we demonstrate the real-
1. It is norm preserving: kxk2 = kx0 k2 .
world applicability with empirical results on a large-scale
1
kx0 k multi-task learning problem.
2. It reduces component magnitude by c
= kxk .
1 2 c1 4
3. Variance increases to x20 ,x0 = c x,x + c 2 kxk2 . 4. Applications
Applying Lemma 6 to Theorem 3, a large magnitude can The advantage of feature hashing is that it allows for sig-
be decreased at the cost of an increased variance. nificant storage compression for parameter vectors: storing
w in the raw feature space navely requires O(d) numbers,
when w Rd . By hashing, we are able to reduce this to
3.3. Approximate Orthogonality
O(m) numbers while avoiding costly matrix-vector multi-
For multitask learning, we must learn a different parameter plications common in Locality Sensitive Hashing. In addi-
vector for each related task. When mapped into the same tion, the sparsity of the resulting vector is preserved.
Feature Hashing for Large Scale Multitask Learning

The benefits of the hashing-trick leads to applications in i . More precisely:


almost all areas of machine learning and beyond. In par-
ticular, feature hashing is extremely useful whenever large h0 (x) + u (x), wh i = hx, w0 + wu i + d + i . (6)
numbers of parameters with redundancies need to be stored
within bounded memory capacity. The interference error consists of all collisions between
0 (x) or u (x) with hash functions of other users,
Personalization (Daume, 2007) introduced a very sim- X X
ple but strikingly effective method for multitask learning. i = h0 (x), v (wv )i + hu (x), v (wv )i . (7)
Each task updates its very specific own (local) weights and vU,v6=0 vU,v6=u
a set of common (global) weights that are shared amongst
all tasks. Theorem 7 allows us to hash all these multiple To show that i is small with high probability we can
classifiers into one feature space with little interaction. To apply Theorem 7 twice, once for each term of (7).
illustrate, we explore this setting in the context of spam- We consider each P users classification to be a separate
classifier personalization. task, and since vU,v6=0 wv is independent of the hash-
function 0 , the conditions of Theorem 7 apply with w =
Suppose we have thousands of users U and want to per- P
w
form related but not identical classification tasks for each Pv6=0 v and we can employ it to bound the second term,
vU,v6=0 hu (x), u (wv )i. The second application is
of the them. Users provide labeled data by marking emails identical except that all subscripts 0 are substituted with
as spam or not-spam. Ideally, for each user u U , we u. For lack of space we do not derive the exact bounds.
want to learn a predictor wu based on the data of that user
solely. However, webmail users are notoriously lazy in la- The distortion error occurs because each hash function that
beling emails and even those that do not contribute to the is utilized by user u can self-collide:
training data expect a working spam filter. Therefore, we X
also need to learn an additional global predictor w0 to allow d = | hv (x), v (wv )i hx, wv i |. (8)
data sharing amongst all users. v{u,0}

Storing all predictors wi requires O(d (|U | + 1)) mem- To show that d is small with high probability, we apply
ory. In a task like collaborative spam-filtering, |U |, the Corollary 4 once for each possible values of v.
number of users can be in the hundreds of thousands and
the size of the vocabulary is usually in the order of mil- In section 5 we show experimental results for this set-
lions. The nave way of dealing with this is to elimi- ting. The empirical results are stronger than the theoretical
nate all infrequent tokens. However, spammers target this bounds derived in this subsectionour technique outper-
memory-vulnerability by maliciously misspelling words forms a single global classifier on hundreds thousands of
and thereby creating highly infrequent but spam-typical users. In the same section we provide an intuitive explana-
tokens that fall under the radar of conventional classi- tion for these strong results.
fiers. Instead, if all words are hashed into a finite-sized
feature vector, infrequent but class-indicative tokens get a Massively Multiclass Estimation We can also regard
chance to contribute to the classification outcome. Further, massively multi-class classification as a multitask problem,
large scale spam-filters (e.g. Yahoo MailTM or GMailTM ) and apply feature hashing in a way similar to the person-
typically have severe memory and time constraints, since alization setting. Instead of using a different hash func-
they have to handle billions of emails per day. To guaran- tion for each user, we use a different hash function for each
tee a finite-size memory footprint we hash all weight vec- class.
tors w0 , . . . , w|U | into a joint, significantly smaller, feature (Shi et al., 2009) apply feature hashing to problems with
space Rm with different hash functions 0 , . . . , |U | . The a high number of categories. They show empirically that
resulting hashed-weight vector wh Rm can then be writ- joint hashing of the feature vector (x, y) can be efficiently
ten as: X achieved for problems with millions of features and thou-
wh = 0 (w0 ) + u (wu ). (5) sands of classes.
uU

Note that in practice the weight vector wh can be learned Collaborative Filtering Assume that we are given a very
directly in the hashed space. All un-hashed weight vectors large sparse matrix M where the entry Mij indicates what
never need to be computed. Given a new document/email action user i took on instance j. A common example for
x of user u U , the prediction task now consists of calcu- actions and instances is user-ratings of movies (Bennett &
lating h0 (x) + u (x), wh i. Due to hashing we have two Lanning, 2007). A successful method for finding common
sources of error distortion d of the hashed inner prod- factors amongst users and instances for predicting unob-
ucts and the interference with other hashed weight vectors served actions is to factorize M into M = U > W . If we
Feature Hashing for Large Scale Multitask Learning


1 !"'#%
x !"!'%

!"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%
NEU NEU 0 !"#$% !"#&%
Votre USER123_NEU -1 !"##% !"##% !%
0 !"##%
Apotheke Votre
... USER123_Votre 0
-1 #"$#%
Apotheke #"$'%
USER123_Apotheke 0
1 #"(#% +,-./,01/2134%
... #")#%
0 #")$% #")(%
... 5362-7/,8934%
#"*#%
./23,873%
text document (email) bag of words bag of words hashed,
(personalized) sparse vector #"'#%
0 (x)+u (x)
#"##%
!$% '#% ''% '*% ')%

Figure 1. The hashed personalization summarized in a schematic 0%0&)!%&1%3#!3')#0,*%

layout. Each token is duplicated and one copy is individualized


(e.g. by concatenating each word with a unique user identifier). Figure 2. The decrease of uncaught spam over the baseline clas-
Then, the global hash function maps all tokens into a low dimen- sifier averaged over all users. The classification threshold was
sional feature space where the document is classified. chosen to keep the not-spam misclassification fixed at 1%.
The hashed global classifier (global-hashed) converges relatively
have millions of users performing millions of actions, stor- soon, showing that the distortion error d vanishes. The personal-
ing U and W in memory quickly becomes infeasible. In- ized classifier results in an average improvement of up to 30%.
stead, we may choose to compress the matrices U and W
using hashing. For U, W Rnd denote by u, w Rm
vectors with user-id to each word in the email and then hash the newly
X X generated tokens with the same global hash function.
ui = (j, k)Ujk and wi = 0 (j, k)Wjk .
j,k:h(j,k)=i j,k:h0 (j,k)=i
The data set was collected over a span of 14 days. We
used the first 10 days for training and the remaining 4 days
where (h, ) and (h0 , 0 ) are independently chosen hash for testing. As baseline, we chose the purely global classi-
functions. This allows us to approximate matrix elements fier trained over all users and hashed into 226 dimensional
Mij = [U > W ]ij via space. As 226 far exceeds the total number of unique words
we can regard the baseline to be representative for the clas-

X
Mij := (k, i) 0 (k, j)uh(k,i) wh0 (k,j) . sification without hashing. All results are reported as the
k amount of spam that passed the filter undetected, relative
to this baseline (eg. a value of 0.80 indicates a 20% reduc-
This gives a compressed vector representation of M that tion in spam for the user)2 .
can be efficiently stored.
Figure 2 displays the average amount of spam in users in-
boxes as a function of the number of hash keys m, relative
5. Results to the baseline above. In addition to the baseline, we eval-
We evaluated our algorithm in the setting of personaliza- uate two different settings.
tion. As data set, we used a proprietary email spam- The global-hashed curve represents the relative
classification task of n = 3.2 million emails, properly spam catch-rate of the global classifier after hashing
anonymized, collected from |U | = 433167 users. Each h0 (w0 ), 0 (x)i. At m = 226 this is identical to the
email is labeled as spam or not-spam by one user in U . Af- baseline. Early convergence at m = 222 suggests that at
ter tokenization, the data set consists of 40 million unique this point hash collisions have no impact on the classifi-
words. cation error and the baseline is indeed equivalent to that
For all experiments in this paper, we used the Vowpal Wab- obtainable without hashing.
bit implementation1 of stochastic gradient descent on a In the personalized setting each user u U gets her own
square-loss. In the mail-spam literature the misclassifica- classifier u (wu ) as well as the global classifier 0 (w0 ).
tion of not-spam is considered to be much more harmful Without hashing the feature space explodes, as the cross
than misclassification of spam. We therefore follow the product of u = 400K users and n = 40M tokens results
convention to set the classification threshold during test in 16 trillion possible unique personalized features. Fig-
time such that exactly 1% of the not spam test data is ure 2 shows that despite aggressive hashing, personaliza-
classified as spam Our implementation of the personalized tion results in a 30% spam reduction once the hash table is
hash functions is illustrated in Figure 1. To obtain a person- indexed by 22 bits.
alized hash function u for user u, we concatenate a unique
2
As part of our data sharing agreement, we agreed not to in-
1
http://hunch.net/vw/ clude absolute classification error-rates.
Feature Hashing for Large Scale Multitask Learning

(#%" 6. Related Work


!"#$%$&!!'(#)*%+(*,#-.*%)/%0#!*,&1*2%

(#$"
)!*"
("
A number of researchers have tackled related, albeit differ-
)(*"
)$+,*"
ent problems.
!#'"
)%+-*"
!#&" (Rahimi & Recht, 2008) use Bochners theorem and sam-
)'+(.*"
!#%"
)(&+,(*"
pling to obtain approximate inner products for Radial Ba-
!#$" ),$+&%*" sis Function kernels. (Rahimi & Recht, 2009) extend this
!" )&%+/0" to sparse approximation of weighted combinations of ba-
('" $!" $$" $%" $&" 12345674" sis functions. This is computationally efficient for many
0%0&)!%&1%3#!3')#0,*% function spaces. Note that the representation is dense.
(Li et al., 2007) take a complementary approach: for sparse
Figure 3. Results for users clustered by training emails. For ex- feature vectors, (x), they devise a scheme of reducing the
ample, the bucket [8, 15] consists of all users with eight to fifteen
number of nonzero terms even further. While this is in prin-
training emails. Although users in buckets with large amounts of
training data do benefit more from the personalized classifier (up-
ciple desirable, it does not resolve the problem of (x) be-
to 65% reduction in spam), even users that did not contribute to ing high dimensional. More succinctly, it is necessary to
the training corpus at all obtain almost 20% spam-reduction. express the function in the dual representation rather than
expressing f as a linear function, where w is unlikely to be
User clustering One hypothesis for the strong results in compactly represented: f (x) = h(x), wi.
Figure 2 might originate from the non-uniform distribution (Achlioptas, 2003) provides computationally efficient ran-
of user votes it is possible that by using personalization domization schemes for dimensionality reduction. Instead
and feature hashing we benefit a small number of users who of performing a dense dm dimensional matrix vector mul-
have labeled many emails, degrading the performance of tiplication to reduce the dimensionality for a vector of di-
most users (who have labeled few or no emails) in the pro- mensionality d to one of dimensionality m, as is required
cess. In fact, in real life, a large fraction of email users do by the algorithm of (Gionis et al., 1999), he only requires 13
not contribute at all to the training corpus and only interact of that computation by designing a matrix consisting only
with the classifier during test time. The personalized ver- of entries {1, 0, 1}.
sion of the test email u (xu ) is then hashed into buckets
of other tokens and only adds interference noise i to the (Shi et al., 2009) propose a hash kernel to deal with the is-
classification. sue of computational efficiency by a very simple algorithm:
high-dimensional vectors are compressed by adding up all
To show that we improve the performance of most users, coordinates which have the same hash value one only
it is therefore important that we not only report averaged needs to perform as many calculations as there are nonzero
results over all emails, but explicitly examine the effects terms in the vector. This is a significant computational sav-
of the personalized classifier for users depending on their ing over locality sensitive hashing (Achlioptas, 2003; Gio-
contribution to the training set. To this end, we place users nis et al., 1999).
into exponentially growing buckets based on their num-
ber of training emails and compute the relative reduction Several additional works provide motivation for the investi-
of uncaught spam for each bucket individually. Figure 3 gation of hashing representations. For example, (Ganchev
shows the results on a per-bucket basis. We do not compare & Dredze, 2008) provide empirical evidence that the
against a purely local approach, with no global component, hashing-trick can be used to effectively reduce the memory
since for a large fraction of usersthose without training footprint on many sparse learning problems by an order of
datathis approach cannot outperform random guessing. magnitude via removal of the dictionary. Our experimen-
It might appear surprising that users in the bucket with none tal results validate this, and show that much more radical
or very little training emails (the line of bucket [0] is iden- compression levels are achievable. In addition, (Langford
tical to bucket [1]) also benefit from personalization. After et al., 2007) released the Vowpal Wabbit fast online learn-
all, their personalized classifier was never trained and can ing software which uses a hash representation similar to
only add noise at test-time. The classifier improvement of that discussed here.
this bucket can be explained by the subjective definition of
spam and not-spam. In the personalized setting the indi- 7. Conclusion
vidual component of user labeling is absorbed by the local
classifiers and the global classifier represents the common In this paper we analyze the hashing-trick for dimensional-
definition of spam and not-spam. In other words, the global ity reduction theoretically and empirically. As part of our
part of the personalized classifier obtains better generaliza- theoretical analysis we introduce unbiased hash functions
tion properties, benefiting all users. and provide exponential tail bounds for hash kernels. These
Feature Hashing for Large Scale Multitask Learning

give further insight into hash-spaces and explain previously Rahimi, A., & Recht, B. (2009). Randomized kitchen
made empirical observations. We also derive that random sinks. In L. Bottou, Y. Bengio, D. Schuurmans and
subspaces of the hashed space are likely to not interact, D. Koller (Eds.), Advances in neural information pro-
which makes multitask learning with many tasks possible. cessing systems 21. Cambridge, MA: MIT Press.
Our empirical results validate this on a real-world applica- Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A.,
tion within the context of spam filtering. Here we demon- Strehl, A., & Vishwanathan, V. (2009). Hash kernels.
strate that even with a very large number of tasks and Proc. Intl. Workshop on Artificial Intelligence and Statis-
features, all mapped into a joint lower dimensional hash- tics 12.
space, one can obtain impressive classification results with
finite memory guarantee.
A. Mean and Variance
References Proof [Lemma 2] To compute the expectation we expand

Achlioptas, D. (2003). Database-friendly random projec-


X
hx, x0 i = (i)(j)xi x0j h(i),h(j) . (9)
tions: Johnson-lindenstrauss with binary coins. Journal i,j
of Computer and System Sciences, 66, 671687.
Since E [hx, x0 i ] = Eh [E [hx, x0 i ]], taking expecta-
Bennett, J., & Lanning, S. (2007). The Netflix Prize. Pro- tions over we see that only the terms i = j have nonzero
ceedings of Conference on Knowledge Discovery and value, which shows the first claim. For the variance we
2
Data Mining Cup and Workshop 2007. compute E [hx, x0 i ]. Expanding this, we get:

Bernstein, S. (1946). The theory of probabilities. Moscow: 2


X
hx, x0 i= (i)(j)(k)(l)xi x0j xk x0l h(i),h(j) h(k),h(l) .
Gastehizdat Publishing House. i,j,k,l

Daume, H. (2007). Frustratingly easy domain adaptation. This expression can be simplified by noting that:
Annual Meeting of the Association for Computational
Linguistics (p. 256). E [(i)(j)(k)(l)] = ij kl +[1ijkl ](ik jl + il jk ).

Ganchev, K., & Dredze, M. (2008). Small statistical mod- Passing the expectation over through the sum, this allows
els by random feature mixing. Workshop on Mobile Lan- us to break down the expansion of the variance into two
guage Processing, Annual Meeting of the Association for terms.
Computational Linguistics. 2 2
x2i x0j Eh h(i),h(j)
X X
E [hx, x0 i ] = xi x0i xk x0k +
 

Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity i,k i6=j
search in high dimensions via hashing. Proceedings of
X
+ xi x0i xj x0j Eh h(i),h(j)
 
the 25th VLDB Conference (pp. 518529). Edinburgh, i6=j
Scotland: Morgan Kaufmann.
2 1 2
x2i x0j +
X X
= hx, x0 i + xi x0i xj x0j
Langford, J., Li, L., & Strehl, A. (2007). Vow- m
i6=j i6=j
pal wabbit online learning project (Technical Report).
http://hunch.net/?p=309. 1
by noting that Eh h(i),h(j) = m for i 6= j. Using the fact
 
2
Ledoux, M. (2001). The concentration of measure phe- that = E [hx, x i ]E [hx, x i ]2 proves the claim.
2 0 0

nomenon. Providence, RI: AMS.

Li, P., Church, K., & Hastie, T. (2007). Conditional random


sampling: A sketch-based sampling technique for sparse B. Concentration of Measure
data. In B. Scholkopf, J. Platt and T. Hoffman (Eds.),
Our proof uses Talagrands convex distance inequality. We
Advances in neural information processing systems 19,
first define a weighted Hamming distance function between
873880. Cambridge, MA: MIT Press.
two hash-function and 0 as follows.
Rahimi, A., & Recht, B. (2008). Random features for large- X
d(, 0 ) = sup i I(h(i) 6= h0 (i) or (i) 6= 0 (i))
scale kernel machines. In J. Platt, D. Koller, Y. Singer kk2 1 i
and S. Roweis (Eds.), Advances in neural information
= | {i : h(i) 6= h0 (i) or (i) 6= 0 (i)} |
p
processing systems 20. Cambridge, MA: MIT Press.
Feature Hashing for Large Scale Multitask Learning

Next denote by d(, A) the distance between a hash func- (The last inequality holds because in the worst case all mass
tion and a set A of hash functions, that is d(, A) = is concentrated in a single entry of vi .) As a next step we
inf 0 A d(, 0 ). In this case Talagrands convex distance will express k0 (x)k2 in terms of x,x . Because 0 A,
inequality (Ledoux, 2001) holds. If Pr(A) denotes the total we obtain that
probability mass of the set A, then q
k0 (x)k2 = hx, xi0 (1+ 2x,x )1/2 1+x,x / 2.
1 s2 /4
Pr {d(, A) s} [Pr(A)] e . (10)
To simplify our notation, let us define = 1 + x,x / 2.
Proof [Theorem 3] Without loss of generality assume that Plugging our upper bounds for kvk2 and k0 (x)k2 into (12)
kxk2 = 1. We can then easily generalize to the general x leads to
2

2

case. From Lemma 2 it follows that the variance of kxk is kxk 1 8 kxk d2 (, 0 )( +2 kxk d2 (, 0 ))+ 2.

4 2
2
given by x,x = N2 [1 kxk4 ] and E(kxk ) = 1.
As we have not specified our particular choice of 0 , we
Chebyshevs
inequality states that P (|X E(X)| can now choose it to be the closest vector to within A,
2) 12 . We can therefore denote ie such that d(, 0 ) = d(, A). By Talagrands inequality,
we know that with probability at least 12es /4 we obtain
2

2
n o
A := where kxk 1 2x,x . d(, A) s and therefore with high probability:


2

2

and obtain Pr(A) 21 . From Talagrands inequality (10) kxk 1 8 kxk s2 + 16 kxk s4 + 2.

we know that Pr({ : d(, A) s}) 2es /4 . Now as-
2
2
+
sume that we have a pair of hash functions and 0 , with A change of variables s2 = 4kxk gives us that
0 A. Let us define the difference of their hashed inner-
2

kxk 1 2 +  w.p. 1 2es /4 . Noting that
2
2

products as := kxk hx, xi0 . By the triangle inequal-

ity and because 0 A, we can state that s2 = ( 2 +  )/4 kxk /4 kxk , lets us ob-
p

2

2
tain our final result
kxk 1 kxk hx, xi0 + hx, xi0 1

2
kxk 1 2 +  w.p. 1 2e /4kxk .


|| + 2. (11)
Finally, for a general x, we can derive the above result for
x
Let us now denote the coordinate-wise difference between y = kxk 2
. Replacing kyk = kxk
kxk2 we get the following
the hashed features as vi := 0i (x) i (x). With version for a general x,
P this def-
inition, we can express in terms of v: = i i (x)2 ( 2 2
)
2 | kxk kxk2 | 
kxk2

0i (x) = 2 h0 (x), vi + kvk22 . By applying the Cauchy- Pr 2 2x,x +  exp 4kxk
Schwartz inequality to the inner product h0 (x), vi, we ob- kxk2

tain || 2k0 (x)k2 kvk2 + kvk22 . Plugging this into (11)


leads us to
C. Inner Product
2
kxk 1 2k0 (x)k2 kvk2 + kvk22 + 2x,x . (12)

2
Proof [Corollary 4] We have that 2 hx, x0 i = kxk +
2 2
kx0 k kx x0 k . Taking expectations, we have the stan-
P kvk0 2 in terms of d(, ). To do this, ex-
0
Next, we bound
pand vi = j xj (j h0 (j)i j h(j)i ). As j {+1, 1}, dard inner product inequality. Thus,
2 2
we know that |j j0 | 2. Further, xj kxk and we |2 hu (x), u (x)i 2 hx, xi | | ku (x)k kxk |
can write 2 2 2 2
X + | ku (x0 )k kx0 k | + | k(x x0 )k kx x0 k |
|vi | 2 kxk h(j)i + h0 (j)i . (13)  
j Using union bound, with probability 1 3 exp 4 ,
We can now make two observations: First note that each of the terms above is bounded using Theorem 3. Thus,
putting thebounds together, we have that, with probability
j h(j)i +h0 (j)i is at most 2t where t = |{j : h(j) 6=
P P
i 

h0 (j)}|. Second, from the definition of the distance func- 1 3 exp 4 ,

tion, we get that d(, 0 ) t. Putting these together,
|2 hu (x), u (x)i 2 hx, xi |
|vi | 4 kxk t 4 kxk d2 (, 0 )
X
2 2 2
( 2 + )(kxk + kx0 k + kx x0 k )
i
2
kvi k22 = |vi |2 16 kxk d4 (, 0 ).
X

You might also like