You are on page 1of 10

Who Supported Obama in 2012?

Ecological Inference through Distribution Regression

Seth R. Flaxman Yu-Xiang Wang Alexander J. Smola


Machine Learning Department Machine Learning Department Machine Learning Department
and H. J. Heinz III College Carnegie Mellon University Carnegie Mellon University
Carnegie Mellon University yuxiangw@cs.cmu.edu and Marianas Labs
sflaxman@cs.cmu.edu alex@smola.org

ABSTRACT ysis. The ecological inference problem has much in common


We present a new solution to the ecological inference prob- with the modifiable areal unit problem [20] and Simpsons
lem, of learning individual-level associations from aggregate paradox. Simply put, it is the problem of inferring individ-
data. This problem has a long history and has attracted ual correlations from ecological correlations. This challenge
much attention, debate, claims that it is unsolvable, and arises in computational advertising, healthcare data, opin-
purported solutions. Unlike other ecological inference tech- ion survey data, and population health data, because in each
niques, our method makes use of unlabeled individual-level case for privacy or cost reasons, we are missing individual-
data by embedding the distribution over these predictors level data, we have access to aggregate-level data, and we
into a vector in Hilbert space. Our approach relies on re- want to make individual-level predictions. One way to un-
cent learning theory results for distribution regression, us- derstand the reason it is called a problem is to consider a
ing kernel embeddings of distributions. Our novel approach two-by-two contingency table, with unknown entries inside
to distribution regression exploits the connection between the table, and known marginals. As shown in the contin-
Gaussian process regression and kernel ridge regression, giv- gency table below, we might know that a certain electoral
ing us a coherent, Bayesian approach to learning and infer- districts voting population is 43% men and 57% women and
ence and a convenient way to include prior information in that in the last election, the outcome was 63% in favor of
the form of a spatial covariance function. Our approach is the Democratic candidate and 37% in favor of the Repub-
highly scalable as it relies on FastFood, a randomized ex- lican candidate. These percentages correspond to the num-
plicit feature representation for kernel embeddings. We ap- bers of individuals shown below: Is it possible to infer the
ply our approach to the challenging political science prob- Democrat Republican
lem of modeling the voting behavior of demographic groups Men ? ? 1,500
based on aggregate voting data. We consider the 2012 US Women ? ? 2,000
Presidential election, and ask: what was the probability that 2,200 1,300
members of various demographic groups supported Barack
joint and thus conditional probabilities, for example can we
Obama, and how did this vary spatially across the country?
ask, what was the Democratic candidates vote share among
Our results match standard survey-based exit polling data
women voters? It is clear that only very loose bounds can be
for the small number of states for which it is available, and
placed on these probabilities without any more information.
serve to fill in the large gaps in this data, at a much higher
Based on the fact that rows and columns must sum to their
degree of granularity.
marginals, we know, e.g., that the number of Democrats who
are men is between 0 and 1,500. These types of determin-
Keywords istic bounds have been around since the 1950s, under the
Machine learning; supervised learning; kernel methods; Gaus- name the method of bounds [4].
sian processes; distribution regression What if we are given a set of electoral districts, where for
each we know the marginals of the two-by-two contingency
table, but none of the inner entries? Then, thinking statis-
1. INTRODUCTION tically, we might be tempted to run a regression, predict-
The name ecological inference refers to the idea of ecolog- ing the electoral outcomes based on the gender breakdowns
ical correlations [28], that is correlations between variables of the districts. But this approach, formalized as Good-
observed for a group of individuals, as opposed to individ- mans method [8] a few years after the method of bounds
ual correlations, where the individuals are the unit of anal- was proposed, can easily lead us astraythere is not even a
guarantee that outcomes be bounded between 0 and 1, and
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
it ignores potentially useful information provided by deter-
for profit or commercial advantage and that copies bear this notice and the full cita- ministic bounds.
tion on the first page. Copyrights for components of this work owned by others than We review related work in Section 2 and provide the nec-
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- essary background on kernel embeddings of distributions,
publish, to post on servers or to redistribute to lists, requires prior specific permission distribution regression, and GP regression in Section 3. We
and/or a fee. Request permissions from Permissions@acm.org. formalize the ecological inference problem in Section 4 and
KDD15, August 10-13, 2015, Sydney, NSW, Australia.
propose our method in Section 5. We apply it to the case
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3664-2/15/08 ...$15.00. of the 2012 US presidential election in Section 6, comparing
DOI: http://dx.doi.org/10.1145/2783258.2783300. our results to survey-based exit polls.
2. RELATED WORK ting, since we are only interested in subgroup level predic-
The ecological inference problem has a long history of solu- tions the extra task of estimating individual level predictions
tions, counter-solutions, and it is often taught with a note of is probably not worth the effort considering we are working
grave caution and stark warnings that ecological inference is with n = 10 million individuals.
to be avoided at all costs, usually in favor of individual-level Our method is based on recent advances in distribution re-
surveys. As with Simpsons paradox, it should come as no gression [6, 33], which we generalize to address the ecological
surprise that correlations at one level of aggregation can and inference case. Previous work on distribution regression has
do flip signs at other levels of aggregation. But abandoning relied on kernel ridge regression, but we use Gaussian pro-
all attempts at ecological inference in favor of surveys is not cess (GP) regression instead, thus enabling us to incorporate
feasible or appropriate in many circumstancesrelevant re- spatial variation, learn kernel hyperparameters, and provide
spondents are no longer alive to answer historical questions posterior uncertainty intervals, all in a fully Bayesian set-
of interest; subjects are reluctant to answer questions about ting. For scalability (our experiments use n = 10 million
sensitive topics like drug usage or cheatingmeaning social individuals), we use a randomized explicit feature represen-
scientists have been hard-pressed and even discouraged from tation (FastFood) [16] rather than the kernel trick.
studying many interesting and important questions. Eco-
logical inference problems appear in demography, sociology, 3. BACKGROUND
geography, and political science, andas discussed in [13] In this section we review kernel embeddings for proba-
landmark legislation in the US such as the Voting Rights bility distributions, distribution regression, FastFood, and
Act requires a solution to the ecological inference problem Gaussian process (GP) regression.
to understand racial voting patterns1 .
This problem has attracted a variety of approaches over 3.1 Kernel embeddings of distributions
the years as summarized in [13], which also proposes a Bayesian Kernel embeddings of distributions, e.g. [31, 32, 5], are a
statistical modeling framework incorporating the method of powerful class of reproducing kernel Hilbert space (RKHS)
bounds (thus uniting the deterministic and probabilistic ap- techniques that map joint, marginal and conditional proba-
proaches). [13] sparked a renewed interest in ecological in- bility distributions to vectors in a high (or infinite) dimen-
ference, much of which is summarized in [14]. A parametric sional feature space. Let : Rn H. It has been shown
Bayesian approach to this setting was proposed in [12] and that if a kernel map is universal/characteristic (e.g. a
a semiparametric approach was proposed in [23]. Gaussian RBF kernel), then for iid samples x X, the
Our method differs from existing methods in fours ways. mean embedding in feature space, denoted:
First, it uses more information than is typically considered
in a standard ecological regression setting: we assume that X = ExX [(x)] (1)
we have access to representative unlabeled individual-level completely characterizes the distribution in the sense that
data. In the voting example, this means having a sample any two distributions with a difference in any moment will
of individual-level census records (microdata) about each be mapped to a different point in the Hilbert space. This
electoral district. Second, our method incorporates spatial result has been used in a variety of kernel-based statistical
variation. Spatial data is a common feature of ecological tests, including tests of independence and two-sample tests
regressions (which, after all, usually have much to do with [10]. It is a key feature of our method, because it will allow
geography) but it is only very recently that ecological in- us to link aggregate labels to individual-level data without
ference methods have begun to address spatial variation ex- throwing out any information.
plicitly [14]. Third, while our method may be applied to the In this work, we use the simple empirical mean estimator
classic ecological inference problem of inferring individual for the kernel mean:
level correlations from aggregate data, we propose that it is
1 X
most well-suited to a related ecological problem, common in X =
c (xj ) (2)
political science: inferring the unobserved behavior of sub- N j
groups based on the aggregate behavior of groups of which
they are part. For our application, this means inferring the It is shown in Smola et al. [31] that this plug-in estimator is
voting behavior of men and women separately by electoral a consistent,
and it converges to X with rate O(Rn (H) +
district, given aggregate voting information by district. Fi- 1/ n), where Rn (H) is the Rademacher complexity of the
nally, our work is nonparametric. Kernel embeddings are RKHS. As long as Rn (H) = O(n1/2 ) we have the (optimal)
used to capture all moments of the probability distribution parametric rate. Recent work has focused on improving this
over covariates, and Gaussian process regression is used to estimator using James-Stein shrinkage [17].
non-parametrically model the dependence between predic-
tors and labels. 3.2 Distribution regression
A related line of work, termed learning from label propor- In this section, we formalize distribution regression, the
tions by some authors [24, 15, 30, 21], has the individual- task of learning a classifier or a regression function that maps
level goal in mind, and aims to build a classifier for individ- probability distributions to labels. The problem is funda-
ual instances based only on group level label proportions. mentally challenging because we only observe the probability
While in principle, this approach could be used in our set- distributions through groups of samples from these distribu-
tions. Specifically, our dataset is structured as follows:
1
Long-standing solutions have proved quite inadequate: in
     
{xj1 }N j N2 j Nn
j=1 , y1 , {x2 }j=1 , y2 , . . . {xn }j=1 , yn
1
(3)
one court case involving the Voting Rights Act, a qualified
expert testified, based on Goodmans method, that the per-
centage of blacks who were registered to vote in a certain where group i has a single real-valued label yi and Ni indi-
electoral district exceeded 100% [13]. This evidently false vidual observations (e.g. demographic covariates for Ni in-
claim was apparently made earnestly. dividuals) denoted xji Rd .
To admit a theoretical analysis, it is assumed that the where i is the imaginary unit of a complex number and V
probability distributions themselves are drawn randomly from is a appropriately scaled p d Gaussian random matrix
some unknown meta distribution of probability distributions. with p > d. FastFood allows us to approximately com-
The intuition behind why distribution regression is possible pute V x without explicitly construct V . In particular, Fast-
is that if each group of samples are iid draws from a distribu- Food transformation takes V = [V1T , V2T , ..., Vdp/de
T
]T and
tion which is itself an iid drawn from the meta distribution, each square d d matrix is given by:
then we will be able to learn. 1
Recently, this two-stage sampled structure was analyzed, Vj = SHGHB,
showing that a ridge regression estimator is consistent [33] d
with polynomial rate of convergence for almost any meta- Here S, B, G are diagonal random matrices (nonnegative
distribution of distributions that are sufficiently smooth. scaling, Rademacher and Gaussian respectively), is a ran-
The basic approach is as follows: use the kernel mean es- dom permutation, and H is the Walsh-Hadamard matrix.
timator of Eq. (2) for each group separately to estimate: Every single one of these transformation can be computed
in almost linear time. The whole transformation (x) can
N1 Nn
1 X 1 X be therefore computed in O(p log d) time. This is orders of
c1 =
(xj1 ), cn =
..., (xjn ) (4) magnitude faster than random kitchen sinks [25] which costs
N1 j=1 Nn j=1
O(pd) per transformation or the kernel trick which needs to
Next, use kernel ridge regression [29] to learn a function f : do an O(N 3 ) inversion of a dense N N kernel matrix.

It is shown in [16, Theorem 6] that for any x, x0 , h(x), 0 )i
(x
y = f (b
) +  (5) 0 log(2/)
converges to h(x), (x )i with rate O( p1/2 ) where is
where the objective is to minimize the L2 loss subject to a the failure probability. This can be viewed as a Johnson-
ridge complexity penalty weighted by a positive constant Lindenstrauss transformation of an infinite dimensional space
: to a finite dimensional Euclidean space while preserving the
angles and distances in the original space. While this is not
f = arg minf Hf
X
[yi f (bi )]2 + kf k2Hf (6) a uniform convergence bound as with the random features
i in [25], the exponential tail enables us to simultaneously
In [33] a variety of kernels for f corresponding to the guarantee an exponential number of kernel evaluations via
Hilbert space Hf are considered. We follow the simplest the union bound. Not surprisingly, it has been empirically
choice of the linear kernel k(bi ,
cj ) = hbi ,
cj i, motivated by shown to be comparable in accuracy and to approximate
the fact that we are already working in Hilbert space over the kernel transformation for all data points quite well. For
the i . Following the standard derivation of kernel ridge re- simplicity, from here onwards we will overload notation and
gression [29], we can find the function f in closed form for a refer to (x) Rp as our feature mapping which is under-
new test group : stood to be approximated with FastFood.

f ( ) = k (K + I)1 [y1 , . . . , yn ]T (7) 3.4 Gaussian process regression


In this section, we briefly state the main results we need
where k = [hc 1 , i, . . . , hc
n , i] and Kab = hc cb i. In
a , from Gaussian process regression [26], reviewing the well-
practice, it is hard to know whether the conditions under known connection between the posterior mean in GP regres-
which the proofs in these papers hold are met. As a partial sion and the kernel ridge regression estimator of Eq. (7).
remedy, our Bayesian approach allows us to quantify the Given observations (s1 , y1 ), . . . (sn , yn ) a Gaussian process
degree of uncertainty in our posterior predictions. Also, as prior on a function f where our model is y = f (s) +  is
shown in the experiments, a useful diagnostic is to measure written:
the distance between training and test distributions.
f GP(0, k(s, s0 ))
3.3 FastFood for explicit kernel expansion with mean 0 and covariance function k. This implies that for
Naively implementing distribution regression using the ker- a finite set of locations X = {s1 , . . . , sn }, the distribution of
nel trick is not scalable in the setting we consider: to com- f = [f (s1 ), . . . , f (sn )]> is multivariate Gaussian:
pute just one entry in K requires computing Kab = hc cb i =
a , f N (0, K) (8)
1
P j1 j2 2
Na Nb j1 j2 k(xa , x b ). This computation is O(N ) (where
we assume for simplicity Ni = N, i) so computing K is where Kij = k(si , sj ). Notice that we have switched from a
O(n2 N 2 ). In our application, N 104 , so we need a much function f (s) to a vector f . This is because it is only for-
more scalable approach. Since we ultimately only need to mally correct to consider a probability distribution over the
work with the mean embeddings i rather than the indi- finite-dimensional vector f , not over the infinite dimensional
vidual observations xji , an explicit feature representation, function f (s). For a formal discussion see [35]. Conditional
even if it is very high-dimensional, will drastically reduce on the latent variable f , we have a Gaussian observation
our computational costs. model:
We use an approximate kernel transformation called Fast- yi |f (si ) N (0, 2 ), i (9)
Food [16], which finds a d-dimensional approximation (x) 2
for variance parameter which can be thought of as mea-
Rd of (x) for every x. Here can be any radial basis func- surement error (known as the nugget in geostatistics). For
tion (RBF) kernel. Take Gaussian RBF kernel as an ex- a fixed set of locations X, it is straightforward to sample f
ample, FastFood boils down to the following transformation from its prior distribution in Eq. (8). Due to conjugacy, we
due to Rahimi and Recht [25]: can marginalize out f in closed form to find the distribution:
(x) = p1/2 exp(i[V x]) y N (0, K + 2 I) (10)
If we wish to make a prediction at a new location s , the i.e. the number of votes this group gave Obama divided by
standard predictive equations for GP regression [26], derived the total number of votes in this group.
by conditioning a multivariate Gaussian distribution, tell us:
y | s , X, y N (k (K + 2 I)1 y, k k (K + 2 I)1 k> ) 5. OUR METHOD
(11) In this section we propose our new ecological inference
where Kij = k(si , sj ) and k = [k(s1 , s ) . . . k(sn , s )] and method. Our approach is illustrated in a schematic in Figure
k = k(s , s ). Thus we have a way of combining a prior 1 and formally stated in Algorithm 1.
over f , parametrized by k(s, s0 ), with observed data to ob-
tain a posterior distribution over a new prediction y at a y2
new location s . This is a very powerful method, as it en-

% vote for Obama


ables a fully Bayesian treatment of regression, a coherent ?
approach to kernel learning through the marginal likelihood
(for details see [26]), and posterior uncertainty intervals. y3
We can immediately see the connection between the ker- ?
nel ridge regression estimator in Eq. (7) and the posterior
mean of the GP in Eq. (11). (A superficial difference is that y1
in Eq. (7) our predictors are bi while in Eq. (11) they are
generic locations si , but this difference will go away in Sec-
tion 5 when we propose using GP regression for distribution 1 w
3 3 m
3 2 feature space
regression.) The predictive mean of GP regression is ex-
men
actly equal to the kernel ridge regression estimator, with 2 women
corresponding to . In ridge regression, a larger penalty x21 both
x23 x33
leads to a smoother fit (equivalently, less overfitting), while x11 x12 x22
in GP regression a larger 2 favors a smoother GP poste- x13
rior because it implies more measurement error. For a full x31 x53 x43 x32
discussion of the connections see [2, Sections 6.2.2-6.2.3].
region 1 region 2 region 3

4. ECOLOGICAL INFERENCE Figure 1: Illustration of our approach. Labels


In this section we state the ecological inference problem y1 , y2 and y3 are available at the group level giving
that we intend to solve. We use the motivating example of Obamas vote share in regions 1, 2, and 3. Co-
inferring Barack Obamas vote share by demographic sub- variates are available at the individual level giv-
group (e.g. men versus women) in the 2012 US presidential ing the demographic characteristics of a sample of
election, without access to any individual-level labels. Vote individuals in regions 1, 2, and 3. We project
totals by electoral precinct are publicly available, and these the individuals from each group into feature space
provide the labels in our problem. Predictors are in the using a feature map (x) and take the mean by
form of demographic covariates about individuals (e.g. from group to find high-dimensional vectors 1 , 2 and 3 ,
a survey with individual level data like the census). The e.g. 1 = 13 ((x11 ) + (x21 ) + (x31 )). Now our prob-
challenge is that the labels are aggregate, so it is impossi- lem is reduced to supervised learning, where we
ble to know which candidate was selected by any particular want to learn a function f : y. Once we have
individual. This explains the terminology: ecological cor- learned f we make subgroup predictions for men
relations are correlations between variables which are only and women in region 3 by calculating mean embed-
available as aggregates at the group level [28] dings for the men m 1 3 4
3 = 2 ((x3 ) + (x3 )) and women
w 1 1 2 5
We use the same notation as in Section 3.2. Let xji Rd 3 = 3 ((x3 ) + (x3 ) + (x3 )) and then calculating
be a vector of covariates for individual i in region j. Let f (m w
3 ) and f (3 ). For a more rigorous description
wij be survey weights2 . Let yi be labels in the form of two- of our algorithm see Algorithm 1.
dimensional vectors (ki , ni ) where ki is the number of votes
received by Obama out of ni total votes in region i. Then Recall the two-stage distribution regression approach in-
our dataset is: troduced in Section 3.2. Our method has a similar approach.
      To begin, we use FastFood as introduced in Section 3.3 with
{xj1 }N j N2 j Nn
j=1 , y1 , {x2 }j=1 , y2 , . . . , {xn }j=1 , yn
1
(12) an RBF kernel to produce an explicit feature map and
calculate the mean embeddings3 , one for each region i, of
We will typically have a rich set of covariates available, in
Eq. (4) with survey weights:
addition to the demographic variables we are interested in
stratifying on, so the xji will be high-dimensional vectors
P j j P j j
j w1 (x1 ) j wn (xn )
denoting gender, age, income, education, etc. c1 =
P j , ..., cn = P j (13)
Our task is to learn a function f from a demographic sub- j w1 j wn
3
group (which could be everyone) within region i to the prob- Distribution regression with explicit random features was
ability that this demographic subgroup supported Obama, previously considered in Oliva et al. [19] using Rahimi and
2 Recht [25] to speed up an earlier distribution regression
Covariates usually come from a survey based on a random method based on kernel density estimation [22]. This ap-
sample of individuals. Typically, surveys are reported with proach has comparable statistical guarantees to distribution
survey weights wij for each individual to correct for oversam- regression using RKHS-mean embeddings but inferior em-
pling and non-response, which must be taken into account pirical performance [33]. As far as we are aware, using Fast-
for any valid inference (e.g. summary statistics, regression Food kernel mean embeddings for distribution regression is
coefficients, standard errors, etc.). a novel approach.
Algorithm 1 Ecological inference algorithm the predictions we get from covariates alone. (We also con-
  sidered a multiplicative structure, which had slightly worse
Input: {(xj1 , w1j )}N j j Nn

j=1 , s1 , y1 , . . . , {(xn , wn )}j=1 , sn , yn
1
performance.)
1: for i = 1 . . . n do Eq.s (14)-(15) complete our hierarchical model specifica-
2: Calculate bi using Eq. (13) with FastFood. tion. For non-Gaussian observation models like Eq. (14),
3: Calculate m i using Eq. (17) with FastFood. the posterior prediction in Eq. (11) is no longer available
4: end for in closed form due to non-conjugacy. We follow the stan-
5: Learn hyperparameters = (x2 , s2 , `) of the GP model dard approach for GP classification and logistic Gaussian
specified by Eqs. (14)(15) with observations yi at lo- processes and use the Laplace approximation [36, 27]. The
cations (c 1 , s1 ), . . . , (c
n , sn ) using gradient descent and Laplace approximation gives an approximate posterior dis-
the Laplace approximation. tribution for f , from which we can calculate a posterior dis-
6: Make posterior predictions using at locations tribution over the ki of Eq. (14) as explained in detail in [26,
(m m
1 , s1 ), . . . , (n , sn ) using the Laplace approximation. Section 3.4.2]. The Laplace approximation also allows us to
calculate the marginal likelihood, which is the probability
Output: Posterior means and variances for y1m , . . . , ynm of the observed data, integrating out f . To learn x2 , s2 ,
and `, we use gradient ascent to maximize the log marginal
likelihood.
Once we have learned the best set of hyperparameters for
Next, instead of kernel ridge regression, we use GP regres-
our model we can make predictions for any demographic
sion. Recall that unlike in distribution regression our labels
subgroup of interest. To predict the fraction of men who
yi are given by vote counts (ki , ni ). We use a Binomial like-
voted for Obama, we create new mean embedding vectors
lihood as the observation model in GP regression (this is
by gender and region, modifying Eq. (13):
sometimes known as a logistic Gaussian process [27]). We
transform each component of the latent real-valued vector f P j j
of Section 3.4 by the logistic link function (f ) = 1+e1f and m j m w1 (x1 )
c
i = P j
, i (17)
we replace Eq. (9) with the following: j m w1

ki |f (xi ) Binomial(ni , (f (xi ))) (14) where j m are the indices of the observations of men in region
m
i and c i is the mean embedding of the covariates for the
where we use the formulation for the Binomial distribu- men in region i. We then make posterior predictions using
tion of ni trials and probability of success (f (xi )). This the Laplace approximation as above at these new gender-
is the generalized linear model (GLM) specification for bi- region predictors. Notice that for a new this requires
nary data, combining a Binomial distribution with logistic calculating k = [k1 , k2 , . . . , kn ] of Eq. (11) where ki =
link function [3, Ch. 7]. x2 hbi , i + ks (si , s ) using Eq. (15). Thus new predictions
The predictors in our GP are the mean embeddings cn . will
c1 , . . . , be similar to existing predictions in regions with similar
We also include spatial information in the form of 2-dimensional covariates and they will be similar to existing predictions at
spatial coordinates si giving the centroid of region i. Putting the same (and nearby) locations.
these predictors together we adopt an additive covariance Our algorithm is stated in Algorithm 1. We now ana-
structure: lyze its complexity. Lines 23 are calculated by streaming
f GP(0, x2 hbi ,
cj i + ks (si , sj )) (15) through the data for individuals. For each individual, cal-
culating the FastFood feature transformation (xji ) takes
Where we have used a linear kernel between mean embed- O(p log d) where xji Rd and (xji ) Rp . To save memory,
dings weighted by a variance parameter x2 . Since the mean theres no need to store each (xji ). We simply update the
embeddings are already in feature space using the FastFood
weighted average bi by adding wji (xji ) to it. Notice that the
approximation to the RBF kernel, we are approximately us-
demographic subgroup considered in line 3 is simply a subset
ing the RBF kernel. For the spatial coordinates we use the
of the observations calculated in line 2, so there is no added
Matern covariance function which is a popular choice in spa- m1 mq
cost to calculate the m i or indeed a set of i , . . . , i for
tial statistics [11], with = 3/2, length-scale ` and variance
2 q different demographic subgroups of interest. Overall, if
parameter s :
we have N individuals the for loop takes time O(N p log d).
 
0 2

ks s0 k 3

ks s0 k 3 Usually p  N and d  N so this is practically linear and
k(s, s ) = s 1 + exp (16) trivially parallelizable.
` `
On line 5 to learn the hyperparameters in the GP re-
By adding together the linear kernel between mean em- gression requires calculations involving the covariance ma-
beddings and the spatial covariance function, we allow for a trix K Rnn . Each entry in K requires computing a dot
smoothly varying surface over space and demographics. The product hbi , cj i which takes O(p) and it requires computing
intuition is that this additive covariance encourages predic- the Matern kernel for the spatial locations, which is a fast
tions for regions which are nearby in space and have simi- arithmetic calculation. Once we have K, the Laplace ap-
lar demographic compositions to be similar; predictions for proximation is usually implemented with Cholesky decom-
regions which are far away or have different demographics positions for numerical reasons. The runtime of computing
are allowed to be less similar. GP regression with a spa- the marginal likelihood and relevant gradients is O(n3 ) [26],
tial covariance function is equivalent to the spatial statistics and gradient ascent usually takes less than a hundred steps
technique of krigingwe are effectively smoothly interpolat- to converge. Posterior predictions on line 6 require calcu-
ing y values over a very high dimensional space of predic- lating k R1n for each m 2
i so this is O(n ). Reusing
tors. Another way to think about additivity is that we are the Cholesky decompositions above means predictions can
accounting for a spatially autocorrelated error structure in be made in O(n2 ). GP regression requires O(n2 ) storage.
Overall, we expect n  N , so our algorithm is practically
O(N ), with little extra computational cost arising from the
GP regression as compared to the work of streaming through
all the observations. The N observations do not need to be
stored in memory, so the overall memory complexity is only
O(n2 ).

6. EXPERIMENTS
In this section, we describe our experimental evaluation,
using data from the 2012 US Presidential election, and com-
pare our results to survey-based exit polls, which are only (a) (b)
available for the 18 states for which large enough samples
were obtained. Our method enables us to fill in the full pic- Figure 2: Election outcomes were available for
ture, with much finer-grained spatial estimation and results the 67 counties in Florida shown in (a). Demo-
for a much richer variety of demographic variables. This graphic data from the American Community Sur-
demonstration shows the applicability of our new method to vey was available for 127 public use microdata areas
a large body of political science literature (see, e.g. [7]) on (PUMAs) in Florida, which sometimes overlapped
voting patterns by demographics and geography. Because parts of multiple counties and sometimes contained
voting behavior is unobservable and due to the ecological multiple counties. We merged counties and PUMAs
inference problem, previous work has been mostly based on as described in text to create a set of disjoint regions
exit polls or opinion polls. with the result of 37 electoral regions as shown in
We obtained vote totals for the 2012 US Presidential Elec- (b).
tion at the county level4 . Most voters chose to either re-
elect President Barack Obama or vote for the Republican
come in past 12 months (in dollars) and travel time to work
party candidate, Mitt Romney. A small fraction of voters
(in minutes). We divided the real-valued variables by their
(< 2% across the country) chose a third party candidate.
standard deviation to put them all on the same scale. For
Separately, we obtained data from the US Census, specifi-
the categorical variables with D categories, we converted
cally the 2006-2010 American Community Surveys Public
them into D dimensional 0/1 indicator variables, i.e. for the
Use Microdata Sample (PUMS). The American Community
variable when last worked with categories 1 = within the
Survey is an ongoing survey that supplements the decen-
past 12 months, 2 = 1-5 years ago, and 3 = over 5 years
nial US census and provides demographically representatives
ago or never worked we mapped 1 to [1 0 0]T , 2 to [0 1 0]
individual-level observations. PUMS data is coded by pub-
and 3 to [0 0 1].
lic use microdata areas (PUMAs), contiguous geographic re-
Putting together the indicator variables and real-valued
gions of at least 100,000 people, nested within states. We
variables, we ended up with 3,251 variables total. For every
used the 5-year PUMS file (rather than a 1-year or 3-year
single individual-level observation, we used FastFood with
sample) because it contains a larger sample and thus there is
an RBF kernel to generate a 4,096-dimensional feature rep-
less censoring for privacy reasons. To merge the PUMS data
resentation. Using Eq. (13) we calculated the weighted mean
with the 2012 election results, we created a mapping between
embedding for each region. The result was a set of 837 vec-
counties and PUMAs5 , merging individual-level census data
tors which were 4,096-dimensional.
and aggregating vote totals as necessary to create larger geo-
We treated the vote totals for Obama and Romney as is,
graphic regions for which the census data and electoral data
discarded the remaining third party votes as the exit polls we
coincided. The mapping between PUMAs and counties is
use for validation did not report third party votes. Thus for
many-to-many, so we were effectively finding the connected
each region, we had a positive integer valued 2-dimensional
components. Since counties and PUMAs do not cross state
label giving the number of votes for Obama and the total
borders, none of the geographic regions we created cross
number of votes.
state borders. An example is shown in Figure 2.
We focused on the ecological inference problem of pre-
In total, we ended up with 837 geographic regions rang-
dicting Obamas vote share by the following demographic
ing from Orleans Parish in New Orleans, which voted 91%
groups: women, men, income US$50,000 per year, income
for Barack Obama to Davis County, a suburb of Salt Lake
between $50,000 and $100,000 per year, income 100,000
City, Utah which voted 84% for Mitt Romney. For the cen-
per year, ages 18-29, 30-44, 45-64, and 64 plus. For each
sus data, we excluded individuals under the age of 18 (vot-
region, we used the strategy outlined above, restricting our
ing age in the US) and non-citizens (only citizens can vote
census sample to only those observations matching the sub-
in presidential elections). There were a total of 10,787,907
group of interest and creating new mean embedding predic-
individual-level observations, or in other words, almost 11 subgroup
million people included in the survey. The mean number tors as in Eq. (17), i . We made predictions for each
of people per geographic region was 12,812 with standard region-demographic pair. Note that we have made our task
deviation 21,939. harder than necessary to demonstrate our method; we could
There were 223 variables in the census data, including have trained our model using the exit polling data, where
both categorical variables such as race, occupation, and ed- available, and we would certainly recommend practitioners
ucational attainment and real valued variables such as in- use all available data to get the best possible estimates.
All of our models were fit using the GPstuff package with
4
https://github.com/huffpostdata/election-2012-results scaled conjugate gradient optimization and the Laplace ap-
5
using the PUMA 2000 codes and the tool at proximation [34]. Since n  N , the time required to fit the
http://mcdc.missouri.edu/websas/geocorr12.html GP model and make predictions is much less than the time
(a) Exit poll results for women (b) Exit poll results for men

(c) Ecological regression results for women (d) Ecological regression results for men

Figure 3: Support for Obama among women (a) and men (b) in the 18 states for which exit polling was done;
due to cost, no representative data was collected for the majority of states or for regions smaller than states.
Support for Obama among women (c) and men (d) in 837 different regions as inferred using our ecological
regression method.

required to preprocess the data to create the mean embed- for and a variety of demographic questions about themselves.
dings at the beginning of Algorithm 1. Bias due to factors such as unrepresentativeness of the sam-
pled precincts and inadequate coverage of early or absentee
7. RESULTS voters could be an issue [1]. The national results had a mar-
gin of error (corresponding to a 95% uncertainty interval) of
We learned the following hyperparameters for our GP:
4 percentage points6 and the state results had a margin of
s2 = 0.18, ` = 7.92, and x2 = 4.56. The 2 parameters can
error of between 4 and 5 percentage points [18]. For com-
be roughly interpreted as the fraction of variance explained
paring to the 18 state-level exit polls, we aggregated our
so the fact that x2 is much larger than s2 means that the
geographic regions by state, weighting by subgroup popula-
demographic covariates encoded in the mean embedding are
tion.
much more important to the model than the spatial coordi-
As a preview of our results by gender, income, and age,
nates. The length-scale for the Matern kernel is a little more
and to get an idea of the power of our method, Figure 3
than half the median distance between locations, which indi-
shows four maps visualizing Obamas support among women
cates that it is performing a reasonable degree of smoothing.
and men. In Figures 3a3b, we show the results from the
We used 10-fold crossvalidation to evaluate our model and
exit polls, at the state level, for only 18 states. In Figures 3c
ensure that it was not overfitting, an important considera-
3d we fill in the missing picture, providing estimates for 837
tion as generalization performance is critical. The root mean
different regions. We compare to competing methods below
squared error of the model was 2.5 and the mean log predic-
for national-level gender estimates. In the supplementary
tive density was -1.9. Predictive density is a useful measure
materials, we consider the non-binary demographic covari-
because it takes posterior uncertainty intervals into account.
ates age and income and the case of regional-level estimates,
For comparison, predicting the national average of Obama
which present a difficulties for the competing methods.
receiving 51.1% of the vote in every location has a root mean
squared error of 8.3. As a sensitivity analysis, we also con-
sidered a multiplicative model, for which the performance
7.1 Gender
was comparable. Voting by gender is shown in Figure 4, where we compare
To validate our models, we compared to the 2012 exit our results to the exit poll results. The fit is quite good,
polls, conducted by Edison Research for a consortium of with correlations equal to 0.96 for men and 0.94 for women.
news organizations. National results were based on inter- The inference that we are most interested in is the gender
views with voters in 350 randomly chosen precincts, and 6
state results in 18 states were based on interviews in 11 to This presumably corresponds to a sample size of only n =
600 individuals, since the usual margin of error reported by
50 random precincts. In these interviews, conducted as vot- q
.52
ers left polling stations, voters were asked who they voted news organizations is 1.96 n1
Implementing a racially stratified homogenous 706714. JMLR.org, 2014.
precinct approach. PS: Political Science and Politics, [20] S. Openshaw. Ecological fallacies and the analysis of
39(3):pp. 477483, 2006. ISSN 10490965. areal census data. Environment and Planning A, 16
[2] N. Cristianini and J. Shawe-Taylor. An introduction to (1):1731, 1984.
support vector machines and other kernel-based [21] G. Patrini, R. Nock, T. Caetano, and P. Rivera.
learning methods. Cambridge university press, 2000. (almost) no label no cry. In Z. Ghahramani,
[3] A. Dobson. An introduction to generalized linear M. Welling, C. Cortes, N. Lawrence, and
models. Chapman & Hall texts in statistical science, K. Weinberger, editors, NIPS 27, pages 190198.
2002. Curran Associates, Inc., 2014.
[4] O. B. Duncan and D. B. An alternative to ecological [22] B. Poczos, A. Singh, A. Rinaldo, and L. Wasserman.
correlation. American Sociological Review, 16:665666, Distribution-free distribution regression. In
1953. Proceedings of the Sixteenth International Conference
[5] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes on Artificial Intelligence and Statistics, pages 507515,
rule. In NIPS, pages 17371745, 2011. 2013.
[6] T. Gartner, P. A. Flach, A. Kowalczyk, and A. J. [23] R. L. Prentice and L. Sheppard. Aggregate data
Smola. Multi-instance kernels. In ICML, volume 2, studies of disease risk factors. Biometrika, 82(1):
pages 179186, 2002. 113125, 1995.
[7] A. Gelman, D. Park, B. Shor, J. Bafumi, and [24] N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V.
J. Cortina. Red State, Blue State, Rich State, Poor Le. Estimating labels from label proportions. JMLR,
State: Why Americans Vote the Way They Do. 10:23492374, 2009.
Princeton University Press, Aug. 2008. ISBN [25] A. Rahimi and B. Recht. Weighted sums of random
069113927X. kitchen sinks: Replacing minimization with
[8] L. A. Goodman. Some alternatives to ecological randomization in learning. In Advances in neural
correlation. American Journal of Sociology, pages information processing systems, pages 13131320,
610625, 1959. 2008.
[9] A. Gretton, A. Smola, J. Huang, M. Schmittfull, [26] C. E. Rasmussen and C. K. Williams. Gaussian
K. Borgwardt, and B. Sch olkopf. Covariate shift by processes for machine learning, 2006.
kernel mean matching. Dataset shift in machine [27] J. Riihim aki and A. Vehtari. Laplace approximation
learning, 3(4):5, 2009. for logistic gaussian process density estimation and
[10] A. Gretton, K. M. Borgwardt, M. J. Rasch, regression. Bayesian Analysis, 9(2):425448, 2014.
B. Scholkopf, and A. Smola. A kernel two-sample test. [28] W. S. Robinson. Ecological correlations and the
JMLR, 13:723773, 2012. behavior of individuals. American Sociological Review,
[11] M. S. Handcock and M. L. Stein. A bayesian analysis 15:35157, 1950.
of kriging. Technometrics, 35(4):403410, 1993. [29] C. Saunders, A. Gammerman, and V. Vovk. Ridge
[12] C. Jackson, N. Best, and S. Richardson. Improving regression learning algorithm in dual variables. In
ecological inference using individual-level data. (ICML-1998) Proceedings of the 15th International
Statistics in medicine, 25(12):21362159, 2006. Conference on Machine Learning, pages 515521.
[13] G. King. A Solution to the Ecological Inference Morgan Kaufmann, 1998.
Problem. Princeton University Press, Mar. 1997. [30] D. R. Sheldon and T. G. Dietterich. Collective
[14] G. King, M. A. Tanner, and O. Rosen. Ecological graphical models. In NIPS, pages 11611169, 2011.
inference: New methodological strategies. Cambridge [31] A. Smola, A. Gretton, L. Song, and B. Sch olkopf. A
University Press, 2004. hilbert space embedding for distributions. In
[15] H. Kueck and N. de Freitas. Learning about Algorithmic Learning Theory, pages 1331. Springer,
individuals from group statistics. In 21st Uncertainty 2007.
in Artificial Intelligence (UAI), pages 332339, 2005. [32] L. Song, J. Huang, A. Smola, and K. Fukumizu.
[16] Q. Le, T. Sarlos, and A. Smola. Fastfood: Hilbert space embeddings of conditional distributions
approximating kernel expansions in loglinear time. In with applications to dynamical systems. In Proceedings
Proceedings of the international conference on of the 26th Annual International Conference on
machine learning, 2013. Machine Learning, pages 961968. ACM, 2009.
[17] K. Muandet, K. Fukumizu, B. Sriperumbudur, [33] Z. Szabo, A. Gretton, B. Poczos, and
A. Gretton, and B. Schoelkopf. Kernel mean B. Sriperumbudur. Two-stage Sampled Learning
estimation and stein effect. In Proceedings of The 31st Theory on Distributions. Artificial Intelligence and
International Conference on Machine Learning, pages Statistics (AISTATS), Feb. 2015.
1018, 2014. [34] J. Vanhatalo, J. Riihim aki, J. Hartikainen, P. Jyl
anki,
[18] New York Times. President exit polls - election 2012. V. Tolvanen, and A. Vehtari. Gpstuff: Bayesian
http://elections.nytimes.com/2012/results/ modeling with gaussian processes. JMLR, 14(1):
president/exit-polls, 2012. Accessed: 16 February 11751179, 2013.
2015. [35] G. Wahba. Spline models for observational data,
[19] J. B. Oliva, W. Neiswanger, B. P oczos, J. G. volume 59. Siam, 1990.
Schneider, and E. P. Xing. Fast distribution to real [36] C. K. Williams and D. Barber. Bayesian classification
regression. In Proceedings of the Seventeenth with gaussian processes. Pattern Analysis and
International Conference on Artificial Intelligence and Machine Intelligence, IEEE Transactions on, 20(12):
Statistics, AISTATS 2014, Reykjavik, Iceland, April 13421351, 1998.
22-25, 2014, volume 33 of JMLR Proceedings, pages

You might also like