Professional Documents
Culture Documents
ki |f (xi ) Binomial(ni , (f (xi ))) (14) where j m are the indices of the observations of men in region
m
i and c i is the mean embedding of the covariates for the
where we use the formulation for the Binomial distribu- men in region i. We then make posterior predictions using
tion of ni trials and probability of success (f (xi )). This the Laplace approximation as above at these new gender-
is the generalized linear model (GLM) specification for bi- region predictors. Notice that for a new this requires
nary data, combining a Binomial distribution with logistic calculating k = [k1 , k2 , . . . , kn ] of Eq. (11) where ki =
link function [3, Ch. 7]. x2 hbi , i + ks (si , s ) using Eq. (15). Thus new predictions
The predictors in our GP are the mean embeddings cn . will
c1 , . . . , be similar to existing predictions in regions with similar
We also include spatial information in the form of 2-dimensional covariates and they will be similar to existing predictions at
spatial coordinates si giving the centroid of region i. Putting the same (and nearby) locations.
these predictors together we adopt an additive covariance Our algorithm is stated in Algorithm 1. We now ana-
structure: lyze its complexity. Lines 23 are calculated by streaming
f GP(0, x2 hbi ,
cj i + ks (si , sj )) (15) through the data for individuals. For each individual, cal-
culating the FastFood feature transformation (xji ) takes
Where we have used a linear kernel between mean embed- O(p log d) where xji Rd and (xji ) Rp . To save memory,
dings weighted by a variance parameter x2 . Since the mean theres no need to store each (xji ). We simply update the
embeddings are already in feature space using the FastFood
weighted average bi by adding wji (xji ) to it. Notice that the
approximation to the RBF kernel, we are approximately us-
demographic subgroup considered in line 3 is simply a subset
ing the RBF kernel. For the spatial coordinates we use the
of the observations calculated in line 2, so there is no added
Matern covariance function which is a popular choice in spa- m1 mq
cost to calculate the m i or indeed a set of i , . . . , i for
tial statistics [11], with = 3/2, length-scale ` and variance
2 q different demographic subgroups of interest. Overall, if
parameter s :
we have N individuals the for loop takes time O(N p log d).
0 2
ks s0 k 3
ks s0 k 3 Usually p N and d N so this is practically linear and
k(s, s ) = s 1 + exp (16) trivially parallelizable.
` `
On line 5 to learn the hyperparameters in the GP re-
By adding together the linear kernel between mean em- gression requires calculations involving the covariance ma-
beddings and the spatial covariance function, we allow for a trix K Rnn . Each entry in K requires computing a dot
smoothly varying surface over space and demographics. The product hbi , cj i which takes O(p) and it requires computing
intuition is that this additive covariance encourages predic- the Matern kernel for the spatial locations, which is a fast
tions for regions which are nearby in space and have simi- arithmetic calculation. Once we have K, the Laplace ap-
lar demographic compositions to be similar; predictions for proximation is usually implemented with Cholesky decom-
regions which are far away or have different demographics positions for numerical reasons. The runtime of computing
are allowed to be less similar. GP regression with a spa- the marginal likelihood and relevant gradients is O(n3 ) [26],
tial covariance function is equivalent to the spatial statistics and gradient ascent usually takes less than a hundred steps
technique of krigingwe are effectively smoothly interpolat- to converge. Posterior predictions on line 6 require calcu-
ing y values over a very high dimensional space of predic- lating k R1n for each m 2
i so this is O(n ). Reusing
tors. Another way to think about additivity is that we are the Cholesky decompositions above means predictions can
accounting for a spatially autocorrelated error structure in be made in O(n2 ). GP regression requires O(n2 ) storage.
Overall, we expect n N , so our algorithm is practically
O(N ), with little extra computational cost arising from the
GP regression as compared to the work of streaming through
all the observations. The N observations do not need to be
stored in memory, so the overall memory complexity is only
O(n2 ).
6. EXPERIMENTS
In this section, we describe our experimental evaluation,
using data from the 2012 US Presidential election, and com-
pare our results to survey-based exit polls, which are only (a) (b)
available for the 18 states for which large enough samples
were obtained. Our method enables us to fill in the full pic- Figure 2: Election outcomes were available for
ture, with much finer-grained spatial estimation and results the 67 counties in Florida shown in (a). Demo-
for a much richer variety of demographic variables. This graphic data from the American Community Sur-
demonstration shows the applicability of our new method to vey was available for 127 public use microdata areas
a large body of political science literature (see, e.g. [7]) on (PUMAs) in Florida, which sometimes overlapped
voting patterns by demographics and geography. Because parts of multiple counties and sometimes contained
voting behavior is unobservable and due to the ecological multiple counties. We merged counties and PUMAs
inference problem, previous work has been mostly based on as described in text to create a set of disjoint regions
exit polls or opinion polls. with the result of 37 electoral regions as shown in
We obtained vote totals for the 2012 US Presidential Elec- (b).
tion at the county level4 . Most voters chose to either re-
elect President Barack Obama or vote for the Republican
come in past 12 months (in dollars) and travel time to work
party candidate, Mitt Romney. A small fraction of voters
(in minutes). We divided the real-valued variables by their
(< 2% across the country) chose a third party candidate.
standard deviation to put them all on the same scale. For
Separately, we obtained data from the US Census, specifi-
the categorical variables with D categories, we converted
cally the 2006-2010 American Community Surveys Public
them into D dimensional 0/1 indicator variables, i.e. for the
Use Microdata Sample (PUMS). The American Community
variable when last worked with categories 1 = within the
Survey is an ongoing survey that supplements the decen-
past 12 months, 2 = 1-5 years ago, and 3 = over 5 years
nial US census and provides demographically representatives
ago or never worked we mapped 1 to [1 0 0]T , 2 to [0 1 0]
individual-level observations. PUMS data is coded by pub-
and 3 to [0 0 1].
lic use microdata areas (PUMAs), contiguous geographic re-
Putting together the indicator variables and real-valued
gions of at least 100,000 people, nested within states. We
variables, we ended up with 3,251 variables total. For every
used the 5-year PUMS file (rather than a 1-year or 3-year
single individual-level observation, we used FastFood with
sample) because it contains a larger sample and thus there is
an RBF kernel to generate a 4,096-dimensional feature rep-
less censoring for privacy reasons. To merge the PUMS data
resentation. Using Eq. (13) we calculated the weighted mean
with the 2012 election results, we created a mapping between
embedding for each region. The result was a set of 837 vec-
counties and PUMAs5 , merging individual-level census data
tors which were 4,096-dimensional.
and aggregating vote totals as necessary to create larger geo-
We treated the vote totals for Obama and Romney as is,
graphic regions for which the census data and electoral data
discarded the remaining third party votes as the exit polls we
coincided. The mapping between PUMAs and counties is
use for validation did not report third party votes. Thus for
many-to-many, so we were effectively finding the connected
each region, we had a positive integer valued 2-dimensional
components. Since counties and PUMAs do not cross state
label giving the number of votes for Obama and the total
borders, none of the geographic regions we created cross
number of votes.
state borders. An example is shown in Figure 2.
We focused on the ecological inference problem of pre-
In total, we ended up with 837 geographic regions rang-
dicting Obamas vote share by the following demographic
ing from Orleans Parish in New Orleans, which voted 91%
groups: women, men, income US$50,000 per year, income
for Barack Obama to Davis County, a suburb of Salt Lake
between $50,000 and $100,000 per year, income 100,000
City, Utah which voted 84% for Mitt Romney. For the cen-
per year, ages 18-29, 30-44, 45-64, and 64 plus. For each
sus data, we excluded individuals under the age of 18 (vot-
region, we used the strategy outlined above, restricting our
ing age in the US) and non-citizens (only citizens can vote
census sample to only those observations matching the sub-
in presidential elections). There were a total of 10,787,907
group of interest and creating new mean embedding predic-
individual-level observations, or in other words, almost 11 subgroup
million people included in the survey. The mean number tors as in Eq. (17), i . We made predictions for each
of people per geographic region was 12,812 with standard region-demographic pair. Note that we have made our task
deviation 21,939. harder than necessary to demonstrate our method; we could
There were 223 variables in the census data, including have trained our model using the exit polling data, where
both categorical variables such as race, occupation, and ed- available, and we would certainly recommend practitioners
ucational attainment and real valued variables such as in- use all available data to get the best possible estimates.
All of our models were fit using the GPstuff package with
4
https://github.com/huffpostdata/election-2012-results scaled conjugate gradient optimization and the Laplace ap-
5
using the PUMA 2000 codes and the tool at proximation [34]. Since n N , the time required to fit the
http://mcdc.missouri.edu/websas/geocorr12.html GP model and make predictions is much less than the time
(a) Exit poll results for women (b) Exit poll results for men
(c) Ecological regression results for women (d) Ecological regression results for men
Figure 3: Support for Obama among women (a) and men (b) in the 18 states for which exit polling was done;
due to cost, no representative data was collected for the majority of states or for regions smaller than states.
Support for Obama among women (c) and men (d) in 837 different regions as inferred using our ecological
regression method.
required to preprocess the data to create the mean embed- for and a variety of demographic questions about themselves.
dings at the beginning of Algorithm 1. Bias due to factors such as unrepresentativeness of the sam-
pled precincts and inadequate coverage of early or absentee
7. RESULTS voters could be an issue [1]. The national results had a mar-
gin of error (corresponding to a 95% uncertainty interval) of
We learned the following hyperparameters for our GP:
4 percentage points6 and the state results had a margin of
s2 = 0.18, ` = 7.92, and x2 = 4.56. The 2 parameters can
error of between 4 and 5 percentage points [18]. For com-
be roughly interpreted as the fraction of variance explained
paring to the 18 state-level exit polls, we aggregated our
so the fact that x2 is much larger than s2 means that the
geographic regions by state, weighting by subgroup popula-
demographic covariates encoded in the mean embedding are
tion.
much more important to the model than the spatial coordi-
As a preview of our results by gender, income, and age,
nates. The length-scale for the Matern kernel is a little more
and to get an idea of the power of our method, Figure 3
than half the median distance between locations, which indi-
shows four maps visualizing Obamas support among women
cates that it is performing a reasonable degree of smoothing.
and men. In Figures 3a3b, we show the results from the
We used 10-fold crossvalidation to evaluate our model and
exit polls, at the state level, for only 18 states. In Figures 3c
ensure that it was not overfitting, an important considera-
3d we fill in the missing picture, providing estimates for 837
tion as generalization performance is critical. The root mean
different regions. We compare to competing methods below
squared error of the model was 2.5 and the mean log predic-
for national-level gender estimates. In the supplementary
tive density was -1.9. Predictive density is a useful measure
materials, we consider the non-binary demographic covari-
because it takes posterior uncertainty intervals into account.
ates age and income and the case of regional-level estimates,
For comparison, predicting the national average of Obama
which present a difficulties for the competing methods.
receiving 51.1% of the vote in every location has a root mean
squared error of 8.3. As a sensitivity analysis, we also con-
sidered a multiplicative model, for which the performance
7.1 Gender
was comparable. Voting by gender is shown in Figure 4, where we compare
To validate our models, we compared to the 2012 exit our results to the exit poll results. The fit is quite good,
polls, conducted by Edison Research for a consortium of with correlations equal to 0.96 for men and 0.94 for women.
news organizations. National results were based on inter- The inference that we are most interested in is the gender
views with voters in 350 randomly chosen precincts, and 6
state results in 18 states were based on interviews in 11 to This presumably corresponds to a sample size of only n =
600 individuals, since the usual margin of error reported by
50 random precincts. In these interviews, conducted as vot- q
.52
ers left polling stations, voters were asked who they voted news organizations is 1.96 n1
Implementing a racially stratified homogenous 706714. JMLR.org, 2014.
precinct approach. PS: Political Science and Politics, [20] S. Openshaw. Ecological fallacies and the analysis of
39(3):pp. 477483, 2006. ISSN 10490965. areal census data. Environment and Planning A, 16
[2] N. Cristianini and J. Shawe-Taylor. An introduction to (1):1731, 1984.
support vector machines and other kernel-based [21] G. Patrini, R. Nock, T. Caetano, and P. Rivera.
learning methods. Cambridge university press, 2000. (almost) no label no cry. In Z. Ghahramani,
[3] A. Dobson. An introduction to generalized linear M. Welling, C. Cortes, N. Lawrence, and
models. Chapman & Hall texts in statistical science, K. Weinberger, editors, NIPS 27, pages 190198.
2002. Curran Associates, Inc., 2014.
[4] O. B. Duncan and D. B. An alternative to ecological [22] B. Poczos, A. Singh, A. Rinaldo, and L. Wasserman.
correlation. American Sociological Review, 16:665666, Distribution-free distribution regression. In
1953. Proceedings of the Sixteenth International Conference
[5] K. Fukumizu, L. Song, and A. Gretton. Kernel bayes on Artificial Intelligence and Statistics, pages 507515,
rule. In NIPS, pages 17371745, 2011. 2013.
[6] T. Gartner, P. A. Flach, A. Kowalczyk, and A. J. [23] R. L. Prentice and L. Sheppard. Aggregate data
Smola. Multi-instance kernels. In ICML, volume 2, studies of disease risk factors. Biometrika, 82(1):
pages 179186, 2002. 113125, 1995.
[7] A. Gelman, D. Park, B. Shor, J. Bafumi, and [24] N. Quadrianto, A. J. Smola, T. S. Caetano, and Q. V.
J. Cortina. Red State, Blue State, Rich State, Poor Le. Estimating labels from label proportions. JMLR,
State: Why Americans Vote the Way They Do. 10:23492374, 2009.
Princeton University Press, Aug. 2008. ISBN [25] A. Rahimi and B. Recht. Weighted sums of random
069113927X. kitchen sinks: Replacing minimization with
[8] L. A. Goodman. Some alternatives to ecological randomization in learning. In Advances in neural
correlation. American Journal of Sociology, pages information processing systems, pages 13131320,
610625, 1959. 2008.
[9] A. Gretton, A. Smola, J. Huang, M. Schmittfull, [26] C. E. Rasmussen and C. K. Williams. Gaussian
K. Borgwardt, and B. Sch olkopf. Covariate shift by processes for machine learning, 2006.
kernel mean matching. Dataset shift in machine [27] J. Riihim aki and A. Vehtari. Laplace approximation
learning, 3(4):5, 2009. for logistic gaussian process density estimation and
[10] A. Gretton, K. M. Borgwardt, M. J. Rasch, regression. Bayesian Analysis, 9(2):425448, 2014.
B. Scholkopf, and A. Smola. A kernel two-sample test. [28] W. S. Robinson. Ecological correlations and the
JMLR, 13:723773, 2012. behavior of individuals. American Sociological Review,
[11] M. S. Handcock and M. L. Stein. A bayesian analysis 15:35157, 1950.
of kriging. Technometrics, 35(4):403410, 1993. [29] C. Saunders, A. Gammerman, and V. Vovk. Ridge
[12] C. Jackson, N. Best, and S. Richardson. Improving regression learning algorithm in dual variables. In
ecological inference using individual-level data. (ICML-1998) Proceedings of the 15th International
Statistics in medicine, 25(12):21362159, 2006. Conference on Machine Learning, pages 515521.
[13] G. King. A Solution to the Ecological Inference Morgan Kaufmann, 1998.
Problem. Princeton University Press, Mar. 1997. [30] D. R. Sheldon and T. G. Dietterich. Collective
[14] G. King, M. A. Tanner, and O. Rosen. Ecological graphical models. In NIPS, pages 11611169, 2011.
inference: New methodological strategies. Cambridge [31] A. Smola, A. Gretton, L. Song, and B. Sch olkopf. A
University Press, 2004. hilbert space embedding for distributions. In
[15] H. Kueck and N. de Freitas. Learning about Algorithmic Learning Theory, pages 1331. Springer,
individuals from group statistics. In 21st Uncertainty 2007.
in Artificial Intelligence (UAI), pages 332339, 2005. [32] L. Song, J. Huang, A. Smola, and K. Fukumizu.
[16] Q. Le, T. Sarlos, and A. Smola. Fastfood: Hilbert space embeddings of conditional distributions
approximating kernel expansions in loglinear time. In with applications to dynamical systems. In Proceedings
Proceedings of the international conference on of the 26th Annual International Conference on
machine learning, 2013. Machine Learning, pages 961968. ACM, 2009.
[17] K. Muandet, K. Fukumizu, B. Sriperumbudur, [33] Z. Szabo, A. Gretton, B. Poczos, and
A. Gretton, and B. Schoelkopf. Kernel mean B. Sriperumbudur. Two-stage Sampled Learning
estimation and stein effect. In Proceedings of The 31st Theory on Distributions. Artificial Intelligence and
International Conference on Machine Learning, pages Statistics (AISTATS), Feb. 2015.
1018, 2014. [34] J. Vanhatalo, J. Riihim aki, J. Hartikainen, P. Jyl
anki,
[18] New York Times. President exit polls - election 2012. V. Tolvanen, and A. Vehtari. Gpstuff: Bayesian
http://elections.nytimes.com/2012/results/ modeling with gaussian processes. JMLR, 14(1):
president/exit-polls, 2012. Accessed: 16 February 11751179, 2013.
2015. [35] G. Wahba. Spline models for observational data,
[19] J. B. Oliva, W. Neiswanger, B. P oczos, J. G. volume 59. Siam, 1990.
Schneider, and E. P. Xing. Fast distribution to real [36] C. K. Williams and D. Barber. Bayesian classification
regression. In Proceedings of the Seventeenth with gaussian processes. Pattern Analysis and
International Conference on Artificial Intelligence and Machine Intelligence, IEEE Transactions on, 20(12):
Statistics, AISTATS 2014, Reykjavik, Iceland, April 13421351, 1998.
22-25, 2014, volume 33 of JMLR Proceedings, pages