Professional Documents
Culture Documents
Data
Research
Paul Longley, James Cheshire
and Alex Singleton
Consumer
Data
Research
Paul Longley, James Cheshire
and Alex Singleton
Acknowledgements
Consumer
Data An ESRC Data
Investment
Research
Centre
Contents
8 INTRODUCTION
Consumer Data Research – An Overview
Paul Longley, James Cheshire and
Alex Singleton
PART ONE
PROVENANCE AND CONSUMER
DATA INFRASTRUCTURE
85 6. Movements in Cities: Footfall and its 153 11. Geotemporal Twitter Demographics
Spatio-Temporal Distribution Alistair Leak and Guy Lansley
Roberto Murcio, Balamurugan Soundararaj
and Karlo Lugomer 167 12. Developing Indicators for
Measuring Health-Related
97 7. The Geography of Online Retail Features of Neighbourhoods
Behaviour Konstantinos Daras, Alec Davies,
Alexandros Alexiou, Dean Riddlesden Mark A Green and Alex Singleton
and Alex Singleton
179 13. Consumers in their Built Environment
111 8. Smart Card Data and Human Mobility Context
Nilufer Sari Aslam and Tao Cheng Alexandros Alexiou and Alex Singleton
It has become a cliché to observe that new of the analyst. Second, different individuals
sources of Big Data are becoming available have different wants, needs and spending
in ever greater variety, in unprecedented power, and so some individuals in the
volumes and with ever more frequent population at large will be represented
temporal updating (velocity). This book more prominently than others – and at the
is about ‘consumer data’ that arise out other extreme, those that consume nothing
of every-day transactions for goods and from a particular retailer / service provider
services, carried out between individuals will not be represented at all. A related
and organisations. Such data account point is that few consumer organisations
for an increasing real share of all of the have a monopoly of their markets, and
characteristics and activities of active many focus upon particular market niches.
citizens today, and offer the prospect Taken together, this means that there
of better understanding the nature and is bias in the content and coverage of
functioning of society. consumer data sources, and that the source
and operation of bias cannot be ascertained
Consumer data are not created for the without reference to external sources.
edification of researchers and analysts. In many ways these issues are akin to
Instead, they are a by-product of the those that characterise volunteered or
myriad consumer transactions that created crowd sourced data – in that individuals
them. This has important implications for need to feel motivated in order to
the data’s content and coverage when they contribute data, and the distinctive
are reused for research purposes. First, the characteristics of those that feel motivated
traces of (some kinds of) transactions or may affect the content and coverage of the
those people conducting them may be more resulting dataset (Haklay, 2010).
evident or detailed than others, and this
outcome is usually well beyond the control
Introduction 9
This situation contrasts sharply with the The research reported in this book has
design of conventional social surveys, developed using the Consumer Data
where the principles of scientific sampling Research Centre’s (CDRC) ‘ladder of
are used to ensure complete coverage of the engagement’, whereby initial collaborations
relevant population of interest at the design with consumer organisations are focused
stage. Nevertheless the quality of social upon specific small MSc projects. A number
surveys is diminished where acceptable of these have developed into co-sponsored
response rates are not achieved, or there is PhD projects, or shared projects staffed by
bias in the relevant characteristics of those CDRC Data Scientists. Some data providers
that respond to the surveys and those that then progress to providing data for wider
do not. In this context, it is important to use by the academic community, under
recognise that recent years have seen agreed terms set out in data licensing
cumulative declines in response rates agreements. Finally, it is also possible to
throughout the developed world (e.g. Sax engage data providers in the co-production
et al 2003) and that in important respects of data with the CDRC itself. Good examples
social surveys are no longer a panacea for are provided by our engagement with
social science research. More generally, players in the domestic energy provision
there is also no guarantee that we will be and retail sector who have participated
able to rely on the long-term availability of in the Master’s Research Dissertation
those traditional sources of data such as a Programme before going on to co-sponsor
Census of the Population, as within many PhD research. This latter development
countries these expensive and time- in turn led to providing CDRC with a
consuming surveys have come under nationwide dataset; which is available
increasing threat in line with fiscal to access by other researchers through
constraint (Singleton et al, 2017). the CDRC service. The collaboration with
the Local Data Company (LDC) reported in
Many of the chapters in this book arise this book represents the highest rung of
out of shared challenges that are faced this ‘ladder of engagement’ and follows
by academics and the organisations that, successful collaboration on MSc and PhD
to differing degrees, create consumer data. projects as well as the co-production of
There are, of course, differences too: the nationwide data with CDRC for further
timescales that characterise academic research and development.
research offer horizon scanning that
business organisations are less likely to Many consumer-facing organisations are
have resource to facilitate; usually focused highly sensitised to the risks of disclosure,
upon more operational matters, such as although these risks are absolutely
optimising the next set of sales figures. minimal where data are anonymized
There may be tensions too, in that prior to transfer, and appropriate resources
consumer data providers may safeguard to access them are put in place. To this
their competitive position, while end, CDRC uses a number of secure data
contributing to research that ultimately facilities (one of which is accredited by
increases the competitiveness of their the London Metropolitan Police), and
industrial sector as a whole. There are CDRC researchers are familiar with using
also differences of emphasis in method, novel data access technologies such as
technique and application that have secure links to sensitive data-sets held
evolved in different ways between the by different organisations.
academic and business sectors. But it is
also possible that there is shared interest The approaches to consumer data research
in better understanding the form and that are reported in this book come at an
functioning of social systems. interesting time in the evolution of data
landscapes in advanced economies. There
10 CONSUMER DATA RESEARCH
PROVENANCE AND
CONSUMER DATA
INFRASTRUCTURE
1
15
Consumer Registers as
Spatial Data Infrastructure
and their Use in Migration and
Residential Mobility Research
Guy Lansley and Wen Li
minority ethnic backgrounds and foreign above) (Figure 1.1). It can be observed that
individuals who were eligible to vote due to two main areas of under-representation
their country of citizenship (i.e. Irish and are London and Northern Ireland. Whilst
Commonwealth citizens). In addition, only under-enumeration in London can possibly
57% of respondents in privately rented be accounted for by the higher proportion
properties were found to be in the Electoral of (non-voter) migrants and individuals
Register. This suggests that it is the in rental properties, the low counts in
geographical mobile population that are Northern Ireland are probably due to
typically under-enumerated or inaccurately different administrative procedures of
recorded. It is highly likely that the their Electoral Office or a low presence of
remaining data sources in the Consumer participating retailers. Indeed the pattern
Registers will also under-enumerate those across the UK is rather serendipitous;
who recently changed address as there are whilst the most over-represented districts
little incentives to immediately update are generally less densely populated, this
your details for many services following a is not always the case. As the electoral roll
change of address. It is also possible that is administered by local authorities, it is
different sources of consumer data may possible their varying practices have
have particular demographic and socio- contributed to these differences. In
economic biases. addition, some of the consumer data may
come from companies which have regional
Previous research has focused upon issues customer biases. We have also considered
of under-representation when discussing the spatial distribution of representation
the provenance of big datasets. The at the intra-urban scale. We have taken
Consumer Registers appear to over- the City of Bristol as an example due to
represent the size of the adult population. its pronounced socio-spatial inequalities
We have compared the number of records and observed the rate at the census output
to the estimated population of persons area (OA) level. Census OAs had an average
aged 17 and above from the ONS mid-year population of just over 300 in 2011. Indeed,
population estimates. For example, the 2013 Figure 1.1 also highlights that most
and 2014 Consumer Registers each contain under-representation occurs in the
over three million more individuals than centre of the city. This part of the city has
the ONS population estimates for the same the greatest proportion of young adults,
year. This could be due to a number of ethnic minorities and those in privately
reasons such as the duplication of those rented accommodation. All three of these
who live at multiple addresses, failure characteristics were found to be associated
to delete old records and issues of cross with under-enumeration in the Electoral
contamination when data are pooled Register (Electoral Commission, 2016).
(Bollier, 2010). There are also likely to Generally, it is neighbourhoods with the
be some individuals below the age of greatest rate of homeownership which
17 in the consumer data who cannot be have the highest counts in the consumer
distinguished due to the unavailability of registers.
demographic variables. We should also
consider that population estimates do not 1.4
represent the actual population counts. Address matching
We have attempted to identify if there are The addresses recorded in the registers are
geographic patterns of overrepresentation. formatted into six text columns representing
Firstly, we have considered local authority distinctive lines of their postal addresses,
(or district) level variations at the national such as house numbers or names, streets,
level through comparisons to the 2011 cities, etc. In addition, there is also a
Census population (adults aged 17 and postcode column. However, unfortunately,
18 CONSUMER DATA RESEARCH: PART ONE
the addresses are not consistently derived from the intuition based on UK Figure 1.1
structured. For example, the first line of addresses. The first one is based on the The ratio of the number
of recorded persons in the
an address may represent a flat number for numbers used in the addresses including 2013 Consumer Register
some addresses, whilst it could represent property numbers and flat numbers. by the population of
the street name and house number for Examples are ‘14’, ‘14a’. The second is persons aged 17 and
above from the 2011
others. In addition, the number of lines based on the word difference between two Census at the district
in each address varies; many records addresses which measures how close the level for the UK (left)
do not include the county or region name. word sets respectively are used in the two and output area level
for Bristol (right).
Although the data provider did include a addresses. This will cover the cases where
unique reference number for each address, addresses do not contain a house number.
there were inconsistencies between its The function also takes into account the
recording in 2013 and 2014. common words in addresses (such as road,
street) by weighting the difference between
Our aim was to create a methodology to words inverse proportionally to their
match as many addresses as possible, frequency in the data, as well as their
regardless of how they are formatted. abbreviations. The third function is a
Due to inconsistencies within the database, variant of Levenshtein Distance (a.k.a.
we could not match all dwellings via a Edit Distance) which measures the
simple string match. To improve the quality difference in terms of characters.
of joining via textual addresses, we devised The adaption incorporates a weighting
a method for matching addresses based on scheme to emphasise the difference
similarity of text strings. The method at the beginning of the textual addresses.
combines three similarity functions To match addresses from a set of
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 19
Table 1.1
Household type Number of households Changing household
Stable household 19,940,359 characteristics, 2013–14.
An
s
moves, such as distance and deprivation. ne
dr
Jo
ew
hn
Sm
Jo
100
ith
This insight can then be used to allocate
David Brown
Paul Smith
Michael Smith
David Williams
John Smith
David Jones
David Smith
the non-unique name holders into the
most likely origin-destination pairings.
80
Cumulative percentage
1.6.1
Unique names
60
As our models are largely based on the
linkage of unique occurrences of names
between our movers databases, it is
40
important to understand the connotations
this may have when attempting to
represent the wider population. Most
20
full names are relatively uncommon.
For example, in the 2013 register, 18.3%
0 2000 4000 6000 8000 10000 12000
of the population have unique full names
and 50% of adults share their names with Frequency of full names
less than 16 other individuals. Figure 1.3
displays the cumulative frequency for full
names in 2013. However, in addition to
considering unique names alone, by pairs (Mateos et al, 2007). The proportion Figure 1.3
pooling all of the names within households of name-inferred ethnic groups for the 2013 A cumulative percentage
of the frequency of full
that change address, our models will also Consumer Register and a subset of those names in 2013.
consider many individuals with more with unique names only is shown in Table
common names. For example, while there 1.2. The percentage of British ethnic groups
are over 11,000 David Smiths and 6,000 as recorded in the 2011 Census have also
Margaret Smiths (the most common male been included for comparison.
and female names respectively) in the 2013
register, there are less than 140 households Although reliant on names as proxies of
comprising of these two names together, cultural heritage, the analysis suggested
despite it being the third most common that the Consumer Register slightly
household name composition. over-represents the White British
population. This assumption is reasonable
Figure 1.3 also labels the nine most popular given that the Electoral Register is known
names in the 2013 data, all of which are to under-enumerate ethnic minorities.
white British male names. We have Although the precise sources of the
considered that a large proportion of consumer data are not known, ethnic
adults with unique names may have minorities are also known to be
international heritage. Therefore, to under-represented in large customer
explore the relationship between ethnic loyalty databases. As anticipated the
heritage and name popularity we ran under-representation of the White British
all of the names from the 2013 register population is considerable amongst adults
through a names classifying tool called with unique names. For example, names
Onomap (www.onomap.org). The tool identified as ‘other white’ background were
assigns each name (considering both over 3.5 times as prominent in the unique
forenames and surnames) to their most names subset relative to the original data.
likely cultural, ethnic and linguistic group This reflects the range and diversity
and was produced from clustering an of European names. We, therefore, need to
extensive database of forename-surname consider that although we have devised a
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 23
2001 Census Ethnic Group Consumer Register Unique names only 2011 (Excl. NI)
A) WHITE - BRITISH 84.36% 60.56% 81.47%
B) WHITE - IRISH 3.79% 4.55% 0.95%
C) WHITE - ANY OTHER WHITE
3.77% 13.30% 4.32%
BACKGROUND
H) ASIAN OR ASIAN BRITISH - INDIAN 2.02% 4.48% 2.36%
J) ASIAN OR ASIAN BRITISH - PAKISTANI 1.78% 3.18% 1.91%
K) ASIAN OR ASIAN BRITISH -
0.40% 0.70% 0.73%
BANGLADESHI
L) ASIAN OR ASIAN BRITISH - ANY OTHER
0.17% 0.74% 1.40%
ASIAN BACKGROUND
M) BLACK OR BLACK BRITISH -
0.04% 0.14% 0.98%
CARIBBEAN
N) BLACK OR BLACK BRITISH - AFRICAN 0.79% 2.67% 1.66%
R) OTHER ETHNIC GROUPS - CHINESE 0.43% 0.86% 0.70%
S) OTHER ETHNIC GROUPS - ANY OTHER
1.67% 4.68% 0.55%
ETHNIC GROUP
Y) UNCLASSIFIED 0.78% 4.14% NA
Table 1.2 novel way of estimating internal migration, mean was 66.1 (Figure 1.4). The Royal Mail
The proportions of a greater proportion of the modelled flows identified that the average distance of
ethnic groups for the
2013 Consumer Register, may be representative of those with movers which could be identified by their
a subset of adults with international heritage. redirection service is just 25.83 miles (Royal
unique names only, Mail, 2017). However this service is likely to
and the UK 2011 Census.
1.6.2 be biased towards home owners.
Representing migration
We have presented the key spatial trends
In total, our model estimated the origin as a flow map below, which displays the
and destination of 762,359 individuals. interactions between local authorities in
In addition to these, our model also Great Britain (Figure 1.5). In order to only
identified a further 100,000 cases where convey the key trends in the data and avoid
adults moved within the same postcode. issues of disclosure, only flows of at least
We have considered that these movers may 40 persons are shown. In addition, we have
have remained in the addresses that could also included moves within each district.
have been recorded differently in both These are displayed as proportional symbols
registers. Therefore these individuals are in the centre of each authority.
not included in the subsequent results.
Most moves between 2013-14 occurred
By joining the postcodes to the ONS Postcode within the same local authority district.
Directory, it was possible to observe spatial It is also observable from Figure 1.5 that
trends in modelled migration. Most moves a large proportion of flows are between
tended to occur over relatively short neighbouring local authorities. It is also
distances which corresponds with known interesting to predict migration between
migration traits within the UK (Stillwell regions, and observe how it may vary from
and Thomas, 2016). Our median distance was officially recorded statistics from the 2011
just 19.7 miles as the crow flies, whilst the Census. Although recorded differently and
24 CONSUMER DATA RESEARCH: PART ONE
Figure 1.4
400,000
A histogram of the distance
moved by adults in the
Consumer Registers.
300,000
Frequency
200,000
100,000
0
km
Figure 1.5
Flows of home movers
between local authorities.
0 25 50 75 100 km
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 25
W
es
es
st
tM
st
tM
Ea
Ea
id
id
la
th
la
n
u
ut
ds
nd
So
So
s
York
Yorks
s
hire
ire h
Scotland
Scotland
Nor
t
t
th W
Eas
Eas
No
est
r th
We
st
No
r th
ds
Ea
No an NI ds
st
r th idl an
Ea tM idl
st s t M
NI Ea Eas
London
London
Figure 1.6 absent of any children, the trends identified adjacent quintiles which suggests that
Chord diagrams by the Consumer Registers were similar to there is still only limited social mobility
representing the
proportion of moves the official statistics from 2011. Flows in England. There are only a minority of
between regions as between regions as recorded from the migrants that move between places of
identified from the Consumer Registers and the 2011 Census drastically different levels of deprivation.
Consumer Registers
(left) and the 2011 are displayed in Figure 1.6. Interestingly, there was only a slight
Census (right). majority of upwardly mobile flows over
The migration model also presents an downwardly mobile flows. Whilst this could
opportunity to gain an understanding highlight that migration is no longer more
of segregation, social mobility and asset abundant amongst socially mobile adults,
accumulation through geographic data it is probably also due to adults moving
linkage. There is an assumption that between living with parents, rental
geographic mobility and social mobility accommodation and eventually home
are extrinsically linked as people generally ownership. House prices have made many
move to improve their life chances (Savage, of the least deprived neighbourhoods
1998). Focusing on the English Index of unaffordable for first-time buyers
Multiple Deprivation (IMD), it is possible (Dorling, 2015). In addition, there are also
to observe the social trajectory of internal occurrences of elderly relatives moving in
migrants by considering the deprivation with family or to assisted accommodation.
ranks of their origin and destination Indeed, these results can also be explained
Lower Super Output Areas (LSOAs). To by the fact that most moves occur over
demonstrate the key trends in our data, we relatively short distances and deprivation
have aggregated all of the English LSOAs is positively spatially autocorrelated.
into IMD quintiles and observed the flows Figure 1.7 also identifies addresses that
of migrants between them (Figure 1.7). were sold in 2013 or 2014. It is also
noteworthy that a greater proportion
For each quintile, the most popular of moves where a house was purchased
out-flow feeds back into the same group. occurred for movers moving to and from
The next largest flows are those into the the least deprived parts of the country.
26 CONSUMER DATA RESEARCH: PART ONE
Figure 1.7
An alluvial plot of
migration between
different quintiles of the
5 5
2015 English Index of
Multiple Deprivation,
where the lowest quintiles
are most deprived. Moves
to addresses that were
4 sold in 2013 have been
4
coloured green.
3
3
2
2
1 1
2013 2014
and fill in data gaps. Hoinville, G. and Jowell, R. (1978). Survey Research
Practice. London: Heinemann Educational Books.
Acknowledgements
Note
phenomenon than previous work with more the data. As we demonstrate below, access
‘traditional’ datasets such as national to the CDRC HSR loyalty data has facilitated
censuses. These depictions would have a better understanding of the nature of
widespread applications in a broad range of these data in their raw form. These insights
public service decision-making processes are foundational to a pragmatic approach to
from transport planning to health. utilising these data in wider research and
facilitate an appraisal of the potential for
It is therefore paradoxical that despite the such research to offer substantive insights
growing abundance of these data, their use into social and geographical phenomena.
by academic researchers has been limited.
This, perhaps understandably, is partly due 2.3
to the data’s origins in privately owned Loyalty cards as social and spatial
businesses and their secure storage data
requirements since they provide
information about consumer transactions, Loyalty cards typically produce very rich
residential locations, movements and temporal data on consumption patterns.
interactions. This raises substantial ethical Whilst these behaviours may arguably
and legal considerations in regards to provide a very useful context of socio-
disclosure control, anonymisation and demographic characteristics, loyalty
privacy. Safeguards are therefore required, data comprise little explicit socio-
especially where geographic information demographic information. However,
technologies facilitate the linkage of these customer postcodes provide a valuable
data to the likes of administrative or means of linkage to conventional statistical
alternative spatial data sources. Combined, geographic units and data associated with
these aspects have generated substantial them such as existing national statistics.
barriers to advancing understanding of their This provides a number of advantages.
fitness for purpose outside of commercial For example, it allows measures of
contexts. Yet, access to these data via the neighbourhood type, population
CDRC has provided a means of beginning characteristics or cultural background
to overcome such obstacles, allowing to be appended to individual customer
exploration of their dynamics and the records, which permits interpretation of
challenges encountered when attempting how consumer behaviours may vary with
to apply these data in research. For example, population characteristics. This also allows
these data are adequate from a retailer’s for the identification of potential biases
perspective, as variables are created and in the data. Conversely, these data offer
data interpreted with the primary focus a variety of attractions for our current
of understanding and maximising the understanding of geodemographic
buying behaviours of their customer base. phenomena. Geodemographic classifications
Conversely, academic endeavours strive are widely used in business and public
to obtain rigorous representative data for service organisations, yet are typically
their population of interest and therefore derived from surveys such as national
tend to prefer official statistics collated by Censuses, which may have limited sample
government. It would be impractical to sizes that can be affected by non-response
assume that these kinds of consumer data rates, a coarse spatial scale and low
will meet the ‘gold standards’ of national temporal granularity (collected on a
statistical datasets in terms of both their decennial basis). In addition, whilst
quality and representativeness; therefore, traditional classifications provide valuable
understanding their applications for local indicators, and there have been
research purposes requires preliminary contributions towards daytime indicators
considerations such as the completeness, with the production of small area
accuracy, bias and validity/plausibility of workplace statistics, human identity
32 CONSUMER DATA RESEARCH: PART ONE
encapsulates more than the duality of work on a regular basis and permit consistent Figure 2.1
and residence (Longley, 2017). It is therefore comparison between different behavioural Customer residence to
store flows – Central
increasingly important to incorporate more datasets on a relatively granular scale (over Lowlands, Scotland.
appropriate representations of individual 1.4 million postcode units across the UK).
trajectories with finer temporal Such information may provide an enriched
granularities, such as those that represent description of what makes people, or
the dynamics of day-to-day activities. The groups of people, distinctive. However,
emergence of novel forms of Big Data such utilising these data in this context also
as from loyalty cards offer the potential to gives rise to a number of shortcomings,
facilitate a more sophisticated view of this such as the well-established issue of
phenomenon, providing voluminous ecological fallacy when aggregating data
consumer data that are not compromised to a small-area level (i.e. confounding the
by uneven response rates, can be updated characteristics of areas with particular
2. The Provenance of Customer Loyalty Card Data 33
into their daily travel obligations, of which research may enable us to summarise daily Figure 2.2
the majority of research to date has only activity patterns in both time and space. Daytime flows card vs.
census.
been able to utilise self-reported travel Figure 2.2 shows an example of patronage
diaries of relatively small sample sizes. flows from customer residences using
In addition, the vast majority of research lunchtime, weekday transactions of HSR
into trip chaining has focused on home to customers, compared to self-reported
work based trips only, despite work-related origin to workplace destination flows
travel not representing all activities that from the 2011 Census.
are undertaken (i.e. leisure and tourism;
Primerano et al., 2008). Loyalty consumption Such comparisons suggest that loyalty
patterns could provide insight into how card data may be able to provide us with
activities change over time, or how the means to understand daytime activities
interactions with increasingly popular within the general population, help us to
online alternatives (such as click and better understand aspects such as the
collect or home delivery) may affect connectedness of various locations over
subsequent behaviours. The data produced different temporal periods (i.e. daily,
by loyalty cards allows us to investigate weekly, seasonal) and – ultimately – aid
a broad number of variables relating to the construction of geo-temporal profiles.
mobility, such as distances travelled, size These temporally integrated analyses
of store networks and the characteristics postulate that people are influenced not
of locations that individuals visit over time. only by where they live, but also by places
These insights have important implications they visit, when they visit them and who
for planning decisions and policies in urban they interact with. It is our expectation
environments and also issues relating to that loyalty card data, both alone and in
high street retail e-resilience. combination with other datasets, will
advance our knowledge of the functional
By incorporating the temporal element of relationships between places given the
these movements, we can further utilise volume of interactions between different
these data to understand more complex social, economic and demographic groups
socio-spatial characteristics. This evolving that they are able to capture.
2. The Provenance of Customer Loyalty Card Data 35
Figure 2.3
Summary diagram of
issues affecting data.
Farming Communities
Rural Tenants
Ageing Rural Dwellers
Students Around Campus
Inner City Students
Comfortable Cosmopolitans
Aspiring and Affluent
Ethnic Family Life
Endeavouring Ethnic Mix
Ethnic Dynamics
Aspirational Techies
Rented Family Living
Group
For example, the loyalty customer base for example, by analysing distributions Figure 2.4
may be subject to an underlying self- of age and gender characteristics present Proportions of customers
by OAC for loyalty
selection process such as customers who in the data. In addition, we can attempt to customers vs. census –
are more money conscious or receptive quantify dynamics by drawing comparisons group level.
to special offers being more likely to between existing geodemographic
participate, whereas those with privacy classifications. Figure 2.4 demonstrates an
concerns likely being deterred. Variations example of the volumes of HSR customers
in behaviour as a result of individual/ across Output Area Classification (OAC)
psychological dispositions to participate groups in comparison to Census estimates.
can be investigated to some extent by These classifications categorise the general
comparing loyalty card and non-card UK population based on socio-economic
transactional data. However, making direct characteristics obtained from the 2011 Census.
comparisons can be problematic due to
non-card data also comprising instances It is clear that certain groups are
where a cardholder did not use their card disproportionately represented by these
with a transaction. Yet, using a data-driven data in terms of both their characteristics
approach, demographic biases can be and geographic locations, with more
investigated by drawing comparisons affluent groups likely being over-
with existing national population statistics, represented (particularly ageing suburban
2. The Provenance of Customer Loyalty Card Data 37
stores (i.e. smaller stores located in card population (approximately 3.5%) with
urban areas). Conversely, higher levels of stated addresses that may no longer be
participation are observed in destination their usual place of residence. Although
locations, such as city centre flagship this is reassuring since it indicates that the
stores. This is likely due to the importance vast majority of loyalty data likely contain
of larger basket sizes in these store types, valid spatial references, it does also suggest
which produce higher loyalty participation errors assumed not to be present in official
(i.e. due to the perceived benefits of more statistics. This has important implications
expensive purchases). A product type bias if using the postcode information as a key
is also evident, with cards more likely to spatial reference to infer social and spatial
be used with higher value items. The processes, and efforts should be made to
implications of these trends are that, identify potentially spurious patterns before
firstly, the distribution of behavioural utilising the data. However, we also
data will be influenced by the characteristics highlight how these errors are not random
of a store location, and secondly, if analysing and can be disproportionately ascribed to
individual product buying behaviours, it is certain segments of the loyalty population
important to consider that the purchasing - primarily students and other groups who
of certain products may be over- or under- are likely to have particularly transient
represented based on the propensity to residential locations. Therefore, we can make
use a card for that particular item. These attempts to identify customers who are most
aspects have important implications for at risk of exhibiting these data errors.
the mobility and product-buying analyses
outlined in Section 2.3, as the completeness 2.5
of individual trajectories may be influenced Conclusion
by these differing motivations to participate.
Despite this (due to the enormous volume Loyalty card data offer an untapped
of overall data) there are still a vast amount opportunity for researchers to analyse
of data produced by loyalty cards available societal and geographical questions in
across all store locations and product types. an entirely new way. They represent large
numbers of people and allow analyses
Finally, due to the lack of data collection at a variety of spatio-temporal scales.
control, there may be elements of However, there are a number of
uncertainty regarding the completeness, preliminary considerations and pragmatic
accuracy and validity of these data. steps required to ensure these data are
Assessment of data completeness and fit for purpose in a research context.
accuracy may be particularly important For example, loyalty cards represent large
in the case of loyalty card metadata, but selective samples, are inherent with
as this information is entirely dependent socio-economic and spatial biases and
on accurate human input at the time of present elements of data uncertainty.
enrolment. Simple exploratory analyses It would be impractical to assume that
can be applied to identify basic errors in these kinds of consumer data will meet the
these data, such as invalid postcodes or standards of national statistical datasets,
illogical age ranges. However, a more yet it is suggested here that pragmatic
complex issue is that the accuracy of actions can be taken to ensure the quality
customer postcodes is also dependent of these data are both understood and
on the motivation of a customer to considered. As such, the preliminary
update this information in the event of focus when applying these data should
a location change. For example, we have be suitably based on the initial assessment
demonstrated that through the linkage and quantification of inherent data quality
of locational and behavioural attributes, issues, such as those outlined in Section
we are able to identify those in the loyalty 2.4. Careful consideration of these
2. The Provenance of Customer Loyalty Card Data 39
characteristics may facilitate extraction of view of social and spatial processes based
insights that were not previously possible on a broader range of information. We may
nor practically obtainable using traditional expect to see correspondence between the
methodologies. These cautions mirror clusters derived from loyalty cards and the
those adopted in traditional methods of categories of traditional classifications for
data handling in regards to data quality example, which will ultimately provide an
and sampling bias; however, due to the enhanced description of what makes
nature of Big Data collection, efficient certain groups of people distinctive.
methods of revealing these inherent data
issues require exploration. The development of these applications will
be made possible through the continuing
An important future direction is therefore data collaborations facilitated by the CDRC,
to continue to develop new methods which provides a means of utilising data
of handling and analysing these data. of a personal nature, whilst adhering to
Traditional statistical methods have been important disclosure controls. It is critically
focused on data-scarce science, where aims important that analyses of this nature
are to identify significant relationships endeavour to achieve outputs that are
from small, controlled sample sizes with both informative and safe, especially where
known relationships. Developments in data linkage is concerned. Nevertheless,
Big Data research may involve applying the prospects of loyalty card data as a social
data-driven approaches to quantify and spatial data source present promising
uncertainty within the data, continual applications for the use of large consumer
critique and truth propagation and using datasets in social science research.
contemporary social and geographical
Further Reading
theory to support the reliable use of these
new kinds of data sources. Beyond this, Cortiñas, M., Elorz, M. and Múgica, J. M. (2008). The
future prospects are concerned with use of loyalty-cards databases: Differences in regular
price and discount sensitivity in the brand choice
gaining a robust understanding of the decision between card and non-card holders. Journal
applications of these data to advancing of Retailing and Consumer Services, 15(1), 52-62.
our knowledge of population dynamics
Longley, P. A. (2017). Geodemographic profiling, In
in respect to consumption behaviour, The International Encyclopedia of Geography. Wiley and
daytime activities, mobility patterns, the American Association of Geographers (AAG).
spatio-temporal dynamics and the
Loyalive (2015). Loyalive – an introduction. URL no
relationship of these patterns in regards longer available.
to consumer attitudes and lifestyles.
Moving towards constructing spatio- Primerano, F., Taylor, M. A., Pitaksringkarn, L. and
Tisato, P. (2008). Defining and understanding trip
temporal classifications using these chaining behaviour. Transportation, 35(1), 55-72.
kinds of data may advance our knowledge
of relationships between places in terms Wright, C. and Sparks, L. (1999). Loyalty saturation
in retailing: Exploring the end of retail loyalty
of the volumes of interactions generated cards? International Journal of Retail & Distribution
by different social, economic and Management, 27(10), 429-440.
demographic groups over different
YouGov (2013). British shoppers in love with loyalty
temporal periods. It is further possible, cards. Online: yougov.co.uk/news/2013/11/07/british-
through the availability of common shoppers-love-loyalty-cards/
spatial keys such as postcodes, to draw
Acknowledgements
comparisons between classifications
derived from alternative spatially The authors thank ‘High Street Retailer’ for
referenced datasets. This offers the providing transaction data to enable us to carry
out this research. The first author’s PhD research
potential to, firstly, bridge gaps between is sponsored by the Economic and Social Research
issues of representation that are inherent Council through the UCL Doctoral Training Centre.
in these data, but also to create an enriched
3
41
format, which hampers their use for company Geolytix for the year 2013 and
research purposes. were available as open data.
locations were typically the result of carried out prior to implementation in the
the two-dimensional representation of evaluation by identifying suitable starting
retail units within multi-storey buildings. values (for those tuning parameters that
Thus, the removal of duplicates (any points a single value could not be determined),
within a 2 metre radius from another point) then producing a number of different
was carried out. models within a range of values and finally
selecting the optimal model based on the
3.3 S_Dbw index.
Estimating retail centre location and
extent: methods and calibration DBSCAN is probably the most prevalent
density-based clustering method, and it
Cluster analysis is a collection of requires the specification of two tuning
unsupervised learning methods that parameters: the radius and the minimum
address the issue of grouping a set of number of nearest neighbours from a focal
objects based on similarity. Many point. It can identify clusters of arbitrary
commonly used clustering algorithms size and shape, it is computationally
make group allocations with the objective efficient and is robust to the presence of
of increasing similarity within a cluster outliers. However, the biggest drawback
and increasing dissimilarity between of DBSCAN is its limited sensitivity for
clusters. Other commonly used clustering datasets with varying point densities.
techniques such as density-based
algorithms seek dense regions separated K-means is the most frequently used
by low density regions, while model-based clustering method and requires the
methods assume that the data come from specification of a single parameter which
a mixture of probability distributions, is the number of clusters in the dataset.
each of which represents a different cluster. It has the disadvantage of producing
Cluster analysis is a multivariate technique clusters of convex hull shape but it has low
(multiple attributes of the phenomenon computational complexity; however, given
under investigation can be used), but in this its popularity, it was used as a benchmark
study it is strictly spatial, utilising only against the other clustering methods.
the locations of the retail units. This is an
appropriate approach for the identification The quality threshold method requires
of retail agglomerations where the extent specification of two parameters: the
of the clusters are determined by spatial maximum diameter of the clusters and the
discontinuity in unit distribution (Dearden minimum number of neighbours within a
and Wilson, 2011). cluster. The method has the advantage that
its parameters are relevant in the context
To estimate the definition of retail centres, of identifying retail agglomerations and
the following clustering methods were it is also robust in the presence of outliers.
evaluated: DBSCAN (Ester et al, 1996), However, given that it is stochastic, it suffers
Quality Threshold (Scharl and Leisch, from long running times.
2006), Kernel Density Estimation (Azzalini
and Torelli, 2007), Random Walk (Csardi The non-parametric Kernel Density
and Nepusz, 2006) and K-means (Lloyd Estimation (KDE) method combines KDE
1982). As will be described, all of the with graph structures and algorithms.
clustering methods evaluated require It requires the specification of a single
the calibration of tuning parameters that tuning parameter, and given that it is
we selected to optimise using the S_Dbw non-parametric, it is insensitive to the
internal evaluation indicator (Halkidi and data distribution. However, similar to the
Vazirgiannis, 2002). As such, the process quality threshold method, it is stochastic
of calibrating each clustering method was and suffers from long running times.
44 CONSUMER DATA RESEARCH: PART ONE
Winchester
established as well as being available 0 75 150 km
3.4
Centre definition and evaluation
Table 3.1 assist with input parameter specification while one of the strongest advantages of
Results from the and testing during the calibration process DBSCAN was the identification of outliers.
qualitative comparison of
the clustering methods in described in the previous section. Secondly,
eight locations across boundaries for the 339 largest ‘retail places’ It is clear from the results that DBSCAN
Great Britain. in GB were acquired from Geolytix, and performed well for the case study selection;
although they represent only a subset of however, this method is known to
total retail boundaries, they nevertheless underperform in areas where the density
provide an additional and relatively large is not uniform (Everitt et al, 2011). Such an
sample of independently created retail area issue also becomes apparent when looking
extents suitable for comparison. at the range of the optimal epsilon values
that were used for the selected areas
Table 3.1 presents the overall evaluation (Table 3.2). If a single global epsilon value
results from the qualitative comparison for had been used for all case studies, it would
all of the eight study areas. In most cases, have resulted in suboptimal local results.
the DBSCAN method provided results that As such, we developed a refinement to
were more consistent with those formal the method which involves splitting of
definitions created from the respective the national-scale data into more
local authorities. Importantly, DBSCAN was homogeneous areas for separate treatment;
the most efficient method in terms of with the challenge being that unlike the
computing resources, which is particularly case study evaluations, this required
significant for a national extent study. In automation given that coverage was for
addition, it was easier to identify starting the national extent.
values for the parameters of the method,
Table 3.2
Optimal epsilon values Study Area DBSCAN epsilon (metres)
used by DBSCAN in the
selected study areas.
Abertillery 84
Bristol 119
Cardiff 120
Clapham Junction 70
Glasgow 70
Inverurie 120
Winchester 80
Wolverhampton 91
46 CONSUMER DATA RESEARCH: PART ONE
Figure 3.2
The point data are
represented as a sparse
graph using a distance-
constrained k-NN sparse
matrix. DBSCAN is first
applied in an exploratory
approach. The neighbouring
clusters (that share a
common edge) with similar
point density are selected
forming a new study area
of homogeneous point
density, where DBSCAN
is iteratively applied until
no cluster can be formed.
as outliers. Following this, a new study area it is no longer required to optimise the
of homogeneous point density is created clustering solution using the S_Dbw index,
from the selected points and DBSCAN is which results in a faster algorithm.
applied again to identify the clusters.
The selected clusters are then removed To evaluate the point density similarity
from the graph representation of the among clusters, the standard deviation
point data, and the process of using an of point density in a subgraph was used.
exploratory DBSCAN model to identify More specifically, those neighbouring
a cluster and select those neighbouring clusters with point density within 1
clusters with similar point density is standard deviation from the point density
iteratively carried out until no cluster of the initially selected cluster were also
can be formed. This process is summarized selected, with the assumption being
in Figure 3.2. It should be noted that one of that they define an area of homogeneous
the advantages of the methodology is that point density. To test the sensitivity
48 CONSUMER DATA RESEARCH: PART ONE
of the method to the standard deviation available and independently created Table 3.3
threshold, five different values were national sample of contemporary retail Summary values of five
clustering models with
considered, 0.6, 0.8, 1.0, 1.2 and 1.4. centre extents. They provide frequent different standard
As can be seen in Table 3.3, the updates of a dataset of retail places across deviation thresholds.
clustering solutions are practically the UK, part of which (339 places) were
identical when looking at the number licensed as open data in 2012. The Geolytix
of clusters produced and the distribution boundaries are produced using multiple
of the local epsilon value. variables (including the locations of retail
units) with information that was collected
For the parameter values required by at least three years prior to the data that
DBSCAN, as detailed earlier, the value were used in our analysis. Additional causes
of the minimum points parameter was of difference between the two datasets
set equal to 10 and the epsilon value was might also include the different objectives
calculated as the 95th percentile of the and notion of what constitutes a retail
4-nearest neighbour distance. However, centre (Geolytix did not use a threshold
the epsilon value was only allowed to vary of minimum 10 retail units), and only the
within the range between maximum 170 boundary polygons from the clustered
metres, which was found to be useful to locations of the retail units were available.
exclude outliers from being identified as Given that the creation of similar polygon
members of clusters, and a lower bound boundaries for our output may have
of 80 metres which was used to avoid resulted in an additional source of error,
identifying certain large shopping malls it was decided to compare the Geolytix
as clusters. This necessity is a consequence boundaries against the retail unit locations
of the hierarchical nature of retail centres and associated clusters. The comparison
within GB given that the objective of the was based on two metrics, the ‘n-ary’
analysis was to create clusters that were relation between the two datasets, and the
inclusive of the different functional retail proportion of points within the Geolytix
forms. Following the application of polygons. The n-ary relation returns a
DBSCAN to each subgraph and the score where the higher the number of
extraction of 2,920 clusters, the final retail clusters that had a one-to-one relation
agglomerations were compiled and each with the clusters identified by Geolytix
retail location was assigned an identifying the better the relationship.
number denoting cluster membership.
The results derived from this new method Data pre-processing removed the major
were compared to data supplied by out-of-town retail parks from the Geolytix
Geolytix, which represent the only freely dataset, which was followed by a spatial
3. Retail Areas and their Catchments 49
and private organisations across the Dearden, J. and Wilson, A. (2011). A framework for
country it can be anticipated that these exploring urban retail discontinuities. Geographical
results will prove to be valuable for Analysis, 43(2), 172-187.
research and analysis. Ester, M., Kriegel, H. P., Sander, J. and Xiaowei, X.
(1996). A density-based algorithm for discovering
With the developed methodology being clusters in large spatial databases with noise. Online:
www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
open source, it will also be straightforward
to update the retail boundaries on a regular Everitt, B. S., Landau, S., Leese, M., and Stahl, D.
basis, and potentially apply the suggested (2011). Cluster Analysis. 5th ed. Chichester, Wiley.
method within a context of historic data. Halkidi, M. and Vazirgiannis, M. (2002). Clustering
Finally, given the variety in point density, validity assessment using multi-representatives.
size and shape of the retail clusters in the Online: lpis.csd.auth.gr/setn02/poster_papers/237.
pdf
dataset it would be reasonable to assume
that the methodology could be applicable Lloyd, S. P. (1982). Least squares quantization in PCM.
with different datasets and for different IEEE Transactions on Information Theory, 28, 128-137.
Acknowledgements
enable users to examine these traits of any The ‘Western Latin’ alphabet is used, with
name that is part of our dataset. Users will accents and capitalisations removed and
also be able to interrogate the data in order the only non-alphabetic characters allowed
to identify the most prevalent surname by are apostrophes and dashes – these being
country lists or intra-country distributions combined anyway with pure-alphabetic
of popular single-country-origin names. variants. This is necessary to accommodate
the inconsistent ways which names are
As such, the project is clearly a Big Data stored on the official records are typically
project that is vulnerable to the vagaries used in the project. For example, MacDonald
of Big Data sources that are discussed at can appear, in different datasets across
various points in this volume. Some of the different countries, as Mac Donald,
data sources are acquired under licence, MACDONALD, Mac.Donald and Mac-
with restrictions upon how they may be Donald. Other non-alphabetic characters,
redistributed, particularly those that such as spaces and underscores, are
formed part of the original Worldnames 1 replaced or removed as judged appropriate.
project. The countries that have formed
the focus of our renewed attention on the It is acknowledged that, with a project
project are openly available on the web, of this scale, using hundreds of diverse
without needing a login or subscription datasets, such simplifications will
to access, and from the original source, potentially obscure helpful demographic
rather than from other consolidator sites information; this is minimized where
with related foci of interest. We also practical. We have retained the names in
exclude datasets whose custodians did not, the original forms captured, however, in
in our opinion, intend the data to be made order to allow the incorporation of other
available for wide public use, albeit in accents in future spin out projects from
aggregate form. These judgments are the research.
inevitably subjective and our intention is
to avoid any legal infringements arising 4.4
from reuse of data for new purposes. In Data acquisition and processing
particular, we have avoided using data methodology
published by third parties without the
consent of the original owner to this end. 4.4.1
From this standpoint, newspaper Search-based initial discovery
republishing of time-restricted electoral
lists would be considered to be valid but To ensure a reasonable level of quality
database dumps, obtained as a result of a and a high geographical and demographic
breach of security or insider leaks, would representation for each country are
not be used. This distinction is not always maintained, we carry out the data
clear cut, and require decisions to be made collection manually, rather than creating
on a case-by-case basis – for example, data a ‘bot’ or ‘spider’ to crawl the web
obtained from the WikiLeaks service and automatically. This also presents
similar investigative journalism projects, opportunities to discover additional
or those where the original source and unindexed datasets with intelligent
authority to publish is unclear. To simplify URL modification by the investigator.
processing a vast array of diverse datasets,
a number of simplifications are applied. This means that, for each of the ~200
The western-style naming convention countries of the world, a different collection
of a given (assigned) first name followed process is employed, built up by starting
by a (typically hereditary) family name from a set of common principles detailed
is assumed, with other name structures below, but then refined as name data are
(e.g. Spanish double surnames) simplified. discovered from the current active country.
56 CONSUMER DATA RESEARCH: PART ONE
Figure 4.1
Map of St Lucia. St Lucia
is approximately 40km
in length (north to south)
and 20km in width (east
to west).
Saint Lucia is an island nation in the number of duplicated records were found
Caribbean with a population of – where the same record would appear
approximately 186,000 people (Figure 4.1). on multiple pages in a table for a single
Our names data come from the polling lists precinct. On de-duplicating, 129,685
published by the Saint Lucia Electoral records remained, representing around
Department at www.electoral.gov.lc/ 70% of the 2016 UN estimate of 186,383
polling-list. Saint Lucia’s top-level people. The reasons for this large
administrative areas are known as discrepancy are not clear, but, if the
quarters, the constituencies are based on official figures are at fault, it would go
these quarters but with a number split or some way to explaining the apparently
merged, to make 17 in total. These are then low turnout of 57% and that the numbers
each further split into between 3 and 9 of reported registered voters have
polling divisions, or precincts. The electoral increased at a much larger rate than
data for each of the 84 precincts are listed the population in general, since 2000.
on the website as paged tables, with a
POST query needed to access each page. 70% of the 2016 estimate is a plausible
percentage, as electoral lists are not
The data listed on the tables include the population lists – they typically exclude
given name, family name, street name, young people and foreign residents. Voters
constituency, unique registration number, for a general election in Saint Lucia must
precinct and gender. be at least 18 years of age and either a
Saint Lucian or a Commonwealth citizen
A python script was used to send POST who has resided in Saint Lucia for at least
queries and download the HTML tables, seven years.
and extract the data from them using
regex into a CSV file. The data are Because of the readily available geospatial
believed to be relatively up-to-date, boundary data for Saint Lucia’s quarters,
as the most recent election was in 2016. and the relatively small population of the
162,025 records were extracted in this nation as a whole, it was decided to
way, at first glance matching well with sub-divide the population by a single
official summary information for the level – quarters, but merged where the
2016 election suggesting that there were constituencies go across quarters, for the
161,883 registered voters. However, a large Worldnames 2 project – rather than further
4. Given and Family Names as Global Spatial Data Infrastructure 57
A small amount of logic, however, can be ‘seed’ surnames were required for any
shared across countries. A number of country. Google Search queries were
countries in West Africa, for example, then carried out using these surnames in
use the same off-the-shelf portal software conjunction with various combinations of
for their government or public service numerical, country filter, data format and/
websites, and so file locations discovered or list keywords. Top-level country filters
during the search for data for one country, on Google Search were used, along with
can then be reapplied to additional gov.xx and edu.xx second level domain
countries using the same software. filters (xx here the country’s top level
domain). These filters restricted results
The initial stage is to perform a simple to being from subdomains of the domains
search, typically using Google Search, for specified and, due to the nature of Google
‘obvious’ open-access lists of the greater Search’s indexing, this more specific search
part of a country’s population. These may often revealed additional results of interest.
take the form of public versions of electoral Format keywords can narrow large
rolls or civil registry lists. These may be numbers of results returned to ones likely
posted both on a country’s official portal, in the form of a downloadable, processable,
but also occasionally republished by list, for example ‘pdf’, ‘xls’ or ‘xlsx’.
citizens on private websites, for example The CSV format is probably the simplest
the Chilean electoral roll can be freely and easiest list format to parse but is little
obtained on disk and implicitly republished used by non-technology focused websites.
privately, but is not itself published online A small number of useful sources were
by the electoral authorities. found in the more modern JSON file format.
Adding sequential numbers, e.g. 1345 1346
If a comprehensive list is not obtained, 1347 1348 can both reveal list-focused
then it is necessary to search using results, and ones with a likely population
distinctive surnames and then full names of (in this case) well over a thousand
for any given country. It is necessary to names. Finally, the inclusion of other
‘seed’ the search with certain key names key words in the search for distinctive
which are popular in the country in document classes can also be useful –
question, but ideally not in neighbouring as with ‘cedula’ (national identification
countries or other jurisdictions. To do this, document) for Spanish-speaking countries.
reliable pan-national websites containing Other more generic words were also useful
lists of famous people from a country, for in our searches, e.g. ‘first name’, ‘given
example from Wikipedia, were used, as well name’, ‘forename’, ‘last name’, ‘family
as government and pan-national database- name’, ‘surname’, ‘ID number’, or
driven websites of elected government ‘candidate’. Translating these into
ministers or national election candidates different languages (typically using
– one type of data that are nearly Google Translate) was also useful.
universally published and that a number
of separate projects, such as IFES Election Somewhat counter intuitively, increasing
Guide and International IDEA, are aiming the numbers of names in a query led to
to catalogue and maintain. The latter more search results being returned. This
project also contains some information is another example, as mentioned above,
about the availability of online electoral of the heuristics applied in Google Search
rolls for each country and its approach to and other search engines, and the ways in
open data, including direct links to them which key words for websites are stored in
where available. the internal indices of the search engines.
As a rule of thumb in our research we Where surnames alone did not reveal useful
deemed that at least three distinctive datasets, use of distinctive individual full
4. Given and Family Names as Global Spatial Data Infrastructure 59
names was useful, since this focused as documents, simple Python scripting
searches on websites with a record for was used to automate retrieval of large
that person amongst a larger list of bearers numbers of webpages, supplying
of less distinctive or unusual names, appropriate GET/POST parameters on a
or with a search function or directory consistent, sequential or known list basis,
index (as discussed below) that revealed and simple processing and name extraction
additional data files. As a general point, of the resulting HTML files retrieved.
we avoided searches using famous names Where webpages listed a large number of
since these directed focus away from documents, a bulk downloader browser
general population names. extension ‘uSelect iDownload’ was used.
Figure 4.3
Map of Somalia.
There are no openly accessible names unique to the country), academic year, score
lists for Somalia (Figure 4.3) that provide and issue date. Somalian names do not tend
widespread coverage of the country – not to follow the Western structure where family
surprising in a country without an effective names are passed down each generation;
government controlling much of its instead the family name of a child is
territory. An additional problem, from the typically the given name of the father.
perspective of the Worldnames 2 project,
is that documents are generally published This is likely to be a highly selective sample,
in Somali, the native language, although for example, completing secondary school
this uses the Latin alphabet and so is may not be possible in large parts of the
relatively easily translated. country due to safety concerns, there may
be a tradition of only boys or only girls
The best data source that was available attending school in some areas, and so on.
was a number of PDF downloads from
the Ministry of Education for the Federal The data sources, combined, contain 7,834
Government of Somalia, at moesomalia. name pairs, representing just 0.07% of the
net/english/ (although the website has population based on the 2015 UN estimate
been updated since the data was of around 11 million. Combined with the
extracted and so it is no longer available likely demographic biases discussed above,
for direct download). The data files are lists this means that the Somalia dataset will
of students that have received the national likely give a poor profile of given name and
secondary school leaving certificate. The family name distributions in the country,
data include the full name of each pupil and as such is included simply because
(with given and family names not split out), of the desire to have as many countries
the mother’s name, year of birth, roll as possible represented in the database
number and certificate number (neither – poor data being better than no data at all.
4. Given and Family Names as Global Spatial Data Infrastructure 61
project. Some more recent (or legacy, where only statistical breakdowns representing
the names data were for jurisdictions that another 1.5 billion people (the majority of
have recently changed) administrative the names in this latter category being
boundaries were obtained from other from China). Around 175 countries are
projects – Natural Earth Data (www. represented, with an eventual aim to
naturalearthdata.com), OpenStreetMap also include data for the remaining
data (osm.org) from Geofabrik Downloads approximately 30 countries, albeit likely
(download.geofabrik.de) and the now very simple statistical summaries of the
discontinued MapZen Borders service most popular names.
(mapzen.com/data/borders), and various
country-specific projects, often available As stated in the introduction, this project
through the GitHub (github.com) is quite different to other CDRC initiatives,
repositories. QGIS (www.qgis.org) both in its longevity (the first funding
was used to process and organise the for this work was received in 2003),
metadata associated with the geodata. the persistently high levels of public
interest that it has generated – recording
This included the creation and population nearly a million visits a year to Worldnames
of a globally consistent ID for all countries 1 for several years following its launch
and the first and second level subdivisions, (worldnames.publicprofiler.org/webstats/
where available and used for certain index.html) and articles in large-
countries. While other projects maintain circulation media such as the Guardian
such a list (e.g. HASC codes and ISO 3166-2 and Daily Mail newspapers, and the way
codes), our own system was used for that it has been conducted in spare time
maximum flexibility, particularly as between funding streams. It is nonetheless
occasionally customized topologies important as the only CDRC project that
and aggregations had to be employed, purports to provide something approaching
depending on the name data available. a global spatial data infrastructure, albeit
The system is based on the ISO 3166-1 founded on diverse, piecemeal and
country codes, an administrative level fragmented data sources.
number and a padded integer code for the
unit. Occasionally, the country’s official This is, without doubt, a Big Data project
codes were adopted for the latter part, – albeit one in which the search for and
where these were integer based. processing of appropriate data sources has
been very labour intensive. We believe that
For the world map of countries, a GeoJSON- this has implications for the wider practice
format dataset from Natural Earth Data – namely that Big Data have to be broadly
was used. Where countries had sub- understood before they are ‘ingested’, and
country name data, the MapShaper website that significant flaws in the content and
service was used to simplify the topology coverage of data cannot be accommodated
of the geodata, and one TopoJSON-format in subsequent analysis through blind
data file for up to two levels of sub-country application of sophisticated techniques.
area borders was created using it. TopoJSON Spatial data are special by their very nature
(github.com/topojson/topojson) is a modern, and geographic skills are foremost of those
flexible and highly compact file format. required to understand the possible sources
and operation of bias in datasets such as
4.5 Worldnames 2.
Conclusion
The greatest impact of the research to
The project thus far has collected individual date has been upon the legions of amateur
data on approximately 1.7 billion genealogists who are interested in
individuals, plus a number of surname- understanding the geographies of their
4. Given and Family Names as Global Spatial Data Infrastructure 67
the generations – both by the contagious Mateos, P., Webber, R. and Longley, P. (2007). The
diffusion of a surname from its known cultural, ethnic and linguistic classification of
point of geographic origin (nearly a populations and neighbourhoods using personal
names. CASA Working Papers Series 116 Centre for
thousand years in the case of Anglo Advanced Spatial Analysis, University College
Saxon surnames, but less than a century London.
for much of Turkey, for example) and by
Munzert, S., Rubba C., Meißner, P. and Nyhuis, D.
hierarchical diffusion cascading through (2014). Mapping the geographic distribution of
the increasingly interconnected system names. In Automated Data Collection with R.
of world cities. From these standpoints, Chapter 15, 380–395. John Wiley & Sons, Ltd.
DYNAMICS AND
CONSUMER DATA
INFRASTRUCTURES
5
71
Table 5.1 Moreover, UK Censuses are usually updated census data. First, ethnicity is assessed at
Comparison of ethnic only every ten years, which undermines the the individual level in a less subjective way
groups across censuses
and merged census ethnic merits of census data when finer-grained with a consistent method applied over
groups for this study. Note: temporal resolution is needed to capture selected inter-Censual years. Second,
This table is modified from population change in intermediate years. ethnicity data can be generated almost
the table in Catney (2015).
For example, a Polish immigration wave continuously because consumer data are
after Poland’s accession to the European generated in real time and are consolidated
Union in 2004 can only be observed as at least once a year. Thus, population
accumulated results using 2011 data2 on dynamics can be measured at a much
ethnicity. This issue is further compounded higher temporal resolution. Third, since
by the omnipresent Modifiable Areal Unit consumer data can be geocoded at address
Problem (Openshaw, 1984), which is level, it enables the examination of ethnic
unlikely to be wholly negated by census geographies across the fullest range of
zone design considerations. scales. Taken together, used in association
with conventional census data, consumer
In the remainder of this chapter we seek to data can be used to identify highly granular
demonstrate that the linkage of consumer geo-temporal patterns of segregation in
registers and use of names-based ethnicity contemporary Britain.
classifications offers promising ways to
begin to address the shortcomings of
74 CONSUMER DATA RESEARCH: PART TWO
Figure 5.1
Workflow of formulating
address-level ethnicity
data from consumer data.
CR/ER
2016
Figure 5.2
White and non-White
proportion change
1998 95.9%
over time. 1999 95.8%
2000 95.6%
2001 95.5%
2002 95.4%
Year
2003 94.9%
2007 92.9%
2013 92.3%
2014 92.1%
2015 92.1%
2016 92.2%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proportion
White Non-White
classification is condensed into the merged The White and non-White bipartition in
2011 Census ethnic groups (Table 5.1) for Figure 5.2 can be further divided into finer
this study. A new version of Onomap, the ethnic categorisations to gain a more
Ethnicity Estimator, is currently under in-depth picture of the ethnicity
development and its main new feature composition in Britain. Since the White
will be its calibration with micro data British group is so predominant that it could
on self-assigned ethnicity from the 2011 easily overshadow patterns of other minority
UK Census. ethnic groups, the White British group is
excluded from the selected groups of the
5.4 2011 Census ethnic classification in Figure
Ethnic diversity in contemporary 5.3. Three years are chosen from the
Britain timeframe: 2001, 2007 and 2016. Year 2001
is the only available directly comparable
We begin by exploring the ethnic reference point to any Census year, although
composition change over time as derived further acquisitions are in prospect.
from the address-level ethnicity data. The
White ethnic group, including White British, As suggested by Figure 5.3, Indian is the
White Irish, and Other White, constitutes largest group among ethnicities, with
the majority of Britain’s population: White around 2.1% of the population in 2016.
British alone accounted for 85% of the Pakistani is the second largest non-
population of Britain in the 2016 consumer White community in Britain with 1.9%
data. By contrast, other ethnic groups such in 2016. Except for the Black African and
as Pakistani, Indian, Bangladeshi and Bangladeshi, an increase in the proportion
Chinese together comprised less than ten across the three years can be seen for most
per cent (Figure 5.2). The proportion of the ethnic groups. There has been a noticeable
White majority group decreased year on boost for the Other White group with an
year from 95.9% in 1998 to 92.2% in 2016 increase of around 1.1% in 2007 and 2%
according to the consumer data, although in 2016 compared with base year 2001.
the absolute size of the White population It is in accordance with the 2011 Census
increased over this period. analysis on ethnicity of the non-UK born
76 CONSUMER DATA RESEARCH: PART TWO
Figure 5.3
Ethnic composition
Other White
change of Britain in 2001,
White Irish 2007 and 2016 (White
British excluded).
Selected ethnic groups
Indian
Pakistani
Other
Black African
Other Asian
Bangladeshi
Chinese
Black Caribbean
0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5%
Percentages
population, which claimed that 71% of the and combined from the 2001 Census
residents who identified themselves as for England/Wales and Scotland for
Other White arrived in the UK between comparison purpose. Population counts
2001 and 2011. It is suggested that the of ethnic groups from consumer data are
drastic increase for the Other White group compared against adult counts of ethnic
is mostly due to the 2004 accession of groups provided by the 2001 Census (Table
several Eastern European countries into 5.2). The comparison shows that consumer
the European Union. data only account for 87% of the 2001
Census total population and all of the
Some facts can be summarised from the individual ethnic groups are under-
ethnicity profiles above. Over the years, represented to a greater or lesser degree.
the relative share of the White majority Particularly, 88% of the White British
population has decreased, although it group against the 2001 Census is relatively
has increased in absolute size. The White well represented in the consumer data,
British group remains the largest ethnic while only 40% of the Chinese group is
category in Britain followed by the Other represented. The representative rates for
White group. Most of the minority ethnic the Indian and Pakistani groups are 72%
groups are experiencing increase in their and 86% respectively.
proportion of the population. Therefore,
it can be concluded that Britain has become The Black Caribbean and Other Mixed
more and more ethnically diverse over groups are severely under-represented and
time. It is also evident in Figure 5.3 that the Arab group is not applicable in 2001
all of the ethnic minorities are growing in Census ethnic categorisation. With the
proportion over the three years, except for elimination of the above three ethnic
the Bangladeshi and Black African groups. groups, the ratios of adult counts for
individual ethnic group in consumer data
The year 2001 is the only point in time that to counts in 2001 Census data are visualised
is shared by both the Census and one of the in Figure 5.4. The red dashed line indicates
consumer registers currently held by CDRC. 1:1 representation. The White Irish group is
Since the eligible age for electoral rolls extremely over-represented, which
registering was 16 in Britain in 2001, suggests that a considerable amount of
the adult (aged 17+) counts are extracted people who are classified as White Irish
5. Ethnicity and Residential Segregation 77
Ratios of counts
Merged Census Groups Adult (16+) counts in 2001 Adult (16+) counts in 2001
(Consumer data to
for the study Consumer Data Census Data
Census Data)
Other Asian 151,594 189,036 0.802
Bangladeshi 106,480 174,257 0.611
Chinese 78,532 198,145 0.396
Indian 580,620 811,044 0.716
Pakistani 418,401 486,061 0.861
Black African 137,123 338,827 0.405
Black Caribbean 11,703 450,498 0.026
Other Mixed 291 337,547 0.001
Other 300,890 245,330 1.226
Arabic 4 NULL NULL
Other White 829,458 1,226,886 0.676
White British 35,754,961 40,534,837 0.882
White Irish 1,178,502 650,658 1.811
5.5
Barnet
Harrow Haringey Redbridge Havering
ratio
South Cambridgeshire
Dissimilarity and the Information Theory 0.207 - 0.500
0.501 - 0.750
Bristol
index for the Evenness dimension; the London
0.751 - 0.900
advance. To demonstrate the feasibility includes both urban and rural OAs when
of using consumer data on residential examining the overall trends of ethnic
segregation studies, we choose the simplest segregation in Britain.
and best interpretable, aspatial Index of
Dissimilarity. Since the index is widely The national residential segregation of
accepted in the previous studies as well individual ethnic groups is measured using
as governmental reports, this choice pairwise Index of Dissimilarity denoted
facilitates comparisons with related studies as D in Equation (5.1), which captures the
in the British context. For the same purpose absolute difference between the spread
of comparability, we aggregated population of a specified group and the spread of the
counts by ethnic group to the 2011 OAs. rest of the population across spatial units
There are 227,759 OAs in Great Britain in nationally. Here wi denotes the number
2011. In addition, traditionally ethnic of residents of the ethnic group under
diversity is largely an urban phenomenon examination in the ith OA, and denotes
and studies of ethnic segregation focus on the number of the total population of the
metropolitan areas. Nonetheless, dispersal ethnic group under examination in Britain.
5. Ethnicity and Residential Segregation 79
Σ
n
1 wi bi 40% and 50%. The White Irish group stands
D= – (5.1.) out as the low segregation group whose
2 W B
i=1 dissimilarity index is below 30%. The Irish
group has a long migration and settlement
Using Equation (5.1), pairwise dissimilarity history in Britain and, if names have
indices for all of the ethnic groups are remained an indicator of ethnic identity,
calculated from year 1998 onwards, aiming they appear to be more evenly distributed
to examine the extent to which ethnic across Britain. The results also indicate
groups are evenly distributed across OAs that the White group, including the White
of Britain. Results of selected ethnic groups British, White Irish, and Other White, is
are shown in Figure 5.6. The Index of more spatially integrated across the whole
Dissimilarity can be interpreted as the of Britain than other ethnic minorities.
proportion of residents in the ethnic group
under examination who would need to be The above findings address the concern
moved to other OAs to achieve even brought up at the beginning of this
distribution. First of all, the overall trend of section as to whether Britain is becoming
residential segregation at the level of Great more residentially segregated or mixed.
Britain is decreasing as Britain becomes It leads to the conclusion that Britain
more ethnically diverse at the same time. has become more ethnically diverse and
Obvious decline can be spotted from the more residentially mixed at the national
temporal changes for most of the ethnic level. It should be noted, however, that
groups (Figure 5.6). However, the White segregation pattern is an outcome of
British is an exception among other ethnic not only selective residential mobility/
groups, with a slight rise of 1.4% in its migration but also results from differential
dissimilarity index. Although there is an fertility and mortality rates among ethnic
increase in the segregation level for the groups (Catney, 2015). Another demographic
White British group, Catney (2015) dynamic comes from international
interpreted this phenomenon in the context immigration during the past decade.
of new ethnic group mixing in less diverse According to the 2011 Census, 13% of the
locales. She argued that there would be an total population in England and Wales in
increase of unevenness whenever members 2011 were born outside the UK. There is a
80 CONSUMER DATA RESEARCH: PART TWO
Figure 5.6
Changes of Index
100%
of Dissimilarity for
90% selected ethnic
groups over time.
80%
Dissimilarity Index
70%
60%
50%
40%
30%
20%
Year
long history of immigrants into the UK. caused by the vagaries of data sources or by
Initially they were drawn to the UK by a demographic process. The best way is to
labour shortages in particular areas, for filter out the consumer register part from
example road-building in the early 20th the consumer data after 2003 so that the
century, health and transport services in filtered consumer data solely consist of the
the 1960s, the textile industry in the 1970s, public version of the electoral roll. The
and agriculture in the 2000s (Simpson, filtering can be done by using one attribute
2012). These particular migrant of each record that indicates the general
destinations, mostly major cities of source of the data. In order to examine
the UK such as London, Manchester whether the data source vagaries have
and Birmingham, served as ‘gateway an impact on the segregation indices,
areas’ (Catney, 2015) for immigration comparisons are made between
flows. International immigrants settled dissimilarity indices calculated from
down in these gateway areas first and then the original consumer data and from
some of them spread into other areas of the filtered consumer data.
the UK, which results in growing ethnic
diversity and more even distributions of Based on this strategy, the workflow in
ethnic minorities. Figure 5.1 is repeated and the dissimilarity
indices of ethnic groups from original
Last but not least, it is nontrivial to consumer data and from filtered consumer
investigate the cause for the dramatic drop data are compared respectively using
starting from year 2003 shown in Figure different colours to represent different
5.6. It has been noted earlier in this chapter ethnic groups in Figure 5.7, where dashed
that consumer data are compiled from lines represent the original consumer data
multiple data sources. Consumer data are while solid lines represent the filtered
mainly derived from the public registers of consumer data. Within each ethnic group,
electoral rolls before 2003. Afterwards, the segregation indices are relatively
the consumer data are comprised of both underestimated by the original consumer
electoral rolls and other commercial data data after 2002 compared against the
sources. Therefore, it is necessary to justify filtered consumer data. The most noticeable
whether the dramatic drop from 2003 is underestimation occurs in the Chinese
5. Ethnicity and Residential Segregation 81
Figure 5.7
Dissimilarity indices 90%
derived from original
consumer data (dashed 80%
line) and from filtered
consumer data (solid line).
Dissimiliarity Index
70%
60%
50%
40%
30%
20%
Year
ethnic group, possibly because some of to 2007. However, they do not affect the
them, for instance international students overall changing tendency of segregation
from China, are not eligible to vote in any indices for individual ethnic groups. It is
election. Thus they are more under- likely because of a natural demographic
represented in the electoral registers than process, for example the increasing number
in the consumer registers. In contrast, the of immigrants and migrant dispersal,
variations are not that large among some which needs further evidence from the
native ethnic groups such as the White Census migration data.
British and White Irish, or some ethnic
communities from commonwealth 5.6
countries, for example the Indian, Conclusion
Pakistani and Bangladeshi group,
or some of the Other White group from Contemporary Britain continues to
EU countries, all of whom are allowed to experience demographic changes in its
vote in at least some elections. Although population, and these have accelerated in
slightly underestimated, the dissimilarity recent years. Ethnic diversity at various
indices for all ethnic communities are still scales offers a key perspective on these
decreasing and the overall patterns of high, dynamics. Against this background, there
moderate and low segregation groups are concerns that a more diverse population
still exist among these different ethnic may become more segregated. By taking
communities. It can be also concluded the opportunity of the 2011 Census release,
that the dissimilarity indices for each numerous studies have been conducted into
ethnic group prior to 2003 are these issues but are now beginning to look
overestimated since fewer non-British outdated and possibly overtaken by events
groups are included in the public electoral as the Census data age. In addition, such
roll before 2003. Afterwards, other analyses are often limited to high levels
commercial data sources are compiled of spatial aggregation for reasons of
into the consumer data to include as many disclosure control. Thus, the use of
non-voters as possible. Changes in the data coded consumer data offers merit in
source partly explain the strong decrease understanding changes over time at finer
in similarity indices observed from 2003 spatial scales.
82 CONSUMER DATA RESEARCH: PART TWO
Our work in this area is just beginning, and registers have been further enhanced with
there are very significant start-up costs in other datasets. Although this might cause
establishing the provenance and quality of inconsistencies in trends between years,
consumer data as well as the veracity of the these seem to be limited based on the
individual-level inference of ethnic group. comparison between public electoral roll
The study contributes to the segregation subset and the full consumer registers.
debate with empirical findings from Consequently, the Dissimilarity Indices
consumer data with finer-grained likely underestimate segregation
resolutions in both spatial and temporal by the original consumer data to varying
dimensions. The results suggest that degrees, as found in the comparison
Britain is becoming more and more between the original consumer data and
ethnically diverse over time with shrinking the filtered consumer data (Figure 5.7).
White British majorities and growing Considerations and precautions should be
ethnic minorities. A decrease in the taken to understand the possible outcomes
overall residential segregation in Britain of these limitations with respect to the
can be identified from the changing of purpose of the analysis.
dissimilarity indices for most ethnic groups
except for a small increase for the White In this chapter, both opportunities and
British group. These findings reinforce challenges of making use of consumer
the existing claims of a more mixed data are presented by re-visiting the
Britain in related studies in the literature ethnic diversity and segregation issues
with empirical evidence. It is believed that of contemporary Britain. It has been well
these changes are the consequence of a demonstrated that it is feasible to address
natural demographic process of fertility, such social concern with novel data
mortality, migration and immigration sources. By linking to other data sources,
of the population. The debate shows that consumer data have even greater potential
there is a need to effectively assess ethnic to address other micro demographics in a
geographies across a full range of spatial broader spectrum. For example, house price
scales using data and methods that permit inflation and asset accumulation can be
robust, timely and transparent assessments investigated by linking consumer data with
of residential segregation. Land Registry data. Other examples of
possible application of consumer data could
Consumer data appear to be an effective be issues such as changes in population
supplementary for census data; however density, household composition, and age
several limitations should be noted. structure. With more such new Big Data
First, although achieving a relatively sources emerging, changes and dynamics
fine penetration rate of the population of the contemporary population can be
in the UK, consumer data are incomplete better understood in novel ways.
as they only record the adult population.
The age limit depends on the eligible age
for electing or applying for credit cards
and loyalty cards. Second, due to the
willingness and eligibility of election,
different ethnic groups might be under-
represented at different degrees (ethnic
bias). Third, multiple data sources could
have impact on the results of the analysis
because it is only since 2003 that persons
registered on the Electoral roll can opt out
of inclusion in the derivative commercial
dataset. Hence, since 2003, the consumer
5. Ethnicity and Residential Segregation 83
Casey, D. L. (2016). The Casey Review: A review into The authors would like to thank CACI Ltd and
opportunity and integration. Online: https://www. DataTalk Research Ltd for providing the Consumer
gov.uk/government/publications/the-casey-review- Register and Electoral Roll data under a special
a-review-into-opportunity-and-integration. research licence to enable us to carry out this
research. We would also like to thank Owen Abbott
Catney, G. (2015). Exploring a decade of small and Adriana Castaldo, Office for National Statistics,
area ethnic (de-)segregation in England for their support in developing Ethnicity Estimator.
and Wales. Urban Studies, 53(8), 1691-1709. The research was also funded by Engineering
doi:10.1177/0042098015576855 and Physical Science Research Council grant EP/
M023483/1 and the Economic and Social Research
Catney, G. (2016). The Changing Geographies of Council grant ES/L013800/1.
Ethnic Diversity in England and Wales, 1991–2011.
Population, Space and Place, 22(8), 750-765. Notes
doi:10.1002/psp.1954
1. http://ons.maps.arcgis.com/home/item.
Finney, N., and Simpson, L. (2009). ‘Sleepwalking html?id=471e6948594540a3bccb2678e0cf50fe
to segregation’? Challenging myths about race and 2. www.ons.gov.uk/peoplepopulationandcommunity/
migration. Policy Press at the University of Bristol. 3. www.ordnancesurvey.co.uk/business-and-
government/products/addressbase-products.html
Mateos, P., Longley, P. A. and O’Sullivan, D. A. (2011).
Ethnicity and Population Structure in Personal
Naming Networks. PLOS ONE, 6(9). doi:10.1371/journal.
pone.0022943
national footfall patterns using automated independently and uploads the collected
data collection. data to a central container at 5-minute
intervals through a dedicated 3G mobile
As a first step, various locations for the data connection. The sensor hardware has
study were identified by CDRC to ensure been improved over the course of the project
a geographical spread, different local and currently has built-in failure prevention
demographic characteristics and range mechanisms such as backup battery for
of retail centre profiles. A custom footfall power failures, automatic reboot capabilities
counting technology using WiFi based and in-device memory for holding data
sensors was developed by LDC and the when the Internet is not available.
sensors were installed in the identified
locations. The sensor monitors and 6.3.2
records signals sent by WiFi enabled mobile Data collection, data storage and
devices present in its range. In addition, data retrieval
pedestrians walking past the sensor were
counted manually for short time periods The probe request frame is the signal sent
during the installation. The project aims to by a WiFi capable device when it needs to
combine these two sets of data to estimate obtain information from another WiFi
footfall at these locations. The first sensor device. For example, a smartphone would
was installed in July 2015 and the network send a probe request to determine which
has grown to almost 789 total active WiFi access points are within range and
sensors as of June 2017. suitable for connection. On receipt of
a probe request, an access point sends
The primary aim of the project is to improve a probe response frame that contains
our understanding of the dynamics of its capability information, supported
high-street retailing in the UK. The key data rates, etc. This ‘request-response’
challenge in this area is the collection of data interaction forms the first step in the
at the finest scales possible with minimal connection process between these devices.
resources while not infringing on people’s The request frame has two parts, a MAC
privacy. This challenge, when solved, header part that identifies the source
can provide immense value to occupiers, device, and the frame body that contains
landlords, local authorities, investors and the information about the source device.
consumers within the retail industry. The As mentioned, the SmartStreetSensor
project aims to facilitate decision making by collects some of the information available
stakeholders in addition to the tremendous in the probe request frame relayed by
opportunities for academic research. the mobile devices, along with the
time interval at which the request was
6.3.1 collected and the number of such requests
Hardware setup collected during that interval. The actual
information present in the data collected
The data are collected through a network of by the SmartStreetSensor is shown in
SmartStreetSensors: a WiFi based sensor Table 6.1.
that collects a specific type of packets
(probe requests) relayed by mobile devices After the probe requests are collected, the
within the device’s signal range. The sensor MAC addresses in the data are hashed at
is usually installed in partnering retailer’s the sensor level to preserve the privacy of
shop windows so that its range covers the device owners and sent to LDC’s cloud
the pavement in front of the shops. In a storage. From there, through a secure
handful of cases (3%), the sensor is placed channel, they are sent to the CDRC secure
within a large shop to monitor internal servers, where the formal translation of a
footfall. Each device collects data probe request to footfall data is completed.
88 CONSUMER DATA RESEARCH: PART TWO
Table 6.1
Field Description Information collected by
the SmartStreetSensor.
MAC address The MAC address of the source device with last two digits hashed
Time interval 5-minute time interval in which the data was captured
1200
900
Packets
600
300
0
0 500 1000 1500
Time (Minutes from 00:00)
device, which also leads to periods of zero The previously internally validated data are Figure 6.1
counts until the device is switched back on. externally validated against - and adjusted The total number of probe
requests collected every
If such intervals are short (no more than to - manual counts. The ratio between 5-min interval vs the
half an hour), we can safely interpolate the manual counts and internally validated number of unique MAC
counts to have better aggregated estimations (cleaned) sensor counts is known as addresses collected in
the same interval vs the
of the daily counts. In practice, the estimated adjustment factor α: final count for the same
count c, at time t, is obtained by a simple interval after cleaning
linear interpolation: M long dwelling devices.
α= (6.2)
ψ
c = c1+m(t-t1 ) (6.1)
Where α is the adjustment factor, though
where m is the slope and c1 are the counts there are certain differences between
at time t1. weekdays and weekends, M is the number
of the passers-by counted manually on the
6.4.2 street and ψ is the number of the processed
External validation sensor counts.
At this point, once we have translated given months (at this stage of the project
probe requests into footfall counts with a there are always more sensors in month Mb
sufficient degree of certainty, we can start than in month Ma); ii) a single sensor could
a proper analysis of the particular patterns be measuring H hours in month Ma and K
generated at each location to compare hours in month Mb, with K≠M and iii)
trends and define different functional areas some sensors can be considered just as
across different parts of the country. An white noise, because they may have only a
example of this is presented in Section 6.5. few valid measures within a particular
month. These discrepancies make, in
6.5 principle, these two months incomparable
UK footfall index with each other.
One of the first analyses conducted, based To solve this, we proceed as follows:
H
on the validated footfall counts, was to look 1) Define SdM = ∑ i =a,b1hd
a,b i
at the shift in footfall figures nationwide to
establish seasonal peaks and troughs and where Ha,b is the total number of half
ensure they reflect known trends. For hours in months Ma,b, and Mdi is the
example, footfall tends to rise in the run half hour aggregated footfall counts
up to Christmas but falls during the first at sensor d at bin i. Put simply, SdM
a,b
months of the year.
is the sum of all the footfall in a single
Two different indexes were therefore month at sensor d.
defined: the first to track seasonal trends 2) Calculate the theoretical probability
in footfall, taking a particular month as a distribution of all SdM in a month.
a,b
base line and the second, to compare the
change in footfall between two consecutive a) Discard all sensors skewed to the left
months. Both indexes try to detect major of the bulk of the distribution, i.e. those
shifts and overall tendencies from one that are to the left of the standard
month to another at the national level, not deviation value. In other words, remove
to explain actual activity patterns. all sensors that didn’t work properly
during months Ma,b
For both, the counts were aggregated to b) For sensors skewed to the right, i.e.
each half-hour, removing those devices those that are two times above the
that were present for more than 5 minutes standard deviation value, we firstly
at every location and without applying any verify if their behaviour is the same
adjustment factor, as these indexes are across the previous few months or if
more concerned with counting all the the month in question was an anomalous
footfall activity around the sensors, and one. If it is the former, we remove the
not only retail related activity. counts, otherwise they are kept in.
3) With the remaining sensors, we define
Equation 6.3 measures the relative change a and b as follows:
in footfall from one month to another:
a = ∑ iH=a 1hi , b = ∑ iH=b 1hi ,
Footfall index (a,b) = ((b-a)/a)*100 (6.3) (6.4)
Sa Sb
where b = Total footfall at month Mb,
a = Total footfall at base month Ma, a≠b. where Ha,b is the total number of hours in
month Ma,b, hi is the half an hour
The major challenge was the actual aggregated footfall counts at bin i and Sa
construction of b and a, as, i) the number and Sb are the total number of sensors left
of sensors is not the same between any two after step 2.
92 CONSUMER DATA RESEARCH: PART TWO
Figure 6.2
Percentage change in
footfall over a 7-month
period with October 2016
as the base month.
Equation 6.4 captures the weighted counts compare the corresponding flows between
at each month, which standardises the different retail areas.
measures, making both months comparable.
In the next section, we present the results 6.5.2
obtained when a single month is set as Footfall trends over short time periods
base month, in this case, October 2016.
The second index, where we compare In order to illustrate the differences in the
the change in footfall between two given volume of footfall across Central London,
months, is explained in detail in the online the validated sensor measurements were
supplementary information. taken for the five-minute intervals of each
day of the week over the period of ten
6.5.1 weeks (9 January 2017 to 19 March 2017)
Footfall trends over long time periods for all the sensors for which data were
available. The period was chosen to
Defining October 2016 (with a net footfall avoid holiday seasons (Christmas, Easter,
of approximately 131 million) as the base summer) or the occurrence of Monday
month, we explored the percentage Bank Holidays which would have influenced
change in footfall across a 7-month period the usual weekday footfall volumes.
(Figure 6.2). November shows a marginal The spatial variation of overall average
increment of 6% while December increases five-minute footfall during the weekdays
by almost 25%, which is expected due to between 7am and 7pm in Central London
the festive season. After this peak, in the is shown on Figure 6.3.
first trimester of the year, footfall returns
to the October levels, then there is an Areas well known for their business are
unusual increase in April 2017 (17%) before Soho (Central London) and Camden Town,
finally returning to the base month level in as well as locations around some of the
May. The April increase could be related Tube and rail stations, with some notable
to the Easter holiday period, but this is examples labelled on the map (Victoria,
something still to be investigated. Waterloo and Angel stations). The influence
of station proximity is also seen on
Although both indexes were presented at Edgware Road. Footfall around Edgware
national level, they can be disaggregated Road and Marble Arch Tube stations
to, for example, retail centre level, to appears to be higher, while at the same
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 93
Camden
Town
Angel
R e g e n t ' s LONDON
P a r k
Bloomsbury
Edgware Road
Station Holborn
Edgware
Road
Soho
Marble Arch
Station
Piccadilly
Circus
Waterloo
H y d e P a r k Tooley
Street
S t J a m e s ' s
P a r k
Victoria
Average Footfall
(per five minutes)
1
10
50
0 2 km 100
Figure 6.3 time sensors between them record lower constructed. Those locations were Holborn
Average five-minute and relatively consistent and spatially Station, Connaught Street (situated to the
footfall in Central London
during the working days comparable footfall. On the other hand, west of Edgware Road) and a pub in Tooley
(7am - 7pm) between stores situated in quieter side streets or Street between London Bridge and Tower
9 January and 19 March less attractive areas show lower footfall, Bridge. Temporal patterns and volume of
2017. Source: Local Data
Company (2017); including areas that may be near main footfall differ among the three locations
Ordnance Survey Vector attractions but outside main corridors on multiple levels (Figure 6.4). First, overall
Map District (2017). – Tooley Street being a good example, volume is very high around Holborn Station
situated behind the far more crowded and very low at the Connaught Street
Thames path near Tower Bridge. location. Second, general profiles differ,
While very important, assessment of the so that both Holborn and Tooley Street
overall footfall may fall short of detecting display three peaks (morning and
some other interesting patterns of human afternoon rush hour and lunchtime),
activity, for example how a certain area while Connaught Street has a less clear,
of the city is being used by its residents, noisier pattern, which could be owing to
workers and visitors during the a low footfall. Finally, there are differences
characteristic time periods during the day even between the profiles of Holborn and
and the week. In order to explore some Tooley Street with the latter experiencing
of these differences in diurnal patterns, a relatively higher PM rush hour peak.
temporal profiles of three locations were
94 CONSUMER DATA RESEARCH: PART TWO
(log scale)
average five-minute footfall on weekdays Tooley Street
F1(Sat-Sun,7-19)
lw = x 100 (6.5)
F2(Mon-Fri,7-19)
Connaught
Street
where Iw is the index of relative weekend
daytime activity, F1 is the average five- 0 3 6 9 12 15 18 21
Hour
minute weekend footfall between 7am
and 7pm and F2 is the average five-minute Connaught Street Holborn Station Tooley Street
Bloomsbury
0 100 200 m
Edgware Road
Station Holborn
Edgware
Road
Soho
Leicester
Marble Arch Square
Station
Piccadilly
Circus
H y d e P a r k Waterloo Tooley
Street
S t J a m e s ' s
P a r k
Victoria
Index of Relative
Weekend Activity
> 116
101 - 116
LONDON 77 - 100
0 1 2 km < 77
Cunche, M., Kaafar, M.-A. and Boreli, R. (2014). Steenbruggen, J. et al. (2013). Mobile phone data
Linking wireless devices using information from GSM networks for traffic parameter and urban
contained in Wi-Fi probe requests. Pervasive and spatial pattern assessment: A review of applications
Mobile Computing, 11, 56–69. and opportunities. GeoJournal, 78(2), 223–243.
Freudiger, J. (2015). How talkative is your mobile Torrens, P. M. (2008). Wi-fi geographies. Annals of the
device? An experimental study of Wi-Fi probe Association of American Geographers, 98(1), 59–84.
requests. In Proceedings of the 8th ACM Conference
on Security & Privacy in Wireless and Mobile Networks. Vazquez-Prokopec, G. M. et al. (2013). Using GPS
ACM, p. 8. technology to quantify human mobility, dynamic
contacts and infectious disease dynamics in a
Hidalgo, C.A. and Rodriguez-Sickert, C. (2008). resource-poor urban environment. PloS one 8(4),
The dynamics of a mobile phone network. Physica e58802.
A: Statistical Mechanics and its Applications, 387(12),
3017–3024. Acknowledgements
Kobsa, A. (2014). User acceptance of footfall analytics The authors would like to thank Local Data Company
with aggregated and anonymized mobile phone Ltd, for providing, in partnership with CDRC, the
data. In Lecture Notes in Computer Science (Lecture SmartStreetSensor footfall data. The second and
Notes in Artificial Intelligence and Lecture Notes in third authors’ PhD research is sponsored by the
Bioinformatics), pp. 168–179. Springer International Economic and Social Research Council through the
Publishing Switzerland 2014 UCL Doctoral Training Centre.
Domain Description
Infrastructure Fixed-line household infrastructure access and broadband Internet
performance.
Mobile phones Mobile access, connectivity and usage.
Perceptions People’s attitudes and perceptions about the use and utility of the Internet.
Access patterns Information on Internet access patterns, e.g. only at home, while travelling,
through a mobile, etc.
Commercial applications Information on the use of commercial applications such as online shopping,
online banking and online bill payments.
User population Current Internet users, ex-users and non-users.
Demographics and attributes of Demographic attributes such as age, education and occupation and
contextual geography attributes of contextual geography such as rurality, population density, etc.
Table 7.1 Generally, the steps required to create these datasets should cover several
Variable domains used in a classification include selecting the domains, such as those listed in Table 7.1.
the IUC.
appropriate classification scale, selecting
the input variables, preparing the data An important source of data forming input
(e.g. variable transformations or weighting), to the IUC was the Oxford Internet Survey
applying a clustering method and finally (OxIS), which was launched by the Oxford
interpreting results (clusters). Due to data Internet Institute in 2003. The survey,
availability, the IUC was built for England conducted biannually, is carried out by
at the LSOA level. LSOA geography is the interview using a probability sample of
second most granular Census geography around 2,000 people in Great Britain,
available, comprising 32,844 zones of enabling comparisons over time (more
between 1,000-3,000 people or 400-1,200 details can be found on the OxIS website,
households. The majority of data under available at: oxis.oii.ox.ac.uk/research/
consideration were available at the Great methodology/). For the creation of the IUC,
Britain level, albeit that those datasets the 2013 study was used. The OxIS covers a
available for England were more robust. broad range of topics regarding people’s
Furthermore, the nature of these perception of the Internet; given the vast
geographies in Scotland and Wales varies number of questions that were available for
significantly compared to England (e.g. in analysis (there are over 500 potential lines
terms of the characteristics of rural areas) of enquiry), it was necessary to identify
and so the decision was taken to exclude a smaller subset of questions relating to
Scotland and Wales from the analysis. key dimensions of Internet use, behaviours
and attitudes.
Selecting the appropriate variables to be
used in the classification, however, can be The sample used for the 2013 OxIS is
more challenging. The multi-dimensionality representative of the UK population, but its
of the IUC is important; a wide range of size is relatively small to capture the full
spatially referenced input measures are breadth of the survey at higher geographic
essential to the success of the classification, scales. As such, a method for synthetic data
similar to how geodemographics typically estimation was implemented to extrapolate
include a plethora of socio-economic the survey results to national small area
attributes in order to represent coverage. Projection of the survey results
neighbourhoods. If combined and was carried out using a Small Area
summarised effectively, meaningful Estimation (SAE) technique. SAE was
measures could represent a typology of applied to each question and generated a
Internet use and engagement. Broadly, predicted response rate at the Output Area
100 CONSUMER DATA RESEARCH: PART TWO
(OA) Census geography. The estimations are each question is a weighted average derived
‘indirect’, in that they borrow strength by from all the population sub-groups present
using values of the variables of interest within it.
from related areas through a model that
provides that link using secondary data, Two tests were carried out in order to
such as Census counts and administrative validate estimated results. One way was
records (Rao, 2003). In the most basic to compare the average deviation between
sense, it is possible to predict results mean rates of the estimated data at the
for unsampled areas by using data from national level, and mean rates of the
sampled areas. For instance, profiling the original OxIS sample. The average
relationship between age structure and difference was <0.1%, which suggests
Internet usage and subsequently using the the estimated dataset is broadly
results to predict rates for an unsampled representative, as national means
geography where no survey data are are comparable to those of the original
available, but the age structure is known data. Furthermore, comparing distributions
(i.e. from a recent Census of population). showed that the estimation method is
not skewing the output such that it is
In practice, however, the process was more unrepresentative of the sample it was
complex. Firstly, the required explanatory built from. Vastly different average rates
variables for each survey question should between the estimated and original data
be identified. Predictor variables explored at this stage would have flagged potential
were based on those factors known to problems with estimation methods.
influence Internet use and behaviour
that were identified in relevant literature, The next stage in validation involved
namely age (Rice and Katz, 2003; Warf, profiling responses geographically, to
2013), socio-economic status (Silver, 2014), examine if variability pertained to patterns
ethnicity (Wilson et al., 2003), gender that would be expected. An external dataset
(Prieger and Hu, 2008); rurality (Warren, was used for this purpose; each of the OA
2007), education (Helsper and Eynon, 2010) response estimates were profiled using the
and Internet connectivity (Riddlesden and 2011 Output Area Classification (OAC), an
Singleton, 2014). There are a number of open geodemographic system (available
techniques that can be used to identify for download through the CDRC portal at:
those attributes with the highest influence; data.cdrc.ac.uk/dataset/cdrc-2011-oac-
in this case, a decision tree algorithm geodata-pack-uk), to ensure that the
was implemented, specifically the Quick propensity for certain responses (e.g.
Unbiased Effective Statistical Tree engagement to online shopping) were in
(QUEST) algorithm, in order to identify the line with responses given to the general
relationship of those external attributes to demographic profile of the clusters. For
response rates. Decision tree algorithms instance, the national average response
are commonly used in data mining and rate of question QC30b: Buying Online
seem to perform well compared to, e.g., for those who responded as ‘frequently’
ecological regression analysis, which in (i.e. buying online at least monthly) is
this case provided poor results. 53.5% of all Internet users. Figure 7.1
shows the deviation of frequent users
In total, 42 OxIS questions were selected, from the national mean by OAC profiles.
covering each of the 171,372 OAs in England
and Wales. The described model outputs a Profiling response rates by OAC revealed
series of rates which were then fitted to OA significant correspondence between
geography by examining the distribution of socio-spatial groups and prevailing levels
these population sub-groups within each of engagement with different domains of
OA nationally. Essentially, an OA rate for the Internet. In most cases, groups with
7. The Geography of Online Retail Behaviour 101
Figure 7.1
Deviation from the
national average of
response rates to the 2.5
question QC30b:
−7.5
−10.0
−12.5
detached suburbia
6b: Semi−
Output Area Classification 2011, Group Level
Table 7.2
Domain Variables Data Source Number of variables per
domain used in the IUC,
Age 15 Census 2011 and data sources.
Table 7.3
Supergroup Group The structure and class
labels of the IUC.
1: E-unengaged 1a: Too Old to Engage
1b: E-marginals: Not a Necessity
1c: E-marginals: Opt Out
2: E-professionals and students 2a: Next Generation Users
2b: Totally Connected
2c: Students Online
3: Typical trends 3a: Uncommitted and Casual Users
3b: Young and Mobile
4: E-rural and fringe 4a: E-fringe
4b: Constrained by Infrastructure
4c: Low Density but High Connectivity
7. The Geography of Online Retail Behaviour 103
LONDON
Figure 7.2 based on the values of the input attributes East and West London, while the periphery
The Greater London (usually through radial plots) and mapping is mostly identified by less engaged
Region by IUC Group.
results for visual analysis. These outputs populations.
informed the Supergroup and Group
names (Table 7.3) as well as the ‘Pen Potential uses of the IUC are broad, and
Portraits’. These describe the typical fields of use may include data profiling,
characteristics of the areas included in online survey stratification, targeted
each of the clusters, while also considering marketing, location planning, customer
their variability between clusters. The insight, and public policy formation and
complete pen portraits can be found in delivery. Such a classification is particularly
Singleton et al (2016). useful in the commercial sector, as the IUC
could be used in the profiling of existing
Along with pen portraits, a series of maps customer databases to identify trends,
are essential in order to reveal the spatial assisting in the development of targeted
structure that emerges from the marketing strategies. This may be valuable
classification. Figure 7.2 demonstrates the for businesses that operate online, or are
resulting Group typology for the Greater interested in the aggregate Internet
London Region. The map clearly shows the engagement characteristics of their
differentiation between central London and customer base.
the periphery, with the centre occupied by
the highly engaged Supergroup 2 classes, The IUC is an open product that is offered
such as 2a: Next Generation Users and 2b: through the CDRC data portal (available
Totally Connected. Cluster 3b: Young and for download at: data.cdrc.ac.uk/dataset/
Mobile clearly forms several clusters to cdrc-2014-iuc-geodata-pack-england).
104 CONSUMER DATA RESEARCH: PART TWO
Furthermore, an interactive map of the and ownership will govern the extent to
classification is available on the CDRC which they can adapt to or accommodate
website (maps.cdrc.ac.uk/#/ these changes. Essentially, e-resilience
geodemographics/iuc14/). can be expressed as a balance between
the propensity of localised populations
7.3 to engage with online retailing and the
e-Resilience and the online physical retail provision and mix that might
geography of retail centres increase or constrain these effects, as not all
retail categories would be equally impacted.
Online shopping impacts upon retail
centres in complex ways, often referred to Measuring the vulnerability of competing
in the literature as a ‘slow burn’ (Pendall et retail destinations to consumers of
al, 2010). UK Government initiatives aimed differential Internet engagement
at revitalisation of British high streets characteristics requires an understanding
highlight the importance of digital of the location and geographic extent
technology in redefining traditional retail of retail centres, combined with some
spaces (Digital High Street Advisory Board, assessment of their composition and
2015). In this framework, it is important to size. A nationally expansive record of the
study the impacts that online shopping has location, occupancy and facia of UK retail
on the structure of traditional high streets stores are generated by the Local Data
at a more granular level. For instance, in Company (London, UK), a commercial
the UK a number of national retailers such organisation that employs a large survey
as Borders, Zavvi, Jessops and Game have team to collect these data on a rolling basis.
either entirely withdrawn or substantially A national extract for February 2014 was
limited their physical retail offerings made available for this research, with each
within the past few years, while some other record comprising the location of a retail
major retailers such as John Lewis, Next, premise with latitude and longitude
Boots or Argos have successfully embraced coordinates, retail category and details
new technologies through opening click- of the current occupier. The dataset is
and-collect points, or by developing mobile currently available through CDRC (data.
applications (Turner and Gardner, 2014). cdrc.ac.uk/dataset/local-data-company-
retail-unit-address-data) with permission.
Despite evidence to suggest that factors
impacting decisions about whether or not Retail unit data were used to calculate a
to shop online are linked to demographic series of measures which were identified
and socioeconomic characteristics of in relevant literature to influence
populations (Longley and Singleton, 2009), propensity to online shopping, e.g. physical
there is limited knowledge about the store attractiveness or retail category
geography of online sales (Forman et al, vulnerability, calculated as the level of
2008). This study explored these challenges risk of the main product switching from
through a concept defined here as physical to online offering channels.
‘e-resilience’, a concept that provided A composite of these measures forms
both the theoretical and methodological a ‘supply vulnerability index’. Input
framework in assessing the vulnerability measures to this index included the
of retail centres to the effects of rapidly weighted percentage of anchor stores
growing Internet sales, balancing (Damian et al, 2011), i.e. the top 20 most
characteristics of both supply and demand. attractive stores as presented by Wrigley
E-resilience defines the vulnerability of and Dolega (2011), and leisure outlets
retail centres to the effects of growing (Reimers and Clulow, 2009), as opposed
Internet sales, and estimates the likelihood to the prevalence of ‘digitalisation retail’,
that their existing infrastructure, functions such as newsagents, booksellers, computer
7. The Geography of Online Retail Behaviour 105
E−fringe
Students Online
Totally Connected
Table 7.4
The 10 most e-resilient
town centres identified
within England.
Table 7.5 as knowledge about the geography and with regards to online shopping, and to
The 10 least e-resilient drivers of Internet shopping are still the role and future of town centres at the
town centres identified
within England. limited. This study explores some national scale. Certainly, one of the most
aspects of online retail behaviour, influencing factors is the behavioural
particularly on the nature and impact component: whether or not to use the
that Internet user behaviour is having Internet for a given activity. The study
on retail centres nationally. highlighted that influencing such decisions
are both demographic effects, mainly age
The analysis of the geography of online and socioeconomic status, and local retail
retail provides unique insights into the supply including ‘softer’ factors such as
apparent diversity of population groups convenience and accessibility.
108 CONSUMER DATA RESEARCH: PART TWO
For example, the 2021 Census in the UK will Birkin, M., Clarke, G. and Clarke, M. (2002). Retail
largely be completed online and so the IUC Geography and Intelligent Network Planning. Chichester,
can assist in highlighting areas where low NY: Wiley.
response rates are likely. Birkin, M., Clarke, G. and Clarke, M. (2010). Refining
and operationalizing entropy-maximizing models
A further application of the IUC is its for business applications. Geographical Analysis, 42(4),
422–445.
contribution to the e-resilience indicator.
The distribution of e-resilience measures Calderwood, E. and Freathy, P. (2014). Consumer
revealed a geography where attractive mobility in the Scottish isles: The impact of internet
adoption upon retail travel patterns. Transportation
and large retail centres such as the inner Research Part A: Policy and Practice, 59: 192–203.
cores of large metropolitan areas, along
with smaller, specialised centres were Carlson, J., O’Cass, A. and Ahrholdt, D. (2015). Assessing
customers’ perceived value of the online channel of
highlighted as more resilient, while centres multichannel retailers: A two country examination.
within many secondary and medium sized Journal of Retailing and Consumer Services, 27, 90–102.
centres were identified as most vulnerable.
Damian, D., Curto, J. and Pinto, J. (2011). the impact of
One of the most defining contributions anchor stores on the performance of shopping centres:
of this approach is that it provides a The case of Sonae Sierra. International Journal of Retail &
comprehensive classification of all Distribution Management, 39(6), 456–475.
retail centres based on their e-resilience Dholakia, U. M., Kahn, B. E., Reeves, R., Rindfleisch, A.,
levels, a resource that can be used by a wide Stewart, D. and Taylor, E. (2010). Consumer behavior
range of stakeholders including academics, in a multichannel, multimedia retailing environment.
Journal of Interactive Marketing, 24(2), 86–95. Special
retailers and town centre managers, and Issue on Emerging Perspectives on Marketing in a
inform policy decisions. Multichannel and Multimedia Retailing Environment.
Harris, R., Sleight, P. and Webber, R. (2005). Silver, M. (2014). Socio-economic status over the
Geodemographics, GIS, and Neighbourhood Targeting. lifecourse and internet use in older adulthood. Ageing
Chichester: John Wiley and Sons. and Society, 34, 1019–1034.
Hart, C. and Laing, A. (2014). The consumer journey Singleton, A. D. and Dolega, L. (2015). The e-resilience
through the high street in the digital area. In of UK town centres. In Evolving High Streets: Resilience &
Evolving High Streets: Resilience and Reinvention - Reinvention, Perspectives from Social Science, pp. 40–43.
Perspectives from Social Science, pp. 36–39. University of Economic and Social Research Council.
Southampton, Southampton.
Singleton, A. D., Dolega, D., Riddlesden, D. and Longley,
Helsper, E. and Eynon, R. (2010). Digital natives: Where P. A. (2016). Measuring the spatial vulnerability of
is the evidence? British Educational Research Journal, retail centres to online consumption through a
36(3), 503–520. framework of e-resilience. Geoforum, 69(1), 5-18
Huff, D.L. (1964). Defining and estimating a trade area. Turner, J. and Gardner, T. (2014). Critical reflections
Journal of Marketing, 28(3), 34–38. on the decline of the UK high street: Exploratory
conceptual research into the role of the service
Kaufmann, A., Lehner, P. and Todtling, F. (2003). encounter. In Handbook of Research on Retailer-
Effects of the internet on the spatial structure of Consumer Relationship Development, pp. 127–151.
innovation networks. Information Economics and Policy, Hershey, PA: IGI Global.
15(3), 402–424.
Verhoef, P. C., Kannan, P. and Inman, J. J. (2015). From
Longley, P. A. and Singleton, A. D. (2009). Classification multi-channel retailing to omni-channel retailing:
through consultation: Public views of the geography Introduction to the special issue on multi-channel
of the e-society. International Journal of Geographical retailing. Journal of Retailing, 91(2), 174–181.
Information Science, 23(6): 737–763.
Warf, B. (2013). Contemporary digital divides in the
Nathan, M. and Rosso, A. (2015). Mapping digital United States. Tijdschrift voor economischeen sociale
businesses with big data: Some early findings from the geografie, 104(1), 1–17.
UK. Research Policy, 44(9), 1714 – 1733. The New Data
Frontier. Warren, M. (2007). The digital vicious cycle: Links
between social disadvantage and digital exclusion in
Newing, A., Clarke, G. and Clarke, M. (2015). Developing rural areas. Telecommunications Policy, 31(6-7), 374–388.
and applying a disaggregated retail location model
with extended retail demand estimations. Geographical Wilson, K. R., Wallin, J. S. and Reiser, C. (2003). Social
Analysis, 47(3), 219–239. stratification and the digital divide. Social Science
Computer Review, 21(2), 133–143.
Pendall, R., Foster, K. and Cowell, M. (2010). Resilience
and regions: Building understanding of the metaphor. Wood, S. and Reynolds, J. (2012). Leveraging locational
Cambridge Journal of Regions Economy and Society, 3(1), insights within retail store development? Assessing
71–84. the use of location planners’ knowledge in retail
marketing. Geoforum, 43(6), 1076–1087.
Prieger, J. E. and Hu, W. M. (2008). The broadband
digital divide and the nexus of race, competition,and Wrigley, N. and Dolega, L. (2011). Resilience, fragility,
quality. Information Economics and Policy, 20(2), 150–167. and adaptation: New evidence on the performance of
UK high streets during global economic crisis and its
Rao, J. (2003). Small Area Estimation. Wiley series in policy implications. Environment and Planning A, 43(10),
survey methodology. Hoboken, NJ: John Wiley. 2337–2363.
Reimers, V. and Clulow, V. (2009). Retail centres: It’s Wrigley, N. and Lambiri, D. (2014). High Street
time to make them convenient. International Journal of Performance and Evolution: A Brief Guide to the
Retail & Distribution Management, 37(7), 541–562. Evidence. Technical report. Southampton: University
of Southampton.
Rice, R. E. and Katz, J. E. (2003). Comparing internet
and mobile phone usage: Digital divides of usage, Acknowledgements
adoption, and dropouts. Telecommunications Policy,
27(8), 597–623. The authors would like to thank the Oxford Internet
Institute for providing survey data and the Local
Riddlesden, D. and Singleton, A. D. (2014). Broadband Data Company Ltd for providing retail unit data for
speed equity: A new digital divide? Applied Geography, this research. The research was also funded by the
52(0), 25–33. Economic and Social Research Council, grant number
ES/L003546/1.
Ryan-Collins, J., Cox, E., Potts, R., and Squires, P.
(2010). Re-imagining the High Street: Escape from Clone
Town Britain. London: New Economics Foundation.
8
111
Figure 8.1
The frequency of journeys 10000
NUMBER OF USERS
by number
of users. 8000
6000
4000
2000
0
1 5 10 20 30 40 50 60
JOURNEY COUNT
The volume of data from Oyster cards on activities both at the individual and the
the TfL network is extremely high; more aggregate level. The classification of this
than 80 percent of the 3 million journeys activity can be thought of as a two-step
carried out each day on the network make process. The first step is to use the
use of Oyster cards (TfL, 2016). Although temporal information within the
the Oyster card is used on multiple modes commuting sequences to identify the
of transportation across London, 95% periods of stay. A period of stay is
of all Oyster card usage is for London’s characterised by two consecutive
Underground and bus journeys (Gordon, journeys to and from the same location.
2012). One of the limitations identified in The time between these two journeys is
the TfL dataset is the incomplete recording significant as it is an indicator of the type
of trip information for bus journeys. As TfL of activity and can help discern activities
do not currently capture the alighting from transit stops. The second step in the
information from its bus trips, bus journeys process is the classification of activities
are often excluded from certain trip into predefined activity types. The
analysis. Such journeys, however, can be activities are classified by means of their
included with an enhancement of the association with POI. Stay location, stay
model, where missing information is duration and time provide the spatial-
identified as a sub-step within the temporal context of the activity. Also
identification process. significant in the inference of the activity
is the distance of POI to the transit station
In the following analysis, the sample data that is captured via smart card data.
available comprised a total of 60 million The combination of these factors
journeys. Since the processing of such large could therefore explain the different
volumes of data are so resource intensive, characteristics of human movement.
for the purpose of the study, the smart For example people travel to work on a
card records of 9,900 randomly selected daily basis but only go to watch a concert
TfL users were identified for further at specific times. Therefore a short stay
investigation. The sample contains a total near a concert venue can only be an
of 1,823,906 complete journey records made indicator of the activity ‘at the concert’
by individual users for the months of if it matches the temporal attributes
October and November 2013. of POI. This chapter only discusses the
identification of home and work locations
8.3.1 along with the work related activities.
Activity description
Commuting patterns provide the ability to
An understanding of human mobility identify regular activities such as work,
requires the understanding of daily once a stay location has been identified
114 CONSUMER DATA RESEARCH: PART TWO
Work,
H to W
Primary Activities Home (H) and Work (W) offices, universities,
W to H
related activities schools, college, etc.
Secondary Activities W to X2
After Work (W) Activities
X2 to H Pub, dinner, shopping
W to X3
Midday/Work (W) Activities Lunch, shopping
X3 to W
with a certain degree of confidence. defined as those activities that fall outside Table 8.1
Therefore it is important to ensure the of a regular commuter journey’s key user A variety of travel-based
activities can be identified.
regularity of the usage prior to making locations. User activities that are irregular
inference about such activities. in nature are more challenging to model.
In such a scenario the spatial-temporal
Figure 8.1 shows that for 20% of the users, aspects of the trip alone do not provide
the available journey count (defined as a significant insights into user behaviour.
numeric parameter based on the regularity At the same time, spatial attributes of
of usage) is less than 10 journeys, which is visited locations can provide clues for
too low to carry out any meaningful activity classification.
analysis. The right balance is, therefore,
required in the selection of threshold; a 8.3.2
value too small will include a large number Individual mobility as a sequence of
of irregular users in the dataset, and a activities
threshold too high will be too restrictive
and leave out regular users from the Activities are representations of users’
analysis (Hasan et al, 2012). changing presence in both space and time.
Some of the activities are captured by
Table 8.1. provides examples of primary and means of the smart card data whilst others
secondary activities that are linked to work lack a digital footprint. Importantly, the
(W) and home (H) locations. It is assumed states that are visible provide clues about
that an individual can have one or more the states that are unobserved. Figure 8.2
locations classified as home locations illustrates a simple activity sequence for an
if it fits the criteria defined for the home Oyster card user. The individual carries out
location identification. Similarly, one or a morning commute between the hours of
more locations can be classified as work 08:00 to 09:00 from a home location to a
locations. Identification of secondary work location. The same individual uses the
activities are not within the scope of network for their work to home commute
this analysis. between 17:00 to 18:00. Observed activities
for this sequence are the journeys carried
Irregular activities are more challenging to out by means of the Oyster card, whilst
model, so for the purpose of this chapter are the hidden event is the work activity.
8. Smart Card Data and Human Mobility 115
Figure 8.2
Sample activity sequence
(simple).
Figure 8.3
Sample activity sequence
(complex).
The methodology described in this work locations gathered from smart card data.
explains the identification and labelling of It serves to identify the stay duration
the activities based on POIs, for example between consecutive journeys and enables
home/work and variables such as stay the identification of work location.
duration and visit frequency.
8.3.3.1
Similarly, a more complex activity sequence Home location
is illustrated in Figure 8.3 where the home
to work commute is interrupted by a short The most frequently used station is a
stay activity. Although the information key marker in the identification of home
captured within the smart card data are not stations. For a large majority of users, the
sufficient to infer the exact nature of the station for the first and last journey of the
short stay visit, the timing and location of day is an indicator of a home location. In
the activities can be useful in assigning an this work, an algorithm has been devised
appropriate description for such activities, based on the frequency of most frequently
for example day-care visit and school used stations coupled with the temporal
drop-offs. information of the journey to classify home
stations for users.
8.3.3
Human mobility pattern identification The origin station of the first journey of the
day and the destination station of the last
The heuristics described in this section journey of the day is the POI-classified
consider the visit frequency (number of home location. Selected stations for the
times a specific user visits a location) available number of the days are then
and stay duration (the duration between further analysed, and if the selected station
consecutive journeys) as parameters in the count fits the criteria required in the
model to identify home and work locations. algorithm, the station is selected as a home
In order to explore mobility patterns of location. It is possible to have more than
urban commuters, the data are grouped one station classified as a user’s home
by frequency of use. An important station if the station fits the criteria. The
characteristic of the temporal patterns type of behaviour could be due to a number
of urban human mobility is the stay time of reasons, such as a user having multiple
duration. It describes the activities between home locations, service degradation on the
116 CONSUMER DATA RESEARCH: PART TWO
Home Stations
Walthamstow
Central
Hammersmith
Bromley
Count
Sutton 1 - 40
41 - 132
10 133 - 541
km
London Borough
TfL network, or other personal journey This is representative of the high Figure 8.4
Work Stations
choices. Similarly, it is also possible that population density in the inner city. The results of home
locations around TfL
the algorithm fails to highlight any station Similarly, some central London interchange and National Rail stations
as home station if no location meets the stations such as Hammersmith, King’s in London.
expected criteria. Cross and St Pancras station feature heavily
as home locations in the results. This is
In Figure 8.4, home locations are spread because the final legs of many journeys
evenly across London’s
Wembleyouter boroughs
Park to other cities and outer London stations
with the City of London. Some commuter are not captured by the Oyster network,
belt boroughs such as Sutton and Bromley hence the last station captured has been
are not well represented in the results. Canary
inaccurately marked as aWharf
home station.
The users are Ealing Broadway
still able to travel to and This is one significant limitation of the
from these locations using National Rail available data. To mitigate this, large
infrastructure, but the journeys are not transit stations can be excluded or
Richmond
as accurately captured with Oyster cards. additional rules for the identification of
Therefore these stations do not feature home locations can be devised. It is possible
Brixton
significantly as major home locations.
Wimbledon to add rules, such as frequency of weekend
use, to the home location algorithm in
Count
Figure 8.4 also shows that outer London order to identify such users.
stations are represented by smaller points 1 - 40
in comparison to the inner London stations. 41 - 132
10
km 133 - 541
10 133 - 541
km
8. Smart Card Data and Human Mobility London Borough 117
Work Stations
Wembley Park
Canary Wharf
Ealing Broadway
Richmond
Brixton
Wimbledon
Count
1 - 40
41 - 132
10
km 133 - 541
London Borough
financial and commercial services for example, a monthly theatre visit might
around the City of London. This is due fall into this category. With this in mind,
to the close proximity of these regions approaches based on continuous learning
to the financial districts of the City of from the data hold promise. For example,
London and Canary Wharf. machine learning algorithms can recognise
patterns in data and construct new rules
Some locations such as Ealing Broadway, dynamically (Ethem, 2004). Examples of
Wembley, Wimbledon, Richmond and machine learning in transportation include
Brixton outside of central London have insurance premium calculations based on
also been identified as work locations. These the driving patterns of individuals and
locations are also an example of commercial tracking congestion. The most talked about
centres outside of central London. of all the applications of machine learning
in transportation is perhaps self-driving
8.3.4 vehicles. Based on research by the Business
Validation Insider, there will be 10 million self-driving
cars on the roads by 2020 (Gerage, 2017).
The results of identification of home and The technology behind these relies upon
work locations were compared with the sensors, which collect data from the
LTDS. LTDS data capture, among other surrounding environment and objects,
attributes, information about home and such as size and speed. The task of machine
work locations of the individuals (TfL, learning algorithms is the continuous
2011). This makes LTDS data invaluable interpretation of this data in order to
as it provides a source of validation for classify objects as pedestrians, cyclists
travel pattern algorithms. or other cars and objects as well as the
forecasting of their movements (Gates,
The results were compared with the LTDS 2017; Anil, 2017).
dataset, and 82% of home users were
identified with the same location by the The identification of activities can be
algorithm as LTDS data at the level of described as a classification problem in the
postcode district. For work locations, context of machine learning. The activities
60% were correctly identified. that have been extracted based on the stay
duration of individuals at a location need to
The accuracy of the comparison relies be classified into one of the categories, for
heavily upon the correctness of the user example, weekend social visits or weekday
data captured through the surveys. Any shopping trips. With respect to smart card
errors in the data gathering and entry data, one of the inherent challenges is the
would adversely impact the reliability unavailability of labelled data that can be
of the comparison. used to train the classifiers. In order to
address this, a number of options can be
8.4 considered to generate labelled datasets:
Prospects for understanding mobility
Expert labelling: This can be done with
To present the complete picture of an careful analysis of information, for
individual’s mobility, so-called ‘secondary example, the day of the week, time of the
activities’ need to be identified. In this day, attributes of the locations (shopping
context, secondary activities include all centre, entertainment hub, residential area,
activities which last longer than the and sports venue). Expert labelling of the
standard transit stops but are shorter than activities relies on the intuition of the
the presumed work activities. These are researcher to evaluate the available
particularly challenging since they may information about the activity and assign
not have obvious recurring travel patterns, a suitable classification to the activity, e.g.
8. Smart Card Data and Human Mobility 119
app, and recruitment of volunteers. Labelled Gordon, J. B. (2012). Intermodal Passenger Flows on
test data, combined with the individual London’ s Public Transport Network. MIT Press.
user journey records from the smart card
Hasan, S. et al (2012). Spatiotemporal patterns of
data, provide the two pieces of information urban human mobility. Journal of Statistical Physics,
necessary to classify the user activities. 151, 304–318.
Smart card data provide a rich and detailed TfL (2011). London Travel Demand Survey. Online:
window into activity patterns through www.clocs.org.uk/wp-content/uploads/2014/05/
their ability to capture vast quantities london-travel-demand-survey-2011.pdf.
Home and work locations are important We are grateful to Transport for London (TfL) for
anchors since the majority of journey provision of the experimental data for this research.
activities revolve around these. A better The first author’s PhD research is sponsored by the
Economic and Social Research Council through the
identification of these locations would UCL Doctoral Training Centre.
provide a more effective classification
of the activities of individual users. The
heuristic approach to human mobility
proposed in this chapter has the potential
to improve our understanding of wider
9
121
and offer real-time updates, typically advantages and limitations posed by the
aggregated to half hourly intervals. introduction of smart meter data. We revise
The data source can be considered large the methods for segmenting energy data
in volume since each household with dual with an example of the results. Finally, we
fuel smart meters annually generates build upon these preliminary investigations
around 17,520 readings indicative of the to set a research agenda for linking the
correspondence between household smart meter data to other administrative
characteristics and residential property and open datasets in order to better
attributes. There are thus immediate understand consumer behaviour in this
advantages to the use of smart meters in domain. Using a case study of Bristol,
research with a focus upon consumer we investigate the possibilities that
behaviour and energy policy. However, as may be available using UK Census data.
these data are new to industry analysts and The chapter is concluded with a concise
the research community alike, this chapter discussion of issues for future research.
focuses on the interplay of issues of content
and coverage of the data in the analysis. 9.2
The UK energy sector and smart
Smart meters present novel opportunities meter roll out
for small-area population analysis when
triangulated with the 2011 UK Census of The UK energy sector is regulated by the
Population (Anderson et al, 2017). In this Department for Business, Energy and
chapter, we dedicate more attention to gas Industrial Strategy (BEIS) and the Office
data as being a more direct indicator of of Gas and Electricity Markets (OFGEM).
household activities. Regardless, for On the supply side, there are currently
understanding of behavioural patterns 12 large and 46 small energy companies.
for both sources, be it gas/electricity The market share is monitored by OFGEM
energy expenditure or real-time energy and assessed on the basis of how many
consumption, these consumer data play a electricity meters are installed on the
vital role for policy-making and regulation. distributional network by a supplier.
As of late 2016, British Gas was the largest
While offering greater precision in provider with 23% share of the market, and
understanding the differences in behaviour Scottish and Southern Electricity (SSE) and
over time by households, smart meter data e.ON the second and third largest providers
also create further challenges when it comes with 15% and 14% share respectively.
to the generalisation of temporal profiles.
How do we identify the average or expected The UK Government aims to ensure that
energy consumption profile? Should this by 2020 every domestic and non-domestic
bring focus to the activity patterns of property will have been offered a smart
consuming households, their properties, meter. The regulatory environment
or neighbourhood setting? ‘Variability in encourages providers to roll out smart
residential consumption reported in the meters as quickly as possible to meet the
literature suggests that there is hardly obligation of complete installation by 2020.
a “typical” level of consumption for any By the first quarter of 2017 there was a total
energy end-use’ (Lutzenhiser, 1993, 249). of 6.78 million smart meters installed by
We address this by looking at both energy suppliers across residential and
aggregate and disaggregated patterns business addresses in the UK of which six
of energy consumption. million had been installed in domestic
properties by the ‘Big Six’ energy providers.
We first provide a description of the Electricity meters account for more than
available data within the context of the half of the total of these installations due
UK energy sector, briefly looking at the to wider availability over gas. BEIS (2017)
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 123
600,000
Electricity
each Trimester
500,000
Gas
400,000
300,000
200,000
100,000
0
Jan-Mar Apr-Jun Jul-Sept Oct-Dec
124 CONSUMER DATA RESEARCH: PART TWO
Table 9.2
The average annual
household energy
consumption compared
to BEIS 2015 national Figure 9.2
estimates. As observed, Smart electricity and
our estimates are slightly gas meters by postcode
lower than official sector at the end of
statistics. This may be December, 2015.
an indication of further These maps show the
bias in our dataset. distribution of smart
meters across Great
Britain with the West
Midlands and North West
regions having the highest
frequencies of meters per
postcode sector.
Count
1 - 24
25 - 47
48 - 71
72 - 97
98 - 127
128 - 162
163 - 205
206 - 268
269 - 393
No data
0 100 km
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 125
Table 9.3 year coverage (Figure 9.1). Between April 30% of the smart meters are installed.
Breakdown of smart gas and September, less than 5% of the total In contrast, Wales and North East regions
and electricity meters
by region. were enrolled. Finally, in December around are deeply under-represented, accounting
50,000 users were added bringing the total for only 8% of the total of smart meters
to 600,000 users with a smart meter. (Table 9.3).
We may conclude that the rollout of the
electricity meters is gathering momentum. The data for 2015 represent the early stages
This was also confirmed by BEIS (2017). of smart meter roll out. One of the potential
A breakdown for the rollout by trimester sources of bias associated with this, is the
is shown in Figure 9.1. fact that the first properties to receive a
smart meter were those with old energy
In 2015, 600,000 electricity smart meter meters. Another possible bias might arise
users consumed 1,200 Gwh, representing from the fact that the first households to
just 1.1% of the total domestic electricity receive an installation were more likely to
consumption in Great Britain for that year. be at home during the campaign: this may
For gas, 480,000 users consumed 4,000 skew the customer representativeness
Gwh, accounting for 1.3% of the total slightly towards the elderly and families.
domestic gas consumption in Great Britain To test these ideas, we compare the
(2015). Basic centrality measures around distribution of property build period by
individual consumption are shown in region (generated by the Valuation Office
Table 9.2. Agency) with the total number of smart
meters installed, particularly for the 1965
The geographical distribution of meters to 1972 period (Table 9.4). Although the
(Figure 9.2) is slightly biased towards the North West region scores high in both
North West and West Midlands regions, measures and Wales scores low in both,
for both electricity and gas, where almost this test is by no means conclusive and the
126 CONSUMER DATA RESEARCH: PART TWO
Table 9.4
Region Electricity Meters Properties Built 1965 to 1972 Meters installed and
properties built between
South East 57,100 430,340 1965-1972. The correlation
North West 96,100 330,780 between number of
meters installed and
East of England 47,000 314,640 properties built in this
period is 0.35, indicating a
West Midlands 79,200 292,610 weak positive correlation.
South West 41,100 259,040
Yorkshire-Humber 58,000 238,860
London 65,900 224,600
East Midlands 48,600 211,720
Wales 28,100 135,900
North East 24,700 132,820
Scotland 53,300 NA
Proportion of
Households
0.1 - 1.2 %
1.3 - 2.1%
2.2 - 3.1%
3.2 - 4.6%
4.7 - 10.1%
10.2 - 21.4%
No data
0 100 km
Table 9.5
Cluster name % of total Resulting national clusters
for annual half hour
Cluster 1 51% aggregates at postcode
sector level.
Cluster 2 1%
Cluster 3 38%
Cluster 4 10%
To date, a number of methods has been K-means clustering is the most popular
developed for clustering consumer data. approach due to its simplicity and fast
The majority are associated with a reliable minimisation of the similarities among
performance on static data only, while the objects within each class centre. It may
disregarding the sequential links between be suitable for datasets with static features.
variables. This poses further challenges if For highly variable temporal variables,
we are to consider spatial and temporal the assignment of the cluster may be highly
Figure 9.4
dimensions. One of the immediate solutions unstable as different customers will be Clusters derived from
could be to transform dynamic data into assigned to a different cluster subject to annual aggregates at
static format. For example, we may the day and time. postcode sector level.
The whisker box plots
calculate the mean for each of the represent the median
individuals and create a numerical Alternatively, a Gaussian Mixture Model energy consumption and
indicator that represents an estimate of based on a probabilistic setting for the variation within four
quantiles. Postcode sector
average consumption for the individuals clustering may be proposed. Such a setting differences in aggregated
in our sample. The decision on the method brings about the ability to handle diverse consumption are based
is thus broadly driven by the data types of data, including dealing with mainly on the variation
around expected morning
characteristics that include: discrete versus missing or unobservable data that may and evening peaks
real values, uniformity of the sample, have contributed to variation differences consumption.
Cluster 1 Cluster 3
5000 5000
4000 4000
3000 3000
Wh
Wh
2000 2000
1000 1000
0 0
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30
Time Time
Cluster 2 Cluster 4
5000 5000
4000 4000
3000 3000
Wh
Wh
2000 2000
1000 1000
0 0
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30
Time Time
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 129
Table 9.6
Resulting national clusters
Cluster name % of total
for annual half hour Cluster 1 30%
aggregates at postcode
sector level after Cluster 2 13%
exclusion of outlier group
observations. Cluster 3 7%
Cluster 4 1%
Cluster 5 2%
Cluster 6 3%
Cluster 7 19%
Cluster 8 18%
Cluster 9 7%
among segmented groups. This is achieved To look deeper into the variation in energy
by assigning a probability measure to the profiles across postcode sector we removed
cluster. Where uncertainty about the the data that fall into cluster 2, treating it
assignment is greater, additional variables as an outlier group. A number of factors
may be introduced or the individual may be may have contributed to the unusually
treated as an outlier or uncertain group. high variation captured in this cluster:
Unlike k-means, it produces stable results non-domestic properties may be
and selects the number of clusters using mistakenly occurring in the sample, or
smoothing. This is also convenient for the multiple occupations may be associated
matter of replication as clustering results with a single smart meter address (e.g.
remain the same regardless of how many student halls). What we observe is that by
times we run the algorithm. Further excluding highly variable observations, the
research may implement clustering by algorithm can differentiate more variability
dynamics – for example, through grouping and after subtraction of these outliers gives
graphical models (please see further rise to nine national clusters. From Table
reading for more details). 9.6 and Figure 9.5 we observe once again
that there is a clear tendency for morning
Some immediate results of clustering for gas and evening peaks to be similar across
consumption are presented in Table 9.5 and profiles. Additionally, on average the
Figure 9.4. As we note, cluster 2 represents consumption levels stay at the limit of
very high and variable behaviour yet 2,500 kWh per half hour across the
represents a very limited part of the sample. clustered profiles. However, what we are
Cluster 1, in contrast, represents half of the picking up is clusters of really low
national sample variation at aggregated consumption (clusters 1 and 7) compared to
level. We may conclude that, on average, gas high and variable groups (clusters 3,4,6)
energy consumption across Great Britain that are defined by the variability around
does not vary significantly and there is a night time consumption, early mornings
tendency for highly stable consumption and outside peak hours.
through the day with peak hours falling into
intervals of 06:00 – 08:30 and 16:00 – 20:00. 9.5.2
Customers in postcode sectors that fall in Usage during off-peak hours
cluster 1 are more likely to consume during
the peak hours with a lower tendency to As a further extension to this analysis we
consume at night in comparison with cluster segment the temporal analysis of energy
4, for example, where there is a greater data in terms of peak hours as they have
propensity to use gas both overnight and shown to be important for the definition
throughout the day. of the clusters. Figure 9.5 suggests that
17.30 17.30 17.30
18.00 18.00 18.00
18.30 18.30 18.30
19.00 19.00 19.00
19.30 19.30 19.30
20.00 20.00 20.00
20.30 20.30 20.30
21.00 21.00 21.00
21.30 21.30 21.30
22.00 22.00 22.00
22.30 22.30 22.30
23.00 23.00 23.00
130
0
1000
2000
3000
4000
5000
0
3000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
00.00 00.00 00.00
00.30 00.30 00.30 00.00 00.00 00.00
01.00 01.00 01.00 00.30 00.30 00.30
01.30 01.30 01.30 01.00 01.00 01.00
02.00 02.00 02.00
Cluster
Cluster
01.30 01.30 01.30
02.30 02.30 02.30 02.00 02.00 02.00
Cluster 4
Cluser3
Cluser2
Cluser1
Figure 9.5
04.00 04.00 04.00 03.30 03.30 03.30
04.30 04.30 04.30 04.00 04.00 04.00
05.00 05.00 05.00 04.30 04.30 04.30
05.30 05.30 05.30 05.00 05.00 05.00
06.00 06.00 06.00 05.30 05.30 05.30
06.30 06.30 06.30 06.00 06.00 06.00
07.00 07.00 07.00 06.30 06.30 06.30
07.30 07.30 07.30 07.00 07.00 07.00
08.00 08.00 08.00 07.30 07.30 07.30
demonstrate increased
09.30 09.30 09.30 09.00 09.00 09.00
Time
Time
12.30 12.30 12.30 12.00 12.00 12.00
Time
Time
Time
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
Cluster 9
Cluster 8
Cluster 7
03.00 05.00 03.00 02.30 02.30 02.30
Cluster 6
Cluster 5
Cluster 4
Time
Time
12.30 13.30 12.30 12.00 12.00 12.00
Time
Time
Figure 9.6
Clustering of ‘off-peak’
hours data. Here we
assess consumption
levels that are represented
as the times between
11:30 and 15:00. Note:
Grey areas represent no
data; data provided for
areas with complete
annual coverage only.
Edinburgh
Leeds
Liverpool
Cardiff London
Cluster 1 Cluster 3
5000 5000
4000 4000
3000 3000
Wh
Wh
2000 2000
1000 1000
0 0
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
Time Time
Cluster 2 Cluster 4
5000 5000
4000 4000
3000 3000
Wh
Wh
2000 2000
1000 1000
0 0
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
Time Time
UKDS and BEIS Energy 2005-2015 Region or N/A Sample is sufficiently large and covers
Performance address level over 15 million addresses in the UK. Data
Certificates contain information on energy efficiency
(EPC) bands and variables such as age, type of
property, floor area, annual gas and
electricity consumption as fuel poverty
indicators. Dates of records vary as
according to regulations, assessments
should be undertaken when conditions
change, e.g., when a property is rented.
epc.opendatacommunities.org
CDRC House Ages and 1989-2015 LSOA and MSOA ONS (quarterly) The data were collected originally by ONS
Prices and VOA and VOA. The dwelling age counts are at
(annual) LSOA level and median house prices are at
MSOA level.
data.cdrc.ac.uk
Office for Census 2011 2011 Census OA Decennial These data offer a fairly detailed
National statistic description of all households and
Statistics (ONS) properties in the UK. Useful variables could
include household size, employment
characteristics, dwelling age, country of
origin and others. However, the data have
no consideration for recent (<10 year)
temporal variations and may contain
missing data.
www.ons.gov.uk/census/2011census
Table 9.7
Openly-available
datasets that may aid the
understanding of variation
in energy consumption.
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 133
Various datasets available for linkage with Before looking at links between
energy consumption are presented in Table socio-demographic characteristics and
9.7 where we list the sources of additional consumption, we clustered temporal
data, the geographical and temporal profiles in Bristol in a similar fashion
references and time period that these as we did for the national dataset; the
datasets cover. A number of limitations only difference is we are now looking
should be addressed when considering these at a much finer OA level geography. The
data sources. Firstly, the vast majority of resulting clusters are presented in Table 9.8
data on the population are only available and Figure 9.7. The number of distinct
in aggregate geographic units. However, clusters is smaller than that of the national
some data are available at the household dataset. Nevertheless, some immediate
level which will enable us to link them to correspondence with the clusters that we
specific trends identified by individual smart defined previously can be noted for clusters
meters. One such example is the Energy 1 and 2. The consumption in Bristol is
Performance Certificate (EPC) data which observed to be differentiated by both
contains energy performance related peak hours and throughout the day
variables for over 15.6 million households. patterns. Most of the OA aggregates are
associated with very low or no consumption
9.5.4 during night time. This may suggest that
Bristol energy consumption and UK the variability of energy consumption
Census Output Area Classification at a finer geographical level may not
necessarily be less representative as in the
In this section, we present an analysis of large sample. Further to this, it may help us
the region around Bristol where energy in filling the gaps where data are missing
consumption in 2014 is linked to Census by defining some common energy
Output Area (OA) geography. We tested behavioural patterns that are more
the ability of some common predictors frequent in each of the areas in Great
of energy consumption, such as property Britain, or as we call them, typical profiles.
size and life-stage of energy customers
combined in a single indicator, Census Despite defining some clear relationships
Output Area Classification (OAC). Based between household characteristics and
134
Wh Wh Wh
0
1000
2000
3000
0
1000
2000
3000
0
1000
2000
3000
00.30 00.30 00.30
Cluster 1
Cluster 2
Cluster 3
01.00 01.00 01.00
Cluster 3
Cluster 2
Cluster 1
01.30 01.30 01.30
02.00 02.00 02.00
Cluster name
Time
Time
Time
12.30 12.30 12.30
13.00 13.00 13.00
27%
Figure 9.7
OA level, Bristol.
60
50
Gas consumption (kWh) outside peak hours
40
30
20
10
0
1a 1b 1c 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 5a 5b 6a 6b 7a 7b 7c 7d 8a 8b 8c 8d
Figure 9.8 energy use, the current research suggests of smart meter data may be skewed towards
An attempt to use the that over recent years energy consumption elderly people. Besides, a number of factors
2011 Census Output Area
Classification (OAC) to has tended to be shaped more significantly may be associated with consumption outside
explain the variation by consumer habits and lifestyle, than by peak hours that need further validation. To
outside peak hours on a household size or dwelling type. Capturing name but a few, the households present at
regular winter’s day on a
sample of 1,105 meters in this is challenging, yet the behaviour home may be those who work from home,
Bristol. Note: Interestingly, patterns clustered together perhaps may are retired or are carers.
even while the number of give us more direction in quantifying the
meters representing each
group is relatively small, lifestyles of energy customers. 9.6
we can identify students, Conclusion
rural tenants, families and Figure 9.8 presents some preliminary
the elderly as those most
likely to consume outside results using OAC to define the links Research domains that investigate energy
peak hours. between household characteristics and consumption range from engineering and
energy consumption outside peak hours informatics to economics and political
as a potential proxy for households staying science. The complexities of investigating
at home. It is important to note that the roll energy consumption motivate the
out of the smart meter as outlined earlier development of new research
may bias these results as it was suggested methodologies to cope with the diversity
that in 2014-2015 people who were likely of energy data available. The method we
to be present at home were among the first considered in this chapter, the Gaussian
to receive a smart meter. In the case of Mixture Model, tends to work in a quite
Bristol, we observe that around 20% of the stable fashion for the different set of data,
sample falls into the category of ‘Urban meaning that no matter how many times
Professionals and Families’ which is quite we implement an algorithm the result will
contrary to the suggestion that composition hold. Further research should consider
136 CONSUMER DATA RESEARCH: PART TWO
Further Reading Silipo, R., and Winters, P. (2013). Big data, smart
energy, and predictive analytics. Time Series
Anderson, B., Lin, S., Newing, A., Bahaj, A. and James, Prediction of Smart Energy Data, 1, 37.
P. (2017). Electricity consumption and household
characteristics: Implications for census-taking in a Swan, L. G. and Ugursal, V. I. (2009). Modeling of
smart metered future. Computers, Environment and end-use energy consumption in the residential
Urban Systems, 63, 58-67. sector: A review of modeling techniques. Renewable
and Sustainable Energy Reviews. 13(8), 1819–1835.
BEIS. (2016). Sub-National Electricity and Gas
Consumption Statistics. Regional, Local Authority, Acknowledgements
Middle and Lower Layer Super Output Area. Report,
December 2016. The authors are grateful to the ‘Domestic Energy
Provider’, for providing smart meter data for
BEIS (2017). Smart meters, Great Britain. Quarterly this research. The first author’s PhD research is
report, March 2017. sponsored by the Economic and Social Research
Council through the UCL Doctoral Training Centre.
Cao, H.-A., Beckel, C. and Staake, T. (2013). Are
domestic load profiles stable over time? An attempt
to identify target households for demand side
management campaigns. In Industrial Electronics
Society, IECON 2013 - 39th Annual Conference of the
IEEE. IEEE, pp.4733–4738.
NEW APPLICATIONS
AND DATA LINKAGE
10
141
Geovisualisation of
Consumer Data
Oliver O’Brien and James Cheshire
detail, but their effective concatenation, mapping platform called CDRC Maps.
conflation and synthesis are far from Users can access maps generated from
unproblematic. There is also the potential millions of data points depicting a range of
and need for more sophisticated data from deprivation through to Internet
visualisation, in the form of visual usage with links to the raw data for use in
analytics, of these kinds of data both their own analysis. A key motivation for
for exploratory analysis and also for the creating the platform was the desire to
communication of results. This provides share data from the CDRC and a recognition
the chapter’s focus. that online data repositories have limited
effectiveness with users seeking to browse
During the past decade the acceleration datasets or for raising awareness of
in the development and uptake of web- particular data. As will be discussed below,
mapping technologies has led to a CDRC Maps is coupled with the centre’s
proliferation of highly advanced mapping portal CDRC Data in order that users can
interfaces. These are now routinely download the raw data they have seen
accessed across a full range of platforms – mapped if they wish to undertake more
from mobile phones to desktop computers in-depth analysis. This has proved very
– and have expanded from navigational useful to analysts in local and national
devices to key forms of information government in addition to the commercial
visualisation. A trend facilitated by a move and third sector which lack the budget and
away from serving image tiles to users and skills to produce their own maps from
towards the use of vector tiles where the complex data but also wish to explore
web browser effectively performs the relevant subsets of larger datasets.
geographic information system (GIS)
operations previously undertaken on the 10.2
website’s servers at source. Such tiles have Web mapping
the advantage of being generated as and
when required, which enables the inclusion The earliest web maps were created in 1993
of real-time data or rapid updates. and became more ubiquitous in the early
2000s when the growth of real-time
As the technology and data for map geographic services such as mapping,
creation become more complex, a routing and location-based advertising
dichotomy is emerging in the skillsets really took off, most notably in 2005 with
required by potential users. Web maps Google’s release of Google Maps. First-
can now be immediately accessed and generation applications provided only
utilised with limited prior experience, unidirectional flows of data and
but the data used to create them require information from websites to their user
more programming skills as spreadsheets bases. Over time, this system evolved
give way to databases. This bifurcation, into services that facilitate bi-directional
in part, fuels the data science industry as collaboration between users and sites,
it seeks to meet the increasing demand for the outcome of which is that information
innovative visualisations of new data to be is collated and made available to others.
served to a large number of non-specialist The two main technologies that stimulated
users. Companies such as Mapbox and this development were Asynchronous
CARTO all offer demographic data maps, JavaScript And XML (AJAX) and Application
alongside Esri, the leader in this sector for Programming Interfaces (APIs). AJAX
the past three decades. enabled the development of websites
that retain the look and feel of desktop
The Consumer Data Research Centre (CDRC) applications, while APIs defined and
is an academic initiative that has entered documented consistent ways of accessing
this space and developed a bespoke assets and tools created by other projects.
10. Geovisualisation of Consumer Data 143
They have improved the usability of Web typically colours a statistical unit area
mapping significantly by enabling direct according to the proportion of the
manipulation of map data where user population within it that has a particular
interactions (such as ‘click and drag’) attribute, for example the proportion of
are visualised instantaneously. the working-age population that are in
full-time employment.
Early versions of web maps were detached
from the underlying data used to create Choropleth colour ramps are usually scaled
them since in all cases the developer was from the lowest to the highest proportions
required to pre-render the maps before across all the areas and may be evenly
loading onto the server – users were given banded (stepped/graduated, i.e. discrete)
access to images only. As web browsers or use a continuous ramp. Alternatively,
have become more powerful and base- other methods of banding may be applied,
mapping data has become freely available such as Jenks, natural breaks (Jenks, 1967).
through initiatives such as OpenStreetMap To serve the choropleth map to the user it
and government open data platforms, for is partitioned into square images, or tiles,
example the London Data Store, image tiles that are created only when needed on a
have been superseded by vector-based server, following a request made by the
systems. These offer the key advantage user’s browser – the ‘client’. Because of the
that the maps are rendered on demand, enormous number of possible combinations,
that is they are generated at the time each resulting in a unique map tile that
they need to be viewed, from data served could therefore be viewed, it is essential to
via a series of database requests. As we be able to create the map tiles ‘on demand’
demonstrate below, this enables a much in an efficient and timely manner, as
greater amount of flexibility both in terms opposed to pre-rendering these maps and
of reported statistics and the cartographic storing all the tiles on a server. Therefore,
representation. In addition maps can act as CensusProfiler had a system that efficiently
platforms for data download to the point created custom-made maps. The website
that they can now be thought of as data architecture also employed limited ‘caching’
services rather than simply static of the most popular map views, to avoid
representations of a single dataset. repeated server-intensive spatial operations
on the database and accelerate response
10.2.1 time, but for the great majority of queries
CensusProfiler these were created at the time of the query.
This development was one of the key drivers
One of the first comprehensive web maps behind the creation of CensusProfiler’s
of population data was constructed from successor: the DataShine platform.
the UK’s 2001 Census data and called
CensusProfiler. It was one of the first 10.2.2
to offer panning and zooming controls, DataShine (datashine.org.uk)
a revolution in comparison to the pre-2005
standard of clicking around a map’s edge DataShine visualises and provides access to
to visualise the ‘next page’. The user the UK’s 2011 Census aggregated datasets;
interface had three key layers: a basemap users can access and map nearly 2,000
showing context such as roads and rivers, variables across a quarter of a million
the 2001 Census data, and a number of statistical unit areas. It marks a significant
moveable toolbars that controlled the data advancement on the technologies deployed
shown and colour palettes. In the simplest by its predecessor by enabling more
sense CensusProfiler was a series of advanced cartography and map
choropleth maps. This style of mapping customisation, rescaling of the data on
is widely used for demographic data and demand and data download functions.
144 CONSUMER DATA RESEARCH: PART THREE
Figure 10.1
Wolverhampton
Impact of rescaling the
colour ramp based on the
local values, shown for
metro use in north-west
Birmingham. Top:
Nationally scaled colour
ramp. Bottom: Locally
scaled colour ramp.
The rescaling allows
the variations in the
low (relative to national
use) but still significant
(in local terms) usage to
be viewed.
Birmingham
Wolverhampton
Birmingham
10. Geovisualisation of Consumer Data 145
There is the additional UI requirement for to CDRC Maps that acts as a data repository
more context and information to be shown and viewer. CDRC Maps therefore both
around some of the maps. For example, raises awareness of data and facilitates
some maps show composite indicators that access to it.
partition areas into specific categories that
need further explanation. This is provided 10.3
in a series of pop-ups with links to more Developments in web mapping
detailed guidance. This, again, adds to the
potential complexity of the user interface Techniques for showing maps of data on
that needs to be managed. the web continue to evolve rapidly, as the
geostack technology continues to be in
As part of broadening access to the CDRC’s active development. Technologists are
data holdings, we were keen to include looking to pure-vector based maps, as
a series of eye-catching and easily client-computer browsers become more
interpretable layers to CDRC Maps to sophisticated at rendering content
drive traffic to the platform from our themselves. However, the traditional raster
user groups. To this end we feature a approach can still lead to rich and effective
series of single-metric maps, showing how mapping that cannot yet easily be replaced
the metric value varies from low to high, with a vector pathway. Digital cartographers
typically using a fixed-hue colour ramp. are starting to consider augmenting the
It also shows example geodemographic basic approach of displaying the data with
maps, where areas are assigned a category colour variations, by incorporating other
based on the clustering of multiple metrics. kinds of symbology, such as texturing,
As the relationship between each category still best served as raster tiles.
is not normally directly quantifiable,
qualitative colour palettes (changing hue The following section considers some of
for each category) are typically used. the possible advances in this area. It will
Finally, a hybrid type of map is also first discuss prototypes to display of levels
included. Known as ‘Top Metric Maps’, uncertainty in a dataset. The second example
these show, for each area, the top category is the use of colour compositing of multiple
for a single qualitative metric, for example datasets, each represented in different
the most common industry type or the hues, on an automated, rule-based basis,
most popular mode used to travel to work. to generate new ways of looking at
As the categories are qualitative, hue- multivariate data. Finally, we consider an
varying colour palettes are used. Whilst alternative approach to the ‘building mask’
they have proved popular with our users, technique of DataShine and CDRC Maps
we are aware that top metric maps have (discussed above), by using colour to
to be used cautiously, as they may not be emphasise the population’s location.
representative of the wider population in
each area, particularly if the category 10.3.1
break-downs are not carefully calibrated. Uncertainty
The entire geostack used in CDRC Maps – Consumer datasets are often inputs into,
namely Mapnik and PostGIS for data or augmented by, indicators such as
storage and creation; and OpenLayers, geodemographic classifications. These
JQuery and JQueryUI for display - is open are also subject to a quantifiable degree
source, and the datasets mapped are of uncertainty that is rarely mapped, but
generally themselves derived from, or that can have important implications for
simple aggregations of, open data, with analysis and interpretation. Here we take
the data being available at CDRC Data, one such indicator, the UK Output Area
the aforementioned complementary site Classification (OAC), and demonstrate
10. Geovisualisation of Consumer Data 147
Figure 10.2 how the uncertainty inherent to it can used when visualising the uncertainty of
Left: Manchester’s urban be mapped. Unlike its commercial classification of each area.
core, showing sharp
divisions. Right: Halifax counterparts, this free-to-use classification
(south part of map) and benefits from an open-source methodology With each approach we offer an example
Bradford (north part of that facilitates the calculation of a range of of the insights it can provide into the
map) where different
demographics are uncertainty measures. Here we use the 2011 success, or otherwise, of 2011 OAC.
manifest as differences OAC developed by geographers at UCL in Across the entire OAC 2011 supergroup
in how well the central collaboration with the UK’s Office for dataset, the OA average SED for the
zone is defined. Both
maps are aligned with National Statistics. dominant supergroup is 0.913, with a
north upwards. population standard deviation of 0.239.
It is now possible to apply textures to We apply a screen compositing operation
web-based choropleth. We try two such that lightens the supergroup colour in a
approaches in this work; the first is spatially randomised way by combining it
applying an image file of noise and the with a ‘grain’ texture supplied by a source
second is to apply hatching. The level of JPEG file. The grain effect has an opacity
distortion is controlled by an uncertainty set based on the absolute SED score, from 0
measure in the 2011 OAC known as the (i.e. no compositing effect) for SED less than
‘standard equalised distance’ (SED) that 0.6, increasing linearly to 1.0 (i.e. compositing
offers an indication of how close to the the texture fully) for SED greater than 2.4.
centre of each cluster a single output area
(OA) falls. The smaller the SED, the more On examining a version of the OAC 2011
certain we can be that the bulk of an OA’s map with the textured noise applied,
population fits its assigned ‘supergroup’ untextured area boundaries show strongly
(category). The smallest SED to each on the choropleth map while boundaries
supergroup, for each area, becomes that between two areas both with a high SED
area’s designated supergroup classification, are much less distinguishable. The former
but the SED to the other supergroups are case often occurs if there is a linear feature
retained. Both the absolute SED values and that forms a physical barrier (e.g. a river or
the relative SED between the ‘winning’ major highway) separating the two areas.
(referred to as primary) and ‘runner-up’ The identification of such transitions is
(secondary, tertiary etc.) supergroups are aided by the use of texture as well as
148 CONSUMER DATA RESEARCH: PART THREE
Figure 10.3
Variations in SED across
the border between
Scotland and England,
which runs diagonally,
approximately through
the middle of the image
from the bottom left to
top right.
another technique; these assign population the map, regardless of the other properties Figure 10.4
units to a random or weighted location of the colour shown. An appropriate key Combining different hues
to show relative high
within each statistical unit area. However, showing the hue or lightness variations values of unemployment
these maps are fundamentally different for different metric values, superimposed (green), South Asian
from choropleth maps and are more across a number of different saturations, ethnicity (purple) and
deprivation (red) in central
computationally intensive to produce can help emphasise this. Care should be London, alongside the
and not as easily interpretable since they taken however to ensure that large River Thames. The three
introduce a large degree of false precision. variations in saturation don’t dominate hue-based maps (above
left) are overlaid to show
the overall visual appearance of the map, the three socioeconomic
An alternative approach, assuming that at the expense of showing the variation in variables on the single
the statistical variation in the choropleth is the main population metric that is being map (above).
shown by varying the hue and/or lightness, mapped. This technique is used in maps.
is to vary the other colour variable cdrc.ac.uk/#/metrics/ruralurban/ where
(saturation in the case of the HSL colour the hue shows the category of settlement
space described here). By fading the classification – the main metric being
choropleth colour in sparsely populated mapped – and lightness is used to show
areas, and oversaturating it in areas of the variation in population density.
relatively high population density, the
eye is drawn naturally to the latter areas on
10. Geovisualisation of Consumer Data 151
Geotemporal Twitter
Demographics
Alistair Leak and Guy Lansley
Figure 11.1 service is often sufficient. The availability limited to those who choose to disclose
Map showing a random of such data and the ease in which they their location. While various anecdotal
sample of 1 million
geo-tagged Tweets may be accessed have led to an explosion evidence exists in regards to the
collected between in the publication of academic literature demographic bias present, there are
December 2012 and demonstrating the potential insight that limited means by which said bias
January 2014. Each
Tweet is depicted by may be drawn ranging in themes from may be quantified. The reason for
a single blue point. crime and security to health and mobility. this being that users of the service
However, while significant volumes of are not explicitly required to provide
literature have been published touting the any identifying information as part
potential of Twitter data, it is often the of the registration process.
case that only lip service is paid to the
limitations of the data source – specifically Given this lack of demographic specificity,
with respect to the demographic of those it is necessary that key markers are
individuals who are users of the service modelled such that they may be assessed
versus those of the population at large. against existing data or be employed in
Unlike traditional demographic data, those the study of demographics. In seeking to
individuals who are users of Twitter are achieve this goal, individual screen names
a self-selecting sample which is further may be employed in the interference of key
156 CONSUMER DATA RESEARCH: PART THREE
In the UK, this approach results in believed to generally be younger, this may
273,000 Twitter users being identified lead to a left-wing bias being present in any
as being residents. data being analysed. Failure to account for
such bias may adversely affect interpretation
11.3 and, by effect, the conclusions drawn.
Benchmarking In possession of demographically attributed
data, however, it is possible that an
Prior to performing any analysis, it is assessment of the data’s representativeness
imperative that the nature of the sample may be obtained. In the following,
being studied is understood. Such benchmarking of age, gender, ethnicity
consideration, given existing anecdotal and geographic distribution are reported.
evidence, suggests that demographic bias
in the Twitter data will be particularly For the purpose of benchmarking, two
important where the phenomenon being reference datasets are employed: the 2013
studied bears an identifiable correspondence Consumer Register produced by CACI Ltd
with age, gender or ethnicity. For example, and the 2011 UK Census of Population. The
in the UK, political views are known to Consumer Register is an augmented version
relate to age with younger people having of the publicly available Electoral Register
a greater affinity to Labour and those who which substitutes names from other
are older leaning towards the Conservative commercial sources.
party. Given that the Twitter users are
Figure 11.2
Population pyramid of
Twitter users in the UK 85 plus
versus the equivalent
Office for National 80 − 84
Statistics data for 2011.
The ONS data are 75 − 79
illustrated in grey.
70 − 74
65 − 69
60 − 64
55 − 59
Age (years)
50 − 54 Gender
Female
45 − 49 Male
40 − 44
35 − 39
30 − 34
25 − 29
20 − 24
15 − 19
10 − 14
0.075
0.05
0.025
0.025
0.05
0.0750
Proportion
158 CONSUMER DATA RESEARCH: PART THREE
in Lansley and Longley, 2016a) and the Table 11.1 presents a comparison between
equivalent data from the 2011 UK Census the ethnic composition of Twitter and the
of Population. Consumer Register as estimated by
Onomap. This highlights population
The population pyramid shown in Figure segments in which Twitter users are
11.2 confirms the anecdotal belief that likely to be more or less well represented
Twitter is predominantly used by a younger relative to the usual resident population of
proportion of the population. However, the UK. Clearly evident is that the combined
having differentiated the data by gender White Group is over-represented whilst the
it is evident that differences exist. While Asian and Black groups are systematically
female users are more prevalent in the under-represented. The Mixed group
10 to 19 bands, males become increasingly (arguably the hardest to identify using
dominant beyond this age. Beyond the 20 our chosen data classification techniques)
-24 age bracket, the proportion of male is the most under-represented. Thus,
users increases significantly suggesting in seeking to draw general inference
that it is older males who have chosen to regarding population behaviour based
adopt the platform. on Twitter it must be recognised that
the minority groups are likely to be
11.3.2 under-represented within the sample.
Ethnicity
11.3.3
As with age and gender, it is well Geographic distribution
recognised that ethnicity may play a role
in individuals’ social attitudes, health and Geographic distribution is examined using
wellbeing. In seeking to quantify the degree the Location Quotient (LQ) measure. The
to which each ethnic group is represented, LQ may be considered as the quotient of the
we applied the Onomap tool to both the UK proportion of Twitter users in a specific
Twitter population inventory and likewise geographic area versus the corresponding
11. Geotemporal Twitter Demographics 159
11.3.4
Demographic summary
assessment of demographics structure, we cases, the analysis is performed using data Figure 11.4
need to focus on individual users. Failure to collected from London’s Heathrow Airport. Map showing the LQ of
residential locations of
consider the appropriate unit of analysis The largest of London’s six airports, those UK-based Twitter
can result in the generation of invalid Heathrow handled 72 million passengers users observed within the
results or distorts insight. in 2013. There are various motivations for Heathrow extent.
Table 11.2
Topic Frequency Percentage Frequencies of Tweets by
1 Destinations 3,547 8.03 topic as determined in the
LDA analysis.
2 Anger 5,183 11.73
3 Thoughts 3,597 8.14
4 Anticipation 5,031 11.39
5 Conversations 3,951 8.94
6 England 5,792 13.11
7 Travel 5,779 13.08
8 Consumers 2,975 6.73
9 Media 3,021 6.84
10 Other 5,312 12.02
Using the same sample of Tweets from Of course, a higher number would allow
Heathrow as described in the preceding more specific and intricate topics to be
sections, the Tweet messages were first generated, although these would represent
cleaned in order to ensure the topic model fewer Tweets. The results of the model were
returned valid and coherent segmentations then appended to each Tweet in the data
of documents. We removed duplicated so we could detect trends. In this case, we
Tweets under the assumption that these simply allocated each Tweet to its most
rarely reflect original content, also often probable topic based on its total probability
these can be automatically generated scores. The group sizes are relatively well
messages. We then removed stop words, balanced; the smallest group represents over
punctuation and non-Latin characters. 6.7% of all the Tweets in our final sample.
We also removed very short Tweets with
less than three words as it would be The topics are presented as a word cloud in
difficult to generate significant topics Figure 11.5. Only the most common words
from very short documents. It was also from the Tweets are shown, which have
necessary to remove all Tweets with been partitioned by the topics they are
uncertain coordinates. Following the most commonly found in. Labels have
data cleaning, 48,188 of the original 56,417 been given to each of the topics and were
Tweets from Heathrow were deemed to be manually derived from interpreting the
appropriate for topic modelling. We used top words and via observing a random
Latent Dirichlet Allocation (LDA) to selection of Tweets. Each of the topics
generate probabilistic topics from the is sufficiently distinctive. Whilst a small
Twitter documents. LDA creates semantic number of topics are perhaps typical of
groups from large collections of documents general social media usage in the UK, many
(Blei et al, 2003). Each word (or value) of the topics are probably particular to
is assigned a score for each topic. It is, airport activities. The group labelled Travel
therefore, possible to view the topics for predominantly describes travelling. These
each Tweet by looking up the scores for messages range from complaints about
each of their words. queues and delays within the airport to
flights and transport links to central
Although LDA is an unsupervised model, London. The England topic contains many
the researcher does need to select the messages about people saying their
number of topics (k) to be generated farewells to the country, notably including
(see Table 11.2). As this study is intended to complaints about the English weather. The
demonstrate the utility of topic modelling group also includes comments from those
Tweets, we have produced just 10 groups. who have arrived at the airport, both
11. Geotemporal Twitter Demographics 163
Figure 11.5
Comparative word cloud upgrade delays baggage later sleep holiday tired lots buddy babes stay ahh
illustrating the most
hrs stuck ever bags touched England long enjoy aww xxxxxx always dude
common terms observed control seat boardingsnow weatherarrivedxxxxx baby thankyou
in each of the 10 groups. britishcrew passportdelay alreadyweeks wet yes bro hey sendbless
air virgin well
board town Conversations ken fab
travel luggage Travel still flights come xxxx
placejfk bag queue half ready cold cute haha hun merry
hello luck
drinking service goodbye finally much christmas god games
late business staff first wait hours cheers safe lovely man
fish nice fail minutes gate excited soil
plane good sir follow believe
babe lets tickets
wine eating bus check security now landed soon ill birthday
lhr sun yet ticket
water tea chocolate miss fun happywon sure fella
champagne morning class delayed sfo byeengland will best gonnafriday since
min hope
Consumers glass breakfastfast
hour just home xxx got might Anticipation
lot see one till
bar pic full beer duty drinkwaitingflight back try may
terminal love yeah tonight saturday
though play
far bacon shopping lounge free minslondon let work
airports early wifi coffee food heathrow thankcan mate going
away roll bed
last
eat
shopwow english news airport
likebitthanks del
get night til tomorrow weekend
das vuelo meu
alright old bad looks look lol new
say feel que paralos pas pero pra com gracias
fans betterfan true sounds said looking day don londres est las che jag aqui och
boy saw show called neednever con det fait
hahaha done forward peopleknow read por allah aeroportomed des
fox great york
Media film name week next f##k think phone una mais dia Other
song idea way evenwtf left twitter use como non casa ver vai
cool david nah today timef##king hate
want please mas ich aku varestoy
funny job start stop trip days sat really tweet les att horas
didn mind
course watched meeting bound year guy sit omg make book number londra kan
nyc bien
music flying s##t someonefind anything sorry aeropuerto
team two amazing heading little something dont
house route san right wanna forget real bir
money
films family meet seeing face shoes many doesn etc buy sin
friends Destinations fly actually sitting Thoughts give worry
wonderful vegas end coming else watching also agree quite
dublin woman lifedad wrong thought email put car live
bird sunday big francisco via visit girls wearing card hard person call text
ahead tourconference city awkward Anger front anyone app
busy berlin exciting worldeveryoneeyes cry behind girl thing without care
able
tourists looking forward to their holiday Tweets in the Consumers topic are more
and people happy to return home. The topic likely to be found in locations across
we labelled Consumers largely represents Heathrow where the services are provided.
comments about retail and hospitality Figure 11.6 illustrates the distribution of the
services; it is probable that a substantial Consumers Tweets (red) and the rest of the
proportion of these messages refer to Tweets from the sample (blue). It can be
purchases within the departure lounges. observed that Tweets from the Consumers
Lastly, the Other group comprises Tweets topic are very densely concentrated
that could not be allocated into the other between a handful of locations across
categories; almost all of the Tweets from the site. These are departure and arrivals
this group are written in foreign languages. lounges where retail and catering facilities
With more data, it would be viable to split are located which correspond with much of
these groups into unique subgroups. the activities discussed.
It is also possible that each of the groups The classification we present here is fairly
has distinctive temporal and spatial rudimentary as it was built with a relatively
patterns, and these may resonate more so small number of Tweets and was restricted
in groups that describe particular activities to just 10 topics. However, it is a valid
that are usually restricted to particular demonstration of the types of trends that
routines or locations. For example, the can be determined from unstructured
164 CONSUMER DATA RESEARCH: PART THREE
social media data. It is also inferred that highlight key considerations for the Figure 11.6
what is spoken about on social media can effective use of such data as part of the Map of Heathrow Airport
showing the distribution
often be associated with activities that may analysis process. It has been demonstrated of Tweets relating to
have distinctive geographies. Therefore, it is how, through the novel analysis of personal consumer activity
possible that harnessing Twitter data can be names, key identity markers may be (red) versus all Tweets
submitted within the
useful in identifying activity trends across inferred and, subsequently, how such airport extent. Basemap
time and space. These can range from data may facilitate the assessment of supplied by Stamen.
unusual delays to feedback on retail representativeness. Using said data,
offerings. Looking forwards, analysts can it has been possible to estimate the
use the probability scores for our database degree to which Twitter is representative
of words to model future data into the of the UK’s residential population.
pre-existing topic categories in order to
identify trends on the fly. Beyond illustrating potential techniques
for data enrichment, we have sought to
11.5 showcase the potential of Twitter in the
Conclusion study of demography. Using the example
of London Heathrow Airport, it has been
At the outset, the objective of this chapter shown how one may recreate conventional
was to showcase the potential of Twitter forms of analysis and, further, how the
data in the study of demography and richness of the data may be exploited
11. Geotemporal Twitter Demographics 165
Note
We developed indicators of accessibility National Health Service (NHS) was that the
to ‘fast food outlets’, ‘pubs, bars and provision of these services should be made
nightclubs’, ‘off-licences’ and ‘tobacconists’. available to all. One area of interest in
Fast food outlets typically sell foods that implementing such a policy has been the
are energy dense and nutritionally poor, equitable access to services with the aim
and the consumption of such foods are of minimising geographical barriers.
associated with increased risk of obesity. However, health services are not always
We have two measures of accessibility equally spread throughout the population
to alcohol. Pubs, bars and nightclubs and there has been considerable interest in
represent outlets that sell alcohol on-trade whether geographical barriers prevent the
(i.e. alcohol is purchased and consumed utilisation of such services.
on site) and off-licences are stores that
primarily sell alcohol as off-trade (i.e. We include measures of accessibility to
alcohol is purchased on site but consumed features of primary and secondary health
off site). These have different harms, care. These include ‘General Practices
with access to on-trade outlets typically (GPs)’ which are the first point of care
associated with acute alcohol-related for patients, ‘pharmacies’ which sell
harms and off-trade outlets with chronic medicines, ‘dentists’ which provide
harms. Finally, tobacconists are specialist oral health care, and finally ‘hospitals
stores which sell primarily tobacco-related with accident and emergency (A&E)
products such as cigarettes, cigars and departments’ which provide more serious
loose form tobacco. care. We also include accessibility to leisure
services that while they are not health
We also include access to ‘gambling services, offer individuals the opportunity
outlets’. These represent slightly different to exercise, which is important for
harms compared to our other indicators. promoting healthy lifestyles.
Gambling outlets represent the potential
for economic losses which are indirectly 12.4
related to health. Individuals who use them Physical environment
have also been shown to be associated with
poorer mental health. There has been longer interest in
understanding how features of the physical
Local governments in the UK (and beyond) environment impact health compared to
have sought to regulate aspects of the built other domains. We focus on two important
environment in attempts to address the aspects of the physical domain: green space
access and supply of unhealthy amenities. and air quality.
Planning regulations have been introduced
to limit the density of fast food outlets, Green space refers to areas of natural
pubs/bars and off-licences. There is also environments including grassland,
similar interest in reducing access to woodland, parks and other areas of
gambling outlets. Therefore, the interest vegetation. It has been demonstrated
in such metrics is not purely academic, to be an important determinant of
but has important policy relevance. physical and mental health. Parks offer
opportunities for physical activity, as well
12.3 as social interactions with friends and
Provision of health care family. Individuals residing in ‘greener’
environments also tend to have improved
Health care services provide important mental wellbeing.
point of care amenities for the diagnosis,
treatment and maintenance of health. One Air quality is an important determinant of
of the founding principles of the UK’s respiratory health and is viewed as one of
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 169
we used modelled estimates from restrictions on roads as well as tagged Table 12.1
Department for Environment, Food and speed limits and barriers. We measured the LDC categories and
subcategories selected
Rural Affairs (DEFRA) for a series of air network distance between the centroid of for each indicator of
pollutants with known health implications each postcode in the National Statistics the Retail environment
(NO2, PM10 and SO2). The air pollution data Postcode Lookup (NSPL) (a database domain.
are modelled under DEFRA’s Modelling of containing all postcodes for Great Britain)
Ambient Air Quality contract to provide and the coordinates of the nearest service
policy support and are created at a 1x1 km (e.g. a postcode centroid for GP practice).
resolution. Model estimates are derived However, the overall process for calculating
from a mixture of data collected from network distances for about 2 million
monitoring sites and estimated levels postcodes in Great Britain is CPU-intensive
based on the location of industry and and the Routino tool computes distances
road networks. Additionally, we acquired sequentially. To address both these issues,
information on ‘green’ spaces available for we implemented a parallelisation
use by the public from the Open Street Map framework using 10 Docker5 containers
(OSM) through selecting areas with the that run Routino instances in parallel for
following tags: cemetery, common, dog subsets of 200,000 GB postcodes. In this
park, scrub, fell, forest, garden, greenfield, way, we achieved a significant decrease in
golf course, grass, grassland, heath, processing time from roughly eight days to
meadow, nature reserve, orchard, park, about eight hours per indicator!
pitch, recreation ground, village green,
vineyard and wood. Accessibility to each of The indicators for the physical environment
our indicators (other than the indicators of domain required a different approach.
physical environment domain) were created For measuring access to green space,
using the Routino4 open source software. we defined accessibility as a measure of
Routino is an application for finding a route the overall area of green space available
between two points using the OSM road to each postcode that falls within a 900
network, and takes into account metres buffer zone. We selected this
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 171
Table 12.2 measure following the recommendation (we refer to these as LSOAs for simplicity in
Indicator weights and of the European Environment Agency the rest of the chapter) although they are
direction for each
indicator of the Access which argues that each person should slightly smaller with population sizes
to Healthy Assets and have access to green space no further than between 500 and 1,000 people.
Hazards (AHAH) index. 900 metres (or a 15 min walk) from their
home (Stanners and Bourdeau, 1995). We Each indicator was then individually
additionally performed sensitivity testing standardised by ranking LSOAs from best
of additional buffer sizes; however the to worst. The direction of each variable was
results did not significantly alter. We did dictated by the literature (e.g. accessibility
not measure access to our air pollution to fast food outlets were identified as
measures but used their modelled values health negating, whereas accessibility
from DEFRA and aggregated them at the to GP practices were health promoting;
LSOA level. see Table 12.2). Each variable was then
transformed to the standard normal
Measured network distances for each distribution. The indicators within
indicator were aggregated from postcode each domain were combined with equal
into an aggregate geography. For England weights forming an overall domain score.
and Wales, these were Lower Super Output We chose to equally weight each indicator
Area (LSOA), and in Scotland, Data Zones. since there was no clear justification for
We selected these geographies since different weightings, which otherwise
they are relatively small zones which would emphasise the relative importance
are regularly used in research, local of the composite score versus those
government or health, and could be easily others considered.
aggregated to other statistical geographies
if required. To give an idea of scale, LSOAs To calculate our overall index (and domain-
contain a mean population size of 1,500 specific values), we followed an aspect of
people with a minimum of 1,000 and the methodology from the 2015 English
maximum of 3,000 people per LSOA. Index of Multiple Deprivation (Smith et al,
For Scotland, we used ‘Data Zones’, 2015). We ranked each domain R and scaled
which are the equivalent geographical scale it to the range [0,1]. R=1/N was defined as
172 CONSUMER DATA RESEARCH: PART THREE
the most ‘health promoting’ LSOA and The main domains across our indicators: Table 12.3
R=N/N for the least promoting (N is the retail environment, health services and the Median values of each
indicator for LSOAs by
number of LSOAs in Great Britain). physical environment then were combined urban/rural status.
Exponential transformation of the to form an overall index of ‘Access to
ranked domain scores was then applied to Healthy Assets & Hazards’ (AHAH).
LSOA values to reduce ‘cancellation effects’
(Smith et al, 2015). So, for example, high 12.7
levels of accessibility in one domain are not Results
completely cancelled out by low levels of
accessibility in a different domain. The Table 12.3 presents descriptive statistics for
exponential transformation applied also each of our indicators. These reveal for the
puts more emphasis on the LSOAs at the first time low-level differences in access to
end of the health demoting side of the various health-related features for the whole
distribution and so facilitates identification of Great Britain. Many features of the retail
of the neighbourhoods with the worst environment are located on average within
health promoting aspects. The exponential less than 1.5 km of the population. Pubs, bars
transformed indicator score X is given by: and nightclubs were the most accessible.
This was followed by gambling and fast food
X = – 23 ln (1– R(1– exp-100/23)) outlets which both demonstrated high
accessibility. Off-licences and tobacconists
where ‘ln’ denotes natural logarithm and were the least accessible premises in the
‘exp’ the exponential transformation. retail environment, particularly in Scotland
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 173
good green space accessibility it also has retail features in contrast to health Figure 12.1
high scores for the indicators of SO2 and services. On the contrary, rural areas have Quintiles of accessibility
in GB: a) Physical
PM10 from the farming process. poorer access to these retail outlets in environment domain,
comparison with urban areas. b) Health services domain,
The health services domain (Figure 12.1.b) c) Retail environment
domain.
has a contrasting pattern to that seen in Figure 12.2 shows our overall index of
the physical environment (Figure 12.1.a). ‘Access to Healthy Assets & Hazards’
Rural areas have poorer accessibility to (AHAH). The figure shows that the
health services than urban areas (as shown most remote rural areas are identified
in Table 12.3). Urban areas are more clearly as ‘unhealthy’ areas in terms of
defined in Figure 12.1.b, which is expected accessibility in our measure. While they
due to the distinct differences in typically performed well on our physical
infrastructure provision and population environment and retail domains (although
density. Plotting quintiles hides some not always, e.g. Lincolnshire), they perform
variation between areas particularly in poorly on accessibility to health services,
rural areas where remote regions in Wales due to their remoteness and being sparsely
and Scotland have very poor access to populated. By contrast, most urban cores
health services. of cities such as central London, central
Birmingham, and the city centres of areas
The retail services domain is very similar such as Liverpool, Leeds and Manchester
to the health domain, with urban areas also perform poorly on our index. These
once again clearly defined. Though the urban centres have high volumes of health
relationships are reversed, urban areas services, but have poor accessibility due to
have higher accessibility to health negating the high number of ‘unhealthy’ services
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 175
restricted to just the urban core but also the Greater London metropolitan area,
extend out to the east and west. In the east, accessibility to retail outlets is by contrast
areas in the lowest quintile extend along poorer (i.e. further away) since these areas
the River Thames representing the location are predominantly more rural / less densely
of industry and river traffic. The west is populated. There is one area within Inner
characterised by Heathrow Airport which London that does perform well on AHAH
has high levels of pollution and poor access despite being surrounded by areas that do
to health services. not perform well. This is Richmond Park
and Wimbledon Common, two large
The areas that perform best on AHAH expenses of green space and parkland.
can be found in the periphery/outskirts This area performs best on the physical
of the city. These areas are characterised environment domain (and to a lesser
by good access to health services and high extent on the retail environment domain).
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 177
will be updated over time with new The authors would like to thank the Local Data
indicators that we develop or update. Company Ltd for providing the retail unit data, the
NHS of England, Wales and Scotland and the DEFRA
Further Reading for providing the health data and the air pollution
data respectively under the OGL license and the
Smith, T., Noble, M., Noble, S., Wright, G., McLennan, OpenStreetMap Foundation (OSMF) for providing the
D. and Plunkett, E. (2015). The English Indices of GB network data under the Open Data Commons Open
Deprivation 2015, Department for Communities Database License. The second author’s PhD research is
and Local Government. Online: https://www.gov. sponsored by the Economic and Social Research Council
uk/government/publications/english-indices-of- through the North West Doctoral Training Centre.
13
179
Table 13.1 as a matter of fact, any other Census challenges about how such measures might
Description of the spatial geography). OA borders were designed be calculated, and to which area they
dataset compiled.
to minimise within-zone homogeneity should be attributed.
in population characteristics (population
normalisation), without regard to the To facilitate these methodological
geographical features of the area (Martin shortcomings, three different types of
et al, 2001; see Figure 13.1). As such, for attribute measures are introduced for
proximity based inputs there were each OA that related to either two types of
182 CONSUMER DATA RESEARCH: PART THREE
Figure 13.1
Map looking at the
un-generalised OA
borders (blue lines)
in the Sefton Park area,
Liverpool. Notice how the
area of the park is divided
arbitrarily between
proximal OAs (pink
hashed line pattern).
Moreover, OA borders
usually coincide with the
street network, making
any street network-to-
area measurements
impracticable.
Figure 13.2
The spatial data model
used to process data and
produce OA zonal inputs Proximity Measures
to the classification. Housing
Natural and Density
Environment
Public and
Building
Private
Characteristics
Services
and Singleton, 2015). For example, in this secondary data. The derived direct
application, when 600m buffers were used measures included listed buildings
for major roads, this resulted in more than (Figure 13.3) and cul-de-sacs. The latter
50% of buildings meeting this criterion, were defined geocomputationally as the
providing a weak differentiation. These end of a line segment that did not intersect
tasks were computationally expensive, with any other such segment. A sensitivity
as the complete dataset contained more of 10m was applied to this criterion in order
than 12.8 million observations (building to avoid topological errors and intermittent
polygons). Therefore, the database was street segments. Results show that such
processed within the R coding language. measures can capture specific urban
morphologies even at the small-area level.
Finally, there were two further types of
direct measure: those that were derived For the other non-derived direct measures,
from building-level geographic features, the variables were simply aggregated
and those that were simple inputs from directly at the OA level, such as housing
Figure 13.3
The total surface area
of listed (registered)
buildings (ha) per OA
within the Greater
Manchester metropolitan
area.
Listed Buildings
Greater Manchester
Total Surface Area per OA
< 0.05 ha
0.05 - 0.25 ha
0.25 - 0.75 ha
0.75 - 1.5 ha
> 1.5 ha
0 5 10 km
184 CONSUMER DATA RESEARCH: PART THREE
Table 13.2
Built and physical
environment
attributes used
in the classification.
13. Consumers in their Built Environment Context 185
type. Population density was calculated of Census data where they seem to
using a ratio of persons per total building perform well for socio-economic data at
area, which potentially would give more the US Census tract scale (Spielman and
accurate results regarding housing Thill, 2008).
dynamics. The final OA attributes along
with their descriptions are provided in Prior to clustering, the input data,
Table 13.2. consisting of 18 variables and summarised
in Table 13.2, were transformed into
13.3 z-scores in order to standardise their
A small-area classification of urban measurement scales.
morphology features
This SOM implements a hexagonal grid
Methodologically, the cluster analysis within which OAs are projected and thus
follows the conventional geodemographic create the classification based on the
approach, as detailed in Harris et al (2005); resulting topology. A relatively unexplored
however, only the physical and built built environment classification with
environment data, detailed above, are used too many clusters would be difficult to
to create the typology. A common clustering interpret, so a selection of a 4-by-2
technique used in geodemographic analyses hexagonal grid was made, which produced
is the iterative allocation–reallocation eight distinct clusters. Once areas were
algorithm, known as K-means. Although assigned to clusters, mean attribute
this algorithm has been used in a variety of values were assigned to radar plots in order
geodemographic applications, this dataset to map cluster characteristics and label
is characterised by very sparsely populated them accordingly, as seen in Figure 13.4.
attribute values, which is not fit for
K-means applications. Essentially, the Radial plots are used extensively in
majority of values are zero, indicating the geodemographics as they are very intuitive
absence of the particular built environment in identifying the nature of formed clusters.
or physical characteristic from that area. A radial plot essentially depicts the cluster
centre; it is a vector representing each
Due to these shortcomings, an alternative attribute mean (in this case for 18 variables)
technique was used: a Self-Organizing Map within the cluster. Each attribute mean can
(SOM). A SOM is an unsupervised classifier be traced along every radial axis at their
that uses artificial neural networks to intersection, forming a unique pattern for
classify multidimensional observations every cluster. Since values were standardised
in two-dimensional space based on their to z-scores, values of zero suggest that
similarities (Kohonen, 2001). A SOM the cluster attribute mean is equal to the
typically organises observations by national mean, while values above or below
projecting them as grid units onto a plane, zero suggest that cluster attribute means
and through consecutive iterations finds are above or below the national average
the best configuration of observations so respectively. It also suggests that the values
that every observation is most similar to shown are measured in standard deviations.
the others closest to them. Typically, the
SOM mapping process employs a lattice of To illustrate, assume that Cluster C is under
squares or hexagons as the output layer, consideration. The radial plot shows that
and the results are therefore easily mapped Cluster C has an above average prevalence
as they retain their topology. SOMs have of major roads (1.0), pedestrian streets (0.4),
many applications in a broad range of parks and gardens (1.4) and retail sites (1.5).
fields, from medicine and biology to image It has below average values of detached and
analysis and computer science. SOMs have semi-detached housing ratios (-1.6 and
also been tested as an alternative classifier -1.7), but a high concentration of flats and
186 CONSUMER DATA RESEARCH: PART THREE
0.5 0.5
Density Railway Tracks Density Railway Tracks
0 0
Flat Ratio -0.5 Green Areas Flat Ratio -0.5 Green Areas
-1 -1
Terraced Ratio Surface Water Terraced Ratio Surface Water
Figure 13.4 terraced housing (1.4 and 1). The defining of public amenities, and have plenty of
Radial plots of cluster aspect of this cluster, however, is the listed access via major roads and railways. For
attribute centres, as
produced by the SOM. buildings attribute, which has an average moderate-size cities the title holds true,
value of 5.1 within the cluster. From the but in areas such as London they tend to
mean values of attributes of Cluster C, it is be too expansive to be labelled as central.
suggested that these neighbourhoods are in
the periphery of the city centre, proximal 3. The Old Town
to some major roads and retail activities. The traditional town centre or historically
The number of historical buildings and affluent residential developments, usually
the presence of flats and semi-detached in the periphery of the main high street.
housing suggest neighbourhoods that The cluster is strongly defined by the
have been historically affluent, potentially number of registered buildings. Typically,
with a strong presence of churches or a lot of recreational facilities can be found
administrative buildings that have been here, like pubs and restaurants, along with
repurposed to housing (e.g. flats) or many administrative buildings and some
recreational facilities (e.g. pubs and historical major roads. Although the cluster
restaurants). does have a considerable number of flats,
densities remain low, potentially due to
Mapping the classification can also provide refurbishments and change of usage.
further insights in cluster labelling. For
instance, looking at the Liverpool city 4. Railway Buzz
centre, some of the OAs of Cluster C are The areas that are dominated by railway
located within the Georgian Quarter, a tracks and railway stations. They have
historic affluent housing neighbourhood no other major distinguishing attributes,
built in the 1800s. Cluster C appears to be which may suggest that they are actually
dominating the geographical extents of the rather heterogeneous in physical and
City of London as well, possibly due to the socio-economic structure.
high number of historical sites in the area.
In a similar manner, the rest of the clusters 5. Victorian Terraces
were examined in order to identify defining These are typical neighbourhoods with
characteristics. This enabled cluster types terraced housing, average densities and
to be labelled and the following short moderate access to public and private
descriptions to be created: services. In general, this is one of the
most central clusters in the classification;
1. High Streets and Promenades excluding housing types, all attributes are
These clearly depicted areas represent very close to average. It is also one of the
the main retail centres of urban regions few typologies that can be found anywhere.
located along the main commercial streets.
The main characteristic of this cluster is 6. Suburban Landscapes
the very high ratio of pedestrianised street These areas are typically of semi-detached
networks, not only around retail clusters houses, with good access to parks. They
but also along seafronts, where tend to be quite distant from retail centres.
traditionally a lot of recreational Densities are higher than average as a
and leisure venues can be found. result of the few non-domestic properties
found within (since population density is
2. Central Business District calculated per building surface). They are
The area often called city centre. primarily residential areas, and tend to be
Typically, high-rise buildings with a lot close to schools. Cul-de-sacs are relatively
of commercial and office spaces, hence common, possibly because of organised
the relatively low net population density. developments and gated communities.
These areas have proximity to the majority
188 CONSUMER DATA RESEARCH: PART THREE
MODUM Classification
London Region
High Streets and Promenades
Central Business District
The Old Town
Railway Buzz
Victorian Terraces
Suburban Landscapes
Countryside Sceneries
Waterside Settings
0 5 10 km
7. Countryside Sceneries of the Greater London Region (Figure 13.5), Figure 13.5
These areas are dotted with detached as identified by the MODUM classification. The Greater London
Region as identified
houses, and are located either near or As discussed previously, the core of the by the MODUM
within open countryside. This typology metropolitan region is identified as Cluster Classification.
is also defined by the higher than average C: The Old Town, expanding outwards
access to green spaces. Most rural villages along major transport corridors as Cluster
fall into this category, along with some city B: Central Business District (although in
fringe developments that lie beyond the the case of London, this cluster may be
classic suburbs. too expansive to provide any useful
differentiation). In general, axial zones
8. Waterside Settings exhibit much more strongly in an urban
The principal defining attribute of these morphology classification derived from
neighbourhoods is their proximity to built environment and physical features
surface water such as rivers, canals or sea which are linear in nature, such as roads,
(these are very distinctive in the East of railways and rivers.
England). Some of these neighbourhoods,
however, can also be found within close 13.4
proximity of ports, industrial or post- Conclusion
industrial sites (hence the low densities).
Among the distinctive infrastructure are The development of the MODUM
arterial roads, i.e. secondary roads wide classification illustrates that the production
enough to be used by lorries for the and analysis of a classification of the built
distribution of goods. environment using Big and Open Data can
offer unique insights into some aspects of
A visual interpretation of the classification geodemographic structure of urban areas.
is always meaningful in evaluating The results capture, through the
emergent clusters, as illustrated by the map multidimensionality of the data, both
13. Consumers in their Built Environment Context 189
physical properties of geographic space Green, P. E., Frank, R. E. and Robinson, P. J. (1967).
that can be used to explore correlations Cluster analysis in test market selection. Management
with other spatial phenomena, potentially Science, 13, 387–400.
in a variety of applications, from real estate Harris, R., Sleight, P. and Webber, R. (2005).
and house prices to health and wellbeing. Geodemographics, GIS, and Neighbourhood Targeting.
In a dynamic sense, it can be used by Chichester, UK: John Wiley & Sons.
urban planners and investors in the built Hui, E., Chau, C., Pun, L. and Law, M. (2007). Measuring
environment to identify the areas in which the neighboring and environmental effects on
the physical preconditions exist for residential property value: Using spatial weighting
matrix. Building and Environment, 42(6), 2333–2343.
neighbourhood renewal or upscaling.
Kohonen, T. (2001). Self-organizing Maps. Berlin:
On the other hand, the classification Springer.
process described here is very specific Martin, D., Nolan, A. and Tranmer, M. (2001). The
to the underlying data and methodology. application of zone-design methodology in the 2001 UK
An inherent disadvantage of all Census. Environment and Planning A, 33, pp. 1949-1962.
big data that are currently available. Spielman, S. E. and Singleton, A. D. (2015). Studying
neighborhoods using uncertain data from the
American community survey: A contextual approach.
Annals of the Association of American Geographers,
105(5), 1003-1025.
Acknowledgements
Epilogue:
Researching Consumer Data
Paul Longley, James Cheshire and Alex Singleton
Epilogue 191
The contributions to this book provide For this to be operationalised in the best
wide-ranging evidence that consumer data interests of data providers as well as
are both pervasive and have the potential society as a whole, the CDRC is finding it
to generate a deeper understanding about helpful to subsume particular consumer
our society. This extends their importance data sources into composite indicators,
far beyond the realm of identifying similar to existing widely used indices of
customer tastes and preferences into multiple deprivation and geo-demographic
substantive contributions to the social classifications. The CDRC research agenda
sciences in particular. For example, the thus includes the creation and maintenance
chapters in this volume demonstrate that of indicators relating to retail dynamics, use
consumer data can help to provide insight of digital channels and media, demographic
to issues as diverse as urban vitality, structure, mobility characteristics, local
community carbon footprints, or the health and carbon footprints.
collective consumption of public transport
services. However, for their potential to be In these respects, it is important to
fully realised in tackling issues of broader be aware of an important deviation of
societal concern, the quality and CDRC interests from those of commercial
provenance of consumer datasets need data providers and those of government.
to be fully understood. Developing this Academic research has concern not only
understanding is part of the process of with the short-term gyrations of consumer-
assimilation and documentation of diverse led markets but also with their long-term
sources and forms of consumer ‘Big Data’ evolution and potential socio-economic
into appropriate digital data infrastructure; implications. In consolidating large
a core mission of the Consumer Data assemblages of data into summary
Research Centre (CDRC). For example, the indicators, the CDRC is aware that it is
desire to generalise patterns - as recorded important to have well-founded data
within consumer data - to the population infrastructure that is also enduring and
at large necessitates triangulation and facilitates comparisons across time and
validation with more conventional sources space. As such, the diverse data sources and
of data, such as the Census of Population case studies reported in this volume coalesce
and Mid Year Population Estimates. Whilst into a long-term vision that reshapes the
this work integrates consumer data into the way in which we think of digital data
national data infrastructure routinely infrastructure in the social sciences.
utilised by the academic community, it also
offers insights relevant to data providers,
many of whom may not be fully aware of
the precise sectors of society that they
serve. It is also of relevance to government
in its efforts to integrate consumer data
sources into official statistics. From this
perspective, consumer data are in a
significant part a public, rather than
a private good. They are also non-rival –
that is, the use does not undermine
the competition concerns of individual
business organisations, but they
contribute to the overall competitiveness
of the economy.
Index
T
Tabula software 63
targeting of customers 35, 103
telephone directories 62
temporal granularity 30, 32, 73, 82
Tesco Clubcard 33
text mining 161–5
tiles used in mapping 142, 146
Tooley Street 93
Top Metric Maps 146
Torridge 175
town centres 187
boundaries of 41, 51
traffic counts 85
Transport for London (TfL) 10, 112
travel diaries 34, 111
automatically generated 111
trip-chaining 33–4
trust 10
Turkey 67
Twitter 153–8, 161–5
demographically-attributed data from 156–7,
160
U
uncertainty, mapping of 146–7
under-enumeration 17, 22
V
vacant retail units 42
validation of research results 100, 118
vector-based maps 146
Victorian terraces 187
visualisation of data 142
W
walking distances 46, 182
waterside settings 188
weather conditions, energy use related to 133
web-mapping 142–50
developments in 146–50
White British group 75–82
WiFi 85–90, 94–5
Wikileaks 55
Wikipedia 58
Winchester 44
Wolverhampton 44
word clouds 162–3
work-based trips 34
work location 116–19
First published in 2018 by
UCL Press
University College London
Gower Street
London WC1E 6BT