You are on page 1of 198

Consumer

Data
Research
Paul Longley, James Cheshire
and Alex Singleton
Consumer
Data
Research
Paul Longley, James Cheshire
and Alex Singleton
Acknowledgements

The editors are grateful to the Economic


and Social Research Council for funding
and supporting the work of the Consumer
Data Research Centre (CDRC), an ESRC Data
Investment, grant ES/L011840/1 and all the
research featured in this book.

Sarah Sheppard (CDRC Project Manager)


has been particularly instrumental in the
success of CDRC and, by extension, this
book. Her efforts to coordinate researchers
as well as maintain close working
relationships with data providers are
greatly appreciated! Thanks also to Patrick
Morrissey (Unlimited) for his excellent
work designing and typesetting the book.

The authors and the CDRC would also like


to thank our Data Partners for making the
data available for the research featured and
for their continued support.

Consumer
Data An ESRC Data
Investment
Research
Centre
Contents

8 INTRODUCTION
Consumer Data Research – An Overview
Paul Longley, James Cheshire and
Alex Singleton

PART ONE
PROVENANCE AND CONSUMER
DATA INFRASTRUCTURE

15 1. Consumer Registers as Spatial Data


Infrastructure and their Use in Migration
and Residential Mobility Research
Guy Lansley and Wen Li

29 2. The Provenance of Customer Loyalty


Card Data
Alyson Lloyd, James Cheshire and
Martin Squires

41 3. Retail Areas and their Catchments


Michalis Pavlis and Alex Singleton

53 4. Given and Family Names as


Global Spatial Data Infrastructure
Oliver O’Brien and Paul Longley
PART TWO PART THREE
DYNAMICS AND CONSUMER NEW APPLICATIONS
DATA INFRASTRUCTURES AND DATA LINKAGE

71 5. Ethnicity and Residential Segregation 141 10. Geovisualisation of Consumer Data


Tian Lan, Jens Kandt and Paul Longley Oliver O’Brien and James Cheshire

85 6. Movements in Cities: Footfall and its 153 11. Geotemporal Twitter Demographics
Spatio-Temporal Distribution Alistair Leak and Guy Lansley
Roberto Murcio, Balamurugan Soundararaj
and Karlo Lugomer 167 12. Developing Indicators for
Measuring Health-Related
97 7. The Geography of Online Retail Features of Neighbourhoods
Behaviour Konstantinos Daras, Alec Davies,
Alexandros Alexiou, Dean Riddlesden Mark A Green and Alex Singleton
and Alex Singleton
179 13. Consumers in their Built Environment
111 8. Smart Card Data and Human Mobility Context
Nilufer Sari Aslam and Tao Cheng Alexandros Alexiou and Alex Singleton

121 9. Interpreting Smart Meter Data of UK —


Domestic Energy Consumers
Anastasia Ushakova and Roberto Murcio 190 EPILOGUE
Researching Consumer Data
Paul Longley, James Cheshire and Alex
Singleton
8 CONSUMER DATA RESEARCH

Consumer Data Research –


An Overview
Paul Longley, James Cheshire and Alex Singleton

It has become a cliché to observe that new of the analyst. Second, different individuals
sources of Big Data are becoming available have different wants, needs and spending
in ever greater variety, in unprecedented power, and so some individuals in the
volumes and with ever more frequent population at large will be represented
temporal updating (velocity). This book more prominently than others – and at the
is about ‘consumer data’ that arise out other extreme, those that consume nothing
of every-day transactions for goods and from a particular retailer / service provider
services, carried out between individuals will not be represented at all. A related
and organisations. Such data account point is that few consumer organisations
for an increasing real share of all of the have a monopoly of their markets, and
characteristics and activities of active many focus upon particular market niches.
citizens today, and offer the prospect Taken together, this means that there
of better understanding the nature and is bias in the content and coverage of
functioning of society. consumer data sources, and that the source
and operation of bias cannot be ascertained
Consumer data are not created for the without reference to external sources.
edification of researchers and analysts. In many ways these issues are akin to
Instead, they are a by-product of the those that characterise volunteered or
myriad consumer transactions that created crowd sourced data – in that individuals
them. This has important implications for need to feel motivated in order to
the data’s content and coverage when they contribute data, and the distinctive
are reused for research purposes. First, the characteristics of those that feel motivated
traces of (some kinds of) transactions or may affect the content and coverage of the
those people conducting them may be more resulting dataset (Haklay, 2010).
evident or detailed than others, and this
outcome is usually well beyond the control
Introduction 9

This situation contrasts sharply with the The research reported in this book has
design of conventional social surveys, developed using the Consumer Data
where the principles of scientific sampling Research Centre’s (CDRC) ‘ladder of
are used to ensure complete coverage of the engagement’, whereby initial collaborations
relevant population of interest at the design with consumer organisations are focused
stage. Nevertheless the quality of social upon specific small MSc projects. A number
surveys is diminished where acceptable of these have developed into co-sponsored
response rates are not achieved, or there is PhD projects, or shared projects staffed by
bias in the relevant characteristics of those CDRC Data Scientists. Some data providers
that respond to the surveys and those that then progress to providing data for wider
do not. In this context, it is important to use by the academic community, under
recognise that recent years have seen agreed terms set out in data licensing
cumulative declines in response rates agreements. Finally, it is also possible to
throughout the developed world (e.g. Sax engage data providers in the co-production
et al 2003) and that in important respects of data with the CDRC itself. Good examples
social surveys are no longer a panacea for are provided by our engagement with
social science research. More generally, players in the domestic energy provision
there is also no guarantee that we will be and retail sector who have participated
able to rely on the long-term availability of in the Master’s Research Dissertation
those traditional sources of data such as a Programme before going on to co-sponsor
Census of the Population, as within many PhD research. This latter development
countries these expensive and time- in turn led to providing CDRC with a
consuming surveys have come under nationwide dataset; which is available
increasing threat in line with fiscal to access by other researchers through
constraint (Singleton et al, 2017). the CDRC service. The collaboration with
the Local Data Company (LDC) reported in
Many of the chapters in this book arise this book represents the highest rung of
out of shared challenges that are faced this ‘ladder of engagement’ and follows
by academics and the organisations that, successful collaboration on MSc and PhD
to differing degrees, create consumer data. projects as well as the co-production of
There are, of course, differences too: the nationwide data with CDRC for further
timescales that characterise academic research and development.
research offer horizon scanning that
business organisations are less likely to Many consumer-facing organisations are
have resource to facilitate; usually focused highly sensitised to the risks of disclosure,
upon more operational matters, such as although these risks are absolutely
optimising the next set of sales figures. minimal where data are anonymized
There may be tensions too, in that prior to transfer, and appropriate resources
consumer data providers may safeguard to access them are put in place. To this
their competitive position, while end, CDRC uses a number of secure data
contributing to research that ultimately facilities (one of which is accredited by
increases the competitiveness of their the London Metropolitan Police), and
industrial sector as a whole. There are CDRC researchers are familiar with using
also differences of emphasis in method, novel data access technologies such as
technique and application that have secure links to sensitive data-sets held
evolved in different ways between the by different organisations.
academic and business sectors. But it is
also possible that there is shared interest The approaches to consumer data research
in better understanding the form and that are reported in this book come at an
functioning of social systems. interesting time in the evolution of data
landscapes in advanced economies. There
10 CONSUMER DATA RESEARCH

is emerging consensus that data are ‘passporting’ of data originally acquired


the world’s most valuable resource for government statistical purposes to
(The Economist, 2017). To the behemoths researchers. Such arrangements would
of the Internet age – Alphabet, Amazon, also have favourable implications for the
Apple, Facebook, Microsoft – data are a preservation and curation of many sources
strategic resource, largely to be acquired of consumer data under the provisions for
and siloed within corporate organisations. research exemptions of the General Data
From the broader public good perspective, Protection Regulation (GDPR).
data provide infrastructure for individual
and societal decision-making. For example, This vision begs a number of important
there is abundant evidence that Open strategic questions concerning the form
Data platforms and open Application and detail of the emerging data landscape:
Programming Interfaces (APIs) lead to
wide economic and social benefits, with 1) Are Big Data to be thought of as a rival
the data feeds from Transport for London or non-rival resource? The siloed
(TfL) providing one of the most well- approach of large corporations
known exemplars. Such initiatives can lead suggests that data are a valuable
to the creation and successive updating of commodity and strategic resource,
new data infrastructures, although in many the potency of which is diluted if data
cases this process is impeded by difficulties are shared with competitor ‘rivals’.
in apportioning the cost of infrastructure Seen from this perspective, they are
creation and maintenance. Whilst there not to be traded or otherwise shared.
has been significant progress, the freer Yet data sharing has been shown to
movement of data within and between leverage wide benefits, particularly
jurisdictions and industrial sectors if data platforms can be made open
still presents daunting challenges for to the widest constituency of users.
government, not least because there 2) Does GDPR present a threat to the
exists no open market for many sources creation and maintenance of datasets
and forms of data. for research purposes, or an
opportunity for researchers to create,
Without a strong precedent, the work of maintain and preserve data-rich
CDRC relies heavily upon the attitudes to representations of social systems?
data licencing of a wide range of industrial 3) How can the Big Data ‘exhaust’ of
partners with their own policies and consumer transactions and
procedures (over 20 data licensing interactions be reused in
agreements have been signed to date). representations of social systems
These partners provide their data for the that are genuinely inclusive? How can
public good and pursue research questions scientific methods be repurposed to
that contribute to a more competitive analyse data that are created and
economy and fairer society. Some of these possibly assembled without any
shared objectives were integral to the 2017 scientific research design?
Digital Economies Act, which includes 4) How can public trust and
provisions to require business to assist understanding of science be developed
in the compilation of national statistics. and maintained in support of research
The spirit of the approach underpinning that realises more of the potential of
the chapters of this book is to go beyond consumer data?
narrow official requirements and engage
in truly collaborative inter-sector research CDRC’s mission includes the creation and
of common concern. It is our hope that maintenance of new measures of the ways
these arrangements might flourish further in which ‘smart’ urban systems function,
in the future, for example through the for example with respect to pedestrian
Introduction 11

flows, household activity patterns and Further Reading


residential and social mobility. Any
Haklay, M. (2010). How good is volunteered
representation of a ‘smart’ system is geographical information? A comparative study
necessarily incomplete, and it is important of OpenStreetMap and Ordnance Survey datasets.
for analysts and public alike to understand Environment and Planning B: Planning and Design, 37(4),
682-703.
the nature and extent of this incompleteness.
Furthermore, improved scientific Sax, L. J., Gilmartin, S. K. and Bryant, A. N. (2003).
understanding of the public is inextricably Assessing response rates and nonresponse bias in
Web and paper surveys. Research in Higher Education,
linked to improved public understanding 44, 409-32.
of science, since only this is likely to bring
informed consent for acquisition of the Singleton, A. D., Spielman, S. and Folch, D. (2017).
Urban Analytics. London: Sage.
best data and the best research practices
to take place. The Economist (2017). ‘The world’s most valuable
resource is no longer oil, but data’. May 6. https://
www.economist.com/news/leaders/21721656-data-
There are rapid developments and changes economy-demands-new-approach-antitrust-rules-
in the digital data economy, ranging from worlds-most-valuable-resource
renewed open data initiatives to the
creation of new data silos within industry.
Given its increasing real share of all data
collected and its salience to understanding
individual activities, attitudes and
preferences, it seems clear that consumer
data have an important role to play in
developing tomorrow’s data infrastructures.
The contributions to this book illustrate
many of the ways in which academic
engagement with customer-facing
organisations can release consumer
data that will help us to better understand
what is going on in contemporary society.
Yet effective representation of consumer
behaviour will not be achieved unless the
sources and operation of bias in consumer
datasets can be successfully accommodated.
This argues for a research agenda that
seeks to triangulate rich, salient and timely
consumer data with more conventional
census, administrative data and social
survey sources.
PART ONE

PROVENANCE AND
CONSUMER DATA
INFRASTRUCTURE
1
15

Consumer Registers as
Spatial Data Infrastructure
and their Use in Migration and
Residential Mobility Research
Guy Lansley and Wen Li

1.1 migration at a household level would


Introduction give us the opportunity to develop an
understanding of social mobility and
This chapter outlines efforts to devise asset accumulation through linkage to
modelled estimates of population change other geographic datasets.
at a small-area level using annual registers
that blend consumer and voter registration In this chapter, we present work on
data. Names and addresses of individuals the 2013 and 2014 Consumer Registers
are routinely collected by governments and produced by CACI Ltd (London, UK).
commercial organisations. However, there The registers comprise the public version
have been few attempts by academics to of the Electoral Register (sometimes
pool the data in order to track population termed the ‘edited register’) and are
changes despite the registers representing supplemented by a range of unattributed
the majority of the adult population. consumer data sources. Together, these
Therefore, the possibility of linking population databases provide near
databases for chronological pairs of complete coverage of the adult population
years could provide a unique insight into at the individual level and are consolidated
population dynamics on an annual basis. on an annual basis. However, the data only
Aligned with consumer data analytics, contain information on adult individuals’
this information could reveal important names and postal addresses and lack any
statistics about the United Kingdom’s demographic variables. In addition, due to
changing social structure and how it varies the nature of their data collection and
geographically – with far more frequent amalgamation, the consumer data are of
refresh than available from comprehensive unknown provenance. We have therefore
government sources such as the Census developed novel data-linkage techniques
of Population. Comprehensive models of in order to assess the completeness of the
16 CONSUMER DATA RESEARCH: PART ONE

population recorded prior to modelling 1.3


apparent trends from these pooled data. Consumer representation

Set in the context of harnessing Issues of representation are paramount


information on population dynamics to all consumer datasets (Kitchin, 2014).
from data linkage between two registers, Therefore, we have been considerate of
this study has three broad aims. First, to possible data biases, and how they may
devise an appropriate technique to match vary geographically. The Electoral Register
addresses. Second, to estimate household has historically been considered a
dynamics by linking names at matched representative source of data on the voting
addresses. And finally, to estimate population. Many social researchers have
migration by modelling the movements used the registers to create effective sample
of those that have left and joined addresses frames for surveys (Hoinville and Jowell,
– specifically between 2013 and 2014. 1978). However, there are three main issues
We will explore the feasibility of this with accessing the data for social research
model as a means of representing today. Firstly, not all adults living in the
migration and social mobility. UK are eligible to vote and are therefore
excluded from the registers. Secondly,
1.2 not all eligible adults are on the register
The data source due to political disengagement or changes
of address that are untimely from the
The consumer registers potentially provide perspective of voter registration. Finally,
an invaluable source of population data as not all adults agree to have their names and
they comprise the vast majority of the adult addresses shared on the public versions of
population at an individual level. The data the Electoral Registers. Consequently, the
are routinely collected throughout the year, public versions do not fully enumerate the
although collection methods vary between adult population. In this case, only about
the registers’ different data sources. 50% of the adult population could be
The latest public Electoral Register recorded by the edited version of the
enumerates about 50% of the population, Electoral Register in 2013 and there has
it is usually updated in bulk in the autumn been considerable variation around this
(with a deadline for inclusion being 15th mean figure in recent years. The opt-out
October) and then released a few months rates for the edited register have also
later. However, following the introduction steadily increased since its introduction in
of Individual Electoral Registration in 2014, November 2001. Data made available from
the proportion of those who decided to opt the UK Office for National Statistics (ONS)
out of the edited versions of the Electoral revealed that the opt-out rate in 2014
Registers has increased (Electoral ranged between 19% and 88% between
Commission, 2016). Therefore, the local authorities. However, the accuracy
consumer sources are becoming more of the records is unknown. For example,
important underpinning components it is estimated that 91% of entrees were
of the consumer registers. accurate at the time of release of the 2015
register (Electoral Commission, 2016).
In this study we have acquired registers for
2013 and 2014. In total, the 2013 register has Previous research by the Electoral
54,380,747 records, whilst the 2014 register Commission found that Electoral
represents 55,397,463 individuals. There are Registers have an inherent demographic
slightly over 27 million unique addresses in bias. As few as 67% of adults aged 20 to
both datasets. 24 were included in the data (Electoral
Commission, 2016). There was also an
under-representation of adults of black and
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 17

minority ethnic backgrounds and foreign above) (Figure 1.1). It can be observed that
individuals who were eligible to vote due to two main areas of under-representation
their country of citizenship  (i.e. Irish and are London and Northern Ireland. Whilst
Commonwealth citizens). In addition, only under-enumeration in London can possibly
57% of respondents in privately rented be accounted for by the higher proportion
properties were found to be in the Electoral of (non-voter) migrants and individuals
Register. This suggests that it is the in rental properties, the low counts in
geographical mobile population that are Northern Ireland are probably due to
typically under-enumerated or inaccurately different administrative procedures of
recorded. It is highly likely that the their Electoral Office or a low presence of
remaining data sources in the Consumer participating retailers. Indeed the pattern
Registers will also under-enumerate those across the UK is rather serendipitous;
who recently changed address as there are whilst the most over-represented districts
little incentives to immediately update are generally less densely populated, this
your details for many services following a is not always the case. As the electoral roll
change of address. It is also possible that is administered by local authorities, it is
different sources of consumer data may possible their varying practices have
have particular demographic and socio- contributed to these differences. In
economic biases. addition, some of the consumer data may
come from companies which have regional
Previous research has focused upon issues customer biases. We have also considered
of under-representation when discussing the spatial distribution of representation
the provenance of big datasets. The at the intra-urban scale. We have taken
Consumer Registers appear to over- the City of Bristol as an example due to
represent the size of the adult population. its pronounced socio-spatial inequalities
We have compared the number of records and observed the rate at the census output
to the estimated population of persons area (OA) level. Census OAs had an average
aged 17 and above from the ONS mid-year population of just over 300 in 2011. Indeed,
population estimates. For example, the 2013 Figure 1.1 also highlights that most
and 2014 Consumer Registers each contain under-representation occurs in the
over three million more individuals than centre of the city. This part of the city has
the ONS population estimates for the same the greatest proportion of young adults,
year. This could be due to a number of ethnic minorities and those in privately
reasons such as the duplication of those rented accommodation. All three of these
who live at multiple addresses, failure characteristics were found to be associated
to delete old records and issues of cross with under-enumeration in the Electoral
contamination when data are pooled Register (Electoral Commission, 2016).
(Bollier, 2010). There are also likely to Generally, it is neighbourhoods with the
be some individuals below the age of greatest rate of homeownership which
17 in the consumer data who cannot be have the highest counts in the consumer
distinguished due to the unavailability of registers.
demographic variables. We should also
consider that population estimates do not 1.4
represent the actual population counts. Address matching

We have attempted to identify if there are The addresses recorded in the registers are
geographic patterns of overrepresentation. formatted into six text columns representing
Firstly, we have considered local authority distinctive lines of their postal addresses,
(or district) level variations at the national such as house numbers or names, streets,
level through comparisons to the 2011 cities, etc. In addition, there is also a
Census population (adults aged 17 and postcode column. However, unfortunately,
18 CONSUMER DATA RESEARCH: PART ONE

the addresses are not consistently derived from the intuition based on UK Figure 1.1
structured. For example, the first line of  addresses. The first one is based on the The ratio of the number
of recorded persons in the
an address may represent a flat number for numbers used in the addresses including 2013 Consumer Register
some addresses, whilst it could represent property numbers and flat numbers. by the population of
the street name and house number for Examples are ‘14’, ‘14a’. The second is persons aged 17 and
above from the 2011
others. In addition, the number of lines based on the word difference between two Census at the district
in each address varies; many records addresses which measures how close the level for the UK (left)
do not include the county or region name. word sets respectively are used in the two and output area level
for Bristol (right).
Although the data provider did include a addresses. This will cover the cases where
unique reference number for each address, addresses do not contain a house number.
there were inconsistencies between its The function also takes into account the
recording in 2013 and 2014. common words in addresses (such as road,
street) by weighting the difference between
Our aim was to create a methodology to words inverse proportionally to their
match as many addresses as possible, frequency in the data, as well as their
regardless of how they are formatted. abbreviations. The third function is a
Due to inconsistencies within the database, variant of Levenshtein Distance (a.k.a.
we could not match all dwellings via a Edit Distance) which measures the
simple string match. To improve the quality difference in terms of characters.
of joining via textual addresses, we devised  The adaption incorporates a weighting
a method for matching addresses based on scheme to emphasise the difference
similarity of text strings. The method at the beginning of the textual addresses.
combines three similarity functions To match addresses from a set of
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 19

candidates, we combined the scores from  been recorded differently in different


the three similarity functions by weighted registers. Many individuals may have
sums. The parameters were tuned by changed their names. There are roughly
inspecting the matching pairs with large 120,000 marriages a year in England and
dissimilarity with respect to each Wales and many married women will take
similarity function. their husbands’ surnames. We therefore
applied heuristics to detect name changes
Using our methodology, between 2013 and due to marriage. Titles were not found to
2014 we were able to match 26,757,456 be useful discriminators of gender, many
addresses, 98.9% of records in 2013. records were missing titles and there were
also occurrences of gender neutral titles
We also acquired the addresses of all such as ‘Dr’. Therefore, we used a lookup
dwellings that were sold in 2013 and 2014 table of genders by forenames to estimate
from the Land Registry. This data would gender where the titles ‘Mr’, ‘Mrs’, ‘Miss’
be useful to determine where changes in or ‘Ms’ were not present. The database was
residence were very likely to have occurred. built from birth certificate and consumer
In total, the databases contained 683,842 data files and represented over 17 million
sold homes in 2013 and 794,929 in 2014. individuals (as described in Lansley and
Through our methodology, 100% of these Longley, 2016). With the ability to
addresses could be matched to addresses differentiate between genders, our next
from the Consumer Registers. task was to identify occurrences of where
a female’s forename matched between
1.5 both datasets within a household but her
Identifying household change surname did not. We then checked to see
if a male was also present in the same
With a valid means of linking addresses, household in both years. If the female’s
it was possible to detect household level surname in the second year was identical
changes between years by matching the to that of the male’s, then we assume her
residents. We considered both the total surname changed following marriage.
number of residents in each year, and also Between 2013 and 2014, 100,439 individuals
changes in household composition. This were identified as having names that
was possible by matching residents’ full changed due to marriage. This figure is
names between different years in order to plausible given that many wives may not
detect reoccurring residents. For example, change their names after marriage and a
if in one year ‘John Smith’ and ‘Sally Smith’ proportion may not have lived with their
resided at a dwelling, and the following year husband in the preceding year.
‘John Smith’ and ‘David Jones’ lived there,
our model would assume one adult has Although punctuation was removed from
remained, one adult left the property the name matching process, we also
and one adult joined or came of age. We created a flag to identify those with
also created a key to represent the small double-barrelled names. It was observed
number of individuals who may share their that some adults may have double-barrelled
full name with another resident in their surnames in one register and just one of
household. As this accounted for roughly their singular surnames in the other. Aside
100,000 individuals in each dataset, we from marriages, the main cause of this
have presumed that many of these are not could be inconsistencies in name entry
duplications and could be senior/junior procedures between data suppliers. In
name variants. addition to the identified marriages, we
found that over 11,743 individuals had
However, this method would fail to account double-barrelled surnames that were
for individuals whose names may have inconsistently recorded. Finally, we also
20 CONSUMER DATA RESEARCH: PART ONE

Table 1.1
Household type Number of households Changing household
Stable household 19,940,359 characteristics, 2013–14.

Complete change 3,153,518


Growth 1,614,979
Shrinkage 1,218,182
Unstable household 1
830,418
Present in 2013 only 289,808
Present in 2014 only 512,244

considered surnames that were misspelled proportion of addresses which represent


using a similar approach. This time the same households in both years,
we identified occurrences of identical identifying that more population churn
forenames and surnames which were occurs in cosmopolitan areas.
different by up to just three characters.
In addition to those identified as recently It is very difficult to determine who may
married, or with inconsistently formatted have joined a household due to a change
names, 73,532 persons were identified of address or due to coming of age. One
as having differently spelt surnames. possibility is to filter adults who join
In total, 185,714 persons were matched households where at least one other
despite being recorded with different household member shares their surname as
surnames; these were subsequently a large proportion of these are likely to be
reassigned as stable residents. the offspring of other household members.
Indeed, just over 2 million people met this
Although the registers contain personal criterion between 2013 and 2014. However,
information, our analysis was automated this number is very high considering the
and the outputs have been aggregated to population of 18 year olds in this period
avoid issues of privacy. Throughout the was just over 770,000 according to the
chapter we have used some names as mid-year population estimates from the
fictitious examples to demonstrate key ONS. Therefore, many of these may be
concepts. young adults returning to their parents’
homes due to rising rent costs or elderly
Following name cleaning, our household family members moving in. Indeed,
matching model identified that the vast between 2008 and 2015 the number of
majority of households remained stable, young adults who resided with their
by which we mean their composition of parents rose drastically to 3.3 million
recorded residents were identical in both (ONS, 2015). Through linkage to our
registers. The frequency of different types forenames database, it was possible to
of household change between 2013 and 2014 obtain inferences about age structures.
are outlined in Table 1.1. Names have been found to be associated
with age groups due to changes in baby
We would expect there to be a geography to name popularity over time, and changing
the rate of churn identified by linking the rates of migration (Lansley and Longley,
2013 and 2014 databases. Taking Bristol as 2016). The forenames database provides
an example, the proportion of households models for the typical age structures for
with at least one continuing resident (by over 10,000 given names and was built from
which we mean a name appearing at an birth certificate records and consumer data
address in both 2013 and 2014) have been sources (Lansley and Longley, 2016). It was
mapped (Figure 1.2). It can be observed that observed that the median estimated age of
the central parts of the city have the lowest those who have joined the family household
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 21

It is highly probable that many of the


leavers have also been recorded as joiners
at their new address. However, it is difficult
to confidently match them. Our first
approach is to consider only adults with
one occurrence of their full name in each
subset of movers. If a name is recorded only
once in the leavers subset and only once in
the movers subset, it is highly probable that
they represent the same individual who has
changed address between 2013 and 2014. It
is, therefore, possible to record their origin
and destination. To increase the number of
adults who we are able to match we will
also consider residential mobility at the
household level. By grouping all of the
movers from the same household into
a single household name composition
key, it is possible we may be able to link
household units that moved together.
In many cases, we could also identify
individuals with more common names
using this approach. For example,
potentially thousands of John Smiths
could have moved house between our
Figure 1.2 is substantially younger than the average sample years. However, if one John Smith
The proportion of for the rest of the data. was originally a resident at an address with
addresses with at
least one continuing an individual with a less common name,
resident in Bristol by 1.6 there is a stronger likelihood that they are
output area. Estimating migration the only household unit which contains
those individual names together. Of course,
Having established a means of data linkage this model can only identify the moves of
in order to detect population changes at a household units if members move together
household level, our next objective was to and are, therefore, recorded identically at
model a substantive proportion of their new address in 2014.
residential mobility which occurred
between 2013 and 2014. Therefore, to predict migration we utilised
two models. The first model attempts to
As the data lack any associated attributes, identify singular joins between household
modelling migration had to be computed units from households that left an address
from novel data linkage techniques based in 2013 and are present in households in
on given and family names. Our heuristics 2014. A second model then found additional
are relatively straightforward. If we subset movers by focusing only on adults with
the data to create a database of adults who unique full names, as not all moved
were present at a given address in the household units will remain intact
initial time period and were not rerecorded following a move (i.e. household
there in the subsequent time period – deformation or due to delays in recording
we could term these individuals as specific members). Although the models
‘leavers’. Then a second subset of ‘joiners’ may neglect individuals with more common
who were recorded at a dwelling during the names, their results could be informative
subsequent time period but not the earlier. of broader migration trends. For instance,
22 CONSUMER DATA RESEARCH: PART ONE

it will be possible to generate statistics on

An
s
moves, such as distance and deprivation. ne

dr
Jo

ew
hn

Sm
Jo

100

ith
This insight can then be used to allocate

David Brown

Paul Smith
Michael Smith

David Williams

John Smith

David Jones
David Smith
the non-unique name holders into the
most likely origin-destination pairings.

80
Cumulative percentage
1.6.1
Unique names

60
As our models are largely based on the
linkage of unique occurrences of names
between our movers databases, it is

40
important to understand the connotations
this may have when attempting to
represent the wider population. Most

20
full names are relatively uncommon.
For example, in the 2013 register, 18.3%
0 2000 4000 6000 8000 10000 12000
of the population have unique full names
and 50% of adults share their names with Frequency of full names
less than 16 other individuals. Figure 1.3
displays the cumulative frequency for full
names in 2013. However, in addition to
considering unique names alone, by pairs (Mateos et al, 2007). The proportion Figure 1.3
pooling all of the names within households of name-inferred ethnic groups for the 2013 A cumulative percentage
of the frequency of full
that change address, our models will also Consumer Register and a subset of those names in 2013.
consider many individuals with more with unique names only is shown in Table
common names. For example, while there 1.2. The percentage of British ethnic groups
are over 11,000 David Smiths and 6,000 as recorded in the 2011 Census have also
Margaret Smiths (the most common male been included for comparison.
and female names respectively) in the 2013
register, there are less than 140 households Although reliant on names as proxies of
comprising of these two names together, cultural heritage, the analysis suggested
despite it being the third most common that the Consumer Register slightly
household name composition. over-represents the White British
population. This assumption is reasonable
Figure 1.3 also labels the nine most popular given that the Electoral Register is known
names in the 2013 data, all of which are to under-enumerate ethnic minorities.
white British male names. We have Although the precise sources of the
considered that a large proportion of consumer data are not known, ethnic
adults with unique names may have minorities are also known to be
international heritage. Therefore, to under-represented in large customer
explore the relationship between ethnic loyalty databases. As anticipated the
heritage and name popularity we ran under-representation of the White British
all of the names from the 2013 register population is considerable amongst adults
through a names classifying tool called with unique names. For example, names
Onomap (www.onomap.org). The tool identified as ‘other white’ background were
assigns each name (considering both over 3.5 times as prominent in the unique
forenames and surnames) to their most names subset relative to the original data.
likely cultural, ethnic and linguistic group This reflects the range and diversity
and was produced from clustering an of European names. We, therefore, need to
extensive database of forename-surname consider that although we have devised a
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 23

2001 Census Ethnic Group Consumer Register Unique names only 2011 (Excl. NI)
A) WHITE - BRITISH 84.36% 60.56% 81.47%
B) WHITE - IRISH 3.79% 4.55% 0.95%
C) WHITE - ANY OTHER WHITE
3.77% 13.30% 4.32%
BACKGROUND
H) ASIAN OR ASIAN BRITISH - INDIAN 2.02% 4.48% 2.36%
J) ASIAN OR ASIAN BRITISH - PAKISTANI 1.78% 3.18% 1.91%
K) ASIAN OR ASIAN BRITISH -
0.40% 0.70% 0.73%
BANGLADESHI
L) ASIAN OR ASIAN BRITISH - ANY OTHER
0.17% 0.74% 1.40%
ASIAN BACKGROUND
M) BLACK OR BLACK BRITISH -
0.04% 0.14% 0.98%
CARIBBEAN
N) BLACK OR BLACK BRITISH - AFRICAN 0.79% 2.67% 1.66%
R) OTHER ETHNIC GROUPS - CHINESE 0.43% 0.86% 0.70%
S) OTHER ETHNIC GROUPS - ANY OTHER
1.67% 4.68% 0.55%
ETHNIC GROUP
Y) UNCLASSIFIED 0.78% 4.14% NA

Table 1.2 novel way of estimating internal migration, mean was 66.1 (Figure 1.4). The Royal Mail
The proportions of a greater proportion of the modelled flows identified that the average distance of
ethnic groups for the
2013 Consumer Register, may be representative of those with movers which could be identified by their
a subset of adults with international heritage.  redirection service is just 25.83 miles (Royal
unique names only, Mail, 2017). However this service is likely to
and the UK 2011 Census.
1.6.2 be biased towards home owners.
Representing migration
We have presented the key spatial trends 
In total, our model estimated the origin as a flow map below, which displays the
and destination of 762,359 individuals. interactions between local authorities in
In addition to these, our model also Great Britain (Figure 1.5). In order to only
identified a further 100,000 cases where convey the key trends in the data and avoid
adults moved within the same postcode. issues of disclosure, only flows of at least
We have considered that these movers may 40 persons are shown. In addition, we have
have remained in the addresses that could also included moves within each district.
have been recorded differently in both These are displayed as proportional symbols
registers. Therefore these individuals are in the centre of each authority.
not included in the subsequent results.
Most moves between 2013-14 occurred
By joining the postcodes to the ONS Postcode within the same local authority district.
Directory, it was possible to observe spatial It is also observable from Figure 1.5 that
trends in modelled migration. Most moves a large proportion of flows are between
tended to occur over relatively short neighbouring local authorities. It is also
distances which corresponds with known interesting to predict migration between
migration traits within the UK (Stillwell regions, and observe how it may vary from
and Thomas, 2016). Our median distance was officially recorded statistics from the 2011
just 19.7 miles as the crow flies, whilst the Census. Although recorded differently and
24 CONSUMER DATA RESEARCH: PART ONE

Figure 1.4

400,000
A histogram of the distance
moved by adults in the
Consumer Registers.
300,000
Frequency

200,000
100,000
0

0 300 600 900 1200

km

Figure 1.5
Flows of home movers
between local authorities.

Flows Between Districts


40 - 59
60 - 79
80 - 99
100 - 119
120 +

0 25 50 75 100 km
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 25

South West South West


Wa Wa
les les

W
es

es
st

tM

st

tM
Ea

Ea
id

id
la
th

la
n
u

ut
ds

nd
So

So

s
York

Yorks
s
hire

ire h
Scotland

Scotland
Nor
t

t
th W
Eas

Eas
No

est
r th
We
st

No
r th
ds

Ea
No an NI ds

st
r th idl an
Ea tM idl
st s t M
NI Ea Eas

London
London

Figure 1.6 absent of any children, the trends identified adjacent quintiles which suggests that
Chord diagrams by the Consumer Registers were similar to there is still only limited social mobility
representing the
proportion of moves the official statistics from 2011. Flows in England. There are only a minority of
between regions as between regions as recorded from the migrants that move between places of
identified from the Consumer Registers and the 2011 Census drastically different levels of deprivation.
Consumer Registers
(left) and the 2011 are displayed in Figure 1.6. Interestingly, there was only a slight
Census (right). majority of upwardly mobile flows over
The migration model also presents an downwardly mobile flows. Whilst this could
opportunity to gain an understanding highlight that migration is no longer more
of segregation, social mobility and asset abundant amongst socially mobile adults,
accumulation through geographic data it is probably also due to adults moving
linkage. There is an assumption that between living with parents, rental
geographic mobility and social mobility accommodation and eventually home
are extrinsically linked as people generally ownership. House prices have made many
move to improve their life chances (Savage, of the least deprived neighbourhoods
1998). Focusing on the English Index of unaffordable for first-time buyers
Multiple Deprivation (IMD), it is possible (Dorling, 2015). In addition, there are also
to observe the social trajectory of internal occurrences of elderly relatives moving in
migrants by considering the deprivation with family or to assisted accommodation.
ranks of their origin and destination Indeed, these results can also be explained
Lower Super Output Areas (LSOAs). To by the fact that most moves occur over
demonstrate the key trends in our data, we relatively short distances and deprivation
have aggregated all of the English LSOAs is positively spatially autocorrelated.
into IMD quintiles and observed the flows Figure 1.7 also identifies addresses that
of migrants between them (Figure 1.7). were sold in 2013 or 2014. It is also
noteworthy that a greater proportion
For each quintile, the most popular of moves where a house was purchased
out-flow feeds back into the same group. occurred for movers moving to and from
The next largest flows are those into the the least deprived parts of the country.
26 CONSUMER DATA RESEARCH: PART ONE

Figure 1.7
An alluvial plot of
migration between
different quintiles of the
5 5
2015 English Index of
Multiple Deprivation,
where the lowest quintiles
are most deprived. Moves
to addresses that were
4 sold in 2013 have been
4
coloured green.

3
3

2
2

1 1

2013 2014

1.7 through matching registers which are


Conclusion recorded more than one year apart.
Research from the Electoral Commission
Our analysis has presented a novel means identified that there is a data lag in the
of estimating migration from population electoral register which occurs when
registers for two given years. Given that individuals change address (Electoral
similar datasets are built each year, such Commission, 2016). Only a minority of
models could fill the data void on internal individuals are correctly updated within a
migration which occur between census year of a change of address. It is conceivable
years. Although the model presented in that a similar lag may be inherent in the
this chapter identified a larger sample of consumer data files too, although it is not
movers than any other available dataset on possible to determine the rate due to the
internal migration, excluding the decennial anonymity of the sources.
Census, there are a couple of means which
could be employed to increase the number This research has demonstrated that it is
of moves that could be identified from the viable to model household dynamics and
registers. The social and geographic trends migration from population registers
from the modelled data could be used to through novel data linkage techniques.
allocate the duplicated records that could Although collected for administrative
not be definitively matched before. Indeed, purposes, the coverage of the data presents
we have found that distance is an obvious us with a unique opportunity to harness
influencer, although geodemographic detailed information on the population
characteristics are also worth considering. at a small-area level. Through a household
However, this approach is based entirely on matching algorithm, it was possible
assumptions built from a particular sample to create indicators of churn which
of movers. A second feasible approach is corresponded with results from the
1. Consumer Registers as Spatial Data Infrastructure and their Use in Migration and Residential Mobility Research 27

2011 Census. It was also possible to Further Reading


determine internal migration flows for
Bollier, D. (2010). The promise and peril of big
hundreds of thousands of individuals. data (p. 1). Washington, DC: Aspen Institute,
Whilst the Consumer Registers are the Communications and Society Program.
most comprehensive data on the population
Dorling, D., (2015). Policy, politics, health and
at an individual level, there were still housing in the UK. Policy & Politics, 43(2), 163-180.
undoubtable issues of representation,
and these vary at regional and intra-urban Electoral Commission (2016). The December 2015
electoral registers in Great Britain: Accuracy and
scales. Therefore future work should completeness of the registers in Great Britain and the
attempt to understand the provenance transition to Individual Electoral Registration. The
of the data in order to scale back the data Electoral Commission Report, July 2016.

and fill in data gaps. Hoinville, G. and Jowell, R. (1978). Survey Research
Practice. London: Heinemann Educational Books.

Kitchin, R. (2014). The data revolution: Big data, open


data, data infrastructures and their consequences.
London: Sage.

Lansley, G. and Longley, P. (2016). Deriving age and


gender from forenames for consumer analytics.
Journal of Retailing and Consumer Services, 30, 271-278.

Mateos, P., Webber, R. and Longley, P. (2007).


The cultural, ethnic and linguistic classification
of populations and neighbourhoods using personal
names. CASA Working Paper 116, Centre for
Advanced Spatial Analysis, University College
London.

ONS (2015). Families and households: 2015. Office for


National Statistics, Statistical bulletin.

Royal Mail (2017). UK’s first home move map reveals


nation’s habits. Online: http://www.royalmailgroup.
com/uk%E2%80%99s-first-home-move-map-
reveals-nation%E2%80%99s-habits

Savage, M. (1988). The missing link? The


relationship between spatial mobility and social
mobility. British Journal of Sociology, 39(4), 554-577.

Stillwell, J. and Thomas, M. (2016). How far do


internal migrants really move? Demonstrating
a new method for the estimation of intra-zonal
distance. Regional Studies, Regional Science, 3(1),
28-47.

Acknowledgements

We are grateful to CACI for providing the Consumer


Register data under a special research licence to
enable us to carry out this research.

Note

1. Unstable households refer to addresses that have


remained the same household size, but some of the
residents have changed.
2
29

The Provenance of Customer


Loyalty Card Data
Alyson Lloyd, James Cheshire and Martin Squires

2.1 in an entirely new way. However, as is


Introduction prominent in ‘Big Data’ research, many
uncertainties arise when they are utilised
in contexts beyond those for which they
Loyalty card schemes have become were originally created.
extremely popular for both retailers
and consumers. They create a system There has been a wealth of academic
of marketing incentives that encourage research on loyalty card schemes,
customer loyalty by offering rewards for which has primarily focused on the
repeat shopping behaviour. These schemes, concept of loyalty and whether or not
facilitated by technological innovation, schemes are effective in increasing
have placed retailers at the forefront of shopping behaviours. However, sparse
the ‘Big Data’ revolution since they now evidence is available regarding the
retain and interpret an immense body applications of loyalty card data outside
of data about their customers and their of commercial contexts. The emergence of
consumption patterns. These data underpin large, spatially and temporally referenced
a burgeoning consumer insight industry, data, such as is produced by loyalty cards,
but have been subject to relatively little has caused a considerable paradigm shift
appraisal from the academic community. from quantitative geographic research
This oversight is, in part, a symptom of to data-driven geography. This has
disaggregate loyalty card data being hard fundamentally challenged withstanding
to access outside commercial settings. research practices through its blending
Thanks to the efforts of the Consumer Data of abductive, inductive and deductive
Research Centre (CDRC) this hurdle is slowly approaches. For example, data-driven
being overcome, offering the potential to methods advocate descriptive insights
tackle societal and geographical questions of voluminous populations, rather than
30 CONSUMER DATA RESEARCH: PART ONE

theory driven research supported by store visiting and product buying


sampled observations of individuals. characteristics, spending patterns and
Such data allow for the creation of theories timestamps of when interactions occur.
from observed behavioural patterns rather From a retailer’s perspective, these data
than self-reported responses, such as are a are principally collected and analysed for
feature of traditional methods (i.e. surveys). Customer Relationship Marketing (CRM)
They allow us to infer spatiotemporal purposes, allowing a greater understanding
dynamics directly, on a multitude of of individual customers. For example,
scales, and are collected on an ongoing variations in loyalty behaviour are typically
basis, meaning that both mundane and quantified using segmentation; the notion
unplanned events can be captured. This that a market can be divided up into several
has allowed researchers to now analyse the behavioural, demographic and/or
world through large-scale digital systems. psychological groups, over a variety of
behavioural indicators such as transaction
This chapter provides an overview of the frequency, average intervals, duration
provenance of loyalty card data and the utility of activity and basket size. Facilitated
of these data in population research. Examples by customer postcodes, these segments
are presented from a loyalty card dataset are then enriched through linkage to
obtained from a major high street retailer (geo)demographic classifications. These
(HSR) in the UK, made available through the classifications provide a useful context
CDRC. These data represent approximately about social structures and common
18 million UK customers and 500 million characteristics between people and
transactions collected between 2012 and 2014, places, which have been widely applied
from an expansive national network of stores, by businesses to infer lifestyles, social
providing a powerful tool for exploring the attitudes and identify the best locations
dynamics of loyalty card data. from which to serve and retain their
customers (Longley, 2017). Postcodes
2.2 are also typically utilised for marketing
Data provenance strategies such as mail-based rewards
or location-based targeting, yet also for
Loyalty card data have attracted substantial spatial applications such as location
interest from retailers, marketing agencies planning and catchment area mapping.
and the wider research community for
two fundamental reasons. Firstly, the From an academic perspective, this
proliferation of loyalty card schemes relatively new form of data has attracted
amongst major retailers since the 1990s considerable interest due to its potential
has provided a means of gathering data to inform a broad spectrum of social,
pertaining to a large proportion of the economic, political and environmental
consumer population. For example, it is patterns and processes, and represents a
estimated that approximately three- huge opportunity for endeavours in human
quarters (76%) of consumers carry geography. The provision of customer
between one and five cards with them postcodes and store locations provide a
at all times (YouGov, 2013) and collectively, valuable geographic reference that can be
almost 46.5 million people, or 92% of the regarded as the key to utilising these data
adult population, are currently registered for a broad range of spatial applications,
with at least one loyalty card programme and the data are high in temporal
(Loyalive, 2015). Secondly, these schemes granularity, providing detailed timestamps
provide rich data about consumers. Data of consumption patterns. Equipped with
collected typically comprise of customer these data there is the potential to build
metadata such as age, gender and address, a finer-grained – in both spatial and
in addition to transactional data describing temporal terms – depiction of societal
2. The Provenance of Customer Loyalty Card Data 31

phenomenon than previous work with more the data. As we demonstrate below, access
‘traditional’ datasets such as national to the CDRC HSR loyalty data has facilitated
censuses. These depictions would have a better understanding of the nature of
widespread applications in a broad range of these data in their raw form. These insights
public service decision-making processes are foundational to a pragmatic approach to
from transport planning to health. utilising these data in wider research and
facilitate an appraisal of the potential for
It is therefore paradoxical that despite the such research to offer substantive insights
growing abundance of these data, their use into social and geographical phenomena.
by academic researchers has been limited.
This, perhaps understandably, is partly due 2.3
to the data’s origins in privately owned Loyalty cards as social and spatial
businesses and their secure storage data
requirements since they provide
information about consumer transactions, Loyalty cards typically produce very rich
residential locations, movements and temporal data on consumption patterns.
interactions. This raises substantial ethical Whilst these behaviours may arguably
and legal considerations in regards to provide a very useful context of socio-
disclosure control, anonymisation and demographic characteristics, loyalty
privacy. Safeguards are therefore required, data comprise little explicit socio-
especially where geographic information demographic information. However,
technologies facilitate the linkage of these customer postcodes provide a valuable
data to the likes of administrative or means of linkage to conventional statistical
alternative spatial data sources. Combined, geographic units and data associated with
these aspects have generated substantial them such as existing national statistics.
barriers to advancing understanding of their This provides a number of advantages.
fitness for purpose outside of commercial For example, it allows measures of
contexts. Yet, access to these data via the neighbourhood type, population
CDRC has provided a means of beginning characteristics or cultural background
to overcome such obstacles, allowing to be appended to individual customer
exploration of their dynamics and the records, which permits interpretation of
challenges encountered when attempting how consumer behaviours may vary with
to apply these data in research. For example, population characteristics. This also allows
these data are adequate from a retailer’s for the identification of potential biases
perspective, as variables are created and in the data. Conversely, these data offer
data interpreted with the primary focus a variety of attractions for our current
of understanding and maximising the understanding of geodemographic
buying behaviours of their customer base. phenomena. Geodemographic classifications
Conversely, academic endeavours strive are widely used in business and public
to obtain rigorous representative data for service organisations, yet are typically
their population of interest and therefore derived from surveys such as national
tend to prefer official statistics collated by Censuses, which may have limited sample
government. It would be impractical to sizes that can be affected by non-response
assume that these kinds of consumer data rates, a coarse spatial scale and low
will meet the ‘gold standards’ of national temporal granularity (collected on a
statistical datasets in terms of both their decennial basis). In addition, whilst
quality and representativeness; therefore, traditional classifications provide valuable
understanding their applications for local indicators, and there have been
research purposes requires preliminary contributions towards daytime indicators
considerations such as the completeness, with the production of small area
accuracy, bias and validity/plausibility of workplace statistics, human identity
32 CONSUMER DATA RESEARCH: PART ONE

encapsulates more than the duality of work on a regular basis and permit consistent Figure 2.1
and residence (Longley, 2017). It is therefore comparison between different behavioural Customer residence to
store flows – Central
increasingly important to incorporate more datasets on a relatively granular scale (over Lowlands, Scotland.
appropriate representations of individual 1.4 million postcode units across the UK).
trajectories with finer temporal Such information may provide an enriched
granularities, such as those that represent description of what makes people, or
the dynamics of day-to-day activities. The groups of people, distinctive. However,
emergence of novel forms of Big Data such utilising these data in this context also
as from loyalty cards offer the potential to gives rise to a number of shortcomings,
facilitate a more sophisticated view of this such as the well-established issue of
phenomenon, providing voluminous ecological fallacy when aggregating data
consumer data that are not compromised to a small-area level (i.e. confounding the
by uneven response rates, can be updated characteristics of areas with particular
2. The Provenance of Customer Loyalty Card Data 33

therefore are able to capture both short-


term changes in buying behaviours and
long-term trends of changing consumer
attitudes and lifestyles. An example comes
from Tesco’s Clubcard, where analysis of
their loyalty data resulted in the linkage of
customers buying nappies for the first time
also correlating with an increase in buying
beer. Further investigation revealed that
this could be attributed to the behaviours
of new fathers, who showed an increase
in drinking in the home rather than
socialising. These kinds of analysis will
likely increase our understanding of the
linkage between consumption patterns and
geodemographics, providing insights for
both contemporary classifications and
retailers alike.

An additional dimension of loyalty card


data is that they not only provide insights
into what consumers buy, but also the
network of stores that they use to service
their needs. Spatial patterns can be
observed in this instance by utilising
customer postcodes and store location
information. For example, Figure 2.1
shows HSR journeys made from places of
residence to store locations across part of
the Central Lowlands of Scotland. Darker
lines indicate more popular trips, which
can be generated by convenience of access,
or the range of activities that different
destinations offer.

Detailed analyses of individual-level daily


activity patterns in loyalty card data have
the potential to be useful in a number of
individuals who live within them). contexts. Research regarding human
In addition, linkage of these data is activity spaces (i.e. the choice of routes
wholly dependent on the assumption through time and space that individuals
that addresses provided by loyalty cards take in order to meet their daily
are accurate, which may not always be obligations) have important applications
the case (i.e. see Section 2.4). not only for understanding travel and
commuting behaviour, but also under-
Nevertheless, a prominent advantage researched areas such as trip chaining.
of loyalty card data is that they allow Very little research has been able to utilise
us to analyse trends at both individual large, temporally granular and longitudinal
and aggregate levels. In addition, the data for research in such domains. For
transactional data offer high temporal example, there is a need for greater insight
granularity over long time periods, into how consumers incorporate store visits
34 CONSUMER DATA RESEARCH: PART ONE

into their daily travel obligations, of which research may enable us to summarise daily Figure 2.2
the majority of research to date has only activity patterns in both time and space. Daytime flows card vs.
census.
been able to utilise self-reported travel Figure 2.2 shows an example of patronage
diaries of relatively small sample sizes. flows from customer residences using
In addition, the vast majority of research lunchtime, weekday transactions of HSR
into trip chaining has focused on home to customers, compared to self-reported
work based trips only, despite work-related origin to workplace destination flows
travel not representing all activities that from the 2011 Census.
are undertaken (i.e. leisure and tourism;
Primerano et al., 2008). Loyalty consumption Such comparisons suggest that loyalty
patterns could provide insight into how card data may be able to provide us with
activities change over time, or how the means to understand daytime activities
interactions with increasingly popular within the general population, help us to
online alternatives (such as click and better understand aspects such as the
collect or home delivery) may affect connectedness of various locations over
subsequent behaviours. The data produced different temporal periods (i.e. daily,
by loyalty cards allows us to investigate weekly, seasonal) and – ultimately – aid
a broad number of variables relating to the construction of geo-temporal profiles.
mobility, such as distances travelled, size These temporally integrated analyses
of store networks and the characteristics postulate that people are influenced not
of locations that individuals visit over time. only by where they live, but also by places
These insights have important implications they visit, when they visit them and who
for planning decisions and policies in urban they interact with. It is our expectation
environments and also issues relating to that loyalty card data, both alone and in
high street retail e-resilience. combination with other datasets, will
advance our knowledge of the functional
By incorporating the temporal element of relationships between places given the
these movements, we can further utilise volume of interactions between different
these data to understand more complex social, economic and demographic groups
socio-spatial characteristics. This evolving that they are able to capture.
2. The Provenance of Customer Loyalty Card Data 35

Figure 2.3
Summary diagram of
issues affecting data.

2.4 population sample and data quality


Data quality and uncertainty (see Figure 2.3).

The creation of data in a purely commercial In loyalty card data, issues of


setting leads to a complete absence of representation and bias are inherent due
researcher control in the data collection to the effects of self-selection since only
process. This raises substantial those choosing to participate in a loyalty
methodological questions when applying scheme will be represented. Traditional
commercial data in research; it is therefore sampling techniques uphold that every
necessary to determine data quality, member of a population has a known
uncertainty and fitness for purpose in order probability of being selected as part of
to extract and interpret meaningful insights. the sample, and if it is to be valid, we must
For example, initial assessments in this ensure that the sample is independent and
context require identification of potential random. However, loyalty card samples are
bias in the sample (i.e. ascertaining both neither designed nor collected and it is
demographic and geographic coverage of therefore improbable to assume that such
customers), determining the quantity, a dataset will warrant a random sample of
consistency and completeness of data the general population. Retailers attract
pertaining to individual customers, and target certain demographic groups,
and also the plausibility of observed therefore over-representing those who fall
trends. The evaluation of these dynamics in within the scope of the particular markets
loyalty card data is made possible through or activities that are being tracked. In
exploratory analyses of the CDRC’s HSR data, addition to these initial self-selection
and the proceeding section aims to highlight biases, there is ambiguity as to whether
a number of pragmatic measures that should information gathered about the retailer’s
be considered when implementing loyalty loyalty population can be applied to all
card data in research practice. These can be purchases taking place at a given outlet, or
understood in terms of a number of major if behaviours can only be attributed to the
themes that influence both the loyalty loyalty card holding segment of customers.
36 CONSUMER DATA RESEARCH: PART ONE

Census Estimates Loyalty Population

Farming Communities
Rural Tenants
Ageing Rural Dwellers
Students Around Campus
Inner City Students
Comfortable Cosmopolitans
Aspiring and Affluent
Ethnic Family Life
Endeavouring Ethnic Mix
Ethnic Dynamics
Aspirational Techies
Rented Family Living
Group

Challenged Asian Terraces


Asian Traits
Urban Professionals and Families
Ageing Urban Living
Suburban Achievers
Semi Detached Suburbia
Challenged Diversity
Constrained Flat Dwellers
White Communities
Ageing City Dwellers
Industrious Communities
Challenged Terraced Workers
Hard Pressed Ageing Workers
Migration and Churn
0.0 2.5 5.0 7.5 10.0 12.5
Percentage

For example, the loyalty customer base for example, by analysing distributions Figure 2.4
may be subject to an underlying self- of age and gender characteristics present Proportions of customers
by OAC for loyalty
selection process such as customers who in the data. In addition, we can attempt to customers vs. census –
are more money conscious or receptive quantify dynamics by drawing comparisons group level.
to special offers being more likely to between existing geodemographic
participate, whereas those with privacy classifications. Figure 2.4 demonstrates an
concerns likely being deterred. Variations example of the volumes of HSR customers
in behaviour as a result of individual/ across Output Area Classification (OAC)
psychological dispositions to participate groups in comparison to Census estimates.
can be investigated to some extent by These classifications categorise the general
comparing loyalty card and non-card UK population based on socio-economic
transactional data. However, making direct characteristics obtained from the 2011 Census.
comparisons can be problematic due to
non-card data also comprising instances It is clear that certain groups are
where a cardholder did not use their card disproportionately represented by these
with a transaction. Yet, using a data-driven data in terms of both their characteristics
approach, demographic biases can be and geographic locations, with more
investigated by drawing comparisons affluent groups likely being over-
with existing national population statistics, represented (particularly ageing suburban
2. The Provenance of Customer Loyalty Card Data 37

volumes as estimated by the 2011 Census.


Customers may also be more likely to
participate in loyalty schemes if they live
in close proximity to a store; therefore,
the distribution of customers may also be
dependent on the retailer’s store locations
relative to their consumer base.

Further considerations of bias and


uncertainty arise from individual
differences in card usage, which can
result in a disproportionate representation
of customers across the database. Firstly,
data may not include all of the purchases
made by card-holding accounts, due to not
Edinburgh
Glasgow having the card or feeling that it is not worth
using for small purchases (Wright and
Sparks, 1999). In addition, many seldom
or never utilise the cards after signing up
(Cortiñas, Elorz, and Múgica, 2008) and it is
Location Quotient improbable to assume that consumers will
−1.0 − 0.82 Leeds patronise only one chain, meaning there is a
0.83 − 0.96 Liverpool limited completeness in utilising the records
0.97 − 1.07 of a single loyalty card. Therefore, it is
1.08 − 1.18 important to consider one retailer’s loyalty
1.19 − 2.08 data will represent a variable fraction of an
individual’s full spending habits. Due to the
Oxford lack of available data outside of commercial
Bristol London
settings, very little research has been able
to quantify the effects of card usage on data
quality. However, within the HSR database,
approximately a third of customers (33%)
transacted less than 10 times over a two-
0 Km 100 km
year period, and approximately 23.4% of all
customers were responsible for 60% of all
transactions. Overall, only 9% were active
Figure 2.5 cohorts and young professionals), and on a weekly basis, yet selecting customers
LQ of cardholders per deprived neighbourhoods/less affluent that exhibit monthly activity patterns over
MSOA.
segments of the general population a two-year period represents approximately
under-represented. Importantly for 7 million, or 38%, of HSR customers.
applications in human geography, this Therefore, in relation to the total sample
may bias the resulting spatial distribution size, these data still represent a significantly
of customers who are signed up to the large and rich source of data in comparison
scheme, with volumes of customers across to traditional studies of population activity
areas reflecting underlying socio-economic over longitudinal periods.
characteristics rather than a representation
of the general population. Figure 2.5 Variations in card usage are also evident
demonstrates an example of the distribution across different retail locations. For
of cardholders across Middle Layer Super example, significantly lower levels of
Output Area’s (MSOA) in Great Britain, participation are observed in more
taking into account underlying population transient locations, such as ‘convenience’
38 CONSUMER DATA RESEARCH: PART ONE

stores (i.e. smaller stores located in card population (approximately 3.5%) with
urban areas). Conversely, higher levels of stated addresses that may no longer be
participation are observed in destination their usual place of residence. Although
locations, such as city centre flagship this is reassuring since it indicates that the
stores. This is likely due to the importance vast majority of loyalty data likely contain
of larger basket sizes in these store types, valid spatial references, it does also suggest
which produce higher loyalty participation errors assumed not to be present in official
(i.e. due to the perceived benefits of more statistics. This has important implications
expensive purchases). A product type bias if using the postcode information as a key
is also evident, with cards more likely to spatial reference to infer social and spatial
be used with higher value items. The processes, and efforts should be made to
implications of these trends are that, identify potentially spurious patterns before
firstly, the distribution of behavioural utilising the data. However, we also
data will be influenced by the characteristics highlight how these errors are not random
of a store location, and secondly, if analysing and can be disproportionately ascribed to
individual product buying behaviours, it is certain segments of the loyalty population
important to consider that the purchasing - primarily students and other groups who
of certain products may be over- or under- are likely to have particularly transient
represented based on the propensity to residential locations. Therefore, we can make
use a card for that particular item. These attempts to identify customers who are most
aspects have important implications for at risk of exhibiting these data errors.
the mobility and product-buying analyses
outlined in Section 2.3, as the completeness 2.5
of individual trajectories may be influenced Conclusion
by these differing motivations to participate.
Despite this (due to the enormous volume Loyalty card data offer an untapped
of overall data) there are still a vast amount opportunity for researchers to analyse
of data produced by loyalty cards available societal and geographical questions in
across all store locations and product types. an entirely new way. They represent large
numbers of people and allow analyses
Finally, due to the lack of data collection at a variety of spatio-temporal scales.
control, there may be elements of However, there are a number of
uncertainty regarding the completeness, preliminary considerations and pragmatic
accuracy and validity of these data. steps required to ensure these data are
Assessment of data completeness and fit for purpose in a research context.
accuracy may be particularly important For example, loyalty cards represent large
in the case of loyalty card metadata, but selective samples, are inherent with
as this information is entirely dependent socio-economic and spatial biases and
on accurate human input at the time of present elements of data uncertainty.
enrolment. Simple exploratory analyses It would be impractical to assume that
can be applied to identify basic errors in these kinds of consumer data will meet the
these data, such as invalid postcodes or standards of national statistical datasets,
illogical age ranges. However, a more yet it is suggested here that pragmatic
complex issue is that the accuracy of actions can be taken to ensure the quality
customer postcodes is also dependent of these data are both understood and
on the motivation of a customer to considered. As such, the preliminary
update this information in the event of focus when applying these data should
a location change. For example, we have be suitably based on the initial assessment 
demonstrated that through the linkage and quantification of inherent data quality
of locational and behavioural attributes, issues, such as those outlined in Section
we are able to identify those in the loyalty 2.4. Careful consideration of these
2. The Provenance of Customer Loyalty Card Data 39

characteristics may facilitate extraction of view of social and spatial processes based
insights that were not previously possible on a broader range of information. We may
nor practically obtainable using traditional expect to see correspondence between the
methodologies. These cautions mirror clusters derived from loyalty cards and the
those adopted in traditional methods of categories of traditional classifications for
data handling in regards to data quality example, which will ultimately provide an
and sampling bias; however, due to the enhanced description of what makes
nature of Big Data collection, efficient certain groups of people distinctive.
methods of revealing these inherent data
issues require exploration. The development of these applications will
be made possible through the continuing
An important future direction is therefore data collaborations facilitated by the CDRC,
to continue to develop new methods which provides a means of utilising data
of handling and analysing these data. of a personal nature, whilst adhering to
Traditional statistical methods have been important disclosure controls. It is critically
focused on data-scarce science, where aims important that analyses of this nature
are to identify significant relationships endeavour to achieve outputs that are
from small, controlled sample sizes with both informative and safe, especially where
known relationships. Developments in data linkage is concerned. Nevertheless,
Big Data research may involve applying the prospects of loyalty card data as a social
data-driven approaches to quantify and spatial data source present promising
uncertainty within the data, continual applications for the use of large consumer
critique and truth propagation and using datasets in social science research.
contemporary social and geographical
Further Reading
theory to support the reliable use of these
new kinds of data sources. Beyond this, Cortiñas, M., Elorz, M. and Múgica, J. M. (2008). The
future prospects are concerned with use of loyalty-cards databases: Differences in regular
price and discount sensitivity in the brand choice
gaining a robust understanding of the decision between card and non-card holders. Journal
applications of these data to advancing of Retailing and Consumer Services, 15(1), 52-62.
our knowledge of population dynamics
Longley, P. A. (2017). Geodemographic profiling, In
in respect to consumption behaviour, The International Encyclopedia of Geography. Wiley and
daytime activities, mobility patterns, the American Association of Geographers (AAG).
spatio-temporal dynamics and the
Loyalive (2015). Loyalive – an introduction. URL no
relationship of these patterns in regards longer available.
to consumer attitudes and lifestyles.
Moving towards constructing spatio- Primerano, F., Taylor, M. A., Pitaksringkarn, L. and
Tisato, P. (2008). Defining and understanding trip
temporal classifications using these chaining behaviour. Transportation, 35(1), 55-72.
kinds of data may advance our knowledge
of relationships between places in terms Wright, C. and Sparks, L. (1999). Loyalty saturation
in retailing: Exploring the end of retail loyalty
of the volumes of interactions generated cards? International Journal of Retail & Distribution
by different social, economic and Management, 27(10), 429-440.
demographic groups over different
YouGov (2013). British shoppers in love with loyalty
temporal periods. It is further possible, cards. Online: yougov.co.uk/news/2013/11/07/british-
through the availability of common shoppers-love-loyalty-cards/
spatial keys such as postcodes, to draw
Acknowledgements
comparisons between classifications
derived from alternative spatially The authors thank ‘High Street Retailer’ for
referenced datasets. This offers the providing transaction data to enable us to carry
out this research. The first author’s PhD research
potential to, firstly, bridge gaps between is sponsored by the Economic and Social Research
issues of representation that are inherent Council through the UCL Doctoral Training Centre.
in these data, but also to create an enriched
3
41

Retail Areas and their


Catchments
Michalis Pavlis and Alex Singleton

3.1 demand estimation and to produce


Introduction systematic metrics of retail centre
morphology and performance
Shopping destinations often exist at (Thurstain-Goodwin and Unwin, 2000).
the core of urban areas, having evolved
naturally as centres for trade and exchange, In the case of England and Wales, a
but within the contemporary urban national set of town centre boundaries
landscape they can also emerge as were made available from the Department
purpose-created retail opportunities of Communities and Local Government
including: regional shopping centres, (DCLG) in 2004, which were created using
retail parks, strip malls or focused kernel density estimation from socio-
shopping destinations such as designer economic variables including building
outlets. Regardless, retail centres are density, diversity of building use, and
complex economic systems that constantly tourist attraction. However, these
evolve, with changing composition and boundaries are outdated (as they were
spatial extent that expands or contracts produced in 2004) and more expansive
over time due to changes related to (e.g. by including office space) than those
their attraction, market potential and that are specifically related mainly to
competition. By identifying shopping retail. Apart from the DCLG town centre
destinations and delineating their spatial boundaries, in the UK, each local authority
extent it is possible to gain a better is required by law to produce a retail centre
understanding of the relationship between health check that typically requires the
retail space use and changes in consumer delineation of their retail area boundaries.
behaviour. Furthermore, such retail area Even though such reports produced by the
boundaries can be used to implement retail local authorities are rich in information,
analytics tasks related to store location and they are typically only available in PDF
42 CONSUMER DATA RESEARCH: PART ONE

format, which hampers their use for company Geolytix for the year 2013 and
research purposes. were available as open data.

Given these circumstances the objective of 3.2


this work was to move away from a general Where are the retailers located?
definition of town centre location as a centre
for employment, to a more functional A national occupancy dataset of 529,062
measure of space delineated for retail retail locations across GB was provided
and services. This was accomplished by by the LDC through the CDRC and was
developing a national-scale dataset of retail collected via a large pool of local surveying
agglomerations within the context of retail teams during 2015. The data contain
setting and policies in Great Britain (GB), detailed information about the current
using a dataset of approximately 530,000 occupier and location of retail unit and
retail unit locations that were provided to service premises. While a full postcode
the Consumer Data Research Centre (CDRC) was available for all surveyed premises
by the Local Data Company (LDC). The (enabling geocoding proximal to ~13
following methodology was used: properties), more precise latitude and
longitude coordinates were available for
1) The performance of five potentially 437,260 units (about 82%), which were
useful clustering methods were tested retained for further analysis, thus
for the purpose of identifying providing accuracy at building level.
agglomerations of retail units. Other collected information for each
The criteria that were used to select the location included the fascia (a surrogate
candidate clustering methods included: for occupier) and the type of retail or
their capability of identifying spatial service business (i.e. leisure, comparison,
clusters of arbitrary size and shape, service and convenience) including vacant
the easiness of tuning the parameter outlets. For retail units located in shopping
values, and also their computational centres, retail and leisure parks the
complexity. respective name of the shopping centre
2) Eight representative areas across GB or retail park was also provided.
were selected to test the performance
of the clustering methods relative to Conceptually, utilising vacant units in the
those retail area boundaries produced identification of local retail agglomerations
by the respective local authorities. may be problematic given that these voids
3) Subsequently, the selected clustering may often occur as a result of failure of a
method was further refined through particular retail setting, and as such, an
calibrating the tuning parameters indication of potential change in extent
using existing retail area boundaries, morphology. For this reason, all vacant
but also by relying on the formal units were removed from the dataset.
definitions of retail areas in GB to Additional processing also removed
determine some of the parameter units that were classified as auto services
values. For example, by definition that are not typically considered part
the smallest retail area in GB should of retail agglomerations. Furthermore,
consist of at least 10 retail units, miscellaneous (not related to retail or
which we used as a parameter of unclassified units) were also excluded.
the clustering analysis (Wrigley and The final cleaning operation identified
Lambiri, 2015). and removed duplicate locations (i.e. points
4) Finally, the obtained clustering with identical coordinates or within very
solution was validated against an close proximity), which can unduly
independent dataset of 359 retail area influence clustering results as well as the
boundaries that were produced by the identification of outliers. These duplicate
3. Retail Areas and their Catchments 43

locations were typically the result of carried out prior to implementation in the
the two-dimensional representation of evaluation by identifying suitable starting
retail units within multi-storey buildings. values (for those tuning parameters that
Thus, the removal of duplicates (any points a single value could not be determined),
within a 2 metre radius from another point) then producing a number of different
was carried out. models within a range of values and finally
selecting the optimal model based on the
3.3 S_Dbw index.
Estimating retail centre location and
extent: methods and calibration DBSCAN is probably the most prevalent
density-based clustering method, and it
Cluster analysis is a collection of requires the specification of two tuning
unsupervised learning methods that parameters: the radius and the minimum
address the issue of grouping a set of number of nearest neighbours from a focal
objects based on similarity. Many point. It can identify clusters of arbitrary
commonly used clustering algorithms size and shape, it is computationally
make group allocations with the objective efficient and is robust to the presence of
of increasing similarity within a cluster outliers. However, the biggest drawback
and increasing dissimilarity between of DBSCAN is its limited sensitivity for
clusters. Other commonly used clustering datasets with varying point densities.
techniques such as density-based
algorithms seek dense regions separated K-means is the most frequently used
by low density regions, while model-based clustering method and requires the
methods assume that the data come from specification of a single parameter which
a mixture of probability distributions, is the number of clusters in the dataset.
each of which represents a different cluster. It has the disadvantage of producing
Cluster analysis is a multivariate technique clusters of convex hull shape but it has low
(multiple attributes of the phenomenon computational complexity; however, given
under investigation can be used), but in this its popularity, it was used as a benchmark
study it is strictly spatial, utilising only against the other clustering methods.
the locations of the retail units. This is an
appropriate approach for the identification The quality threshold method requires
of retail agglomerations where the extent specification of two parameters: the
of the clusters are determined by spatial maximum diameter of the clusters and the
discontinuity in unit distribution (Dearden minimum number of neighbours within a
and Wilson, 2011). cluster. The method has the advantage that
its parameters are relevant in the context
To estimate the definition of retail centres, of identifying retail agglomerations and
the following clustering methods were it is also robust in the presence of outliers.
evaluated: DBSCAN (Ester et al, 1996), However, given that it is stochastic, it suffers
Quality Threshold (Scharl and Leisch, from long running times.
2006), Kernel Density Estimation (Azzalini
and Torelli, 2007), Random Walk (Csardi The non-parametric Kernel Density
and Nepusz, 2006) and K-means (Lloyd Estimation (KDE) method combines KDE
1982). As will be described, all of the with graph structures and algorithms.
clustering methods evaluated require It requires the specification of a single
the calibration of tuning parameters that tuning parameter, and given that it is
we selected to optimise using the S_Dbw non-parametric, it is insensitive to the
internal evaluation indicator (Halkidi and data distribution. However, similar to the
Vazirgiannis, 2002). As such, the process quality threshold method, it is stochastic
of calibrating each clustering method was and suffers from long running times.
44 CONSUMER DATA RESEARCH: PART ONE

The random walk is a graph-based Figure 3.1


clustering method that requires the The locations of the
eight case study sites
specification of a single parameter which is in Great Britain.
the maximum number of steps required by
the algorithm to find the optimal solution.
The method relies on the assumption that
random short walks tend to stay within
the same densely connected area. It can
identify clusters of arbitrary size and Inverurie

shape, it is relatively fast but it is difficult


to identify the optimal number of steps
required by the algorithm. Glasgow

Obviously, there are a plethora of other


methods that have been shown to be
useful for clustering spatial data; however,
an important factor for inclusion in the
evaluation was that the methods were
accompanied by useful documentation Wolverhampton

that facilitated their implementation.


It was also important that the methods Abertillery
Clapham
were under active development or well Cardiff Bristol Junction

Winchester
established as well as being available 0 75 150 km

within most programming languages.

3.4
Centre definition and evaluation

The candidate methods were evaluated


over eight case study areas that are
representative in terms of GB retail
location density and size. These included:
Abertillery and Cardiff in Wales, Bristol,
Clapham Junction, Winchester and
Wolverhampton in England, Glasgow
and Inverurie in Scotland (Figure 3.1).

Although there is a larger pool of other


representative areas, within these specific
locations additional supplementary data
were also available for cross validation
and included two sources. Firstly, local
authorities within GB are required to
perform a town centre ‘health check’
(NPPF, 2012), which typically requires them
to delineate boundaries for retail centres.
The reports are publicly available in PDF
format and, given the small number of
(qualitative) comparisons that can be made
against these sources without extensive
re-digitising, the reports were used to
3. Retail Areas and their Catchments 45

Case study area Retail centre type Preferred method


Abertillery, Wales Small town centre KDE, Random Walk
Bristol, England Large urban area DBSCAN
Cardiff, Wales City centre DBSCAN
Clapham Junction, England Large high street DBSCAN
Glasgow, Scotland Large city centre DBSCAN
Inverurie, Scotland Small high street DBSCAN
Winchester, England Historic town centre DBSCAN
Wolverhampton, England Regional town centre DBSCAN, Random Walk

Table 3.1 assist with input parameter specification while one of the strongest advantages of
Results from the and testing during the calibration process DBSCAN was the identification of outliers.
qualitative comparison of
the clustering methods in described in the previous section. Secondly,
eight locations across boundaries for the 339 largest ‘retail places’ It is clear from the results that DBSCAN
Great Britain. in GB were acquired from Geolytix, and performed well for the case study selection;
although they represent only a subset of however, this method is known to
total retail boundaries, they nevertheless underperform in areas where the density
provide an additional and relatively large is not uniform (Everitt et al, 2011). Such an
sample of independently created retail area issue also becomes apparent when looking
extents suitable for comparison. at the range of the optimal epsilon values
that were used for the selected areas
Table 3.1 presents the overall evaluation (Table 3.2). If a single global epsilon value
results from the qualitative comparison for had been used for all case studies, it would
all of the eight study areas. In most cases, have resulted in suboptimal local results.
the DBSCAN method provided results that As such, we developed a refinement to
were more consistent with those formal the method which involves splitting of
definitions created from the respective the national-scale data into more
local authorities. Importantly, DBSCAN was homogeneous areas for separate treatment;
the most efficient method in terms of with the challenge being that unlike the
computing resources, which is particularly case study evaluations, this required
significant for a national extent study. In automation given that coverage was for
addition, it was easier to identify starting the national extent.
values for the parameters of the method,

Table 3.2
Optimal epsilon values Study Area DBSCAN epsilon (metres)
used by DBSCAN in the
selected study areas.
Abertillery 84
Bristol 119
Cardiff 120
Clapham Junction 70
Glasgow 70
Inverurie 120
Winchester 80
Wolverhampton 91
46 CONSUMER DATA RESEARCH: PART ONE

3.5 and that each location has at least one


Development and application of a neighbour within the maximum distance.
modified DBSCAN method The vertices that are not part of any
subgraph are removed as outliers. The
Addressing the issue of heterogeneous maximum distance value in this study
density, a modified approach to DBSCAN represents the maximum distance that
was developed by introducing three a location can still be considered well
important concepts: connected to a shopping area on foot.
Different distance values have been
1) The use of k-nearest neighbours to suggested as indicators of walking
represent the point data as a sparse distance, ranging between 300 to 500
graph, which allows the partition of metres (NPPF, 2012). Based upon the
the data into areas of more definition of edge of centre for retail
homogeneous point density. purposes in the UK (DCLG, 2009), the
2) The use of a maximum distance to maximum distance value was set equal to
constrain the points that can be 300 metres. Three k values were tested to
member of a cluster. The maximum split the study area into subgraphs, and
distance within a retail context included 4, 10 and 15. As it would be
represents the distance that a location expected, the lower the k value the greater
can still be considered well connected the number and the more homogeneous
to a shopping area by foot. The the density of the subgraphs that were
rationale behind this decision is that produced. On the other hand, using lower k
distance is an important parameter of values (between 4 and 10) can result in
retail agglomerations, and that it splitting areas with low point density
enhances the sensitivity of the method (mostly chained clusters, i.e. high streets)
to gaps and discontinuities. into different subgraphs. For this reason
3) The third concept that we introduced the k value was set equal to 15.
was the iterative application of
DBSCAN, to select one cluster per Given that the spatial extent of each
iteration based on the condition that subgraph depends on the connectivity
the cluster’s density is representative and number of points within an area,
of the global point density. each subgraph can represent a town
centre, a city centre or even a metropolitan
More specifically, in the first step of the region. DBSCAN, however, assumes that
proposed methodology, a sparse graph the epsilon value is a representative
representation of the spatial dataset is indicator of the local density. To fulfil
created based on a k-nearest neighbour that assumption, in the third step of the
matrix and the maximum distance methodology, DBSCAN is first applied
constraint. The vertices of the graph (within each subgraph) in an exploratory
are the locations that have at least one approach to identify and select the cluster
neighbour within the specified maximum that has density (as estimated by the
distance. Next, a Depth First Search local epsilon, i.e. the 95th percentile of
algorithm is implemented to decompose the 4-nearest neighbours’ distances)
the sparse graph to create more closer to the overall density.
homogeneous (in terms of point density
and distance between the retail units) Following the selection of a single cluster,
subgraphs, under the condition that each all the neighbouring clusters (i.e. the
subgraph has at least 10 vertices, which is clusters that share a common edge in the
the minimum number of retail units graph) with similar density are selected
required for an area to be classified as a along with those neighbouring points that
local centre (Wrigley and Lambiri, 2015), were identified by the exploratory DBSCAN
3. Retail Areas and their Catchments 47

Figure 3.2
The point data are
represented as a sparse
graph using a distance-
constrained k-NN sparse
matrix. DBSCAN is first
applied in an exploratory
approach. The neighbouring
clusters (that share a
common edge) with similar
point density are selected
forming a new study area
of homogeneous point
density, where DBSCAN
is iteratively applied until
no cluster can be formed.

as outliers. Following this, a new study area it is no longer required to optimise the
of homogeneous point density is created clustering solution using the S_Dbw index,
from the selected points and DBSCAN is which results in a faster algorithm.
applied again to identify the clusters.
The selected clusters are then removed To evaluate the point density similarity
from the graph representation of the among clusters, the standard deviation
point data, and the process of using an of point density in a subgraph was used.
exploratory DBSCAN model to identify More specifically, those neighbouring
a cluster and select those neighbouring clusters with point density within 1
clusters with similar point density is standard deviation from the point density
iteratively carried out until no cluster of the initially selected cluster were also
can be formed. This process is summarized selected, with the assumption being
in Figure 3.2. It should be noted that one of that they define an area of homogeneous
the advantages of the methodology is that point density. To test the sensitivity
48 CONSUMER DATA RESEARCH: PART ONE

Models Number of Distribution of epsilon values (metres)


Clusters
Standard Count Minimum 25% 50% Mean 75% Maximum
Deviation
Threshold

0.6 2,928 80 80 80 100.3 113.0 170.0

0.8 2,922 80 80 80 100.3 113.0 170.0

1.0 2,920 80 80 80 100.3 113.0 170.0

1.2 2,923 80 80 80 100.1 113.0 170.0

1.4 2,921 80 80 80 100.1 113.0 170.0

of the method to the standard deviation available and independently created Table 3.3
threshold, five different values were national sample of contemporary retail Summary values of five
clustering models with
considered, 0.6, 0.8, 1.0, 1.2 and 1.4. centre extents. They provide frequent different standard
As can be seen in Table 3.3, the updates of a dataset of retail places across deviation thresholds.
clustering solutions are practically the UK, part of which (339 places) were
identical when looking at the number licensed as open data in 2012. The Geolytix
of clusters produced and the distribution boundaries are produced using multiple
of the local epsilon value. variables (including the locations of retail
units) with information that was collected
For the parameter values required by at least three years prior to the data that
DBSCAN, as detailed earlier, the value were used in our analysis. Additional causes
of the minimum points parameter was of difference between the two datasets
set equal to 10 and the epsilon value was might also include the different objectives
calculated as the 95th percentile of the and notion of what constitutes a retail
4-nearest neighbour distance. However, centre (Geolytix did not use a threshold
the epsilon value was only allowed to vary of minimum 10 retail units), and only the
within the range between maximum 170 boundary polygons from the clustered
metres, which was found to be useful to locations of the retail units were available.
exclude outliers from being identified as Given that the creation of similar polygon
members of clusters, and a lower bound boundaries for our output may have
of 80 metres which was used to avoid resulted in an additional source of error,
identifying certain large shopping malls it was decided to compare the Geolytix
as clusters. This necessity is a consequence boundaries against the retail unit locations
of the hierarchical nature of retail centres and associated clusters. The comparison
within GB given that the objective of the was based on two metrics, the ‘n-ary’
analysis was to create clusters that were relation between the two datasets, and the
inclusive of the different functional retail proportion of points within the Geolytix
forms. Following the application of polygons. The n-ary relation returns a
DBSCAN to each subgraph and the score where the higher the number of
extraction of 2,920 clusters, the final retail clusters that had a one-to-one relation
agglomerations were compiled and each with the clusters identified by Geolytix
retail location was assigned an identifying the better the relationship.
number denoting cluster membership.
The results derived from this new method Data pre-processing removed the major
were compared to data supplied by out-of-town retail parks from the Geolytix
Geolytix, which represent the only freely dataset, which was followed by a spatial
3. Retail Areas and their Catchments 49

Figure 3.3 Retail Units


The retail unit locations Geolytix Boundaries
that are members of the 0.5 km
cluster of the city of Basemap: Stamen/ OSM
Glasgow (blue circles),
overlaid on the Geolytix
retail centre boundaries.

Figure 3.4 Retail Units


The retail unit locations Geolytix Boundaries
that are members of the 0.5 km
cluster of the city of Basemap: Stamen/ OSM
Bristol (blue circles),
overlaid on the Geolytix
retail centre boundaries
(only the area around the
Broadmead shopping
district was available
as open data).
50 CONSUMER DATA RESEARCH: PART ONE

join of the Geolytix dataset with the 3.6


clustered retailer locations. There were Conclusion
294 spatial intersections between the two
datasets, out of which 244 were one-to- The objective of this analysis was to
one. Summary values of the spatial develop a clustering method that would
distribution of the clustered locations facilitate the identification of retail
within the Geolytix boundaries are shown agglomerations across a national extent
in Table 3.4. On average (based on the and that could be updated over time. For
median value) almost 90% of the clustered this purpose, five of the most frequently
points were within the Geolytix boundaries. used clustering methods were compared
Glasgow (Figure 3.3) serves as an example within eight representative locations across
where the two datasets mostly overlap, Great Britain. The DBSCAN method was
but also shows that the spatial extent of selected on the basis that it provided the
the clusters produced in this analysis were most accurate representation of those
on average larger, which to some extent retail areas relative to formal definitions;
is related to Geolytix post-processing of it was faster to produce a clustering
boundaries to be constrained by the road solution and easier to calibrate optimised
network. An example where the two input parameter values.
datasets have significant differences
is Bristol (Figure 3.4). However, to address a well-known issue
that DBSCAN does not cope well in areas
Looking at Bristol, it can be seen that of varying densities, the DBSCAN method
Geolytix split the city centre into smaller was adapted so that it could be iteratively
clusters, of which only Broadmead was applied within smaller, more homogeneous
available as open data. However, the sites that were created using a k-NN sparse
clustering solution for Bristol that was graph representation of the retail locations.
produced in this analysis was very similar Each selected retail cluster was created by
to the one produced by the Bristol local the DBSCAN algorithm with an epsilon
authority and, thus, arguably more value that was representative of the local
appropriate based on this local knowledge. point density. The clusters produced were
Despite these mismatches that to some comparable to those retail areas designated
extent are related to different objectives by the local authorities for the sample areas
and notions of what constitutes a retail of study, and in some cases, were more
centre, it could be argued that the two accurate when compared to the traditional
clustering solutions largely overlap in DBSCAN method. In addition, the identified
the areas that were available by the open clusters were in most areas similar in
source Geolytix retail places, which terms of spatial extent to those produced
provides evidence for the validity of the by Geolytix using alternative data and
retail clusters that were produced in this methodology. It should be noted that
work vis-à-vis competing methods. even though the suggested method is
Table 3.4
Summary values
describing the spatial
distribution of the
clustered locations within
the Geolytix boundaries.

Minimum 1st Quartile Median Mean 3rd Quartile Maximum


0.68 63.97 89.81 73.99 95.99 100.00
3. Retail Areas and their Catchments 51

more demanding in terms of computer Further Reading


resources compared to the traditional
Azzalini, A. and Torelli, N. (2007). Clustering via
DBSCAN, it scales better as it could be nonparametric density estimation. Statistics and
applied in parallel for each subgraph. Computing, 17, 71-80.

Csardi, G., Nepusz, T. (2006) The igraph software


Furthermore, the output of this analysis package for complex network research, InterJournal,
provides a better spatial coverage and option Complex Systems 1695. Online: http://igraph.org
for automated update in comparison to the
DCLG (2009). Practice guidance on need, impact
existing DCLG town centre boundaries. and the sequential approach. Online: www.gov.uk/
Given that the DCLG boundaries were government/uploads/system/uploads/attachment_
widely used by academics, local authorities data/file/7781/towncentresguide.pdf

and private organisations across the Dearden, J. and Wilson, A. (2011). A framework for
country it can be anticipated that these exploring urban retail discontinuities. Geographical
results will prove to be valuable for Analysis, 43(2), 172-187.

research and analysis. Ester, M., Kriegel, H. P., Sander, J. and Xiaowei, X.
(1996). A density-based algorithm for discovering
With the developed methodology being clusters in large spatial databases with noise. Online:
www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
open source, it will also be straightforward
to update the retail boundaries on a regular Everitt, B. S., Landau, S., Leese, M., and Stahl, D.
basis, and potentially apply the suggested (2011). Cluster Analysis. 5th ed. Chichester, Wiley.

method within a context of historic data. Halkidi, M. and Vazirgiannis, M. (2002). Clustering
Finally, given the variety in point density, validity assessment using multi-representatives.
size and shape of the retail clusters in the Online: lpis.csd.auth.gr/setn02/poster_papers/237.
pdf
dataset it would be reasonable to assume
that the methodology could be applicable Lloyd, S. P. (1982). Least squares quantization in PCM.
with different datasets and for different IEEE Transactions on Information Theory, 28, 128-137.

international locations. National Planning Policy Framework (NPPF) (2012).


Online: http://planningguidance.communities.gov.
uk/blog/policy/achieving-sustainable-development/
annex-2-glossary/

Scharl, T. and Leisch, F. (2006). The stochastic QT-


clust algorithm: Evaluation of stability and variance
on time-course microarray data. Online: http://www.
ci.tuwien.ac.at/papers/Scharl+Leisch-2006.pdf

Thurstain-Goodwin, M. and Unwin, D. (2000).


Defining and delineating the central areas of towns
for statistical monitoring using continuous surface
representations. Trans. GIS 4(4), 305-317.

Wrigley, N. and Lambiri, D. (2015). British high


streets: From crisis to recovery? A comprehensive
review of the evidence. Online: http://thegreatbritish
highstreet.co.uk/pdf/GBHS-British-High-Streets-
Crisis-to-Recovery.pdf

Acknowledgements

The authors would like to thank Local Data Company


Ltd for providing retail unit data for this research.
4
53

Given and Family Names


as Global Spatial Data
Infrastructure
Oliver O’Brien and Paul Longley

4.1 In important respects a name is at the


Introduction same time the purest and most widely
used form of personal data. It is a
This chapter outlines an ongoing characteristic of a person that is typically
Consumer Data Research Centre (CDRC) assigned at birth and often not changed
project that has produced a large (c. two or adulterated during the bearer’s lifespan
billion record) global database of people’s except for reasons of marriage. Names data
names, together with the approximate are shared between individuals for many
locations of their bearers. Although in reasons core to social organisation, and
part a ‘hobby’ project, the work is being sharing often follows established social
reconfigured into demographic profiles patterns. Names are ubiquitous across
of people over a full range of scales. cultures and throughout the world –
We discuss the value and provenance of nearly everyone has a name and it is rarely
different data sources, data extraction if ever assigned at random. A person’s full
techniques and the tools used to assemble (given/fore- plus family/sur-) name may
a truly global database. We also present thus provide a direct indicator of its
illustrative results from the resulting bearer’s gender, ethnicity and religion.
dataset, at the global scale as well as for Changing fashions in given naming
selected countries. Case studies are used practices often render given names a
to examine some individual countries, reliable indicator of age.
where simple analysis of publically
available data samples can reveal internal Additionally, most names have geographic
migration patterns and sub-national origins, some of which may be very specific
variations in the popularity of both and localised. The naming practices of
given and family names. most societies provide for inheritance of
family names, and thus comparison of
54 CONSUMER DATA RESEARCH: PART ONE

present residence with historic name 4.2


origins makes it possible to trace the The Worldnames 2 project
probable migration histories of many
individuals and their blood lines. It is The Worldnames 2 project is an outcome
even possible to establish links between of a series of research grants and private
long settled populations and their genetic initiatives stretching back to 2003 (see
make-up (www.peopleofthebritishisles. www.onomap.org), and attempts to collect
org). The accuracy with which this may be and assemble as many name pairs as
done does of course depend upon the event possible until global coverage is achieved.
histories of the individuals whose names The names of resident individuals are
are inherited through the generations, and associated with the country (and often region
historically these have been overwhelmingly or locality) in which they reside. The results
male – yet local marriage practices can be mapped in order to portray the
throughout history do in practice retain geographic distribution of names at a range
this signal through the generations. of scales, and can be analysed in order to
link names and groups of names to places.
Finally, the fact of geographic concentrations It follows on from the Worldnames 1 project,
of clusters of surnames means that names which presents a website showing name
may be associated indirectly with the distributions in around 25 mainly western
economic fortunes of the areas in which countries, by expanding to cover almost every
their bearers live. Inheritance, along with country of the world. A secondary aim is to
the intergenerational inheritances of human allow public dissemination of the enlarged
and social capital, may also mean that a dataset via a more modern website than the
name provides clues to economic and social existing Worldnames 1 platform (which is
standing. This tendency may be reinforced if around 10 years old). The data are mainly
there are social dimensions to given naming collected from freely available governmental,
practices within broader cultural, ethnic and administrative and other public datasets from
linguistic groups. the countries concerned – although some of
the data have been obtained under licence
Linking given names and family names with attendant restrictions upon reuse.
together, and clustering, produces groups The project is in an ongoing phase and
of names typically sharing demographic this chapter describes the work to date,
traits. Studying such groups, with as well as outlining some early results and
appropriate control data, allows the presenting a number of country-specific
geographer to assign profiles which then case studies.
can be used with similar names to infer
similar demographic characteristics. 4.3
Objectives, execution principles and
The clustering process is dependent simplifications
on having a large and geographically
representative pool of given/family The general objective of the ongoing project
name pairs. This chapter describes an is to obtain as many name pairs as possible,
ongoing CDRC project that is gathering on a country-by-country basis that can be
many name pairs, across the world, for deemed to represent the established
such clustering purposes, as well as for a population of the country. In some senses
similar probabilistic classification process the objective is to create a demographic
based on the fact that many names have framework for the world, in terms of the
strong country (and indeed intra-country) personal attributes that can be inferred
geographic profiles that are retained to this from naming conventions at national,
day, even in an era of greater increasing regional and local levels. We are in the
global mobility. process of creating a website that will
4. Given and Family Names as Global Spatial Data Infrastructure 55

enable users to examine these traits of any The ‘Western Latin’ alphabet is used, with
name that is part of our dataset. Users will accents and capitalisations removed and
also be able to interrogate the data in order the only non-alphabetic characters allowed
to identify the most prevalent surname by are apostrophes and dashes – these being
country lists or intra-country distributions combined anyway with pure-alphabetic
of popular single-country-origin names. variants. This is necessary to accommodate
the inconsistent ways which names are
As such, the project is clearly a Big Data stored on the official records are typically
project that is vulnerable to the vagaries used in the project. For example, MacDonald
of Big Data sources that are discussed at can appear, in different datasets across
various points in this volume. Some of the different countries, as Mac Donald,
data sources are acquired under licence, MACDONALD, Mac.Donald and Mac-
with restrictions upon how they may be Donald. Other non-alphabetic characters,
redistributed, particularly those that such as spaces and underscores, are
formed part of the original Worldnames 1 replaced or removed as judged appropriate.
project. The countries that have formed
the focus of our renewed attention on the It is acknowledged that, with a project
project are openly available on the web, of this scale, using hundreds of diverse
without needing a login or subscription datasets, such simplifications will
to access, and from the original source, potentially obscure helpful demographic
rather than from other consolidator sites information; this is minimized where
with related foci of interest. We also practical. We have retained the names in
exclude datasets whose custodians did not, the original forms captured, however, in
in our opinion, intend the data to be made order to allow the incorporation of other
available for wide public use, albeit in accents in future spin out projects from
aggregate form. These judgments are the research.
inevitably subjective and our intention is
to avoid any legal infringements arising 4.4
from reuse of data for new purposes. In Data acquisition and processing
particular, we have avoided using data methodology
published by third parties without the
consent of the original owner to this end. 4.4.1
From this standpoint, newspaper Search-based initial discovery
republishing of time-restricted electoral
lists would be considered to be valid but To ensure a reasonable level of quality
database dumps, obtained as a result of a and a high geographical and demographic
breach of security or insider leaks, would representation for each country are
not be used. This distinction is not always maintained, we carry out the data
clear cut, and require decisions to be made collection manually, rather than creating
on a case-by-case basis – for example, data a ‘bot’ or ‘spider’ to crawl the web
obtained from the WikiLeaks service and automatically. This also presents
similar investigative journalism projects, opportunities to discover additional
or those where the original source and unindexed datasets with intelligent
authority to publish is unclear. To simplify URL modification by the investigator.
processing a vast array of diverse datasets,
a number of simplifications are applied. This means that, for each of the ~200
The western-style naming convention countries of the world, a different collection
of a given (assigned) first name followed process is employed, built up by starting
by a (typically hereditary) family name from a set of common principles detailed
is assumed, with other name structures below, but then refined as name data are
(e.g. Spanish double surnames) simplified. discovered from the current active country.
56 CONSUMER DATA RESEARCH: PART ONE

Figure 4.1
Map of St Lucia. St Lucia
is approximately 40km
in length (north to south)
and 20km in width (east
to west).

Case Study 1: Saint Lucia

Saint Lucia is an island nation in the number of duplicated records were found
Caribbean with a population of – where the same record would appear
approximately 186,000 people (Figure 4.1). on multiple pages in a table for a single
Our names data come from the polling lists precinct. On de-duplicating, 129,685
published by the Saint Lucia Electoral records remained, representing around
Department at www.electoral.gov.lc/ 70% of the 2016 UN estimate of 186,383
polling-list. Saint Lucia’s top-level people. The reasons for this large
administrative areas are known as discrepancy are not clear, but, if the
quarters, the constituencies are based on official figures are at fault, it would go
these quarters but with a number split or some way to explaining the apparently
merged, to make 17 in total. These are then low turnout of 57% and that the numbers
each further split into between 3 and 9 of reported registered voters have
polling divisions, or precincts. The electoral increased at a much larger rate than
data for each of the 84 precincts are listed the population in general, since 2000.
on the website as paged tables, with a
POST query needed to access each page. 70% of the 2016 estimate is a plausible
percentage, as electoral lists are not
The data listed on the tables include the population lists – they typically exclude
given name, family name, street name, young people and foreign residents. Voters
constituency, unique registration number, for a general election in Saint Lucia must
precinct and gender. be at least 18 years of age and either a
Saint Lucian or a Commonwealth citizen
A python script was used to send POST who has resided in Saint Lucia for at least
queries and download the HTML tables, seven years.
and extract the data from them using
regex into a CSV file. The data are Because of the readily available geospatial
believed to be relatively up-to-date, boundary data for Saint Lucia’s quarters,
as the most recent election was in 2016. and the relatively small population of the
162,025 records were extracted in this nation as a whole, it was decided to
way, at first glance matching well with sub-divide the population by a single
official summary information for the level – quarters, but merged where the
2016 election suggesting that there were constituencies go across quarters, for the
161,883 registered voters. However, a large Worldnames 2 project – rather than further
4. Given and Family Names as Global Spatial Data Infrastructure 57

dividing into constituencies, quarters or


precincts, the former of which are not
sufficiently different from the merged
quarters and the latter of which have some
very small populations. The constituencies
and precincts also do not have readily
available boundaries available.

This results in nine geographical areas,


with the electoral population varying from
4,962 to 45,980.

The data were considered to be relatively


‘clean’, because of the direct extraction
from HTML tables, rather than a structure-
recreation approach that is often needed
when extracting from PDFs.

As is characteristic for many Caribbean


countries, Western given names have
come into usage as family names.
In Saint Lucia’s case, the most common
given family names show this trait: Joseph
(4.6% – the most popular), Charles (2.1%),
James, Alexander and Henry. There are
10,074 distinct family names. 7,119 of
these are most common in Saint Lucia,
compared to other countries for which
there is currently data in the Worldnames
Figure 4.2
Distribution of 2 database. Hippolyte and Mathurin are
Charlemagne surname two popular family names in Saint Lucia
in St Lucia, with darker that are relatively unusual elsewhere.
shading indicating
higher proportions
of the population Looking at the names data split by the
bearing this surname. merged quarters, a number of popular
family names have strong variations
through the nation. For example,
Charlemagne is more popular on the
western side of the island than the eastern
side (Figure 4.2). The 100 people with this
family name on the polling list for Choiseul
quarter represent a frequency of 1.7%,
while the 11 in Dennery quarter equate to
frequency of just 0.12%. However, as these
are small numbers in total, this may just be
a statistical coincidence.
58 CONSUMER DATA RESEARCH: PART ONE

A small amount of logic, however, can be ‘seed’ surnames were required for any
shared across countries. A number of country. Google Search queries were
countries in West Africa, for example, then carried out using these surnames in
use the same off-the-shelf portal software conjunction with various combinations of
for their government or public service numerical, country filter, data format and/
websites, and so file locations discovered or list keywords. Top-level country filters
during the search for data for one country, on Google Search were used, along with
can then be reapplied to additional gov.xx and edu.xx second level domain
countries using the same software. filters (xx here the country’s top level
domain). These filters restricted results
The initial stage is to perform a simple to being from subdomains of the domains
search, typically using Google Search, for specified and, due to the nature of Google
‘obvious’ open-access lists of the greater Search’s indexing, this more specific search
part of a country’s population. These may often revealed additional results of interest.
take the form of public versions of electoral Format keywords can narrow large
rolls or civil registry lists. These may be numbers of results returned to ones likely
posted both on a country’s official portal, in the form of a downloadable, processable,
but also occasionally republished by list, for example ‘pdf’, ‘xls’ or ‘xlsx’.
citizens on private websites, for example The CSV format is probably the simplest
the Chilean electoral roll can be freely and easiest list format to parse but is little
obtained on disk and implicitly republished used by non-technology focused websites.
privately, but is not itself published online A small number of useful sources were
by the electoral authorities. found in the more modern JSON file format.
Adding sequential numbers, e.g. 1345 1346
If a comprehensive list is not obtained, 1347 1348 can both reveal list-focused
then it is necessary to search using results, and ones with a likely population
distinctive surnames and then full names of (in this case) well over a thousand
for any given country. It is necessary to names. Finally, the inclusion of other
‘seed’ the search with certain key names key words in the search for distinctive
which are popular in the country in document classes can also be useful –
question, but ideally not in neighbouring as with ‘cedula’ (national identification
countries or other jurisdictions. To do this, document) for Spanish-speaking countries.
reliable pan-national websites containing Other more generic words were also useful
lists of famous people from a country, for in our searches, e.g. ‘first name’, ‘given
example from Wikipedia, were used, as well name’, ‘forename’, ‘last name’, ‘family
as government and pan-national database- name’, ‘surname’, ‘ID number’, or
driven websites of elected government ‘candidate’. Translating these into
ministers or national election candidates different languages (typically using
– one type of data that are nearly Google Translate) was also useful.
universally published and that a number
of separate projects, such as IFES Election Somewhat counter intuitively, increasing
Guide and International IDEA, are aiming the numbers of names in a query led to
to catalogue and maintain. The latter more search results being returned. This
project also contains some information is another example, as mentioned above,
about the availability of online electoral of the heuristics applied in Google Search
rolls for each country and its approach to and other search engines, and the ways in
open data, including direct links to them which key words for websites are stored in
where available. the internal indices of the search engines.

As a rule of thumb in our research we Where surnames alone did not reveal useful
deemed that at least three distinctive datasets, use of distinctive individual full
4. Given and Family Names as Global Spatial Data Infrastructure 59

names was useful, since this focused as documents, simple Python scripting
searches on websites with a record for was used to automate retrieval of large
that person amongst a larger list of bearers numbers of webpages, supplying
of less distinctive or unusual names, appropriate GET/POST parameters on a
or with a search function or directory consistent, sequential or known list basis,
index (as discussed below) that revealed and simple processing and name extraction
additional data files. As a general point, of the resulting HTML files retrieved.
we avoided searches using famous names Where webpages listed a large number of
since these directed focus away from documents, a bulk downloader browser
general population names. extension ‘uSelect iDownload’ was used.

It was often the case that, once identified, 4.4.2


a relevant dataset offered coverage of Targets
only a (regional or sectoral) part of the
population. In such instances the issue A minimum target number of names to be
of coverage was addressed by amending harvested was calculated for each country.
the relevant part of the URL until full In important respects, Web-based research
population coverage had been achieved often remains unreconciled with respect to
– for example after using every established scientific apparatus of
government regional identifier within sampling and inference, in that:
a given country. In some circumstances
this was achieved by identifying the parent 1) Sample frames (in our case normally
directory of a relevant dataset. We became resident populations) are at best
used to anticipating abbreviations and imperfectly defined. For example, not
invoking other trial and error procedures every country in the world has reliable
in this process. and accurate procedures for measuring
and monitoring population size.
The Internet Archive was also useful for 2) There are multiple definitions of the
retrieving files that no longer existed at eligible populations of any country.
their original locations but which were In our research, there is inherent
still revealed through stale indices in ambiguity in the definition of
search engines and related weblinks. ‘eligible’, and in practice falls back
Additionally, websites occasionally change upon identifying the full names of
their domain names but fail to provide as many of the ‘ordinarily resident’
forwarding links from the old domain. population as possible. The conception
Sometimes, missing files referred to from and measurement of ‘ordinarily
a Google search or external web links resident’ nevertheless varies between
can be retrieved simply by modifying jurisdictions. For example, the UK
the domain name to the new one. uses different definitions of ‘ordinarily
resident’ (for access to health services
As well as Google, specialized search and formerly for tax purposes),
engines, such as Docs Engine, are useful. ‘indefinite leave to remain’ and
In particular, we found Docs Engine good ‘right of abode’.
at revealing lists of PDFs, Microsoft Excel 3) Closest approximations to ‘ordinarily
(XLS) and Microsoft Word files which are resident’ populations may systematically
not readily found with Google Search – exclude some groups that have
particularly when searching for a single distinctive naming practices, for example
non-celebrity full name. lists of eligible voters will exclude many
recent migrants and others that have not
Where the information was available yet attained citizenship, voting age or
as numerous web pages rather than other requirements for voting. In some
60 CONSUMER DATA RESEARCH: PART ONE

Figure 4.3
Map of Somalia.

Case Study 2: Somalia

There are no openly accessible names unique to the country), academic year, score
lists for Somalia (Figure 4.3) that provide and issue date. Somalian names do not tend
widespread coverage of the country – not to follow the Western structure where family
surprising in a country without an effective names are passed down each generation;
government controlling much of its instead the family name of a child is
territory. An additional problem, from the typically the given name of the father.
perspective of the Worldnames 2 project,
is that documents are generally published This is likely to be a highly selective sample,
in Somali, the native language, although for example, completing secondary school
this uses the Latin alphabet and so is may not be possible in large parts of the
relatively easily translated. country due to safety concerns, there may
be a tradition of only boys or only girls
The best data source that was available attending school in some areas, and so on.
was a number of PDF downloads from
the Ministry of Education for the Federal The data sources, combined, contain 7,834
Government of Somalia, at moesomalia. name pairs, representing just 0.07% of the
net/english/ (although the website has population based on the 2015 UN estimate
been updated since the data was of around 11 million. Combined with the
extracted and so it is no longer available likely demographic biases discussed above,
for direct download). The data files are lists this means that the Somalia dataset will
of students that have received the national likely give a poor profile of given name and
secondary school leaving certificate. The family name distributions in the country,
data include the full name of each pupil and as such is included simply because
(with given and family names not split out), of the desire to have as many countries
the mother’s name, year of birth, roll as possible represented in the database
number and certificate number (neither – poor data being better than no data at all.
4. Given and Family Names as Global Spatial Data Infrastructure 61

countries, notably in the Middle East,


The data was extracted by using Tabula
migrant populations may be large
to detect the table structure and data
relative to those that are longer settled
fields in the PDFs and output as a CSV.
and that may consider themselves the
The data was then combined in Excel,
true ‘ordinarily resident’ population.
with the column ordering lined up.
4) The non-availability of population
The first word and last word of the name
registers of any kind will often
were interpreted to be the given name and
necessitate the use of available
family name respectively, a necessary
substitute sources, for example
simplification to homogenise name
particular age cohorts or occupational
structures globally for the project. No
groups. In each case, we tried to use
sub-national information (e.g. school
sources that were unlikely to introduce
name) was available to allow a first-
bias into the sample number and
level administrative geography to be
frequency of names harvested.
developed; in any case the sample size
5) The sample size required depends upon
is too small to allow for meaningful
the heterogeneity of the phenomenon
geographical division.
(in our case, given and family names)
that is being recorded. Populations
‘C/’ is a popular prefix for Somali names.
with heterogeneous characteristics
It is short for Cabdi, and is regarded as a
require larger samples, and the
prefix rather than a true first name, so was
diversity of forenames relative to
not used as a first name.
surnames may itself vary within a
country. Diversity of forenames may
Mohamed was found to be the most
also vary between age cohorts within
common last name (i.e. family name for
a country in line with other secular
the purposes of the project), with 8.5% of
trends. The sample fraction required
names, followed by Ali (6.5%) and Ahmed
between countries will thus vary,
(5.4%), with 901 distinct names out of the
subject also to a minimum sample size.
7,834 in total. Mohamed was also the most
6) For all of these reasons, our source
common given name (8.4%) followed by
data are unlikely to represent a purely
Ahmed (4.3%) and Abdirahman (4.2%).
random sample of the ordinarily
The tradition of the father’s given name
resident population. In the absence
becoming the family name of the child
of any reliable population sources,
means that it would be expected that the
it is not possible to reweight names
top lists for family and given names would
by probability of selection when
be similar.
synthesising the complete population.
7) The jurisdictional partitioning of
Alternative spellings of Mohamed are
some of the world is in flux, and for
also popular, with Mohmed, Mohamud,
this reason it may be necessary to use
Maxamed, Mohamoud, Mohmud,
data that do not pertain to current
Mohmoud, Mahad and Maxamuud all
geographic boundaries.
appearing in the top 60 most popular
family names. Maxamed is the direct
In practice our data harvesting criteria
Somali spelling of Mohamed.
were to identify:
• 10% of 2016 World Bank population
Mohamed Mohamed is the most popular
estimates for small countries/areas
name pair in the list, with 101 occurrences
(less than 10,000 population)
within the list of 7,834. There were 5,375
• 1,000 people, for medium countries/
distinct name pairs.
areas (10,000-1 million)
• 0.1%, for large countries/areas
(1 million+)
62 CONSUMER DATA RESEARCH: PART ONE

The targets are based on the most recent 4.4.3


available estimated population of the Data sources
country or sub-country administrative
area. This is generally sourced through At the time of writing this chapter,
simple web searches for the current (September 2017) around 450 distinct
population. The accuracy of the resulting sources have been used across
data is not critical, as the target thresholds approximately 170 countries. A single
are themselves very approximate. The data source has been used for most countries
generally come from the Indicators section but this has not always been possible, for
of the World Bank’s Data platform. example Pakistan’s very large population
Our experience suggested that smaller and lack of a comprehensive single source
countries are generally more open about openly available on the web necessitated
publishing lists of people’s names – use of 24 sources in order to achieve the
possibly because they are inherently 0.1% threshold, both across the country as
more open, but also possibly because their a whole and at Level 1 (Province) and Level
governments have not created elaborate 2 (Division) scales.
data infrastructures. However, such smaller
populations, even when considered in Where multiple sources were used, care
aggregate, are less useful for this project, was taken to try and avoid counting the
as clustering and demographic prediction same person twice, by looking for
is only effective for high data volumes. significant overlaps of names across
Smaller datasets are also more prone sources. Where individual sources represent
to individual errors/omissions biasing a very small proportion of the population,
the final result, so it is more important duplication concerns were, however, often
to have a greater proportion of the disregarded, as they were expected to have
population covered. For this reason, the only a minimal effect on the quality of the
higher thresholds targets were necessary overall information about name
for such smaller countries. Conversely, for distribution in the country.
very large countries, even a relatively small
sample will likely be representative for the As mentioned above, data sources generally
country, at least for more common names. need to be openly available on the web,
or purchased for this project or its
A time/effort limitation guideline was predecessor. We have generally only
also adopted, over and above the number rarely used names derived from social
target outlined above, to protect against media directories (by other projects)
unnecessary effort and diminishing and have strictly avoided using data
returns. Many countries have vast numbers from commercial but otherwise similar
of datasets including people’s names, pan-national projects such as Forebears
of vastly varying quality and quantity. (forebears.co.uk), Ancestry (www.ancestry.
By contrast, other countries appear to com) or Linked-In (www.linkedin.com).
have virtually no useable datasets on the
web that contain useful ranges of first Frequently used sources include:
names and last names. In both cases, • Electoral rolls (also known as voter
to provide an appropriate balance between lists or voter registers)
effort and reward, a maximum of one • Landline telephone directories
person-day was employed to discover (white pages)
the datasets. • Government fund qualification records
(e.g. rural hardship)
• School/national examination results/
candidates
• University matriculation/graduation/
4. Given and Family Names as Global Spatial Data Infrastructure 63

admission lists Sensitive personal information (SPI)


• Professional practice licences datasets – for example medical records,
(doctors, lawyers, engineers) or full CVs – are not collected. During the
• Candidates for elections (local/ data collection process, such information,
parliamentary) surprisingly, is encountered on the open
• Official statistics from national statistics web from time to time, but is discarded.
agencies (e.g. Census summaries)
• Government transparency employee 4.4.4
and contractor pay lists Data processing and georeferencing
• Business owners (e.g. local service
providers, tax registers) For each country, files were processed
• National service callup lists once sufficient names had been retrieved,
(jury service, military) and then entered into a number of
database tables – individual name pairs,
Less frequently used, but still useful aggregated tables by area for given names
sources, particularly for countries with and family names separately, and general
a limited web presence or a culture of statistical tables.
significantly restricting personal data
publication: The Tabula (tabula.technology) open source
• Government lottery/scheme winners software was used to efficiently extract
(e.g. university laptops) tables and lists of names from PDFs.
• Government honours lists/award Microsoft Excel, TextWrangler and a
winners standardised set of SQL queries were
• Local government employee directories also heavily used for names lists,
• Public meeting minutes particularly to extract the given and
• Private club member lists family name components from full
names, strip initials, convert accents to
Other less frequently used datasets, which an unaccented approximation, standardise
potentially can come from super-national apostrophes, remove/convert spaces,
data sources: remove certain prefixes and suffixes
• Player league tables (e.g. chess (e.g. Most, Dr) and normalise the way
rankings) the additional demographic information
• Match lineup lists of international was specified across multiple datasets.
footballers and other athletes For example, certain data sources omitted
• Social network names (used by leading zeros for national IDs while others
and supplied by other projects) maintained them.
• Academic paper data releases
and books More sophisticated data cleaning was
• Private insurance subscribers occasionally required. For example, a
number of data sources were mis-encoded,
The project does not republish full or double encoded in error, requiring
individual records (i.e. no disclosure of careful manual decoding. Transliteration
both the first name and last name of a websites were used, mainly for Cyrillic-
single record) but anticipates that the alphabet to Latin-alphabet conversion,
publication of ‘most popular full name’ which is relatively straightforward.
by region will be appropriate and possible.
It will not publish personally identifying Geodata for Worldnames 2, for displaying
information (PII), at any level, by name results in an online map on the
aggregating appropriately to ensure that project’s website and helping understand
the statistics published are not about a geographic patterns of names, was mainly
single person. sourced from the GADM (www.gadm.org)
64 CONSUMER DATA RESEARCH: PART ONE

Case Study 3: Nepal

The Nepal data were built up from


Police Clearance Report (PCR) Certificate
Lists which are criminal record check
application results, published in English by
the Nepal Police as a PDF every few days.
A complete list of nearly 1000 of these
documents, published in the last three
years, can be found at cid.nepalpolice.gov.
np/index.php/pcr-certificate-list

By downloading the PDFs, converting


them to CSVs using Tabula and combining
them, 517,946 records can be retrieved,
representing approximately 1.6% of the
country’s current population. Full names
are listed, with a double space sometimes, Nepal’s most common family names are Figure 4.4
but not always, separating the given name Tamang (4%), Gurung (3.6%) and Shrestha Map of Nepal. Nepal
is approximately 800km
from the family name. Middle names are (1.7%), with 11,662 distinct family names long (east-west) and
often present, but were disregarded, with in the data, while the most common given 200km wide.
the first word and last word forming the names are Ram (2.6%), Krishna (1.2%)
given and family name respectively, for and Santosh (0.93%), there being 37,141
Worldnames. The gender is also listed, unique given names detected. Ram Yadav
along with a non-unique sequential list was the most common name pair, with 1,193
sequence number, passport number, occurrences, and there were 219,161 distinct
district and a unique sequential dispatch name pairs.
number. The passport number is useful to
de-duplicate the list (as someone may Looking at the sub-country level, in both
have applied more than once), while the Manang and Mustang districts, 61% of
district name can be used to build up the family names were Gurung, while 23%
geography of the person’s location. of the population in Siraha had the given
name of Yadav. Analysis at the district level
Nepal (Figure 4.4) recently introduced a confirmed that this Indian-origin name is
top-level administrative structure of seven much more common along the southern
provinces. However, geospatial data are border of Nepal (Figure 4.5). Opposite top:
more readily available for the previous Figure 4.5
Distribution of Yadav
14 zones, which are split into 75 districts, Tamang, the most common family name, family name in Nepal,
and it is these latter two administrative is much more common in the eastern part with darker shading
areas that are adopted by Worldnames 2, of the country and is almost absent further indicating higher
proportions of the
particularly as the district is listed in the west (Figure 4.6). population bearing
source data. One zone (with a single this surname.
district) has only 1,132 names; however
the rest of the zones are well populated in Opposite bottom:
the dataset. Around 80% of the districts Figure 4.6
Distribution of Tamang
also have a population of at least 1,000 family name in Nepal,
in the data, thus generally satisfying with darker shading
the target minimums discussed earlier indicating higher
proportions of the
in this chapter. population bearing
this surname.
4. Given and Family Names as Global Spatial Data Infrastructure 65
66 CONSUMER DATA RESEARCH: PART ONE

project. Some more recent (or legacy, where only statistical breakdowns representing
the names data were for jurisdictions that another 1.5 billion people (the majority of
have recently changed) administrative the names in this latter category being
boundaries were obtained from other from China). Around 175 countries are
projects – Natural Earth Data (www. represented, with an eventual aim to
naturalearthdata.com), OpenStreetMap also include data for the remaining
data (osm.org) from Geofabrik Downloads approximately 30 countries, albeit likely
(download.geofabrik.de) and the now very simple statistical summaries of the
discontinued MapZen Borders service most popular names.
(mapzen.com/data/borders), and various
country-specific projects, often available As stated in the introduction, this project
through the GitHub (github.com) is quite different to other CDRC initiatives,
repositories. QGIS (www.qgis.org) both in its longevity (the first funding
was used to process and organise the for this work was received in 2003),
metadata associated with the geodata. the persistently high levels of public
interest that it has generated – recording
This included the creation and population nearly a million visits a year to Worldnames
of a globally consistent ID for all countries 1 for several years following its launch
and the first and second level subdivisions, (worldnames.publicprofiler.org/webstats/
where available and used for certain index.html) and articles in large-
countries. While other projects maintain circulation media such as the Guardian
such a list (e.g. HASC codes and ISO 3166-2 and Daily Mail newspapers, and the way
codes), our own system was used for that it has been conducted in spare time
maximum flexibility, particularly as between funding streams. It is nonetheless
occasionally customized topologies important as the only CDRC project that
and aggregations had to be employed, purports to provide something approaching
depending on the name data available. a global spatial data infrastructure, albeit
The system is based on the ISO 3166-1 founded on diverse, piecemeal and
country codes, an administrative level fragmented data sources.
number and a padded integer code for the
unit. Occasionally, the country’s official This is, without doubt, a Big Data project
codes were adopted for the latter part, – albeit one in which the search for and
where these were integer based. processing of appropriate data sources has
been very labour intensive. We believe that
For the world map of countries, a GeoJSON- this has implications for the wider practice
format dataset from Natural Earth Data – namely that Big Data have to be broadly
was used. Where countries had sub- understood before they are ‘ingested’, and
country name data, the MapShaper website that significant flaws in the content and
service was used to simplify the topology coverage of data cannot be accommodated
of the geodata, and one TopoJSON-format in subsequent analysis through blind
data file for up to two levels of sub-country application of sophisticated techniques.
area borders was created using it. TopoJSON Spatial data are special by their very nature
(github.com/topojson/topojson) is a modern, and geographic skills are foremost of those
flexible and highly compact file format. required to understand the possible sources
and operation of bias in datasets such as
4.5 Worldnames 2.
Conclusion
The greatest impact of the research to
The project thus far has collected individual date has been upon the legions of amateur
data on approximately 1.7 billion genealogists who are interested in
individuals, plus a number of surname- understanding the geographies of their
4. Given and Family Names as Global Spatial Data Infrastructure 67

origins across the widest possible range Further Reading


of spatial scales. But, as set out in our
Brunet, G. and Bideau, A. (2000). Surnames. The
introduction, the work is of wider History of the Family, 5(2), 153-160.
importance precisely because a name
is a statement of a number of facets to Cheshire, J. A., Longley, P. A. and Mateos, P. (2010)
Regionalisation and clustering of large spatially-
our individual identities – ranging from referenced population datasets: The Case of
cultural, ethnic and linguistic group, to age Surnames. GIScience conference 2010.
and probable social standing in the world.
Longley, P. A., Cheshire, J. A. and Mateos, P. (2011).
In our future research we hope to address Creating a regional geography of Britain through
the ways in which names provide indicators the spatial analysis of surnames. Geoforum, 42(4),
of the movement of populations through 506–516.

the generations – both by the contagious Mateos, P., Webber, R. and Longley, P. (2007). The
diffusion of a surname from its known cultural, ethnic and linguistic classification of
point of geographic origin (nearly a populations and neighbourhoods using personal
names. CASA Working Papers Series 116 Centre for
thousand years in the case of Anglo Advanced Spatial Analysis, University College
Saxon surnames, but less than a century London.
for much of Turkey, for example) and by
Munzert, S., Rubba C., Meißner, P. and Nyhuis, D.
hierarchical diffusion cascading through (2014). Mapping the geographic distribution of
the increasingly interconnected system names. In Automated Data Collection with R.
of world cities. From these standpoints, Chapter 15, 380–395. John Wiley & Sons, Ltd.

georeferenced names provide valuable Acknowledgements


indicators of the legacies of successive
waves of global migration through to The authors would like to thank Dr Muhammad
Adnan for his work on the original Worldnames,
measures of the social progression of which formed the basis of much of the research and
migrants relative to their source and development of this project.
host communities.
PART TWO

DYNAMICS AND
CONSUMER DATA
INFRASTRUCTURES
5
71

Ethnicity and Residential


Segregation
Tian Lan, Jens Kandt and Paul Longley

5.1 First, we elaborate on the value of


Introduction consumer data viewed against prevailing
approaches in measuring segregation.
The preceding chapters have demonstrated We will then explain the data and methods
the value of consumer data as an employed in order to explore highly
informative component of a spatial data detailed dynamics of ethnic geographies
infrastructure. This chapter focuses on as well as the workflow of linking
applications using UK adults’ names and consumer data to other data sources.
addresses at individual level, which are Subsequently, we present overall profiles
commonly collected Big Data items in of ethnic diversity and explore the
consumer registers. Names are potentially changing ethnic compositions of
informative markers because their social contemporary Great Britain, including
properties permit inference of socio- England, Wales and Scotland. As Britain
demographic attributes such as age, becomes increasingly diverse, so the degree
sex, ethnicity, language and religion. of residential segregation as measured
Furthermore, many consumer databases by ethnicity becomes central to wider
can be turned into longitudinal data concerns about socio-spatial inequalities
resources offering great potential to study and the geography of economic and social
the dynamics of individuals, households opportunity. We conclude the chapter with
and place with novel insights into ethnic findings and reflections on using consumer
segregation, social and spatial mobility and data in the study of ethnic diversity and
local demographic change. In this chapter, segregation alongside their limitations.
we examine trends in ethnic diversity and
residential segregation in contemporary
Britain by using consumer data to underpin
ethnicity classifications.
72 CONSUMER DATA RESEARCH: PART TWO

5.2 Although completion of the Census is


Consumer data and ethnic a legal requirement, and the small area
geographies OA geography is convenient for many
purposes, census data nevertheless
Ethnicity is a taxonomy categorising people present drawbacks for segregation
to social groups based on common research. First of all, there exist
ancestral, language, religion or cultural consistency issues with ethnicity
characteristics, which is not only an classification across censuses. Questions
inherently complicated concept closely and categorisation about ethnicity has been
related to an individual’s identity but changed among the 1991, 2001 and 2011
also an important demographic and Censuses (Table 5.1). For example, the
socioeconomic indicator. It constantly White group in the 1991 Census is further
draws public, political and academic divided into sub-groups as White British,
attentions and is currently strongly White Irish, and any other White
identified again as a paramount factor in background in the 2001 Census. Mixed
social integration in the UK (Casey, 2016). ethnic groups are added in the 2001 and
In recent years, Britain has experienced 2011 Censuses. Categorisations like Gypsy/
increasing ethnic diversity, which has Irish Traveller and Arab are introduced in
stimulated a wide public and policy the 2011 Census. In the census context, it is
debate as to whether Britain may be often categorised into ethnic categories,
‘sleepwalking’ into a more segregated among which some are not ethnic groups
society (Finney and Simpson, 2009). in a strict sense, such as ‘White British’
Revolving around this debate, ethnicity or ‘Black British’. By asking to choose
and residential segregation in the British one categorisation to best describe the
context has been intensively examined respondents’ ethnic group, answers
in the literature. Simpson (2007) has inevitably rely on subjective judgement.
suggested that there had been an increase Numerous longitudinal studies (Simpson,
in residential mixing because of growing Jivraj and Warren, 2016) have indicated
minority populations and their more even that some individuals would change the
spread across localities. Peach (2009) has perception of their own ethnicity among
also challenged the assertions that ethnic censuses. For the purpose of this chapter,
segregation in Britain was increasing and we use the most common categories of the
argued that Britain still does not have latest Census, merging the rare mixed
American-style ghettos like that of ethnicities into Mixed (see Table 5.1).
Chicago. Pointing out that increasing
diversity had become an important feature Geographic boundaries such as OAs for
of contemporary population change, Catney collecting, analysing and reporting the
(2015) mapped the evolving geographies of population counts are not consistent
ethnic diversity over two decades to update throughout censuses, although there is
the knowledge of how diversity had grown. generally a strong correspondence between
Most of these studies rely on national the 2001 and 2011 data. According to the
Census of Population data from the lookup file1 between 2001 and 2011 OAs in
Office for National Statistics (ONS), England and Wales, there are 4,354 (2.4%)
which provides small area aggregated OAs among a total of 175,434 OAs that have
population counts for ethnic groups within been either split or merged to formulate
pre-defined spatial units. For example, 2011 OAs. In addition, the household census
the Census Output Areas (OAs) are the form for England and Wales is different
lowest level of geography in the hierarchy from other parts of the UK. All these issues
of census boundaries and it is convenient lead to comparability problems, and
to further aggregate OAs to higher level precautions need to be taken before
of geographies. comparing ethnic groups across censuses.
5. Ethnicity and Residential Segregation 73

1991 Census 2001 Census 2011 Census Merged census


ethnic groups ethnic groups ethnic groups ethnic groups for this
study
White English/Welsh/
White British Scottish/Northern Irish/ White British
British
White White Irish White Irish White Irish
White Gypsy/Irish
Other White Traveller Other White
Other White
Indian Indian Indian Indian
Pakistani Pakistani Pakistani Pakistani
Bangladeshi Bangladeshi Bangladeshi Bangladeshi
Chinese Chinese Chinese Chinese
Black African Black African Black African Black African
Black Caribbean Black Caribbean Black Caribbean Black Caribbean
Any Other Any Other
Any Other
Any Other Arab Arab
Other Mixed Other Mixed
Mixed White-Caribbean Mixed White-Caribbean
Other Black Mixed White-African Mixed White-African Mixed
Other Black Other Black
Mixed White-Asian Mixed White-Asian
Other Asian
Other Asian Other Asian Other Asian

Table 5.1 Moreover, UK Censuses are usually updated census data. First, ethnicity is assessed at
Comparison of ethnic only every ten years, which undermines the the individual level in a less subjective way
groups across censuses
and merged census ethnic merits of census data when finer-grained with a consistent method applied over
groups for this study. Note: temporal resolution is needed to capture selected inter-Censual years. Second,
This table is modified from population change in intermediate years. ethnicity data can be generated almost
the table in Catney (2015).
For example, a Polish immigration wave continuously because consumer data are
after Poland’s accession to the European generated in real time and are consolidated
Union in 2004 can only be observed as at least once a year. Thus, population
accumulated results using 2011 data2 on dynamics can be measured at a much
ethnicity. This issue is further compounded higher temporal resolution. Third, since
by the omnipresent Modifiable Areal Unit consumer data can be geocoded at address
Problem (Openshaw, 1984), which is level, it enables the examination of ethnic
unlikely to be wholly negated by census geographies across the fullest range of
zone design considerations. scales. Taken together, used in association
with conventional census data, consumer
In the remainder of this chapter we seek to data can be used to identify highly granular
demonstrate that the linkage of consumer geo-temporal patterns of segregation in
registers and use of names-based ethnicity contemporary Britain.
classifications offers promising ways to
begin to address the shortcomings of
74 CONSUMER DATA RESEARCH: PART TWO

Figure 5.1
Workflow of formulating
address-level ethnicity
data from consumer data.

CR/ER
2016

Geocoding Ethnicity Classifying


Geocoded Ethnicity
CR/ER Data Data
AddressBase
Data

5.3 assigned. There are two basic input


Deriving ethnic geographies at datasets for geocoding: consumer data
address level and Ordnance Survey (Great Britain)
AddressBase Premium data. The
The Consumer Data Research Centre (CDRC) AddressBase data provide address point
currently stores consumer data for the whole coordinates in both British National Grid
UK. Consumer data are compiled from and ETRS89 coordinate reference systems
multiple data sources (see Lansley and Li, for postal addresses from the Royal Mail
Chapter 1 in this volume), which include the Postal Address File (PAF). In this study
full public versions of the annual Electoral every postal address is geocoded using
Register (ER) for the earlier holdings coordinate pairs in the British National
(1998-2002) and compendium of both the Grid reference system. A text matching
redacted public ER and consumer dynamics algorithm is used to match customers’
files obtained from a range of sources for the addresses with postal addresses from
later years (2003, 2007, 2013-2016). In what AddressBase for data pertaining to every
follows we describe each annual update annual update of the Consumer Registers.
as a ‘Consumer Register’ for Great Britain. The output from this step is geocoded CR/
In these Registers, consumer residential ER data with geographic coordinates.
locations are identified only by postal
addresses that are captured and retained in 5.3.2
a non-standard form, making it necessary to Ethnicity classification
link the data to a common geo-referencing
framework such as Ordnance Survey A standalone application Onomap (Mateos,
AddressBase3 or the Postcode Address File Longley and Sullivan, 2011; www.onomap.
(PAF). The workflow of formulating ethnicity org/) is used to assign every customer
data from these composite registers is record into ethnic groups, based on their
shown in Figure 5.1. This data processing forenames and surnames. Onomap was
procedure mainly consists of two steps: developed as a series of lookup tables to
geocoding and ethnicity classification. predict ethnicity using fore and surnames
with indicative cultural, ethnic and
5.3.1 linguistic origins. A matching algorithm
Geocoding is used to classify names based on these
tables. The resulting output file contains
Since consumer data only contain non- name pairs and their corresponding
geocoded postal addresses for each customer Onomap Group. To make the result
record, geographic locations need to be more comparable, the Onomap ethnic
5. Ethnicity and Residential Segregation 75

Figure 5.2
White and non-White
proportion change
1998 95.9%
over time. 1999 95.8%
2000 95.6%
2001 95.5%
2002 95.4%

Year
2003 94.9%
2007 92.9%
2013 92.3%
2014 92.1%
2015 92.1%
2016 92.2%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Proportion

White Non-White

classification is condensed into the merged The White and non-White bipartition in
2011 Census ethnic groups (Table 5.1) for Figure 5.2 can be further divided into finer
this study. A new version of Onomap, the ethnic categorisations to gain a more
Ethnicity Estimator, is currently under in-depth picture of the ethnicity
development and its main new feature composition in Britain. Since the White
will be its calibration with micro data British group is so predominant that it could
on self-assigned ethnicity from the 2011 easily overshadow patterns of other minority
UK Census. ethnic groups, the White British group is
excluded from the selected groups of the
5.4 2011 Census ethnic classification in Figure
Ethnic diversity in contemporary 5.3. Three years are chosen from the
Britain timeframe: 2001, 2007 and 2016. Year 2001
is the only available directly comparable
We begin by exploring the ethnic reference point to any Census year, although
composition change over time as derived further acquisitions are in prospect.
from the address-level ethnicity data. The
White ethnic group, including White British, As suggested by Figure 5.3, Indian is the
White Irish, and Other White, constitutes largest group among ethnicities, with
the majority of Britain’s population: White around 2.1% of the population in 2016.
British alone accounted for 85% of the Pakistani is the second largest non-
population of Britain in the 2016 consumer White community in Britain with 1.9%
data. By contrast, other ethnic groups such in 2016. Except for the Black African and
as Pakistani, Indian, Bangladeshi and Bangladeshi, an increase in the proportion
Chinese together comprised less than ten across the three years can be seen for most
per cent (Figure 5.2). The proportion of the ethnic groups. There has been a noticeable
White majority group decreased year on boost for the Other White group with an
year from 95.9% in 1998 to 92.2% in 2016 increase of around 1.1% in 2007 and 2%
according to the consumer data, although in 2016 compared with base year 2001.
the absolute size of the White population It is in accordance with the 2011 Census
increased over this period. analysis on ethnicity of the non-UK born
76 CONSUMER DATA RESEARCH: PART TWO

Figure 5.3
Ethnic composition
Other White
change of Britain in 2001,
White Irish 2007 and 2016 (White
British excluded).
Selected ethnic groups

Indian

Pakistani

Other

Black African

Other Asian

Bangladeshi

Chinese

Black Caribbean

0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 4.5%
Percentages

2001 2007 2016

population, which claimed that 71% of the and combined from the 2001 Census
residents who identified themselves as for England/Wales and Scotland for
Other White arrived in the UK between comparison purpose. Population counts
2001 and 2011. It is suggested that the of ethnic groups from consumer data are
drastic increase for the Other White group compared against adult counts of ethnic
is mostly due to the 2004 accession of groups provided by the 2001 Census (Table
several Eastern European countries into 5.2). The comparison shows that consumer
the European Union. data only account for 87% of the 2001
Census total population and all of the
Some facts can be summarised from the individual ethnic groups are under-
ethnicity profiles above. Over the years, represented to a greater or lesser degree.
the relative share of the White majority Particularly, 88% of the White British
population has decreased, although it group against the 2001 Census is relatively
has increased in absolute size. The White well represented in the consumer data,
British group remains the largest ethnic while only 40% of the Chinese group is
category in Britain followed by the Other represented. The representative rates for
White group. Most of the minority ethnic the Indian and Pakistani groups are 72%
groups are experiencing increase in their and 86% respectively.
proportion of the population. Therefore,
it can be concluded that Britain has become The Black Caribbean and Other Mixed
more and more ethnically diverse over groups are severely under-represented and
time. It is also evident in Figure 5.3 that the Arab group is not applicable in 2001
all of the ethnic minorities are growing in Census ethnic categorisation. With the
proportion over the three years, except for elimination of the above three ethnic
the Bangladeshi and Black African groups. groups, the ratios of adult counts for
individual ethnic group in consumer data
The year 2001 is the only point in time that to counts in 2001 Census data are visualised
is shared by both the Census and one of the in Figure 5.4. The red dashed line indicates
consumer registers currently held by CDRC. 1:1 representation. The White Irish group is
Since the eligible age for electoral rolls extremely over-represented, which
registering was 16 in Britain in 2001, suggests that a considerable amount of
the adult (aged 17+) counts are extracted people who are classified as White Irish
5. Ethnicity and Residential Segregation 77

Ratios of counts
Merged Census Groups Adult (16+) counts in 2001 Adult (16+) counts in 2001
(Consumer data to
for the study Consumer Data Census Data
Census Data)
Other Asian 151,594 189,036 0.802
Bangladeshi 106,480 174,257 0.611
Chinese 78,532 198,145 0.396
Indian 580,620 811,044 0.716
Pakistani 418,401 486,061 0.861
Black African 137,123 338,827 0.405
Black Caribbean 11,703 450,498 0.026
Other Mixed 291 337,547 0.001
Other 300,890 245,330 1.226
Arabic 4 NULL NULL
Other White 829,458 1,226,886 0.676
White British 35,754,961 40,534,837 0.882
White Irish 1,178,502 650,658 1.811

Total 39,548,559 45,643,126 0.866

Table 5.2 based on surname and forename by


Comparison between adult Onomap identify themselves as other
counts in 2001 consumer data
and 2001 Census data. ethnicities, most likely as White British.
The varied representative rates might
Other Asian
relate to the eligibility and willingness of
2.0 different ethnic groups to register to vote
1.8
White Irish 1.6 Bangladeshi in various elections. Although these ethnic
1.4 groups are more or less under-represented,
1.2
1.0 consumer data still remain a powerful
0.8 resource to capture more details about the
White British 0.6 Chinese
0.4 changes in Britain’s ethnic composition.
0.2
0.0
To examine geographic patterns of
consumer data’s representativeness,
Other White Indian
ratios of adult counts in 2001 consumer
data to 2001 Census data for the White
British group are mapped across the
Other Pakistani Local Authority Districts (LADs) in Britain
(Figure 5.5). The White British group are
Black African
generally well represented for the whole of
Britain, except for some districts in yellow
Figure 5.4
Ratios of adult counts colour such as South Cambridgeshire and
for individual ethnic group Glasgow City, which have representativeness
in consumer data to lower than 50%. Better or even over-
counts in Census data.
(The red dashed line representation (greater than 90%) of the
indicates 1:1 ratio). White British group can be spotted in some
78 CONSUMER DATA RESEARCH: PART TWO

urban areas, for instance London Boroughs,


Birmingham and Manchester. Brent and
Newham highlighted in dark blue in the
inset map have the highest representative
rate of the White British group among
other LADs in Great Britain.
Enfield

5.5
Barnet
Harrow Haringey Redbridge Havering

Residential segregation in Hillingdon


Brent Camden
Newham
Ealing
contemporary Britain Southwark
Bexley
Inner London

Trends in ethnic diversity alone cannot Outer London


Bromley
Sutton
inform the distributions of ethnic groups Elmbridge Croydon
Sevenoaks

within Britain. A geographical perspective Glasgow


on ethnicity can provide information on
the important research and policy issue of
residential segregation. Ethnic residential
segregation can be an elusive concept Richmondshire

(Peach, 2009) without unified definition York


Leeds
and measurement. There are various Manchester
indices examining different dimensions
Nottingham Ratios of counts
(Massey and Denton, 1988) of residential of White British
in Great Britain
segregation. For example, the Index of Birmingham
joined_gen_lads09

ratio

South Cambridgeshire
Dissimilarity and the Information Theory 0.207 - 0.500
0.501 - 0.750
Bristol
index for the Evenness dimension; the London
0.751 - 0.900

Exposure/Isolation index for the Exposure 0.901 - 1.100


1.101 - 1.201
dimension. A single index of residential
segregation could even result in different 0 100 km

values at different spatial scales. In this


study, the Index of Dissimilarity is
employed to measure the evenness of ethnic minorities and immigrants into Figure 5.5
of ethnic residential patterns across suburban and rural localities has been Ratios (2001 Consumer
data to 2001 Census data)
the OAs. observed in the UK’s changing geographies of counts of the White
of ethnic diversity in recent decades British group across the
A few concerns need to be addressed in (Catney, 2015). Therefore, this study LADs in Britain.

advance. To demonstrate the feasibility includes both urban and rural OAs when
of using consumer data on residential examining the overall trends of ethnic
segregation studies, we choose the simplest segregation in Britain.
and best interpretable, aspatial Index of
Dissimilarity. Since the index is widely The national residential segregation of
accepted in the previous studies as well individual ethnic groups is measured using
as governmental reports, this choice pairwise Index of Dissimilarity denoted
facilitates comparisons with related studies as D in Equation (5.1), which captures the
in the British context. For the same purpose absolute difference between the spread
of comparability, we aggregated population of a specified group and the spread of the
counts by ethnic group to the 2011 OAs. rest of the population across spatial units
There are 227,759 OAs in Great Britain in nationally. Here wi denotes the number
2011. In addition, traditionally ethnic of residents of the ethnic group under
diversity is largely an urban phenomenon examination in the ith OA, and denotes
and studies of ethnic segregation focus on the number of the total population of the
metropolitan areas. Nonetheless, dispersal ethnic group under examination in Britain.
5. Ethnicity and Residential Segregation 79

Correspondingly, bi is the number of the of another ethnic group moved into a


rest of residents other than the examined district where the White British group
group in the ith OA, and B is the total had a 100% perfectly even distribution.
number of the rest of the population in
Britain. The absolute difference between The declining trend of dissimilarity indices
the percentages is summed up over OAs for most minority ethnic groups suggests
and then halved to make the index value an increase in spatial integration among
fall between 0 and 1. Sometimes it is also ethnic groups. Yet, the overall baseline
transformed into the percentage format, levels of segregation differ considerably
ranging from 0 to 100%. Value 0 means the between groups, and three clusters of
ethnic group under examination is equally segregation can be identified (see Figure
distributed over all OAs. For instance, if the 5.6). The most segregated group of all,
ratio between wi and bi is constant across Bangladeshi, together with the Pakistani,
all OAs, the Index of Dissimilarity equals 0. Indian, Chinese and African groups, are
In another extreme case, if the ethnic highly segregated ethnic communities
group under examination fully occupies whose dissimilarity indices are around or
each OA, the Index of Dissimilarity is above 70%. In contrast, the White British
maximised as 1 indicating completely and Other White groups belong to the
segregated. moderately segregated communities,
whose dissimilarity indices are between

Σ
n
1 wi bi 40% and 50%. The White Irish group stands
D= – (5.1.) out as the low segregation group whose
2 W B
i=1 dissimilarity index is below 30%. The Irish
group has a long migration and settlement
Using Equation (5.1), pairwise dissimilarity history in Britain and, if names have
indices for all of the ethnic groups are remained an indicator of ethnic identity,
calculated from year 1998 onwards, aiming they appear to be more evenly distributed
to examine the extent to which ethnic across Britain. The results also indicate
groups are evenly distributed across OAs that the White group, including the White
of Britain. Results of selected ethnic groups British, White Irish, and Other White, is
are shown in Figure 5.6. The Index of more spatially integrated across the whole
Dissimilarity can be interpreted as the of Britain than other ethnic minorities.
proportion of residents in the ethnic group
under examination who would need to be The above findings address the concern
moved to other OAs to achieve even brought up at the beginning of this
distribution. First of all, the overall trend of section as to whether Britain is becoming
residential segregation at the level of Great more residentially segregated or mixed.
Britain is decreasing as Britain becomes It leads to the conclusion that Britain
more ethnically diverse at the same time. has become more ethnically diverse and
Obvious decline can be spotted from the more residentially mixed at the national
temporal changes for most of the ethnic level. It should be noted, however, that
groups (Figure 5.6). However, the White segregation pattern is an outcome of
British is an exception among other ethnic not only selective residential mobility/
groups, with a slight rise of 1.4% in its migration but also results from differential
dissimilarity index. Although there is an fertility and mortality rates among ethnic
increase in the segregation level for the groups (Catney, 2015). Another demographic
White British group, Catney (2015) dynamic comes from international
interpreted this phenomenon in the context immigration during the past decade.
of new ethnic group mixing in less diverse According to the 2011 Census, 13% of the
locales. She argued that there would be an total population in England and Wales in
increase of unevenness whenever members 2011 were born outside the UK. There is a
80 CONSUMER DATA RESEARCH: PART TWO

Figure 5.6
Changes of Index
100%
of Dissimilarity for
90% selected ethnic
groups over time.
80%
Dissimilarity Index

70%

60%

50%

40%

30%

20%

Year

Pakistani Bangladeshi Black African Indian


Chinese White British White Irish Other White

long history of immigrants into the UK. caused by the vagaries of data sources or by
Initially they were drawn to the UK by a demographic process. The best way is to
labour shortages in particular areas, for filter out the consumer register part from
example road-building in the early 20th the consumer data after 2003 so that the
century, health and transport services in filtered consumer data solely consist of the
the 1960s, the textile industry in the 1970s, public version of the electoral roll. The
and agriculture in the 2000s (Simpson, filtering can be done by using one attribute
2012). These particular migrant of each record that indicates the general
destinations, mostly major cities of source of the data. In order to examine
the UK such as London, Manchester whether the data source vagaries have
and Birmingham, served as ‘gateway an impact on the segregation indices,
areas’ (Catney, 2015) for immigration comparisons are made between
flows. International immigrants settled dissimilarity indices calculated from
down in these gateway areas first and then the original consumer data and from
some of them spread into other areas of the filtered consumer data.
the UK, which results in growing ethnic
diversity and more even distributions of Based on this strategy, the workflow in
ethnic minorities. Figure 5.1 is repeated and the dissimilarity
indices of ethnic groups from original
Last but not least, it is nontrivial to consumer data and from filtered consumer
investigate the cause for the dramatic drop data are compared respectively using
starting from year 2003 shown in Figure different colours to represent different
5.6. It has been noted earlier in this chapter ethnic groups in Figure 5.7, where dashed
that consumer data are compiled from lines represent the original consumer data
multiple data sources. Consumer data are while solid lines represent the filtered
mainly derived from the public registers of consumer data. Within each ethnic group,
electoral rolls before 2003. Afterwards, the segregation indices are relatively
the consumer data are comprised of both underestimated by the original consumer
electoral rolls and other commercial data data after 2002 compared against the
sources. Therefore, it is necessary to justify filtered consumer data. The most noticeable
whether the dramatic drop from 2003 is underestimation occurs in the Chinese
5. Ethnicity and Residential Segregation 81

Figure 5.7
Dissimilarity indices 90%
derived from original
consumer data (dashed 80%
line) and from filtered
consumer data (solid line).

Dissimiliarity Index
70%

60%

50%

40%

30%

20%

Year

Pakistani Bangladeshi Indian Chinese


White British White Irish White other

ethnic group, possibly because some of to 2007. However, they do not affect the
them, for instance international students overall changing tendency of segregation
from China, are not eligible to vote in any indices for individual ethnic groups. It is
election. Thus they are more under- likely because of a natural demographic
represented in the electoral registers than process, for example the increasing number
in the consumer registers. In contrast, the of immigrants and migrant dispersal,
variations are not that large among some which needs further evidence from the
native ethnic groups such as the White Census migration data.
British and White Irish, or some ethnic
communities from commonwealth 5.6
countries, for example the Indian, Conclusion
Pakistani and Bangladeshi group,
or some of the Other White group from Contemporary Britain continues to
EU countries, all of whom are allowed to experience demographic changes in its
vote in at least some elections. Although population, and these have accelerated in
slightly underestimated, the dissimilarity recent years. Ethnic diversity at various
indices for all ethnic communities are still scales offers a key perspective on these
decreasing and the overall patterns of high, dynamics. Against this background, there
moderate and low segregation groups are concerns that a more diverse population
still exist among these different ethnic may become more segregated. By taking
communities. It can be also concluded the opportunity of the 2011 Census release,
that the dissimilarity indices for each numerous studies have been conducted into
ethnic group prior to 2003 are these issues but are now beginning to look
overestimated since fewer non-British outdated and possibly overtaken by events
groups are included in the public electoral as the Census data age. In addition, such
roll before 2003. Afterwards, other analyses are often limited to high levels
commercial data sources are compiled of spatial aggregation for reasons of
into the consumer data to include as many disclosure control. Thus, the use of
non-voters as possible. Changes in the data coded consumer data offers merit in
source partly explain the strong decrease understanding changes over time at finer
in similarity indices observed from 2003 spatial scales.
82 CONSUMER DATA RESEARCH: PART TWO

Our work in this area is just beginning, and registers have been further enhanced with
there are very significant start-up costs in other datasets. Although this might cause
establishing the provenance and quality of inconsistencies in trends between years,
consumer data as well as the veracity of the these seem to be limited based on the
individual-level inference of ethnic group. comparison between public electoral roll
The study contributes to the segregation subset and the full consumer registers.
debate with empirical findings from Consequently, the Dissimilarity Indices
consumer data with finer-grained likely underestimate segregation
resolutions in both spatial and temporal by the original consumer data to varying
dimensions. The results suggest that degrees, as found in the comparison
Britain is becoming more and more between the original consumer data and
ethnically diverse over time with shrinking the filtered consumer data (Figure 5.7).
White British majorities and growing Considerations and precautions should be
ethnic minorities. A decrease in the taken to understand the possible outcomes
overall residential segregation in Britain of these limitations with respect to the
can be identified from the changing of purpose of the analysis.
dissimilarity indices for most ethnic groups
except for a small increase for the White In this chapter, both opportunities and
British group. These findings reinforce challenges of making use of consumer
the existing claims of a more mixed data are presented by re-visiting the
Britain in related studies in the literature ethnic diversity and segregation issues
with empirical evidence. It is believed that of contemporary Britain. It has been well
these changes are the consequence of a demonstrated that it is feasible to address
natural demographic process of fertility, such social concern with novel data
mortality, migration and immigration sources. By linking to other data sources,
of the population. The debate shows that consumer data have even greater potential
there is a need to effectively assess ethnic to address other micro demographics in a
geographies across a full range of spatial broader spectrum. For example, house price
scales using data and methods that permit inflation and asset accumulation can be
robust, timely and transparent assessments investigated by linking consumer data with
of residential segregation. Land Registry data. Other examples of
possible application of consumer data could
Consumer data appear to be an effective be issues such as changes in population
supplementary for census data; however density, household composition, and age
several limitations should be noted. structure. With more such new Big Data
First, although achieving a relatively sources emerging, changes and dynamics
fine penetration rate of the population of the contemporary population can be
in the UK, consumer data are incomplete better understood in novel ways.
as they only record the adult population.
The age limit depends on the eligible age
for electing or applying for credit cards
and loyalty cards. Second, due to the
willingness and eligibility of election,
different ethnic groups might be under-
represented at different degrees (ethnic
bias). Third, multiple data sources could
have impact on the results of the analysis
because it is only since 2003 that persons
registered on the Electoral roll can opt out
of inclusion in the derivative commercial
dataset. Hence, since 2003, the consumer
5. Ethnicity and Residential Segregation 83

Further Reading Acknowledgements

Casey, D. L. (2016). The Casey Review: A review into The authors would like to thank CACI Ltd and
opportunity and integration. Online: https://www. DataTalk Research Ltd for providing the Consumer
gov.uk/government/publications/the-casey-review- Register and Electoral Roll data under a special
a-review-into-opportunity-and-integration. research licence to enable us to carry out this
research. We would also like to thank Owen Abbott
Catney, G. (2015). Exploring a decade of small and Adriana Castaldo, Office for National Statistics,
area ethnic (de-)segregation in England for their support in developing Ethnicity Estimator.
and Wales. Urban Studies, 53(8), 1691-1709. The research was also funded by Engineering
doi:10.1177/0042098015576855 and Physical Science Research Council grant EP/
M023483/1 and the Economic and Social Research
Catney, G. (2016). The Changing Geographies of Council grant ES/L013800/1.
Ethnic Diversity in England and Wales, 1991–2011.
Population, Space and Place, 22(8), 750-765. Notes
doi:10.1002/psp.1954
1. http://ons.maps.arcgis.com/home/item.
Finney, N., and Simpson, L. (2009). ‘Sleepwalking html?id=471e6948594540a3bccb2678e0cf50fe
to segregation’? Challenging myths about race and 2. www.ons.gov.uk/peoplepopulationandcommunity/
migration. Policy Press at the University of Bristol. 3. www.ordnancesurvey.co.uk/business-and-
government/products/addressbase-products.html
Mateos, P., Longley, P. A. and O’Sullivan, D. A. (2011).
Ethnicity and Population Structure in Personal
Naming Networks. PLOS ONE, 6(9). doi:10.1371/journal.
pone.0022943

Massey, D. S., & Denton, N. A. (1988). The Dimensions


of Residential Segregation. Social Forces, 67(2), 281-
315. doi:10.2307/2579183

Openshaw, S. (1984). The modifiable areal unit


problem. Geobooks. Norwich, England: University
of East Anglia.

Peach, C. (2009). Slippery segregation:


Discovering or manufacturing ghettos? Journal
of Ethnic and Migration Studies, 35(9), 1381-1395.
doi:10.1080/13691830903125885

Simpson, L. (2007). Ghettos of the mind: The


empirical behaviour of indices of segregation and
diversity. Journal of the Royal Statistical Society: Series
A (Statistics in Society), 170(2), 405-424. doi:10.1111/
j.1467-985X.2007.00465.x

Simpson, L. (2012). More segregation or more


mixing? Dynamics of diversity: Evidence from the
2011 Census. Online: http://hummedia.manchester.
ac.uk/institutes/code/briefingsupdated/more-
segregation-or-more-mixing.pdf

Simpson, L., Jivraj, S., and Warren, J. (2016). The


stability of ethnic identity in England and Wales
2001–2011. Journal of the Royal Statistical Society.
Series A, (Statistics in Society), 179(4), 1025-1049.
doi:10.1111/rssa.12175
6
85

Movements in Cities: Footfall


and its Spatio-Temporal
Distribution
Roberto Murcio, Balamurugan Soundararaj and
Karlo Lugomer

6.1 dynamic in both the spatial and temporal


Introduction dimensions (Steenbruggen et al, 2013)
and estimating them with confidence is
This chapter intends to address the crucial for decision-making in numerous
problem of how to estimate human applications such as urban management,
activities in retail centres by examining retail, transport planning and emergency
the WiFi probes in a SmartStreetSensors management. Traditionally insights into
network at a fine spatial and temporal the distributions of such activities were
resolution. These sensors capture signals gathered by studying the data available on
sent by WiFi-enabled devices present in night-time residence through population
their range. The data are then used as a censuses and daytime estimates via
proxy for estimating footfall at retail various sample surveys such as traffic
locations. An original methodology for counts. The data generated by censuses,
cleaning and validating probe requests and while being comprehensive, are only
then converting them into actual footfall updated once a decade in countries such
counts is proposed and implemented. as the UK. In contrast, sample surveys and
With these counts, a national level footfall traffic counts get updated more frequently
index is proposed and, finally, the chapter but are usually very specific. The key
concludes with a case study to use these challenge has always been to capture and
data to characterise different retail locations. understand these dynamic and complex
phenomena in detail efficiently and
The accurate measurement and estimation without compromising the privacy
of human activity are one of the first steps of those involved. This has led to a
towards understanding the structure of the considerable volume of research in the last
urban environment (Louail et al, 2014). decade utilising various techniques and
Human activities are highly granular and technologies. The proliferation of personal
86 CONSUMER DATA RESEARCH: PART TWO

mobile devices has generated considerable entirely special purpose infrastructure


interest for research in the past two such as bespoke footfall counters.
decades by opening up unprecedented
avenues in gathering detailed, granular Being a general network protocol designed to
information on people carrying these be used by mobile devices, WiFi devices relay
devices. The general technology landscape a range of public signals - known as probe
that supports this device ecosystem has request frames - on regular intervals
also been constantly evolving and despite throughout its operation, for the purpose of
the increasing concern for privacy, it has connecting and maintaining a reliable and
been observed that the users show secure connection for the mobile device
acceptance to the collection of their data (Freudiger, 2015). These probe requests
at reasonable terms in return for incentives can be non-intrusively captured using
(Kobsa, 2014). inexpensive customised hardware and
utilised in numerous applications. In
6.2 addition to a uniquely identifiable MAC
WiFi address, these probe requests include a
range of other information that, when
There has been significant progress in combined with the temporal signatures
employing novel technologies in measuring of the probe requests received, can help
and analysing human activity patterns us understand the nature of, and identify,
during the past two decades. A particular the devices generating these requests.
emphasis has been on the usage of Using the semantic information present
mobile data and commonly associated in these probe requests it is possible to
technologies: cellular data, GPS (Vazquez- understand the nature of these users on
Prokopec et al, 2013), Bluetooth and, a large scale.
of course, WiFi (Schauer et al, 2014).
Because of the security and privacy
WiFi is a wireless network connection risks posed by the WiFi protocol’s use
protocol standardised by IEEE in 2013. of hardware based MAC addresses, various
It is a distributed server-client based methods to strengthen the security have
system where the client connects to access been introduced in mobile devices over the
points (APs). Every device in the network years. One such measure – randomisation
has a unique hardware specific MAC of MAC addresses – has become more
address, which is transmitted between mainstream in mobile devices with its
the device and AP before the connection is introduction as a default operating system
made. The key feature of WiFi infrastructure behaviour in iOS 8 by Apple Inc.
is that the network is distributed, i.e. the
APs can be set up and operated by anyone 6.3
locally unlike mobile networks. Since they The SmartStreetSensor project
are primarily used for Internet service
provision, the protocol has priority The SmartStreetSensor project is one of the
for continuity of connectivity so the most comprehensive studies carried out on
devices constantly scan for new and consumer volume and characteristics in
better connections using probe requests retail areas across the UK. The project is a
(detailed in later sections). WiFi, therefore, collaboration between Local Data Company
offers near complete coverage, is very (LDC) and the ESRC’s Consumer Data
resilient, and can encapsulate and reinforce Research Centre (CDRC). The data for the
civic space in cities (Torrens, 2008), while study are generated independently within
providing a middle ground between using the project through sensors installed at
infrastructure which is largely general around 1,000 locations across Great Britain.
purpose such as a cellular network and an It is the first comprehensive study into
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 87

national footfall patterns using automated independently and uploads the collected
data collection. data to a central container at 5-minute
intervals through a dedicated 3G mobile
As a first step, various locations for the data connection. The sensor hardware has
study were identified by CDRC to ensure been improved over the course of the project
a geographical spread, different local and currently has built-in failure prevention
demographic characteristics and range mechanisms such as backup battery for
of retail centre profiles. A custom footfall power failures, automatic reboot capabilities
counting technology using WiFi based and in-device memory for holding data
sensors was developed by LDC and the when the Internet is not available.
sensors were installed in the identified
locations. The sensor monitors and 6.3.2
records signals sent by WiFi enabled mobile Data collection, data storage and
devices present in its range. In addition, data retrieval
pedestrians walking past the sensor were
counted manually for short time periods The probe request frame is the signal sent
during the installation. The project aims to by a WiFi capable device when it needs to
combine these two sets of data to estimate obtain information from another WiFi
footfall at these locations. The first sensor device. For example, a smartphone would
was installed in July 2015 and the network send a probe request to determine which
has grown to almost 789 total active WiFi access points are within range and
sensors as of June 2017. suitable for connection. On receipt of
a probe request, an access point sends
The primary aim of the project is to improve a probe response frame that contains
our understanding of the dynamics of its capability information, supported
high-street retailing in the UK. The key data rates, etc. This ‘request-response’
challenge in this area is the collection of data interaction forms the first step in the
at the finest scales possible with minimal connection process between these devices.
resources while not infringing on people’s The request frame has two parts, a MAC
privacy. This challenge, when solved, header part that identifies the source
can provide immense value to occupiers, device, and the frame body that contains
landlords, local authorities, investors and the information about the source device.
consumers within the retail industry. The As mentioned, the SmartStreetSensor
project aims to facilitate decision making by collects some of the information available
stakeholders in addition to the tremendous in the probe request frame relayed by
opportunities for academic research. the mobile devices, along with the
time interval at which the request was
6.3.1 collected and the number of such requests
Hardware setup collected during that interval. The actual
information present in the data collected
The data are collected through a network of by the SmartStreetSensor is shown in
SmartStreetSensors: a WiFi based sensor Table 6.1.
that collects a specific type of packets
(probe requests) relayed by mobile devices After the probe requests are collected, the
within the device’s signal range. The sensor MAC addresses in the data are hashed at
is usually installed in partnering retailer’s the sensor level to preserve the privacy of
shop windows so that its range covers the device owners and sent to LDC’s cloud
the pavement in front of the shops. In a storage. From there, through a secure
handful of cases (3%), the sensor is placed channel, they are sent to the CDRC secure
within a large shop to monitor internal servers, where the formal translation of a
footfall. Each device collects data probe request to footfall data is completed.
88 CONSUMER DATA RESEARCH: PART TWO

Table 6.1
Field Description Information collected by
the SmartStreetSensor.
MAC address The MAC address of the source device with last two digits hashed

Date Date of the data collection

Time interval 5-minute time interval in which the data was captured

Sensor Unique ID of the sensor that captured data

Packets Number of times the device sent packets to the sensor

Signal The signal strength reported by the source device

6.3.3 Range of the sensor


Metadata Since the strength of the signal from a
mobile device to the WiFi access point
From July 2015 to May 2017, there were 652 depends on various factors such as distance
operational sensors installed across Great between them, the nature and size of
Britain. During this time the sensors logged obstructions between them, interference
in the order of 2.6 billion probe requests at from other electromagnetic devices etc.,
a rate of 6 million new requests per day. the exact delineation of the range of the
The geographical distribution of these sensors is different for each and every
sensors per region is shown in Table 6.2. sensor. We assume that the range of the
sensor is equal in all directions and is
6.4 linearly indicated by the RSSI reported
From probe request to footfall counts by the mobile devices in range.

The probe requests received by the sensor Probe request frequency


are not a direct measure of footfall, so, in The frequency of probe requests generated
order to extract a meaningful indicator for by device varies widely based on the
footfall, the information present in the manufacturer, operating system, state of
requests are validated through a series the device and the number of access points
of steps described in this section. The already known to the device (Freudiger,
prime sources of information we use to 2015). These requests are also generated in
accomplish this transformation from the short bursts rather than at regular intervals.
probe requests to footfall are: Moreover, Android devices send probe
1) The hashed/anonymised MAC address requests even when the WiFi is turned off.
of the mobile device. With the large number of different devices
2) The time interval at which the probe available, it is impossible to predict and
request was collected. create a general model for this probing
3) The Received Signal Strength Indicator behaviour. For simplicity, we assume that for
(RSSI) present in the probe request. a probe request received with a MAC address
with a known Organisationally Unique
We carry out the transformation both Identifier (OUI), there is a corresponding
internally by looking at the patterns device present within the range of the sensor
present in the data and externally by at that time interval, irrespective of the
comparing it to data collected via field number of such requests received in the
surveys. Before we validate the data we mentioned interval. Essentially, we are just
take into account the following: looking for unique MAC addresses within a
time period rather than the total number of
requests made by them.
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 89

May 2017 – Basic footfall counts


Region Sensors Percentage (%) Average Median
East Midlands 40 6.13 210,380 178,411
East of England 27 4.14 209,590 194,157
London 219 33.58 415,120 332,059
North East 16 2.45 271,600 246,210
North West 60 9.2 306,490 250,180
Scotland 78 11.96 196,120 138,935
South East 65 9.96 307,370 212,813
South West 54 8.28 171,440 149,400
Wales 13 1.99 275,740 332,414
West Midlands 22 3.37 367,840 271,200
Yorkshire and 58 8.89 248,100 212,282
The Humber
Total 652 100.00 270,890 212,813

Table 6.2 MAC address collisions 6.4.1


Regional distribution of From the initial analysis, we have observed Internal validation
the installed sensors in
Great Britain. that there are a few instances of MAC
address collisions reported where a device After the initial considerations we start
known to be in some place has been transforming the data by standardising the
reported somewhere else. This might be time interval to 288 intervals of 5 minutes
occurring due to rogue MAC randomisation starting from 00:00. We do this by
by certain devices and the hashing rounding-off the non-standard time
procedure done at two different stages. intervals encountered, to the nearest
Due to the negligible volume of such 5 minutes. We then aggregate the number
collisions (~ 0.01%), for the purpose of this of probe requests by their unique MAC
research, we ignore these collisions and address. This results in a count reduction
treat all distinct hashed MAC addresses of approximately 85% (Figure 6.1).
with known OUI to be the same device.
After this aggregation, we investigate the
MAC address randomisation occurrence of MAC addresses over the
A MAC address has two parts. The first entire day. In this example, we see that
part is the OUI, which identifies the around 15% of MAC addresses repeat
manufacturer of the wireless card, and the for more than 5 minutes. Since we are
second part belongs to the device, which, interested in the pedestrian activity,
along with the OUI, was originally designed we eliminate these long-dwelling MAC
to uniquely identify the device at the addresses and redo the count. The detailed
hardware level. But with the introduction account of the 5-minute interval count of
of randomisation, there is an increased probe requests collected through the day
use of ‘unknown’ or ‘public’ addresses that before and after cleaning is shown in
are not registered with IEEE. Once these Figure 6.1.
uncertainties are addressed, the next step
is to clean the data received to transform it Finally, these devices randomly fail for
into a consistent indicator for activity such short periods of time, leaving some
as pedestrians at the location. intervals without any counts. At other
times, retailers need to shut down the
90 CONSUMER DATA RESEARCH: PART TWO

1200

900
Packets

600

300

0
0 500 1000 1500
Time (Minutes from 00:00)

All packets Unique MACs Long dwellers removed

device, which also leads to periods of zero The previously internally validated data are Figure 6.1
counts until the device is switched back on. externally validated against - and adjusted The total number of probe
requests collected every
If such intervals are short (no more than to - manual counts. The ratio between 5-min interval vs the
half an hour), we can safely interpolate the manual counts and internally validated number of unique MAC
counts to have better aggregated estimations (cleaned) sensor counts is known as addresses collected in
the same interval vs the
of the daily counts. In practice, the estimated adjustment factor α: final count for the same
count c, at time t, is obtained by a simple interval after cleaning
linear interpolation: M long dwelling devices.
α= (6.2)
ψ
c = c1+m(t-t1 ) (6.1)
Where α is the adjustment factor, though
where m is the slope and c1 are the counts there are certain differences between
at time t1. weekdays and weekends, M is the number
of the passers-by counted manually on the
6.4.2 street and ψ is the number of the processed
External validation sensor counts.

After addressing the aforementioned Ideally, manual counts are taken in


uncertainties and cleaning the data from multiple periods throughout the day
the unwanted probe requests, some so an average adjustment factor for that
measurement error will still remain. particular location is derived. All the
The factors responsible for this difference internally validated sensor counts are
are the various sources of measurement simply multiplied by α to provide the final
error which cause sensors to undercount. estimate. This step is crucial in enabling
The stores in which the sensors are the spatial comparability of flows measured
deployed have different layouts and at different locations.
walls built from construction materials
with different physical properties Finally, it is important to mention that this
and therefore result in different WiFi method is quite sensitive to the way the
propagation characteristics. As a result, manual counts are conducted, and it could
the sensors will be more effective in some lead to the omission of large groups of
stores than others. The demographic devices, potentially important for a wider
characteristics of passers-by are also type of application, like measures of flows
relevant in this context since some or measures of local activity, beyond the
population contingents – such as the retail domain.
elderly – are less likely to have WiFi
enabled phones.
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 91

At this point, once we have translated given months (at this stage of the project
probe requests into footfall counts with a there are always more sensors in month Mb
sufficient degree of certainty, we can start than in month Ma); ii) a single sensor could
a proper analysis of the particular patterns be measuring H hours in month Ma and K
generated at each location to compare hours in month Mb, with K≠M and iii)
trends and define different functional areas some sensors can be considered just as
across different parts of the country. An white noise, because they may have only a
example of this is presented in Section 6.5. few valid measures within a particular
month. These discrepancies make, in
6.5 principle, these two months incomparable
UK footfall index with each other.

One of the first analyses conducted, based To solve this, we proceed as follows:
H
on the validated footfall counts, was to look 1) Define SdM = ∑ i =a,b1hd
a,b i
at the shift in footfall figures nationwide to
establish seasonal peaks and troughs and where Ha,b is the total number of half
ensure they reflect known trends. For hours in months Ma,b, and Mdi is the
example, footfall tends to rise in the run half hour aggregated footfall counts
up to Christmas but falls during the first at sensor d at bin i. Put simply, SdM
a,b
months of the year.
is the sum of all the footfall in a single
Two different indexes were therefore month at sensor d.
defined: the first to track seasonal trends 2) Calculate the theoretical probability
in footfall, taking a particular month as a distribution of all SdM in a month.
a,b
base line and the second, to compare the
change in footfall between two consecutive a) Discard all sensors skewed to the left
months. Both indexes try to detect major of the bulk of the distribution, i.e. those
shifts and overall tendencies from one that are to the left of the standard
month to another at the national level, not deviation value. In other words, remove
to explain actual activity patterns. all sensors that didn’t work properly
during months Ma,b
For both, the counts were aggregated to b) For sensors skewed to the right, i.e.
each half-hour, removing those devices those that are two times above the
that were present for more than 5 minutes standard deviation value, we firstly
at every location and without applying any verify if their behaviour is the same
adjustment factor, as these indexes are across the previous few months or if
more concerned with counting all the the month in question was an anomalous
footfall activity around the sensors, and one. If it is the former, we remove the
not only retail related activity. counts, otherwise they are kept in.
3) With the remaining sensors, we define
Equation 6.3 measures the relative change a and b as follows:
in footfall from one month to another:
a = ∑ iH=a 1hi , b = ∑ iH=b 1hi ,
Footfall index (a,b) = ((b-a)/a)*100 (6.3) (6.4)
Sa Sb
where b = Total footfall at month Mb,
a = Total footfall at base month Ma, a≠b. where Ha,b is the total number of hours in
month Ma,b, hi is the half an hour
The major challenge was the actual aggregated footfall counts at bin i and Sa
construction of b and a, as, i) the number and Sb are the total number of sensors left
of sensors is not the same between any two after step 2.
92 CONSUMER DATA RESEARCH: PART TWO

Figure 6.2
Percentage change in
footfall over a 7-month
period with October 2016
as the base month.

Equation 6.4 captures the weighted counts compare the corresponding flows between
at each month, which standardises the different retail areas.
measures, making both months comparable.
In the next section, we present the results 6.5.2
obtained when a single month is set as Footfall trends over short time periods
base month, in this case, October 2016.
The second index, where we compare In order to illustrate the differences in the
the change in footfall between two given volume of footfall across Central London,
months, is explained in detail in the online the validated sensor measurements were
supplementary information. taken for the five-minute intervals of each
day of the week over the period of ten
6.5.1 weeks (9 January 2017 to 19 March 2017)
Footfall trends over long time periods for all the sensors for which data were
available. The period was chosen to
Defining October 2016 (with a net footfall avoid holiday seasons (Christmas, Easter,
of approximately 131 million) as the base summer) or the occurrence of Monday
month, we explored the percentage Bank Holidays which would have influenced
change in footfall across a 7-month period the usual weekday footfall volumes.
(Figure 6.2). November shows a marginal The spatial variation of overall average
increment of 6% while December increases five-minute footfall during the weekdays
by almost 25%, which is expected due to between 7am and 7pm in Central London
the festive season. After this peak, in the is shown on Figure 6.3.
first trimester of the year, footfall returns
to the October levels, then there is an Areas well known for their business are
unusual increase in April 2017 (17%) before Soho (Central London) and Camden Town,
finally returning to the base month level in as well as locations around some of the
May. The April increase could be related Tube and rail stations, with some notable
to the Easter holiday period, but this is examples labelled on the map (Victoria,
something still to be investigated. Waterloo and Angel stations). The influence
of station proximity is also seen on
Although both indexes were presented at Edgware Road. Footfall around Edgware
national level, they can be disaggregated Road and Marble Arch Tube stations
to, for example, retail centre level, to appears to be higher, while at the same
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 93

Camden
Town

Angel

R e g e n t ' s LONDON
P a r k

Bloomsbury

Edgware Road
Station Holborn
Edgware
Road
Soho
Marble Arch
Station
Piccadilly
Circus
Waterloo
H y d e P a r k Tooley
Street
S t J a m e s ' s
P a r k

Victoria
Average Footfall
(per five minutes)
1
10
50
0 2 km 100

Figure 6.3 time sensors between them record lower constructed. Those locations were Holborn
Average five-minute and relatively consistent and spatially Station, Connaught Street (situated to the
footfall in Central London
during the working days comparable footfall. On the other hand, west of Edgware Road) and a pub in Tooley
(7am - 7pm) between stores situated in quieter side streets or Street between London Bridge and Tower
9 January and 19 March less attractive areas show lower footfall, Bridge. Temporal patterns and volume of
2017. Source: Local Data
Company (2017); including areas that may be near main footfall differ among the three locations
Ordnance Survey Vector attractions but outside main corridors on multiple levels (Figure 6.4). First, overall
Map District (2017). – Tooley Street being a good example, volume is very high around Holborn Station
situated behind the far more crowded and very low at the Connaught Street
Thames path near Tower Bridge. location. Second, general profiles differ,
While very important, assessment of the so that both Holborn and Tooley Street
overall footfall may fall short of detecting display three peaks (morning and
some other interesting patterns of human afternoon rush hour and lunchtime),
activity, for example how a certain area while Connaught Street has a less clear,
of the city is being used by its residents, noisier pattern, which could be owing to
workers and visitors during the a low footfall. Finally, there are differences
characteristic time periods during the day even between the profiles of Holborn and
and the week. In order to explore some Tooley Street with the latter experiencing
of these differences in diurnal patterns, a relatively higher PM rush hour peak.
temporal profiles of three locations were
94 CONSUMER DATA RESEARCH: PART TWO

In addition to exploring the spatio-


temporal variations of footfall throughout 500
400
a single day, weekly patterns are also of 300

interest. In this case, weekend activity 200


Holborn Station
patterns were compared to the weekday
100
activity patterns by dividing the average

Estimated footfall (per 5 min)


five-minute footfall on Saturdays and
Sundays between 7am and 7pm by the

(log scale)
average five-minute footfall on weekdays Tooley Street

between 7am and 7pm as follows:

F1(Sat-Sun,7-19)
lw = x 100 (6.5)
F2(Mon-Fri,7-19)
Connaught
Street
where Iw is the index of relative weekend
daytime activity, F1 is the average five- 0 3 6 9 12 15 18 21
Hour
minute weekend footfall between 7am
and 7pm and F2 is the average five-minute Connaught Street Holborn Station Tooley Street

weekday footfall between 7am and 7pm.


6.6 Figure 6.4
This kind of index does not necessarily tell us Conclusion Temporal activity
patterns at three locations
which areas get busiest during the weekend in Central London on
daytimes, but rather which areas have WiFi sensors are a very rich source of 16 January 2017.
more pronounced daytime weekend activity continuous and up-to-date data which
relative to their daytime weekday activity. can be used for a variety of purposes
involving the importance of human mobility
As Figure 6.5 shows, Soho, Camden Town in the cities. These data are not free of
and a location south of Hyde Park record measurement errors and associated issues
higher footfall during the weekends than and need to be handled by applying a
during the weekdays, which can be complex validation methodology before
attributed to their reputation as highly making any pragmatic conclusions. However,
attractive tourist and/or recreational areas. despite those challenges, they provide a good
The results for Camden Town indirectly estimate of the footfall around the area
suggest where exactly the main attractions where the sensor is located and are therefore
of the area (Inverness Street Market, a useful estimate of the potential conversion
Buck Street Market and Camden Lock rates and revenue, inevitable factors to be
Market further north) are located. Relative considered in the locational planning in
weekend activity is higher north of the many industries, especially retailing. One of
Camden Town underground station and the main advantages of WiFi sensors lies in
it diminishes in a southerly direction, their ability to measure flows of people on a
i.e. away from the markets. On the other rather fine scale, which in turn gives us the
hand, many areas appear to be much opportunity to assess the suitability of a
busier during the weekdays (Victoria, microsite location for a particular business
Waterloo, Tooley Street, etc.), where there and its opening times. Case studies of
are concentrations of working places, spatio-temporal footfall patterns in Central
including the universities in Bloomsbury. London demonstrate that the impact of the
microsite location is very significant and can
be decisive, as some streets within otherwise
busy areas may end up receiving relatively
low footfall.
6. Movements in Cities: Footfall and its Spatio-Temporal Distribution 95

Figure 6.5 Buck Str.


Index of relative weekend Camden Inverness Str. Market
Town Market
activity in Central London Camden Town
Station
between 9 January
and 19 March 2017. Source:
Angel
Local Data Company
(2017); Ordnance Survey R e g e n t ' s
P a r k
Vector Map District (2017).

Bloomsbury
0 100 200 m

Edgware Road
Station Holborn
Edgware
Road
Soho
Leicester
Marble Arch Square
Station
Piccadilly
Circus
H y d e P a r k Waterloo Tooley
Street
S t J a m e s ' s
P a r k

Victoria
Index of Relative
Weekend Activity
> 116
101 - 116
LONDON 77 - 100
0 1 2 km < 77

Further Reading Schauer, L., Werner, M. and Marcus, P. (2014).


Estimating crowd densities and pedestrian crowds
Barbera, M. V. et al. (2013). Signals from the crowd: using Wi-Fi and Bluetooth. Proceedings of the 11th
Uncovering social relationships through smartphone International Conference on Mobile and Ubiquitous
probes. In Proceedings of the 2013 Conference on Internet Systems: Computing, Networking and Services. ICST
Measurement. ACM (Association for Computing (Institute for Computer Sciences, Social Informatics
Machinery), pp. 265–276. and Telecommunications Engineering), pp. 171–177.

Cunche, M., Kaafar, M.-A. and Boreli, R. (2014). Steenbruggen, J. et al. (2013). Mobile phone data
Linking wireless devices using information from GSM networks for traffic parameter and urban
contained in Wi-Fi probe requests. Pervasive and spatial pattern assessment: A review of applications
Mobile Computing, 11, 56–69. and opportunities. GeoJournal, 78(2), 223–243.

Freudiger, J. (2015). How talkative is your mobile Torrens, P. M. (2008). Wi-fi geographies. Annals of the
device? An experimental study of Wi-Fi probe Association of American Geographers, 98(1), 59–84.
requests. In Proceedings of the 8th ACM Conference
on Security & Privacy in Wireless and Mobile Networks. Vazquez-Prokopec, G. M. et al. (2013). Using GPS
ACM, p. 8. technology to quantify human mobility, dynamic
contacts and infectious disease dynamics in a
Hidalgo, C.A. and Rodriguez-Sickert, C. (2008). resource-poor urban environment. PloS one 8(4),
The dynamics of a mobile phone network. Physica e58802.
A: Statistical Mechanics and its Applications, 387(12),
3017–3024. Acknowledgements

Kobsa, A. (2014). User acceptance of footfall analytics The authors would like to thank Local Data Company
with aggregated and anonymized mobile phone Ltd, for providing, in partnership with CDRC, the
data. In Lecture Notes in Computer Science (Lecture SmartStreetSensor footfall data. The second and
Notes in Artificial Intelligence and Lecture Notes in third authors’ PhD research is sponsored by the
Bioinformatics), pp. 168–179. Springer International Economic and Social Research Council through the
Publishing Switzerland 2014 UCL Doctoral Training Centre.

Louail, T. et al. (2014). From mobile phone data to the


spatial structure of cities. Scientific Reports, 4, 1–14.
doi:10.1038/srep05276
7
97

The Geography of Online


Retail Behaviour
Alexandros Alexiou, Dean Riddlesden and
Alex Singleton

7.1 online shopping has had a significant


Introduction impact within the retail industry especially
since many products such as newspapers,
The advancements of information and magazines and music are digitally rather
communications technologies (ICTs) in the than physically purchased. This has led to
last 30 years have brought fundamental the demise or significant diversification
changes to the way in which populations of many retailers (Dholakia et al, 2010;
can communicate, work and interact with Carlson et al, 2015; Verhoef et al, 2015).
services. Arguably, the most significant In terms of those products and services
advancement has been the arrival of the that cannot be digitised, widespread
Internet, giving disparate populations adoption of the Internet has increased
the ability to connect and interact with competition, choice, access and reduced
one another without the constraints prices for the typical consumer.
of distance. Since its inception in the
mid 1980s, the Internet has been used The associated impact of the Internet on
ubiquitously, engendering benefits for traditional retailers brought a significant
everyday life across multiple domains, shift from bricks and mortar retailing to
such as communication, information omni-channel digital stores as a means
access, education and entertainment. of adaptation. The extent to which this
increases resilience of physical stores
Consumer behaviours have also changed as to their online competitors has become a
a result, with an increasing likelihood for key theme in retail research (Wrigley and
Internet users to purchase online given the Dolega, 2011; Singleton and Dolega, 2015).
financial savings and added convenience it Responses differ, but typically most large
facilitates (Calderwood and Freathy, 2014; retailers will have invested in click-and-
Beck and Rygl, 2015). The proliferation of collect services to allow customers to order
98 CONSUMER DATA RESEARCH: PART TWO

online and collect at their convenience, the creation of such a bespoke


or enhanced home delivery services. geodemographic classification system,
using aggregate measures of Internet
Nevertheless, not all retail activities are infrastructure, access, engagement and
affected equally. For consumers, benefits contextual information. This typology
to proximity still remain and clustering was created at the Lower Super Output
of economic activity prevails despite Area (LSOA) Census geography level for
advances in ICTs (Nathan and Rosso, 2015). England in the form of the Internet User
Furthermore, the importance of face-to- Classification (IUC). The resulting
face communication is a considerable classification is presented through
barrier constraining the use of the Internet a series of cluster summarisations and
in some circumstances (Kaufmann et al., assessments, which describe the prevailing
2003). Consumer demographics can also characteristics of each of the clusters
play an important role towards the impact identified. The IUC is currently openly
of online shopping on retail geography. available through the Consumer Data
For example, the extent to which localised Research Centre (CDRC) data portal
populations are engaged with the Internet (data.cdrc.ac.uk/dataset/cdrc-2014-iuc-
could be considered as an influential factor geodata-pack-england).
in the attractiveness and success of
traditional retail centres (Dolega et al, The IUC geodemographic system provides
2016). In this setting it is important a basis of analysis regarding the
to consider the changes in consumer characteristics of small area populations
behaviours in more detail as well as that can contribute to the wider field of
the extent to which these may vary research through the identification of
geographically and across life cycles. socio-spatially differentiated patterns
of Internet access and engagement.
Any exploration of online retail behaviour A use case of the IUC is presented here
should include not only factors that by examining the ‘e-resilience’ of retail
differentiate access to and engagement centres in England. The analysis, detailed
with the Internet, but also quality in Singleton et al (2016), evaluates the
of infrastructure, local population extent to which retail centres have spatially
characteristics and contextual geography. differentiated vulnerability to the impacts
For instance, there are spatial disparities in of online consumption. Retail centres are
fixed-line broadband services, particularly profiled by the IUC and demand factors are
as a result of the urban-rural dichotomy. coupled with catchment models to create a
Population densities seem to play an composite index of exposure, engendering
important role in the quality of broadband a remarkable geography of retail centres
services since commercial providers that are at high exposure to the effects of
are more likely to develop network online retailing. Measures of exposure are
infrastructure in densely populated areas then coupled with measures of supply
to facilitate increased demand. Moreover, vulnerability pertaining to the mix of
attributes pertaining to, inter alia, age, stores within each retail catchment in order
education and professional occupation to create a composite e-resilience score.
have been considered as having links
to the levels of engagement. 7.2
Creating the Internet User
These complex sets of input data can be Classification
combined using multivariate classification
techniques to produce nested typologies Methodologically, building the IUC followed
at the small-area level. The methodology, a conventional geodemographic approach,
outlined in the following sections, involves as presented in Harris et al (2005).
7. The Geography of Online Retail Behaviour 99

Domain Description
Infrastructure Fixed-line household infrastructure access and broadband Internet
performance.
Mobile phones Mobile access, connectivity and usage.
Perceptions People’s attitudes and perceptions about the use and utility of the Internet.
Access patterns Information on Internet access patterns, e.g. only at home, while travelling,
through a mobile, etc.
Commercial applications Information on the use of commercial applications such as online shopping,
online banking and online bill payments.
User population Current Internet users, ex-users and non-users.
Demographics and attributes of Demographic attributes such as age, education and occupation and
contextual geography attributes of contextual geography such as rurality, population density, etc.

Table 7.1 Generally, the steps required to create these datasets should cover several
Variable domains used in a classification include selecting the domains, such as those listed in Table 7.1.
the IUC.
appropriate classification scale, selecting
the input variables, preparing the data An important source of data forming input
(e.g. variable transformations or weighting), to the IUC was the Oxford Internet Survey
applying a clustering method and finally (OxIS), which was launched by the Oxford
interpreting results (clusters). Due to data Internet Institute in 2003. The survey,
availability, the IUC was built for England conducted biannually, is carried out by
at the LSOA level. LSOA geography is the interview using a probability sample of
second most granular Census geography around 2,000 people in Great Britain,
available, comprising 32,844 zones of enabling comparisons over time (more
between 1,000-3,000 people or 400-1,200 details can be found on the OxIS website,
households. The majority of data under available at: oxis.oii.ox.ac.uk/research/
consideration were available at the Great methodology/). For the creation of the IUC,
Britain level, albeit that those datasets the 2013 study was used. The OxIS covers a
available for England were more robust. broad range of topics regarding people’s
Furthermore, the nature of these perception of the Internet; given the vast
geographies in Scotland and Wales varies number of questions that were available for
significantly compared to England (e.g. in analysis (there are over 500 potential lines
terms of the characteristics of rural areas) of enquiry), it was necessary to identify
and so the decision was taken to exclude a smaller subset of questions relating to
Scotland and Wales from the analysis. key dimensions of Internet use, behaviours
and attitudes.
Selecting the appropriate variables to be
used in the classification, however, can be The sample used for the 2013 OxIS is
more challenging. The multi-dimensionality representative of the UK population, but its
of the IUC is important; a wide range of size is relatively small to capture the full
spatially referenced input measures are breadth of the survey at higher geographic
essential to the success of the classification, scales. As such, a method for synthetic data
similar to how geodemographics typically estimation was implemented to extrapolate
include a plethora of socio-economic the survey results to national small area
attributes in order to represent coverage. Projection of the survey results
neighbourhoods. If combined and was carried out using a Small Area
summarised effectively, meaningful Estimation (SAE) technique. SAE was
measures could represent a typology of applied to each question and generated a
Internet use and engagement. Broadly, predicted response rate at the Output Area
100 CONSUMER DATA RESEARCH: PART TWO

(OA) Census geography. The estimations are each question is a weighted average derived
‘indirect’, in that they borrow strength by from all the population sub-groups present
using values of the variables of interest within it.
from related areas through a model that
provides that link using secondary data, Two tests were carried out in order to
such as Census counts and administrative validate estimated results. One way was
records (Rao, 2003). In the most basic to compare the average deviation between
sense, it is possible to predict results mean rates of the estimated data at the
for unsampled areas by using data from national level, and mean rates of the
sampled areas. For instance, profiling the original OxIS sample. The average
relationship between age structure and difference was <0.1%, which suggests
Internet usage and subsequently using the the estimated dataset is broadly
results to predict rates for an unsampled representative, as national means
geography where no survey data are are comparable to those of the original
available, but the age structure is known data. Furthermore, comparing distributions
(i.e. from a recent Census of population). showed that the estimation method is
not skewing the output such that it is
In practice, however, the process was more unrepresentative of the sample it was
complex. Firstly, the required explanatory built from. Vastly different average rates
variables for each survey question should between the estimated and original data
be identified. Predictor variables explored at this stage would have flagged potential
were based on those factors known to problems with estimation methods.
influence Internet use and behaviour
that were identified in relevant literature, The next stage in validation involved
namely age (Rice and Katz, 2003; Warf, profiling responses geographically, to
2013), socio-economic status (Silver, 2014), examine if variability pertained to patterns
ethnicity (Wilson et al., 2003), gender that would be expected. An external dataset
(Prieger and Hu, 2008); rurality (Warren, was used for this purpose; each of the OA
2007), education (Helsper and Eynon, 2010) response estimates were profiled using the
and Internet connectivity (Riddlesden and 2011 Output Area Classification (OAC), an
Singleton, 2014). There are a number of open geodemographic system (available
techniques that can be used to identify for download through the CDRC portal at:
those attributes with the highest influence; data.cdrc.ac.uk/dataset/cdrc-2011-oac-
in this case, a decision tree algorithm geodata-pack-uk), to ensure that the
was implemented, specifically the Quick propensity for certain responses (e.g.
Unbiased Effective Statistical Tree engagement to online shopping) were in
(QUEST) algorithm, in order to identify the line with responses given to the general
relationship of those external attributes to demographic profile of the clusters. For
response rates. Decision tree algorithms instance, the national average response
are commonly used in data mining and rate of question QC30b: Buying Online
seem to perform well compared to, e.g., for those who responded as ‘frequently’
ecological regression analysis, which in (i.e. buying online at least monthly) is
this case provided poor results. 53.5% of all Internet users. Figure 7.1
shows the deviation of frequent users
In total, 42 OxIS questions were selected, from the national mean by OAC profiles.
covering each of the 171,372 OAs in England
and Wales. The described model outputs a Profiling response rates by OAC revealed
series of rates which were then fitted to OA significant correspondence between
geography by examining the distribution of socio-spatial groups and prevailing levels
these population sub-groups within each of engagement with different domains of
OA nationally. Essentially, an OA rate for the Internet. In most cases, groups with
7. The Geography of Online Retail Behaviour 101

Figure 7.1
Deviation from the
national average of
response rates to the 2.5

question QC30b:

% Difference From National Average


Engagement to Online 0.0

Shopping, for those


identified as ‘frequent’ −2.5
users, by OAC Group.
−5.0

−7.5

−10.0

−12.5

7c: White communities


1b: Rural tenants

1c: Ageing rural dwellers

2c: Comfortable cosmopolitan

2d: Aspiring and affluent

3a: Ethnic family life

3d: Aspirational techies

4b: Challenged Asian terraces

4c: Asian traits

5a: Urban professionals and families

5b: Ageing urban living

6a: Suburban achievers

detached suburbia

7a: Challenged diversity

7b: Constrained flat dwellers

7d: Ageing city dwellers

8b: Challenged terraced workers


8a: Industrious communities

8c: Hard pressed ageing workers

8d: Migration and churn


1a: Farming communities

2a: Students around campus

2b: Inner city students

3b: Endeavouring ethnic mix

3c: Ethnic dynamics

4a: Rented family living

6b: Semi−
Output Area Classification 2011, Group Level

younger, urban populations and variables considered as inputs to the


professional backgrounds displayed the classification were evaluated for their
highest levels of engagement with most discrimination potential, and where
aspects of the Internet. In contrast, those possible, were limited to those without
groups whose constituent populations are strong correlation, as such effects can
elderly, constrained by deprivation, or overly influence cluster assignments.
working in employment sectors that would
not require significant exposure to ICT, A total of 75 measures were considered for
were generally less engaged across multiple the classification, which were compiled at
domains. Interestingly, rural constituencies the aggregate level of the LSOA geography
displayed more mixed engagement for England and Wales (Table 7.2).
characteristics, with higher engagement
rates in some domains.
Once the dataset has been assembled,
the next step of the analysis regards
The next step of the analysis was to
the consideration of transformation and
use estimated rates for the set of OxIS
normalisation procedures. In this case the
questions selected and aggregate them
classification was built using naturally
from OA to LSOA level. The 40 attributes
observed attributes distributions; however
provide small-area information about
values were transformed to z-scores,
Internet users, information seeking,
to ensure that all variables are ascribed
perception of the Internet, household
equal weighting in the model. Alexiou and
and mobile access, access patterns and
Singleton (2015) provide further details on
commercial applications. Finally, a range
the normalisation and standardisation
of socio-demographic indicators from the
techniques commonly used in
2011 Census was collated, such as education
geodemographic analysis.
level, employment sector, prevalence of
full-time students, age structure and
After input measures were assembled, a
population density, in addition to attributes
geodemographic classification was created
regarding local infrastructure. All the
102 CONSUMER DATA RESEARCH: PART TWO

Table 7.2
Domain Variables Data Source Number of variables per
domain used in the IUC,
Age 15 Census 2011 and data sources.

Qualifications 7 Census 2011

Occupation 9 Census 2011

Density 1 Census 2011

HE Student 1 Census 2011

Commerce, Business and Retail 12 OxIS

Mobile Usage 8 OxIS

Engagement Attitude 7 OxIS

Access and Connectivity 13 OxIS

Infrastructure 2 Ofcom; Broadbandspeedchecker

using a clustering algorithm. A common assignments and empirical testing.


clustering technique used in geodemographic The classification procedure is firstly
analyses is the iterative allocation– applied in order to create an initial
reallocation algorithm, known as K-means. ‘coarse’ tier referred to as ‘Supergroups’
The algorithm aims to assign observations and then re-applied within each cluster
(in this case LSOAs) into a predetermined to form a second nested ‘Group’ level.
number of clusters, based on their The final classification formed a two-tier
similarity across the full range of input hierarchy of 4 Supergroups which are
attributes. Results were evaluated in a further classified into a total of 11
number of ways before selecting the distinct Groups.
optimum number of clusters K. This was
carried out through a combination of The final stage of the geodemographic
statistics regarding the sum of squared analysis required the interpretation and
distances within each cluster, plots labelling of cluster results. Interpretation
showing the configuration of clusters includes looking at the cluster centres in
during iterations, mapping cluster order to identify the ‘profile’ of each cluster

Table 7.3
Supergroup Group The structure and class
labels of the IUC.
1: E-unengaged 1a: Too Old to Engage
1b: E-marginals: Not a Necessity
1c: E-marginals: Opt Out
2: E-professionals and students 2a: Next Generation Users
2b: Totally Connected
2c: Students Online
3: Typical trends 3a: Uncommitted and Casual Users
3b: Young and Mobile
4: E-rural and fringe 4a: E-fringe
4b: Constrained by Infrastructure
4c: Low Density but High Connectivity
7. The Geography of Online Retail Behaviour 103

LONDON

Internet User Classification


London Region
Too Old to E ngage
E -marginals: Not a Necessity
E -marginals: Opt Out
Next Generation Users
Totally Connected
S tudents Online
Uncommitted and Casual Users
Young and Mobile
E -fringe
Constrained by Infrastructure
Low Density but High Connectivity
0 5 10 km

Figure 7.2 based on the values of the input attributes East and West London, while the periphery
The Greater London (usually through radial plots) and mapping is mostly identified by less engaged
Region by IUC Group.
results for visual analysis. These outputs populations.
informed the Supergroup and Group
names (Table 7.3) as well as the ‘Pen Potential uses of the IUC are broad, and
Portraits’. These describe the typical fields of use may include data profiling,
characteristics of the areas included in online survey stratification, targeted
each of the clusters, while also considering marketing, location planning, customer
their variability between clusters. The insight, and public policy formation and
complete pen portraits can be found in delivery. Such a classification is particularly
Singleton et al (2016). useful in the commercial sector, as the IUC
could be used in the profiling of existing
Along with pen portraits, a series of maps customer databases to identify trends,
are essential in order to reveal the spatial assisting in the development of targeted
structure that emerges from the marketing strategies. This may be valuable
classification. Figure 7.2 demonstrates the for businesses that operate online, or are
resulting Group typology for the Greater interested in the aggregate Internet
London Region. The map clearly shows the engagement characteristics of their
differentiation between central London and customer base.
the periphery, with the centre occupied by
the highly engaged Supergroup 2 classes, The IUC is an open product that is offered
such as 2a: Next Generation Users and 2b: through the CDRC data portal (available
Totally Connected. Cluster 3b: Young and for download at: data.cdrc.ac.uk/dataset/
Mobile clearly forms several clusters to cdrc-2014-iuc-geodata-pack-england).
104 CONSUMER DATA RESEARCH: PART TWO

Furthermore, an interactive map of the and ownership will govern the extent to
classification is available on the CDRC which they can adapt to or accommodate
website (maps.cdrc.ac.uk/#/ these changes. Essentially, e-resilience
geodemographics/iuc14/). can be expressed as a balance between
the propensity of localised populations
7.3 to engage with online retailing and the
e-Resilience and the online physical retail provision and mix that might
geography of retail centres increase or constrain these effects, as not all
retail categories would be equally impacted.
Online shopping impacts upon retail
centres in complex ways, often referred to Measuring the vulnerability of competing
in the literature as a ‘slow burn’ (Pendall et retail destinations to consumers of
al, 2010). UK Government initiatives aimed differential Internet engagement
at revitalisation of British high streets characteristics requires an understanding
highlight the importance of digital of the location and geographic extent
technology in redefining traditional retail of retail centres, combined with some
spaces (Digital High Street Advisory Board, assessment of their composition and
2015). In this framework, it is important to size. A nationally expansive record of the
study the impacts that online shopping has location, occupancy and facia of UK retail
on the structure of traditional high streets stores are generated by the Local Data
at a more granular level. For instance, in Company (London, UK), a commercial
the UK a number of national retailers such organisation that employs a large survey
as Borders, Zavvi, Jessops and Game have team to collect these data on a rolling basis.
either entirely withdrawn or substantially A national extract for February 2014 was
limited their physical retail offerings made available for this research, with each
within the past few years, while some other record comprising the location of a retail
major retailers such as John Lewis, Next, premise with latitude and longitude
Boots or Argos have successfully embraced coordinates, retail category and details
new technologies through opening click- of the current occupier. The dataset is
and-collect points, or by developing mobile currently available through CDRC (data.
applications (Turner and Gardner, 2014). cdrc.ac.uk/dataset/local-data-company-
retail-unit-address-data) with permission.
Despite evidence to suggest that factors
impacting decisions about whether or not Retail unit data were used to calculate a
to shop online are linked to demographic series of measures which were identified
and socioeconomic characteristics of in relevant literature to influence
populations (Longley and Singleton, 2009), propensity to online shopping, e.g. physical
there is limited knowledge about the store attractiveness or retail category
geography of online sales (Forman et al, vulnerability, calculated as the level of
2008). This study explored these challenges risk of the main product switching from
through a concept defined here as physical to online offering channels.
‘e-resilience’, a concept that provided A composite of these measures forms
both the theoretical and methodological a ‘supply vulnerability index’. Input
framework in assessing the vulnerability measures to this index included the
of retail centres to the effects of rapidly weighted percentage of anchor stores
growing Internet sales, balancing (Damian et al, 2011), i.e. the top 20 most
characteristics of both supply and demand. attractive stores as presented by Wrigley
E-resilience defines the vulnerability of and Dolega (2011), and leisure outlets
retail centres to the effects of growing (Reimers and Clulow, 2009), as opposed
Internet sales, and estimates the likelihood to the prevalence of ‘digitalisation retail’,
that their existing infrastructure, functions such as newsagents, booksellers, computer
7. The Geography of Online Retail Behaviour 105

as defined through the IUC required a


23 % method of modelling consumer flows to
probable retail destinations. There is a long
20%
history of well-developed literature on the
Difference from National Average

ways in which such supply and demand for


12 %
retail centres can be reconciled through
10%
7%
catchment area estimation (Birkin et al,
2002; Birkin et al, 2010; Wood and
2% Reynolds, 2012). These techniques range
0%
in sophistication, from calculating the
−2 %
−3 % distances that consumers are willing to
−6 % −6 % −6 % travel to a retail centre in a given time
−10% −9 % (Grewal et al, 2012), through to more
−11 %
complex mathematical models calibrated
Constrained by Infrastructure

E−fringe

E−marginals: Not a Necessity

E−marginals: Opt Out

Low Density but High Connectivity

Next Generation Users

Students Online

Too Old to Engage

Totally Connected

Uncommitted and Casual Users

Young and Mobile


on the basis of how attractive different
retail offerings are to proximal consumers
(Newing et al, 2015).

This latter group of models was adopted,


which typically makes assumptions that
IUC Group
larger towns with more extensive retail
and leisure offerings are more attractive,
Figure 7.3 games and home entertainment, video and but these effects decay with distance.
IUC profiles for the music stores etc. Specifically, catchments were estimated
catchment area of
Central Milton Keynes using a bespoke Huff model (Huff, 1964)
retail centre. As such, higher proportions of ‘digitalisation which uses town centre composition and
retail’ are associated with enhanced vacancy to produce allocated catchments
vulnerability of retail centres, whereas through a distance decay function. The
higher proportions of anchor store and function was calibrated using road network
leisure units indicated greater resilience. distance and retail centre morphology.
An index was then generated for each retail
centre by creating a composite z-score for Catchment areas were assigned using
each variable, and computing an average LSOAs as the spatial unit of analysis.
for each centre. The final score, referred to Once catchments had been established,
as ‘supply vulnerability index’ was scaled exposure to online shopping was calculated
between 1 (lowest vulnerability) and 100 by overlaying the IUC group typology
(highest vulnerability). (presented earlier) and extracting their
profiles based on the proportions of the IUC
While the above index portrays the impact populations identified within. An example
of online retail on the supply side, there is of a catchment profile for the Milton
still an impact that can be attributed to the Keynes retail centre, located north of
demand side. For instance, retail clusters London, is shown in Figure 7.3.
that are within close proximity of young,
professional populations would be more Since each group has a different propensity
vulnerable to the effects of online retail, for online shopping, the attribute mean
as these populations have higher propensity of the OxIS variable ‘Frequently Shopping
to shop online; this kind of information can for Products and Services Online’ was
be obtained by means of the IUC. extracted for every IUC Group. As such, the
deviation of the catchment population’s
Estimating the exposure of retail centres to propensity to shop online compared to the
populations who are active Internet users national average (53%) was obtained. This
106 CONSUMER DATA RESEARCH: PART TWO

score, calculated for each retail centre


catchment and scaled between 1 and 100, INPUT DATA
formed the ‘index of high exposure’.
OXIS Survey, LDC Retail Data,
The index of high exposure indicates 2011 Census, Town Centre Boundaries,
Internet Infrastructure Street Network
a rather distinct spatial pattern; secondary
and tertiary retail centres located in more
rural areas, including the satellite centres STAGE 1

of more urbanised areas, have Internet User Retail Catchments


predominantly the greatest exposure to Classification (IUC)
(K-means Clustering) (Bespoke Huff Model)
the impacts of online sales. This trend is
reiterated for other parts of the country,
STAGE 2
although the majority of the highly
exposed retail centres can be found within Index of Exposure to Index of Supply
the South East. Moreover, based on those Online Shopping Vulnerability

attractiveness scores that fed into the


catchment model, it is worth noting that
STAGE 3
none of the highly exposed centres were
drawn from the larger, most attractive
e-Resilience Score
centres, unlike the fortunes of many of
the surrounding smaller towns and local
shopping centres.
locations, typically faced with poorer retail Figure 7.4
The index of high exposure and the supply provision, have displayed a higher The model used to
calculate e-resilience
vulnerability index were then combined to propensity for online shopping. scores.
ascribe a measure of e-resilience to each
individual retail centre. The indices were These findings can be associated with a
summed, and then the final score scaled polarisation effect, implying that large
into the range 1 and 100. The complete and attractive centres function as hubs for
methodology for the creation of the higher volumes of comparison shopping
e-resilience indicator can be simply and leisure, whereas the small local centres
represented by a flow diagram, as shown provide everyday convenience shopping.
in Figure 7.4. However, the mid-sized centres have a less
clear function. Combining such effects with
The following tables summarize the 10 higher exposure to online sales due to the
most and least e-resilient retail centres. local population mix, these retailers may
The most attractive retail centres, namely be increasingly faced with considerable
those in larger urban areas such as Greater challenges, such as how to diversify their
London, Birmingham or Manchester, store portfolio, downsize or move to other
demonstrated the highest levels of locations.
e-resilience, followed by the small, local
centres. Conversely, the least e-resilient 7.4
centres were predominantly located in the Conclusion
suburban and rural areas of South East
England, and to a lesser degree around The growth of Internet sales is increasingly
other major conurbations of the country. viewed as one of the most important forces
Typically, these were the secondary and currently shaping the evolving structure of
medium-sized centres, often referred to retail centres (Wrigley and Lambiri, 2014;
as ‘Clone Towns’ (Ryan-Collins et al, 2010). Hart and Laing, 2014). Although current
It could be argued that this is largely research does not suggest a death of
intertwined with the geography of Internet physical space, the consequences for
shopping, where customers in more remote traditional high streets remain unclear
7. The Geography of Online Retail Behaviour 107

Town centre Region e-Resilience Score

Boughton East Midlands 100

Ravenside Retail Park, South East 97.58


Bexhill-on-Sea
Corbridge North East 93.27

Torport South West 71.61

Hersham South East 70.29

Halton, Leeds Yorkshire and the Humber 69.29

Cinderford South West 68.51

Marsh Road, Luton East of England 67.01

South Molton South West 65.41

Parkgate Retail World Yorkshire and the Humber 64.37

Table 7.4
The 10 most e-resilient
town centres identified
within England.

Town centre Region e-Resilience Score

Rochford East of England 1

London Road, Leigh-on-Sea East of England 15.61

North Seaton Industrial Estate North East 16.86

Whalley North West 17.2

Oxted South East 17.25

Barnt Green West Midlands 17.39

Eccleshall West Midlands 17.39

Hurstpierpoint South East 17.98

Botley Road, Oxford South East 18.14

Woburn Sands South East 18.52

Table 7.5 as knowledge about the geography and with regards to online shopping, and to
The 10 least e-resilient drivers of Internet shopping are still the role and future of town centres at the
town centres identified
within England. limited. This study explores some national scale. Certainly, one of the most
aspects of online retail behaviour, influencing factors is the behavioural
particularly on the nature and impact component: whether or not to use the
that Internet user behaviour is having Internet for a given activity. The study
on retail centres nationally. highlighted that influencing such decisions
are both demographic effects, mainly age
The analysis of the geography of online and socioeconomic status, and local retail
retail provides unique insights into the supply including ‘softer’ factors such as
apparent diversity of population groups convenience and accessibility.
108 CONSUMER DATA RESEARCH: PART TWO

In this framework, the geography of online Futher Reading


behaviour can be based on the IUC. The IUC
Alexiou, A. and Singleton, A. D. (2015). Geodemographic
is an open, freely available product that analysis. In Singleton, A.D. and Brunsdon, C. (Eds.), Geo
provides comprehensive summary computation: A practical primer, pp. 137-151. London: Sage.
measures of the complexities between
Beck, N. and Rygl, D. (2015). Categorization of multiple
behaviour, infrastructure and context at channel retailing in multi-, cross-, and omni-channel
the small-area level. It can be viewed as a retailing for retailers and retailing. Journal of Retailing
tool for both the private and public sector. and Consumer Services, 27, 170–178.

For example, the 2021 Census in the UK will Birkin, M., Clarke, G. and Clarke, M. (2002). Retail
largely be completed online and so the IUC Geography and Intelligent Network Planning. Chichester,
can assist in highlighting areas where low NY: Wiley.

response rates are likely. Birkin, M., Clarke, G. and Clarke, M. (2010). Refining
and operationalizing entropy-maximizing models
A further application of the IUC is its for business applications. Geographical Analysis, 42(4),
422–445.
contribution to the e-resilience indicator.
The distribution of e-resilience measures Calderwood, E. and Freathy, P. (2014). Consumer
revealed a geography where attractive mobility in the Scottish isles: The impact of internet
adoption upon retail travel patterns. Transportation
and large retail centres such as the inner Research Part A: Policy and Practice, 59: 192–203.
cores of large metropolitan areas, along
with smaller, specialised centres were Carlson, J., O’Cass, A. and Ahrholdt, D. (2015). Assessing
customers’ perceived value of the online channel of
highlighted as more resilient, while centres multichannel retailers: A two country examination.
within many secondary and medium sized Journal of Retailing and Consumer Services, 27, 90–102.
centres were identified as most vulnerable.
Damian, D., Curto, J. and Pinto, J. (2011). the impact of
One of the most defining contributions anchor stores on the performance of shopping centres:
of this approach is that it provides a The case of Sonae Sierra. International Journal of Retail &
comprehensive classification of all Distribution Management, 39(6), 456–475.

retail centres based on their e-resilience Dholakia, U. M., Kahn, B. E., Reeves, R., Rindfleisch, A.,
levels, a resource that can be used by a wide Stewart, D. and Taylor, E. (2010). Consumer behavior
range of stakeholders including academics, in a multichannel, multimedia retailing environment.
Journal of Interactive Marketing, 24(2), 86–95. Special
retailers and town centre managers, and Issue on Emerging Perspectives on Marketing in a
inform policy decisions. Multichannel and Multimedia Retailing Environment.

Digital High Street Advisory Board (2015). Digital High


Street 2020. thegreatbritishhighstreet.co.uk/digital-
high-street-report-2020. Accessed 15 June 2015

Doherty, N. and Ellis-Chadwick, F. (2010). Internet


retailing: The past the present and the future.
International Journal of Retail & Distribution Management,
38(11/12), 943–965.

Dolega, L., Pavlis, M. and Singleton, A. (2016).


Estimating attractiveness, hierarchy and catchment
area extents for a national set of retail centre
agglomerations. Journal of Retailing and Consumer
Service, 28: 78–90.

Forman, C., Ghose, A. and Goldfarb, A. (2008).


Competition between local and electronic markets:
how the benefit of buying online depends on where
you live. Management Science, 55(1), 47–57.

Grewal, D., Kopalle, P., Marmorstein, H. and


Roggeveen, A. (2012). Does travel time to stores
matter? The role of merchandise availability.
Journal of Retailing, 88(3), 437–444.
7. The Geography of Online Retail Behaviour 109

Harris, R., Sleight, P. and Webber, R. (2005). Silver, M. (2014). Socio-economic status over the
Geodemographics, GIS, and Neighbourhood Targeting. lifecourse and internet use in older adulthood. Ageing
Chichester: John Wiley and Sons. and Society, 34, 1019–1034.

Hart, C. and Laing, A. (2014). The consumer journey Singleton, A. D. and Dolega, L. (2015). The e-resilience
through the high street in the digital area. In of UK town centres. In Evolving High Streets: Resilience &
Evolving High Streets: Resilience and Reinvention - Reinvention, Perspectives from Social Science, pp. 40–43.
Perspectives from Social Science, pp. 36–39. University of Economic and Social Research Council.
Southampton, Southampton.
Singleton, A. D., Dolega, D., Riddlesden, D. and Longley,
Helsper, E. and Eynon, R. (2010). Digital natives: Where P. A. (2016). Measuring the spatial vulnerability of
is the evidence? British Educational Research Journal, retail centres to online consumption through a
36(3), 503–520. framework of e-resilience. Geoforum, 69(1), 5-18

Huff, D.L. (1964). Defining and estimating a trade area. Turner, J. and Gardner, T. (2014). Critical reflections
Journal of Marketing, 28(3), 34–38. on the decline of the UK high street: Exploratory
conceptual research into the role of the service
Kaufmann, A., Lehner, P. and Todtling, F. (2003). encounter. In Handbook of Research on Retailer-
Effects of the internet on the spatial structure of Consumer Relationship Development, pp. 127–151.
innovation networks. Information Economics and Policy, Hershey, PA: IGI Global.
15(3), 402–424.
Verhoef, P. C., Kannan, P. and Inman, J. J. (2015). From
Longley, P. A. and Singleton, A. D. (2009). Classification multi-channel retailing to omni-channel retailing:
through consultation: Public views of the geography Introduction to the special issue on multi-channel
of the e-society. International Journal of Geographical retailing. Journal of Retailing, 91(2), 174–181.
Information Science, 23(6): 737–763.
Warf, B. (2013). Contemporary digital divides in the
Nathan, M. and Rosso, A. (2015). Mapping digital United States. Tijdschrift voor economischeen sociale
businesses with big data: Some early findings from the geografie, 104(1), 1–17.
UK. Research Policy, 44(9), 1714 – 1733. The New Data
Frontier. Warren, M. (2007). The digital vicious cycle: Links
between social disadvantage and digital exclusion in
Newing, A., Clarke, G. and Clarke, M. (2015). Developing rural areas. Telecommunications Policy, 31(6-7), 374–388.
and applying a disaggregated retail location model
with extended retail demand estimations. Geographical Wilson, K. R., Wallin, J. S. and Reiser, C. (2003). Social
Analysis, 47(3), 219–239. stratification and the digital divide. Social Science
Computer Review, 21(2), 133–143.
Pendall, R., Foster, K. and Cowell, M. (2010). Resilience
and regions: Building understanding of the metaphor. Wood, S. and Reynolds, J. (2012). Leveraging locational
Cambridge Journal of Regions Economy and Society, 3(1), insights within retail store development? Assessing
71–84. the use of location planners’ knowledge in retail
marketing. Geoforum, 43(6), 1076–1087.
Prieger, J. E. and Hu, W. M. (2008). The broadband
digital divide and the nexus of race, competition,and Wrigley, N. and Dolega, L. (2011). Resilience, fragility,
quality. Information Economics and Policy, 20(2), 150–167. and adaptation: New evidence on the performance of
UK high streets during global economic crisis and its
Rao, J. (2003). Small Area Estimation. Wiley series in policy implications. Environment and Planning A, 43(10),
survey methodology. Hoboken, NJ: John Wiley. 2337–2363.

Reimers, V. and Clulow, V. (2009). Retail centres: It’s Wrigley, N. and Lambiri, D. (2014). High Street
time to make them convenient. International Journal of Performance and Evolution: A Brief Guide to the
Retail & Distribution Management, 37(7), 541–562. Evidence. Technical report. Southampton: University
of Southampton.
Rice, R. E. and Katz, J. E. (2003). Comparing internet
and mobile phone usage: Digital divides of usage, Acknowledgements
adoption, and dropouts. Telecommunications Policy,
27(8), 597–623. The authors would like to thank the Oxford Internet
Institute for providing survey data and the Local
Riddlesden, D. and Singleton, A. D. (2014). Broadband Data Company Ltd for providing retail unit data for
speed equity: A new digital divide? Applied Geography, this research. The research was also funded by the
52(0), 25–33. Economic and Social Research Council, grant number
ES/L003546/1.
Ryan-Collins, J., Cox, E., Potts, R., and Squires, P.
(2010). Re-imagining the High Street: Escape from Clone
Town Britain. London: New Economics Foundation.
8
111

Smart Card Data and


Human Mobility
Nilufer Sari Aslam and Tao Cheng

8.1 are classified as either primary or


Introduction secondary activities. Primary activities
involve movement patterns that are regular
Prevailing models of urban mobility seek in nature and comprise key user locations,
to establish an understanding of individual for example work (for regular workers)
activity patterns using household travel or study (for students). The secondary
demand surveys. These surveys, activities are marked by unusual and
representative of a subset of the population, infrequent activity patterns and involve
are used to create individual travel diaries movement between other POIs, for example
for estimating projected travel demand. theatres, stadiums, pubs or restaurants.
The whole model is not only time-
consuming and costly, but is limited to The results from the model have been
partial snapshots of the overall dynamic validated against London Travel Demand
needs of urban transportation. The advent Survey (LTDS) data. The ability for the
of large data sources such as smart cards model to accurately identify and analyse
have created new opportunities for the individual mobility patterns in major
understanding of urban mobility and urban centres rests on the precise
behaviour research. identification of these primary locations.
This new activity-based modelling
This chapter presents an overview of a approach aims to provide a better
heuristic model for the understanding understanding of human mobility for
of user mobility from smart card data. transport infrastructure planning.
An understanding of individuals’ mobility  
requires an appreciation of their key Points
of Interest (POIs). Activities that originate
around these key geographical locations
112 CONSUMER DATA RESEARCH: PART TWO

8.2 Singapore the ‘EZ-Link’ card (Pelletier et


Dynamic data and the analysis of al, 2011); they also produce large quantities
human mobility of very detailed data on onboard
transactions. These data can be very useful
Developing an ever greater understanding to transit planners, from the day-to-day
of human mobility has clear benefits for operation of the transit system to the
the provision of transport. Traditional strategic long-term planning of the
modelling systems have relied on network. This chapter covers several
information extracted from travel surveys, aspects of smart card data use in the public
which, while they are designed to gather transit context. First, the technologies are
a wide range of travel use and socio- presented: the hardware and information
economic data from participants, are systems required to operate these tools;
limited both in terms of the relatively and privacy concerns and legal issues
short time spans they represent and their related to the dissemination of smart
relative sample sizes. In recent years the card data, data storage, and encryption
data available regarding day-to-day are addressed. Then, the various uses of
movements of transport users has been the data at three levels of management are
greatly enriched by the transition away described: strategic (long-term planning),
from paper tickets or single use tokens, tactical (service adjustments and network
towards smart card based systems whereby development), and operational (ridership
entries and exits are recorded as users tap statistics and performance indicators)
their cards on readers. These systems build (Pelletier et al, 2011).
up detailed journey profiles per card
(assumed to be a single user) that can As cards are associated with individual
form the basis for models to automatically users, there emerges the possibility of
generate ‘travel diaries’. Such diaries can uniquely logging each individual journey
inform transport providers about typical which can thus capture relatively detailed
user behaviour within their system and spatial and temporal attributes such as
have the potential to markedly improve the location of origin, destination stations,
current survey-based approaches. and stay duration. The analysis of user
behaviour can, therefore, be carried out
Whilst the primary motivation for the by looking at the spatial patterns or the
implementation of smart card payment temporal pattern. The most comprehensive
systems within transportation networks analysis would take into account the spatial
is the streamlining of their revenue and temporal aspects of the trip
collection flows, the collected data has simultaneously.
many auxiliary benefits such as long-term
cost reduction, flexibility in pricing 8.3
options, and the ability to share gathered Extracting meaning from Transport
information with other parties. Partially for London’s Oyster card data
these benefits help explain the enthusiasm
with which smart card automated fare The focus of this chapter is an analysis
collection systems are being extensively of the Oyster card operated by Transport
implemented around the world. In Europe, for London (TfL). The card is valid on all
the use of smart cards is well advanced. London public transport systems such as
Additionally, South America and over 15 London Underground, the bus network,
cities in North America have currently the Docklands Light Railway (DLR), London
implemented smart card transportation Overground, Tramline, some river boat
systems. The application of the smart card services, and most National Rail services
is also growing in Asia, for example Hong within the London Fare Zones.
Kong has its ‘Octopus’ smart card, and
8. Smart Card Data and Human Mobility 113

Figure 8.1
The frequency of journeys 10000

NUMBER OF USERS
by number
of users. 8000

6000

4000

2000

0
1 5 10 20 30 40 50 60

JOURNEY COUNT

The volume of data from Oyster cards on activities both at the individual and the
the TfL network is extremely high; more aggregate level. The classification of this
than 80 percent of the 3 million journeys activity can be thought of as a two-step
carried out each day on the network make process. The first step is to use the
use of Oyster cards (TfL, 2016). Although temporal information within the
the Oyster card is used on multiple modes commuting sequences to identify the
of transportation across London, 95% periods of stay. A period of stay is
of all Oyster card usage is for London’s characterised by two consecutive
Underground and bus journeys (Gordon, journeys to and from the same location.
2012). One of the limitations identified in The time between these two journeys is
the TfL dataset is the incomplete recording significant as it is an indicator of the type
of trip information for bus journeys. As TfL of activity and can help discern activities
do not currently capture the alighting from transit stops. The second step in the
information from its bus trips, bus journeys process is the classification of activities
are often excluded from certain trip into predefined activity types. The
analysis. Such journeys, however, can be activities are classified by means of their
included with an enhancement of the association with POI. Stay location, stay
model, where missing information is duration and time provide the spatial-
identified as a sub-step within the temporal context of the activity. Also
identification process. significant in the inference of the activity
is the distance of POI to the transit station
In the following analysis, the sample data that is captured via smart card data.
available comprised a total of 60 million The combination of these factors
journeys. Since the processing of such large could therefore explain the different
volumes of data are so resource intensive, characteristics of human movement.
for the purpose of the study, the smart For example people travel to work on a
card records of 9,900 randomly selected daily basis but only go to watch a concert
TfL users were identified for further at specific times. Therefore a short stay
investigation. The sample contains a total near a concert venue can only be an
of 1,823,906 complete journey records made indicator of the activity ‘at the concert’
by individual users for the months of if it matches the temporal attributes
October and November 2013. of POI. This chapter only discusses the
identification of home and work locations
8.3.1 along with the work related activities.
Activity description
Commuting patterns provide the ability to
An understanding of human mobility identify regular activities such as work,
requires the understanding of daily once a stay location has been identified
114 CONSUMER DATA RESEARCH: PART TWO

Behaviour Activities Classified Activities Identified Commute Activity and locations

Work,
H to W
Primary Activities Home (H) and Work (W) offices, universities,
W to H
related activities schools, college, etc.

H to X1 Dropping off child at


Before Work (W) Activities
Regular X1 to W school

Secondary Activities W to X2
After Work (W) Activities
X2 to H Pub, dinner, shopping

W to X3
Midday/Work (W) Activities Lunch, shopping
X3 to W

X4 to X5 Unspecified location and


Irregular Unspecified Unspecified
X5 to X6 activity

with a certain degree of confidence. defined as those activities that fall outside Table 8.1
Therefore it is important to ensure the of a regular commuter journey’s key user A variety of travel-based
activities can be identified.
regularity of the usage prior to making locations. User activities that are irregular
inference about such activities. in nature are more challenging to model.
In such a scenario the spatial-temporal
Figure 8.1 shows that for 20% of the users, aspects of the trip alone do not provide
the available journey count (defined as a significant insights into user behaviour.
numeric parameter based on the regularity At the same time, spatial attributes of
of usage) is less than 10 journeys, which is visited locations can provide clues for
too low to carry out any meaningful activity classification.
analysis. The right balance is, therefore,
required in the selection of threshold; a 8.3.2
value too small will include a large number Individual mobility as a sequence of
of irregular users in the dataset, and a activities
threshold too high will be too restrictive
and leave out regular users from the Activities are representations of users’
analysis (Hasan et al, 2012). changing presence in both space and time.
Some of the activities are captured by
Table 8.1. provides examples of primary and means of the smart card data whilst others
secondary activities that are linked to work lack a digital footprint. Importantly, the
(W) and home (H) locations. It is assumed states that are visible provide clues about
that an individual can have one or more the states that are unobserved. Figure 8.2
locations classified as home locations illustrates a simple activity sequence for an
if it fits the criteria defined for the home Oyster card user. The individual carries out
location identification. Similarly, one or a morning commute between the hours of
more locations can be classified as work 08:00 to 09:00 from a home location to a
locations. Identification of secondary work location. The same individual uses the
activities are not within the scope of network for their work to home commute
this analysis. between 17:00 to 18:00. Observed activities
for this sequence are the journeys carried
Irregular activities are more challenging to out by means of the Oyster card, whilst
model, so for the purpose of this chapter are the hidden event is the work activity.
8. Smart Card Data and Human Mobility 115

Figure 8.2
Sample activity sequence
(simple).

Figure 8.3
Sample activity sequence
(complex).

The methodology described in this work locations gathered from smart card data.
explains the identification and labelling of It serves to identify the stay duration
the activities based on POIs, for example between consecutive journeys and enables
home/work and variables such as stay the identification of work location.
duration and visit frequency.
8.3.3.1
Similarly, a more complex activity sequence Home location
is illustrated in Figure 8.3 where the home
to work commute is interrupted by a short The most frequently used station is a
stay activity. Although the information key marker in the identification of home
captured within the smart card data are not stations. For a large majority of users, the
sufficient to infer the exact nature of the station for the first and last journey of the
short stay visit, the timing and location of day is an indicator of a home location. In
the activities can be useful in assigning an this work, an algorithm has been devised
appropriate description for such activities, based on the frequency of most frequently
for example day-care visit and school used stations coupled with the temporal
drop-offs. information of the journey to classify home
stations for users.
8.3.3
Human mobility pattern identification The origin station of the first journey of the
day and the destination station of the last
The heuristics described in this section journey of the day is the POI-classified
consider the visit frequency (number of home location. Selected stations for the
times a specific user visits a location) available number of the days are then
and stay duration (the duration between further analysed, and if the selected station
consecutive journeys) as parameters in the count fits the criteria required in the
model to identify home and work locations. algorithm, the station is selected as a home
In order to explore mobility patterns of location. It is possible to have more than
urban commuters, the data are grouped one station classified as a user’s home
by frequency of use. An important station if the station fits the criteria. The
characteristic of the temporal patterns type of behaviour could be due to a number
of urban human mobility is the stay time of reasons, such as a user having multiple
duration. It describes the activities between home locations, service degradation on the
116 CONSUMER DATA RESEARCH: PART TWO

Home Stations

Walthamstow
Central

Hammersmith

Bromley
Count

Sutton 1 - 40
41 - 132
10 133 - 541
km
London Borough

TfL network, or other personal journey This is representative of the high Figure 8.4
Work Stations
choices. Similarly, it is also possible that population density in the inner city. The results of home
locations around TfL
the algorithm fails to highlight any station Similarly, some central London interchange and National Rail stations
as home station if no location meets the stations such as Hammersmith, King’s in London.
expected criteria. Cross and St Pancras station feature heavily
as home locations in the results. This is
In Figure 8.4, home locations are spread because the final legs of many journeys
evenly across London’s
Wembleyouter boroughs
Park to other cities and outer London stations
with the City of London. Some commuter are not captured by the Oyster network,
belt boroughs such as Sutton and Bromley hence the last station captured has been
are not well represented in the results. Canary
inaccurately marked as aWharf
home station.
The users are Ealing Broadway
still able to travel to and This is one significant limitation of the
from these locations using National Rail available data. To mitigate this, large
infrastructure, but the journeys are not transit stations can be excluded or
Richmond
as accurately captured with Oyster cards. additional rules for the identification of
Therefore these stations do not feature home locations can be devised. It is possible
Brixton
significantly as major home locations.
Wimbledon to add rules, such as frequency of weekend
use, to the home location algorithm in
Count
Figure 8.4 also shows that outer London order to identify such users.
stations are represented by smaller points 1 - 40
in comparison to the inner London stations. 41 - 132
10
km 133 - 541
10 133 - 541
km
8. Smart Card Data and Human Mobility London Borough 117

Work Stations

Wembley Park

Canary Wharf
Ealing Broadway

Richmond

Brixton
Wimbledon
Count
1 - 40
41 - 132
10
km 133 - 541

London Borough

Figure 8.5 8.3.3.2 If a station has been identified more than


The results of work Work location the defined threshold visit frequency, it is
locations around TfL
and National Rail stations classified as a work station for that user.
in London. In order to estimate work locations, all Based on this criterion, users can have one
weekday journeys are considered for the or more workstations. It is important to
regular commuter. The identification of the note the possibility that the algorithm fails
POI work location is based on the assertion to highlight any station as a work location
that for the majority of users with a regular if no location meets the expected criteria.
commuting pattern, the most time spent
away from home is the work location. The examination of the characteristics
of user journeys enabled the selection
The work location identification can be of the parameters of visit frequency and
based on the stay time/duration of the stay duration thresholds. Different values
consecutive journeys for each user. This for the parameters provided a different
defines the work activity and location for outcome for the algorithms. The threshold
the individual user. The destination station for the parameters is based on the duration
of the first journey in the pair and the of data available and level of confidence
origin station of the second journey for the required in the outcome.
journey pair are selected. This will give two
stations for two consecutive journeys. These Figure 8.5 presents the work locations
can be the same station or different stations. identified, pointing to the centres of
118 CONSUMER DATA RESEARCH: PART TWO

financial and commercial services for example, a monthly theatre visit might
around the City of London. This is due fall into this category. With this in mind,
to the close proximity of these regions approaches based on continuous learning
to the financial districts of the City of from the data hold promise. For example,
London and Canary Wharf. machine learning algorithms can recognise
patterns in data and construct new rules
Some locations such as Ealing Broadway, dynamically (Ethem, 2004). Examples of
Wembley, Wimbledon, Richmond and machine learning in transportation include
Brixton outside of central London have insurance premium calculations based on
also been identified as work locations. These the driving patterns of individuals and
locations are also an example of commercial tracking congestion. The most talked about
centres outside of central London. of all the applications of machine learning
in transportation is perhaps self-driving
8.3.4 vehicles. Based on research by the Business
Validation Insider, there will be 10 million self-driving
cars on the roads by 2020 (Gerage, 2017).
The results of identification of home and The technology behind these relies upon
work locations were compared with the sensors, which collect data from the
LTDS. LTDS data capture, among other surrounding environment and objects,
attributes, information about home and such as size and speed. The task of machine
work locations of the individuals (TfL, learning algorithms is the continuous
2011). This makes LTDS data invaluable interpretation of this data in order to
as it provides a source of validation for classify objects as pedestrians, cyclists
travel pattern algorithms. or other cars and objects as well as the
forecasting of their movements (Gates,
The results were compared with the LTDS 2017; Anil, 2017).
dataset, and 82% of home users were
identified with the same location by the The identification of activities can be
algorithm as LTDS data at the level of described as a classification problem in the
postcode district. For work locations, context of machine learning. The activities
60% were correctly identified. that have been extracted based on the stay
duration of individuals at a location need to
The accuracy of the comparison relies be classified into one of the categories, for
heavily upon the correctness of the user example, weekend social visits or weekday
data captured through the surveys. Any shopping trips. With respect to smart card
errors in the data gathering and entry data, one of the inherent challenges is the
would adversely impact the reliability unavailability of labelled data that can be
of the comparison. used to train the classifiers. In order to
address this, a number of options can be
8.4 considered to generate labelled datasets:
Prospects for understanding mobility
Expert labelling: This can be done with
To present the complete picture of an careful analysis of information, for
individual’s mobility, so-called ‘secondary example, the day of the week, time of the
activities’ need to be identified. In this day, attributes of the locations (shopping
context, secondary activities include all centre, entertainment hub, residential area,
activities which last longer than the and sports venue). Expert labelling of the
standard transit stops but are shorter than activities relies on the intuition of the
the presumed work activities. These are researcher to evaluate the available
particularly challenging since they may information about the activity and assign
not have obvious recurring travel patterns, a suitable classification to the activity, e.g.
8. Smart Card Data and Human Mobility 119

a two-hour activity on a Saturday evening mobility patterns at an aggregate level.


in the vicinity of restaurants and bars is Therefore, an understanding of human
indicative of a weekend social activity. mobility patterns plays an important
role in addressing the problems of
App assisted labelling: In this approach, transportation and urban sustainability.
volunteers can be asked to install a mobile
Further Reading
app that will record the GPS locations
during the day, according to a predefined Anil, A. (2017). What kind of machine learning
threshold (e.g. 1+ hours stay). Based on the algorithms do the driverless cars use? Quora. Online:
https://www.quora.com/What-kind-of-machine-
stay locations observed during the day, learning-algorithms-do-the-driverless-cars-use.
each user will be prompted to answer 1-2
questions to label the activity captured, Ethem, A. (2004). Introduction to Machine Learning.
Cambridge, MA: MIT Press.
for example, two-hour stay near Piccadilly
Circus was  (1) socialising, (2) shopping, Gates, G. (2017). The race for self-driving cars.
(3) other. The New York Times. Online: www.nytimes.com/
interactive/2016/12/14/technology/how-self-driving-
cars-work.html.
App assisted approaches have the potential
to capture highly accurate information Intelligence (2016). 10 million self-driving cars
will be on the road by 2020. Online: http://
about the mobility of individual users, but uk.businessinsider.com/report-10-million-
they have some challenges including issues self-driving-cars-will-be-on-the-road-
of development, fine tuning of the mobile by-2020-2015-5-6

app, and recruitment of volunteers. Labelled Gordon, J. B. (2012). Intermodal Passenger Flows on
test data, combined with the individual London’ s Public Transport Network. MIT Press.
user journey records from the smart card
Hasan, S. et al (2012). Spatiotemporal patterns of
data, provide the two pieces of information urban human mobility. Journal of Statistical Physics,
necessary to classify the user activities. 151, 304–318.

Pelletier, M.-P., Trépanier, M. & Morency, C. (2011).


8.5 Smart card data use in public transit: A literature
Conclusion review. Transportation Research Part C: Emerging
Technologies, 19(4), 557–568.

Smart card data provide a rich and detailed TfL (2011). London Travel Demand Survey. Online:
window into activity patterns through www.clocs.org.uk/wp-content/uploads/2014/05/
their ability to capture vast quantities london-travel-demand-survey-2011.pdf.

of information regarding daily journeys. TfL (2016). Oyster. Online: tfl.gov.uk/corporate/


This chapter presents a case for the usage publications-and-reports/oyster-card.
of public transport smart card data for
Uniman, D. L. et al (2010). Service reliability
the characterisation of human mobility measurement using automated fare card
patterns. Activities of individuals and the data. Transportation Research Record: Journal
identification of activity locations are of the Transportation Research Board, 2143(1),
92–99. Online: trb.metapress.com/openurl.
discussed using data collected by Transport asp?genre=article&id=doi:10.3141/2143-12.
for London.
Acknowledgements

Home and work locations are important We are grateful to Transport for London (TfL) for
anchors since the majority of journey provision of the experimental data for this research.
activities revolve around these. A better The first author’s PhD research is sponsored by the
Economic and Social Research Council through the
identification of these locations would UCL Doctoral Training Centre.
provide a more effective classification
of the activities of individual users. The
heuristic approach to human mobility
proposed in this chapter has the potential
to improve our understanding of wider
9
121

Interpreting Smart Meter


Data of UK Domestic Energy
Consumers
Anastasia Ushakova and Roberto Murcio

9.1 There is a growing acknowledgement of


Introduction the potential of commercial data for better
understanding of consumer choices and
One of the recent innovations to domestic behaviour at the level of the individual
energy provision in Great Britain is the or household. In 2013, the UK government
installation of smart meters. Given the obligated the major domestic energy
immediate opportunities for reducing providers to roll out smart meter installation
carbon emissions and helping customers across the country. Such widespread
who may struggle with energy bills, the installation of meters provides a particularly
government has incentivised energy valuable resource for better understanding
providers to ensure that every home has the geography of energy consumption.
a smart meter by 2020. Whilst the data Smart meters provide continuous measures
from such meters present richness in of consumption of electricity and gas and are
both volume and granularity, there are central to a better understanding of energy
a number of hurdles to overcome when consumption by suppliers and researchers
understanding variability, bias and alike. The data generated by smart meters
uncertainty in such datasets. Consequently, are an example of the emergence of Big Data
these challenges may affect the analytical over the last decade, and characteristically
strategies we consider for generating provide detailed and disaggregate
insight into the lifestyles and activities information without the need for routine
of a population. This chapter provides an survey collections.
overview of the CDRC dataset on gas and
electricity smart meters, currently the As such, smart meter data are a new form
largest collection of such data available of data that offers a temporal breakdown of
for academic research in the UK. energy consumption for both electricity
and gas. Data are recorded automatically
122 CONSUMER DATA RESEARCH: PART TWO

and offer real-time updates, typically advantages and limitations posed by the
aggregated to half hourly intervals. introduction of smart meter data. We revise
The data source can be considered large the methods for segmenting energy data
in volume since each household with dual with an example of the results. Finally, we
fuel smart meters annually generates build upon these preliminary investigations
around 17,520 readings indicative of the to set a research agenda for linking the
correspondence between household smart meter data to other administrative
characteristics and residential property and open datasets in order to better
attributes. There are thus immediate understand consumer behaviour in this
advantages to the use of smart meters in domain. Using a case study of Bristol,
research with a focus upon consumer we investigate the possibilities that
behaviour and energy policy. However, as may be available using UK Census data.
these data are new to industry analysts and The chapter is concluded with a concise
the research community alike, this chapter discussion of issues for future research.
focuses on the interplay of issues of content
and coverage of the data in the analysis. 9.2
The UK energy sector and smart
Smart meters present novel opportunities meter roll out
for small-area population analysis when
triangulated with the 2011 UK Census of The UK energy sector is regulated by the
Population (Anderson et al, 2017). In this Department for Business, Energy and
chapter, we dedicate more attention to gas Industrial Strategy (BEIS) and the Office
data as being a more direct indicator of of Gas and Electricity Markets (OFGEM).
household activities. Regardless, for On the supply side, there are currently
understanding of behavioural patterns 12 large and 46 small energy companies.
for both sources, be it gas/electricity The market share is monitored by OFGEM
energy expenditure or real-time energy and assessed on the basis of how many
consumption, these consumer data play a electricity meters are installed on the
vital role for policy-making and regulation. distributional network by a supplier.
As of late 2016, British Gas was the largest
While offering greater precision in provider with 23% share of the market, and
understanding the differences in behaviour Scottish and Southern Electricity (SSE) and
over time by households, smart meter data e.ON the second and third largest providers
also create further challenges when it comes with 15% and 14% share respectively.
to the generalisation of temporal profiles.
How do we identify the average or expected The UK Government aims to ensure that
energy consumption profile? Should this by 2020 every domestic and non-domestic
bring focus to the activity patterns of property will have been offered a smart
consuming households, their properties, meter. The regulatory environment
or neighbourhood setting? ‘Variability in encourages providers to roll out smart
residential consumption reported in the meters as quickly as possible to meet the
literature suggests that there is hardly obligation of complete installation by 2020.
a “typical” level of consumption for any By the first quarter of 2017 there was a total
energy end-use’ (Lutzenhiser, 1993, 249). of 6.78 million smart meters installed by
We address this by looking at both energy suppliers across residential and
aggregate and disaggregated patterns business addresses in the UK of which six
of energy consumption. million had been installed in domestic
properties by the ‘Big Six’ energy providers.
We first provide a description of the Electricity meters account for more than
available data within the context of the half of the total of these installations due
UK energy sector, briefly looking at the to wider availability over gas. BEIS (2017)
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 123

Number of postcode sectors Meters per Postcode sector


Type Number of meters with at least 10 meters
Mean Median
installed
Electricity 600,000 8,000 70 60
Gas 480,000 7,500 60 50

Table 9.1 reports that despite an acceleration of 9.3


Gas and electricity: smart meter rollout, most domestic Data
the number of postcode
sectors with at least 10 properties nevertheless still have
smart gas or electricity traditional meters. The national dataset of smart meter data
meters in Great Britain that we use for the analysis in this chapter
as of December 2015.
It is unlikely to be the case that roll out is held by the Consumer Data Research
by any energy company thus far has been Centre (CDRC) and was sourced from one
to a random selection of addresses. For of the UK Big Six energy suppliers. The
instance, some domestic properties are data contain details of around 1,080,000
unsuitable for meter installation while electricity and gas domestic smart meters
the needs of disabled customers may pose for the year 2015, which represents 43% of
challenges. The perceived wisdom is that the 2.3 million smart meters installed by
there is a bias in successful installations the end of December 2015 in the UK. The
towards elderly people or families. This is spatial granularity is at postcode sector.
driven by the fact that when local The broader figures are shown in Table 9.1
installation campaigns are mounted, It is important to note that throughout this
representatives are more likely to find section, individual figures on numbers of
households from these groups at home smart meters and measures are rounded to
during normal working hours. It is also the nearest hundred.
important to note that nationally, around
70% of households will have electricity and The number of energy users per month is
gas supplied by the same company, with not constant as the rollout of smart meters
17% having duel supplier, meaning they is increasing from one month to another.
will have a separate supplier for gas For example, in the case of electricity, 75%
and electricity. The remaining 12% of of the users were already present in the
households will be connected only to first quarter of 2015 meaning that these
the electricity network. will be the customers’ records with the full

Figure 9.1 800,000


Number of smart meters
being rolled out at each
700,000
quarter of 2015.
Number of Smart Meters Added

600,000
Electricity
each Trimester

500,000
Gas
400,000

300,000

200,000

100,000

0
Jan-Mar Apr-Jun Jul-Sept Oct-Dec
124 CONSUMER DATA RESEARCH: PART TWO

Estimate Electricity Gas

Mean 2,130 kWh 8,480 kWh

Median 1,820 kWh 7,105 kWh

Standard Deviation 1,680 kWh 6,510 kWh

BEIS 2015 Typical consumption median 3,148 kWh 13,202 kWh

BEIS 2015 Typical consumption mean 3,894 kWh 11,707 kWh

Table 9.2
The average annual
household energy
consumption compared
to BEIS 2015 national Figure 9.2
estimates. As observed, Smart electricity and
our estimates are slightly gas meters by postcode
lower than official sector at the end of
statistics. This may be December, 2015.
an indication of further These maps show the
bias in our dataset. distribution of smart
meters across Great
Britain with the West
Midlands and North West
regions having the highest
frequencies of meters per
postcode sector.

Electricity Meters Gas Meters

Count
1 - 24
25 - 47
48 - 71
72 - 97
98 - 127
128 - 162
163 - 205
206 - 268
269 - 393
No data

0 100 km
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 125

Electricity % of all % of all


Gas meters
Region meters meters in the meters in the
(thousands)
(thousands) region in 2015 region in 2015
East Midlands 48.6 2.0% 40.3 2.0%
East of England 47.0 2.0% 38.0 2.0%
London 65.9 2.0% 54.6 2.0%
North East 24.7 2.0% 22.6 2.0%
North West 96.1 3.0% 76.0 3.0%
South East 57.1 1.3% 47.7 1.7%
South West 41.1 1.5% 31.6 2.2%
West Midlands 79.2 3.0% 64.8 4.0%
Yorkshire-Humber 58.0 2.0% 45.7 3.0%
Wales 28.1 1.8% 18.3 2.5%
Scotland 53.3 1.7% 40.4 2.6%
Total 600.0 480.0

Percentage of total installed in Great Britain


69.0% 75.0%
by all suppliers in Q4, 2015
   
Percentage of all domestic meters in 2017
2.0% 2.0%
(smart and traditional)

Table 9.3 year coverage (Figure 9.1). Between April 30% of the smart meters are installed.
Breakdown of smart gas and September, less than 5% of the total In contrast, Wales and North East regions
and electricity meters
by region. were enrolled. Finally, in December around are deeply under-represented, accounting
50,000 users were added bringing the total for only 8% of the total of smart meters
to 600,000 users with a smart meter. (Table 9.3).
We may conclude that the rollout of the
electricity meters is gathering momentum. The data for 2015 represent the early stages
This was also confirmed by BEIS (2017). of smart meter roll out. One of the potential
A breakdown for the rollout by trimester sources of bias associated with this, is the
is shown in Figure 9.1. fact that the first properties to receive a
smart meter were those with old energy
In 2015, 600,000 electricity smart meter meters. Another possible bias might arise
users consumed 1,200 Gwh, representing from the fact that the first households to
just 1.1% of the total domestic electricity receive an installation were more likely to
consumption in Great Britain for that year. be at home during the campaign: this may
For gas, 480,000 users consumed 4,000 skew the customer representativeness
Gwh, accounting for 1.3% of the total slightly towards the elderly and families.
domestic gas consumption in Great Britain To test these ideas, we compare the
(2015). Basic centrality measures around distribution of property build period by
individual consumption are shown in region (generated by the Valuation Office
Table 9.2. Agency) with the total number of smart
meters installed, particularly for the 1965
The geographical distribution of meters to 1972 period (Table 9.4). Although the
(Figure 9.2) is slightly biased towards the North West region scores high in both
North West and West Midlands regions, measures and Wales scores low in both,
for both electricity and gas, where almost this test is by no means conclusive and the
126 CONSUMER DATA RESEARCH: PART TWO

Table 9.4
Region Electricity Meters Properties Built 1965 to 1972 Meters installed and
properties built between
South East 57,100 430,340 1965-1972. The correlation
North West 96,100 330,780 between number of
meters installed and
East of England 47,000 314,640 properties built in this
period is 0.35, indicating a
West Midlands 79,200 292,610 weak positive correlation.
South West 41,100 259,040
Yorkshire-Humber 58,000 238,860
London 65,900 224,600
East Midlands 48,600 211,720
Wales 28,100 135,900
North East 24,700 132,820
Scotland 53,300 NA

correlation between both quantities is not Wales is rather unusually over-represented


particularly significant. in our sample. As in Figure 9.2, the brown
areas represent the sectors with no
9.4 available data.
Comparison with the UK Census
of population 9.5
Variability on energy consumption
To concentrate on how the data are
representative of the British population we Previous research on energy consumption
linked the dataset with the UK 2011 Census (Huebner et al, 2015; McLoughlin et al,
data on number of households that reside 2012) and energy expenditure (Druckman
in each postcode sector. As smart meters and Jackson, 2008) has considered
are installed at address level, the number aggregated consumption values at either
of households may be used as a proxy for annual or six month intervals. Previous
the coverage of individuals represented analysis of smart meter data in the UK
in our data. We found that, in general, was performed primarily with Irish smart
the percentage of households is no more meter data (Silipo and Winters, 2013;
than 3%, for both electricity and gas Cao et al, 2013). A UK-wide national
(Figure 9.3). This could imply that roll out dataset may offer a more insightful
of smart meters has started at the same approach when considering energy
time in each of the postcode sectors consumption. Previous work aimed to
gradually, varying from 1 to 65 meters identify if there is any correspondence
installed. It can also be observed that between property attributes and household
speed of installation may be greater in characteristics that may explain variability
urban regions of the country. in energy use. Huebner et al (2015) for
instance found that building characteristics
In more than 80% of the postcode sectors, and socio-demographics can jointly
smart meters were installed at between explain only about 44% of energy use
1% and 4.8% of the total number of variation. Further work by Haben et al
households. The higher percentages can (2013) attempted to link profiled energy
be found in the West Midlands, North West consumption patterns to socio-
and the North of Wales. North West, in fact, demographic classification and found little
is the second largest region by a number of correspondence between temporal profiles
all type meters present, while northern and socio-economic groups. This suggested
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 127

Electricity Meters Gas Meters

Proportion of
Households
0.1 - 1.2 %
1.3 - 2.1%
2.2 - 3.1%
3.2 - 4.6%
4.7 - 10.1%
10.2 - 21.4%
No data

0 100 km

Figure 9.3 that studying actual energy consumption at 9.5.1


Proportion of electricity greater temporal breakdown may further Classification methods and outlier
and gas meters relative to
the total number of inform us about behavioural patterns. detection
households by postcode For instance, variability in half hourly
sector. consumption can be used as an indicator We accessed the suitability of some
of distinct consumption behaviours or clustering methods for visualisation of
lifestyles. As a response, we attempt to variability in energy consumption profiles
look not only at how much is consumed across postcode sectors. Potentially, this
but where and when. Having the same total may be used for defining outlier groups
value for the day may, in fact, be associated of readings that represent slightly unusual
with increased differentiation in customer behaviour compared to the majority of the
profiles, which are based on the variation sample. Several studies have attempted
throughout the day and the region in which energy classification for electricity data.
smart meter users reside. We address this However, the samples tend to differ as
by attempting to cluster temporal energy well as the representations and additional
usage, to group customers not by their features that are added to the smart meter
average total consumption, but rather with readings. Further work may consider the
a combination of their consumption levels clustering on a particular day or at a
in and outside peak hours. specific time; variation at such scales
may be pre-determined by season or,
even more narrowly, time of day that
can be associated with different activities.
The decision of whether these dimensions
128 CONSUMER DATA RESEARCH: PART TWO

Table 9.5
Cluster name % of total Resulting national clusters
for annual half hour
Cluster 1 51% aggregates at postcode
sector level.
Cluster 2 1%

Cluster 3 38%

Cluster 4 10%

can be considered simultaneously or univariate versus multivariate series, as


sequentially underpins much of the well as lengths of time series considered
CDRC research into these data. for the analysis (Liao, 2005).

To date, a number of methods has been K-means clustering is the most popular
developed for clustering consumer data. approach due to its simplicity and fast
The majority are associated with a reliable minimisation of the similarities among
performance on static data only, while the objects within each class centre. It may
disregarding the sequential links between be suitable for datasets with static features.
variables. This poses further challenges if For highly variable temporal variables,
we are to consider spatial and temporal the assignment of the cluster may be highly
Figure 9.4
dimensions. One of the immediate solutions unstable as different customers will be Clusters derived from
could be to transform dynamic data into assigned to a different cluster subject to annual aggregates at
static format. For example, we may the day and time. postcode sector level.
The whisker box plots
calculate the mean for each of the represent the median
individuals and create a numerical Alternatively, a Gaussian Mixture Model energy consumption and
indicator that represents an estimate of based on a probabilistic setting for the variation within four
quantiles. Postcode sector
average consumption for the individuals clustering may be proposed. Such a setting differences in aggregated
in our sample. The decision on the method brings about the ability to handle diverse consumption are based
is thus broadly driven by the data types of data, including dealing with mainly on the variation
around expected morning
characteristics that include: discrete versus missing or unobservable data that may and evening peaks
real values, uniformity of the sample, have contributed to variation differences consumption.

Cluster 1 Cluster 3
5000 5000

4000 4000

3000 3000
Wh

Wh

2000 2000

1000 1000

0 0
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30

00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30

Time Time
Cluster 2 Cluster 4
5000 5000

4000 4000

3000 3000
Wh

Wh

2000 2000

1000 1000

0 0
00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30

00.00
00.30
01.00
01.30
02.00
02.30
03.00
03.30
04.00
04.30
05.00
05.30
06.00
06.30
07.00
07.30
08.00
08.30
09.00
09.30
10.00
10.30
11.00
11.30
12.00
12.30
13.00
13.30
14.00
14.30
15.00
15.30
16.00
16.30
17.00
17.30
18.00
18.30
19.00
19.30
20.00
20.30
21.00
21.30
22.00
22.30
23.00
23.30

Time Time
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 129

Table 9.6
Resulting national clusters
Cluster name % of total
for annual half hour Cluster 1 30%
aggregates at postcode
sector level after Cluster 2 13%
exclusion of outlier group
observations. Cluster 3 7%
Cluster 4 1%
Cluster 5 2%
Cluster 6 3%
Cluster 7 19%
Cluster 8 18%
Cluster 9 7%

among segmented groups. This is achieved To look deeper into the variation in energy
by assigning a probability measure to the profiles across postcode sector we removed
cluster. Where uncertainty about the the data that fall into cluster 2, treating it
assignment is greater, additional variables as an outlier group. A number of factors
may be introduced or the individual may be may have contributed to the unusually
treated as an outlier or uncertain group. high variation captured in this cluster:
Unlike k-means, it produces stable results non-domestic properties may be
and selects the number of clusters using mistakenly occurring in the sample, or
smoothing. This is also convenient for the multiple occupations may be associated
matter of replication as clustering results with a single smart meter address (e.g.
remain the same regardless of how many student halls). What we observe is that by
times we run the algorithm. Further excluding highly variable observations, the
research may implement clustering by algorithm can differentiate more variability
dynamics – for example, through grouping and after subtraction of these outliers gives
graphical models (please see further rise to nine national clusters. From Table
reading for more details). 9.6 and Figure 9.5 we observe once again
that there is a clear tendency for morning
Some immediate results of clustering for gas and evening peaks to be similar across
consumption are presented in Table 9.5 and profiles. Additionally, on average the
Figure 9.4. As we note, cluster 2 represents consumption levels stay at the limit of
very high and variable behaviour yet 2,500 kWh per half hour across the
represents a very limited part of the sample. clustered profiles. However, what we are
Cluster 1, in contrast, represents half of the picking up is clusters of really low
national sample variation at aggregated consumption (clusters 1 and 7) compared to
level. We may conclude that, on average, gas high and variable groups (clusters 3,4,6)
energy consumption across Great Britain that are defined by the variability around
does not vary significantly and there is a night time consumption, early mornings
tendency for highly stable consumption and outside peak hours.
through the day with peak hours falling into
intervals of 06:00 – 08:30 and 16:00 – 20:00. 9.5.2
Customers in postcode sectors that fall in Usage during off-peak hours
cluster 1 are more likely to consume during
the peak hours with a lower tendency to As a further extension to this analysis we
consume at night in comparison with cluster segment the temporal analysis of energy
4, for example, where there is a greater data in terms of peak hours as they have
propensity to use gas both overnight and shown to be important for the definition
throughout the day. of the clusters. Figure 9.5 suggests that
17.30 17.30 17.30
18.00 18.00 18.00
18.30 18.30 18.30
19.00 19.00 19.00
19.30 19.30 19.30
20.00 20.00 20.00
20.30 20.30 20.30
21.00 21.00 21.00
21.30 21.30 21.30
22.00 22.00 22.00
22.30 22.30 22.30
23.00 23.00 23.00
130

23.30 23.30 23.30


Wh Wh Wh
Wh Wh Wh

0
1000
2000
3000
4000
5000
0
3000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
00.00 00.00 00.00
00.30 00.30 00.30 00.00 00.00 00.00
01.00 01.00 01.00 00.30 00.30 00.30
01.30 01.30 01.30 01.00 01.00 01.00
02.00 02.00 02.00

Cluster
Cluster
01.30 01.30 01.30
02.30 02.30 02.30 02.00 02.00 02.00

Cluster 4
Cluser3
Cluser2
Cluser1

03.00 03.00 03.00 02.30 02.30 02.30


03.30 03.30 03.30 03.00 03.00 03.00

Figure 9.5
04.00 04.00 04.00 03.30 03.30 03.30
04.30 04.30 04.30 04.00 04.00 04.00
05.00 05.00 05.00 04.30 04.30 04.30
05.30 05.30 05.30 05.00 05.00 05.00
06.00 06.00 06.00 05.30 05.30 05.30
06.30 06.30 06.30 06.00 06.00 06.00
07.00 07.00 07.00 06.30 06.30 06.30
07.30 07.30 07.30 07.00 07.00 07.00
08.00 08.00 08.00 07.30 07.30 07.30

4000 after exclusion of the


08.30 08.30 08.30

quantiles. The results


08.00 08.00 08.00

2000 represent the median


5000 postcode sector level
annual5 aggregates at
Clusters derived from

The whisker box plots

groups were removed.


09.00 09.00 09.00 08.30 08.30 08.30

demonstrate increased
09.30 09.30 09.30 09.00 09.00 09.00

the variation within four


10.00 10.00 10.00

sector6levels once outlier


09.30 09.30 09.30

1000 energy consumption and


10.30 10.30 10.30 10.00 10.00 10.00

outlier group observations.


11.00 11.00 11.00 10.30 10.30 10.30

variability across postcode


11.30 11.30 11.30 11.00 11.00 11.00
12.00 12.00 12.00 11.30 11.30 11.30

Time
Time
12.30 12.30 12.30 12.00 12.00 12.00

Time
Time
Time

13.00 13.00 13.00 12.30 12.30 12.30


13.30 13.30 13.30 13.00 13.00 13.00
14.00 14.00 14.00 13.30 13.30 13.30
14.30 14.30 14.30 14.00 14.00 14.00
15.00 15.00 15.00 14.30 14.30 14.30
15.30 15.30 15.30 15.00 15.00 15.00
16.00 16.00 16.00 15.30 15.30 15.30
16.30 16.30 16.30 16.00 16.00 16.00
17.00 17.00 17.00 16.30 16.30 16.30
17.30 17.30 17.30 17.00 17.00 17.00
18.00 18.00 18.00 17.30 17.30 17.30
18.30 18.30 18.30 18.00 18.00 18.00
19.00 19.00 19.00 18.30 18.30 18.30
19.30 19.30 19.30 19.00 19.00 19.00
20.00 20.00 20.00 19.30 19.30 19.30
20.30 20.30 20.30 20.00 20.00 20.00
21.00 21.00 21.00 20.30 20.30 20.30
21.30 21.30 21.30 21.00 21.00 21.00
22.00 22.00 22.00 21.30 21.30 21.30
22.30 22.30 22.30 22.00 22.00 22.00
23.00 23.00 23.00 22.30 22.30 22.30
23.30 23.30 23.30 23.00 23.00 23.00
23.30 23.30 23.30
Wh Wh Wh
Wh Wh Wh

0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000

00.00 00.00 00.00


00.30 00.30 00.30 00.00 00.00 00.00
01.00 01.00 01.00 00.30 00.30 00.30
01.30 01.30 01.30 01.00 01.00 01.00
02.00 04.30 02.00 01.30 01.30 01.30
02.30 02.30 02.00 02.00 02.00

Cluster 9
Cluster 8
Cluster 7
03.00 05.00 03.00 02.30 02.30 02.30
Cluster 6
Cluster 5
Cluster 4

03.30 05.30 03.30 03.00 03.00 03.00


04.00 06.00 04.00 03.30 03.30 03.30
04.30 06.30 04.30 04.00 04.00 04.00
05.00 07.00 05.00 04.30 04.30 04.30
05.30 07.30 05.30 05.00 05.00 05.00
06.00 08.00 06.00 05.30 05.30 05.30
06.30 08.30 06.30 06.00 06.00 06.00
07.00 09.00 07.00 06.30 06.30 06.30
07.30 07.30 07.00 07.00 07.00
08.00 09.30 08.00 07.30 07.30 07.30
08.30 10.00 08.30 08.00 08.00 08.00
09.00 10.30 09.00 08.30 08.30 08.30
09.30 11.00 09.30 09.00 09.00 09.00
10.00 11.30 10.00 09.30 09.30 09.30
10.30 12.00 10.30 10.00 10.00 10.00
11.00 12.30 11.00 10.30 10.30 10.30
11.30 13.00 11.30 11.00 11.00 11.00
12.00 12.00 11.30 11.30 11.30

Time
Time
12.30 13.30 12.30 12.00 12.00 12.00
Time
Time

13.00 14.00 13.00 12.30 12.30 12.30


13.30 14.30 13.30 13.00 13.00 13.00
14.00 15.00 14.00 13.30 13.30 13.30
14.30 15.30 14.30 14.00 14.00 14.00
15.00 16.00 15.00 14.30 14.30 14.30
15.30 16.30 15.30 15.00 15.00 15.00
16.00 17.00 16.00 15.30 15.30 15.30
16.30 17.30 16.30 16.00 16.00 16.00
17.00 17.00 16.30 16.30 16.30
17.30 18.00 17.30 17.00 17.00 17.00
18.00 18.30 18.00 17.30 17.30 17.30
18.30 19.00 18.30 18.00 18.00 18.00
19.00 19.30 19.00 18.30 18.30 18.30
19.30 20.00 19.30 19.00 19.00 19.00
20.00 20.30 20.00 19.30 19.30 19.30
20.30 21.00 20.30 20.00 20.00 20.00
21.00 21.30 21.00 20.30 20.30 20.30
21.30 22.00 21.30 21.00 21.00 21.00
22.00 22.00 21.30 21.30 21.30
22.30 22.30 22.30 22.00 22.00 22.00
23.00 23.00 23.00 22.30 22.30 22.30
23.30 23.30 23.30 23.00 23.00 23.00
23.30 23.30 23.30
Wh Wh Wh
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000
0
1000
2000
3000
4000
5000

00.00 00.00 00.00


00.30 00.30 00.30
01.00 01.00 01.00
01.30 01.30 01.30
02.00 02.00
CONSUMER DATA RESEARCH: PART TWO

02.30 04.30 02.30


Cluster 9
Cluster 8
Cluster 7

03.00 05.00 03.00


03.30 05.30 03.30
04.00 06.00 04.00
04.30 06.30 04.30
05.00 07.00 05.00
05.30 07.30 05.30
06.00 08.00 06.00
06.30 08.30 06.30
07.00 09.00 07.00
07.30 07.30
08.00 09.30 08.00
08.30 10.00 08.30
09.00 10.30 09.00
09.30 11.00 09.30
10.00 11.30 10.00
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 131

Figure 9.6
Clustering of ‘off-peak’
hours data. Here we
assess consumption
levels that are represented
as the times between
11:30 and 15:00. Note:
Grey areas represent no
data; data provided for
areas with complete
annual coverage only.

Edinburgh

Leeds

Liverpool

Cardiff London

Cluster 1 Cluster 3
5000 5000

4000 4000

3000 3000
Wh

Wh

2000 2000

1000 1000

0 0
11.30

12.00

12.30

13.00

13.30

14.00

14.30

15.00

11.30

12.00

12.30

13.00

13.30

14.00

14.30

15.00
Time Time
Cluster 2 Cluster 4
5000 5000

4000 4000

3000 3000
Wh

Wh

2000 2000

1000 1000

0 0
11.30

12.00

12.30

13.00

13.30

14.00

14.30

15.00

11.30

12.00

12.30

13.00

13.30

14.00

14.30

15.00

Time Time

regardless of segmentation, there are quite interactions between characteristics of


similar patterns around morning and people living in the area and energy,
evening peak hours which vary in concentrating on specific time and location
magnitude but are evident for each of the may reveal more information about energy
clusters. Examining peak and outside peak consumption rather than when both time
hours separately may tell us slightly more and space are aggregated. As may be
about a household’s presence at home as observed in clusters 3 and 4 in Figure 9.6,
well as particular habits or routine (i.e. consumption throughout the day is more
waking up early for work, late nighters). frequent around the coastal regions and
We further show that for defining the less in central England. Further investigation
132 CONSUMER DATA RESEARCH: PART TWO

Source Name Time Geographical Temporal Description


Period reference Granularity
UK Data Service English Housing 2008-2015 Households/ Annual Statistic Updated each year and provides data on
(UKDS) and Survey (EHS) Dwellings1 energy efficiency, insulation and tenure
Department for trend. Does not cover entire Great Britain.
Communities Sample of around 6-7,000 houses drawn
and Local randomly each year for an investigation.
Government Sample for 2014/15 have slightly better
(DCLG) coverage: 2,297 dwellings; 11,851
households. Similar surveys for other UK
countries.
www.ukdataservice.ac.uk

UKDS and BEIS Energy 2005-2015 Region or N/A Sample is sufficiently large and covers
Performance address level over 15 million addresses in the UK. Data
Certificates contain information on energy efficiency
(EPC) bands and variables such as age, type of
property, floor area, annual gas and
electricity consumption as fuel poverty
indicators. Dates of records vary as
according to regulations, assessments
should be undertaken when conditions
change, e.g., when a property is rented.
epc.opendatacommunities.org

CDRC House Ages and 1989-2015 LSOA and MSOA ONS (quarterly) The data were collected originally by ONS
Prices and VOA and VOA. The dwelling age counts are at
(annual) LSOA level and median house prices are at
MSOA level.
data.cdrc.ac.uk

Office for Census 2011 2011 Census OA Decennial These data offer a fairly detailed
National statistic description of all households and
Statistics (ONS) properties in the UK. Useful variables could
include household size, employment
characteristics, dwelling age, country of
origin and others. However, the data have
no consideration for recent (<10 year)
temporal variations and may contain
missing data.
www.ons.gov.uk/census/2011census

University of Classification of 2011 Census OA NA A geodemographic classification of the


Southampton Workplace Zones characteristics of Workplace Zones (WZs).
and ONS The categorisation of WZs is based on the
similarity measures derived from a range
of variables from 2011 Census of England
and Wales. Such a specific classification
may be useful if we want to consider using
energy consumption patterns to study the
employment types of smart meter users
(i.e. part-time vs full time, unemployment,
retired).
cowz.geodata.soton.ac.uk

Table 9.7
Openly-available
datasets that may aid the
understanding of variation
in energy consumption.
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 133

would be necessary as in general each on previous research that looked at the


cluster is scattered around the nation and forecasting of energy consumption, the
may be well presented in each of the regions. main factor for consumption dynamics was
considered to be the weather (Swan and
In this section we presented a preliminary Ugursal, 2009). Nevertheless, as more data
analysis of smart meter segmentation and were added to smart meter analysis what
outlier detection by considering when and has become clearer is that a response to
how people consume regardless of their the weather is affected by the type of
location. One of the further research property, tenure, house size, life stage,
objectives is to look at how the profiles of income group, and other customer
customers are segmented geographically characteristics (McLoughlin et al, 2012).
and whether their variational differences A further, and perhaps no less obvious
can be explained by characteristics of the point, is that of economics and the
areas where people live. This can be relationship between income variables
achieved with openly available and energy consumption at both national
adminstrative datasets. and individual level. As we might expect,
those who are wealthier tend to consume
9.5.3 more energy as they are likely to live in
Data linkage more spacious properties.

Various datasets available for linkage with Before looking at links between
energy consumption are presented in Table socio-demographic characteristics and
9.7 where we list the sources of additional consumption, we clustered temporal
data, the geographical and temporal profiles in Bristol in a similar fashion
references and time period that these as we did for the national dataset; the
datasets cover. A number of limitations only difference is we are now looking
should be addressed when considering these at a much finer OA level geography. The
data sources. Firstly, the vast majority of resulting clusters are presented in Table 9.8
data on the population are only available and Figure 9.7. The number of distinct
in aggregate geographic units. However, clusters is smaller than that of the national
some data are available at the household dataset. Nevertheless, some immediate
level which will enable us to link them to correspondence with the clusters that we
specific trends identified by individual smart defined previously can be noted for clusters
meters. One such example is the Energy 1 and 2. The consumption in Bristol is
Performance Certificate (EPC) data which observed to be differentiated by both
contains energy performance related peak hours and throughout the day
variables for over 15.6 million households. patterns. Most of the OA aggregates are
associated with very low or no consumption
9.5.4 during night time. This may suggest that
Bristol energy consumption and UK the variability of energy consumption
Census Output Area Classification at a finer geographical level may not
necessarily be less representative as in the
In this section, we present an analysis of large sample. Further to this, it may help us
the region around Bristol where energy in filling the gaps where data are missing
consumption in 2014 is linked to Census by defining some common energy
Output Area (OA) geography. We tested behavioural patterns that are more
the ability of some common predictors frequent in each of the areas in Great
of energy consumption, such as property Britain, or as we call them, typical profiles.
size and life-stage of energy customers
combined in a single indicator, Census Despite defining some clear relationships
Output Area Classification (OAC). Based between household characteristics and
134

Wh Wh Wh

0
1000
2000
3000
0
1000
2000
3000
0
1000
2000
3000
00.30 00.30 00.30
Cluster 1
Cluster 2
Cluster 3
01.00 01.00 01.00

Cluster 3
Cluster 2
Cluster 1
01.30 01.30 01.30
02.00 02.00 02.00
Cluster name

02.30 02.30 02.30


03.00 03.00 03.00
03.30 03.30 03.30
04.00 04.00 04.00
04.30 04.30 04.30
05.00 05.00 05.00
05.30 05.30 05.30
06.00 06.00 06.00
06.30 06.30 06.30
07.00 07.00 07.00
07.30 07.30 07.30
08.00 08.00 08.00
08.30 08.30 08.30
09.00 09.00 09.00
09.30 09.30 09.30
10.00 10.00 10.00
10.30 10.30 10.30
11.00 11.00 11.00
11.30 11.30 11.30
12.00 12.00 12.00

Time
Time
Time
12.30 12.30 12.30
13.00 13.00 13.00
27%

13.30 13.30 13.30


35%
38%

14.00 14.00 14.00


14.30 14.30 14.30
15.00 15.00 15.00
% of total

15.30 15.30 15.30


16.00 16.00 16.00
16.30 16.30 16.30
17.00 17.00 17.00
17.30 17.30 17.30
18.00 18.00 18.00
18.30 18.30 18.30
19.00 19.00 19.00
19.30 19.30 19.30
20.00 20.00 20.00
20.30 20.30 20.30
21.00 21.00 21.00
21.30 21.30 21.30
22.00 22.00 22.00
22.30 22.30 22.30
23.00 23.00 23.00
23.30 23.30 23.30
00.00 00.00 00.00
Table 9.8

Figure 9.7
OA level, Bristol.

OA level for Bristol.


annual aggregates at
annual aggregates at

Clusters derived from


Clusters derived from
CONSUMER DATA RESEARCH: PART TWO
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 135

60
50
Gas consumption (kWh) outside peak hours

40
30
20
10
0

1a 1b 1c 2a 2b 2c 2d 3a 3b 3c 3d 4a 4b 4c 5a 5b 6a 6b 7a 7b 7c 7d 8a 8b 8c 8d

Census 2011 Output Area Classification

Figure 9.8 energy use, the current research suggests of smart meter data may be skewed towards
An attempt to use the that over recent years energy consumption elderly people. Besides, a number of factors
2011 Census Output Area
Classification (OAC) to has tended to be shaped more significantly may be associated with consumption outside
explain the variation by consumer habits and lifestyle, than by peak hours that need further validation. To
outside peak hours on a household size or dwelling type. Capturing name but a few, the households present at
regular winter’s day on a
sample of 1,105 meters in this is challenging, yet the behaviour home may be those who work from home,
Bristol. Note: Interestingly, patterns clustered together perhaps may are retired or are carers.
even while the number of give us more direction in quantifying the
meters representing each
group is relatively small, lifestyles of energy customers. 9.6
we can identify students, Conclusion
rural tenants, families and Figure 9.8 presents some preliminary
the elderly as those most
likely to consume outside results using OAC to define the links Research domains that investigate energy
peak hours. between household characteristics and consumption range from engineering and
energy consumption outside peak hours informatics to economics and political
as a potential proxy for households staying science. The complexities of investigating
at home. It is important to note that the roll energy consumption motivate the
out of the smart meter as outlined earlier development of new research
may bias these results as it was suggested methodologies to cope with the diversity
that in 2014-2015 people who were likely of energy data available. The method we
to be present at home were among the first considered in this chapter, the Gaussian
to receive a smart meter. In the case of Mixture Model, tends to work in a quite
Bristol, we observe that around 20% of the stable fashion for the different set of data,
sample falls into the category of ‘Urban meaning that no matter how many times
Professionals and Families’ which is quite we implement an algorithm the result will
contrary to the suggestion that composition hold. Further research should consider
136 CONSUMER DATA RESEARCH: PART TWO

a more thorough design of energy of the ‘expected’ or ‘typical’ temporal


consumption process-generation. profile remains ambiguous, unexplained
We primarily used the gas data as little and in need of further inter-disciplinary
attention was given to it in previous research – analysis that could integrate
research. The ability to use smart meter spatial, temporal and social components.
data to group and identify customers Addressing the validation processes for
can help to improve energy efficiency investigations that attempt to infer the
through better utilisation of data to causes of energy consumption variation
target interventions and policies. A more remains an important aspect for anyone
targeted policy framework may thus be who is interested in smart meter data
designed to address simultaneously the exploration. Further to this, one of our
issues of carbon emission reduction and conclusions is that perhaps, ‘smaller is
affordability of energy. better’, meaning that reducing Big Data
generated by smart meters to Small Data
We suggest that subsamples of the smart may lead to more insightful results.
meter data may be taken in order to bring Researchers with a smaller sample
focus to the analysis of variability in but more data available on household
consumption, possibly also stratified characteristics or property attributes may
according to the study area. In this chapter, use this national dataset to define how
we presented work on the largest sample representative the profiles obtained on a
of smart meter data ever available for Great smaller sample are of wider British trends
Britain. On this geographically extensive in behaviour and lifestyle. Such analysis
scale, greater heterogeneity is observed, could further help us to define what the
both over time and space. In an effort to acceptable sample size is that can be used
identify potential sources of bias in our to study energy consumption such that the
dataset we compared the data to official results can be acceptably generalised for
sources. In our case, further limitations the whole of the UK.
may also be posed by the specific customer
base of our data provider and issues that While there are many challenges
may prevent households in receiving a surrounding smart meter data for
smart meter (e.g. low priority house type both energy company analysts and
or condition during the early stages of researchers, undoubtedly these are also
roll out). We further identified that great possibilities to see the study of
clustering may be useful for visualisations consumption behaviour in a new light.
of Big Data from smart meters and provide As suggested by Swan and Ugursal (2009),
a ground for identification of the outliers or previous research that analysed energy
unusual behaviours. consumption had tended to place a focus
on private sector actors that have greater
The substantial variety in energy incentives and expertise for consumption
usage across Great Britain, and the reduction, as well as the need to adhere to
acknowledgement that what can be true for tougher regulatory requirements. Academic
one household may not necessarily always research can complement such work and
tell us about the residential area, provide unlock the further potential of smart meter
immediate opportunities for further data, for instance to generate new insights
research. As we observed in our results, about people’s consumption patterns,
around 50% of our sample tend to follow which in turn would give us a better
quite typical gas consumption patterns knowledge about the areas and activities
with morning and evening peaks and across the country and inform public policy
relatively lower consumption outside peak decision-making.
hours. However, the amount of residential
variation that would not meet the criteria
9. Interpreting Smart Meter Data of Uk Domestic Energy Consumers 137

Further Reading Silipo, R., and Winters, P. (2013). Big data, smart
energy, and predictive analytics. Time Series
Anderson, B., Lin, S., Newing, A., Bahaj, A. and James, Prediction of Smart Energy Data, 1, 37.
P. (2017). Electricity consumption and household
characteristics: Implications for census-taking in a Swan, L. G. and Ugursal, V. I. (2009). Modeling of
smart metered future. Computers, Environment and end-use energy consumption in the residential
Urban Systems, 63, 58-67. sector: A review of modeling techniques. Renewable
and Sustainable Energy Reviews. 13(8), 1819–1835.
BEIS. (2016). Sub-National Electricity and Gas
Consumption Statistics. Regional, Local Authority, Acknowledgements
Middle and Lower Layer Super Output Area. Report,
December 2016. The authors are grateful to the ‘Domestic Energy
Provider’, for providing smart meter data for
BEIS (2017). Smart meters, Great Britain. Quarterly this research. The first author’s PhD research is
report, March 2017. sponsored by the Economic and Social Research
Council through the UCL Doctoral Training Centre.
Cao, H.-A., Beckel, C. and Staake, T. (2013). Are
domestic load profiles stable over time? An attempt
to identify target households for demand side
management campaigns. In Industrial Electronics
Society, IECON 2013 - 39th Annual Conference of the
IEEE. IEEE, pp.4733–4738.

Chicco, G. (2012). Overview and performance


assessment of the clustering methods for electrical
load pattern grouping. Energy, 42(1), 68–80.

DECC (2015). Smart meters, Great Britain. Quarterly


report, December 2015.

Druckman, A. and Jackson, T. (2008).


Household energy consumption in the UK:
A highly geographically and socio-economically
disaggregated model. Energy Policy, 36(8), 3177-3192.

Haben, S., Rowe, M., Greetham, D. V., Grindrod, P.,


Holderbaum, W., Potter, B. and Singleton, C. (2013).
Mathematical solutions for electricity networks in
a low carbon future. In 22nd International Conference
and Exhibition on Electricity Distribution (CIRED 2013)
pp. 1-4. 

Holderbaum, W., Potter, B. and Singleton, C. (2013).


Mathematical solutions for electricity networks in
a low carbon future. In 22nd International Conference
and Exhibition on Electricity Distribution (CIRED 2013).

Huebner, G. M., Hamilton, I., Chalabi, Z., Shipworth,


D. and Oreszczyn, T. (2015). Explaining domestic
energy consumption: The comparative contribution
of building factors, socio-demographics, behaviours
and attitudes. Applied Energy, 159, 589-600.

Liao, T. W. (2005). Clustering of time series data –


a survey. Pattern Recognition, 38(11), 1857–1874.

Lutzenhiser, L. (1993). Social and behavioral aspects


of energy use. Annual Review of Energy and the
Environment, 18(1), 247–289.

McLoughlin, F., Duffy, A. and Conlon, M. (2012).


Characterizing domestic electricity consumption
patterns by dwelling and occupant socio-economic
variables: An Irish case study. Energy and Buildings,
48, 240–248.
PART THREE

NEW APPLICATIONS
AND DATA LINKAGE
10
141

Geovisualisation of
Consumer Data
Oliver O’Brien and James Cheshire

10.1 series of research priorities for more


Introduction informative geovisualisation of consumer
datasets, and population data more broadly.
As the volume and variety of spatially-
referenced consumer data continues to 10.1.1
grow there is an unprecedented need for Background
their curation, analysis and communication.
Consumers like to be informed about what Central to the field of geographical
their data says about them, retailers are information science has been the need to
keen to exploit data to drive sales and handle large and complex datasets. Without
researchers see great potential in such data developments in this and associated fields
for deriving insights into social processes. there would, for example, be no efficient
Interactive maps are a proven tool in means of combining demographic data to
facilitating data access across these groups. its respective locations: essential procedures
They communicate insights, in addition in the analysis of consumer data.
to providing an interface through which
subsets of large and complex databases What’s more we are in an era of
can be downloaded for further analysis. unprecedented change in the nature of,
funding for and access to social, economic
This chapter will share insights from and demographic datasets. It is a pressing
a decade of research into the creation concern that the full potential of consumer
of web-mapping tools for a variety of data is realised as government-funded
consumer and government datasets. datasets become a diminished part of
It will detail the developments that the data landscape. Such data offer the
underpin the creation of three innovative possibility of investigating new research
mapping platforms before signposting a issues in unprecedented spatiotemporal
142 CONSUMER DATA RESEARCH: PART THREE

detail, but their effective concatenation, mapping platform called CDRC Maps.
conflation and synthesis are far from Users can access maps generated from
unproblematic. There is also the potential millions of data points depicting a range of
and need for more sophisticated data from deprivation through to Internet
visualisation, in the form of visual usage with links to the raw data for use in
analytics, of these kinds of data both their own analysis. A key motivation for
for exploratory analysis and also for the creating the platform was the desire to
communication of results. This provides share data from the CDRC and a recognition
the chapter’s focus. that online data repositories have limited
effectiveness with users seeking to browse
During the past decade the acceleration datasets or for raising awareness of
in the development and uptake of web- particular data. As will be discussed below,
mapping technologies has led to a CDRC Maps is coupled with the centre’s
proliferation of highly advanced mapping portal CDRC Data in order that users can
interfaces. These are now routinely download the raw data they have seen
accessed across a full range of platforms – mapped if they wish to undertake more
from mobile phones to desktop computers in-depth analysis. This has proved very
– and have expanded from navigational useful to analysts in local and national
devices to key forms of information government in addition to the commercial
visualisation. A trend facilitated by a move and third sector which lack the budget and
away from serving image tiles to users and skills to produce their own maps from
towards the use of vector tiles where the complex data but also wish to explore
web browser effectively performs the relevant subsets of larger datasets.
geographic information system (GIS)
operations previously undertaken on the 10.2
website’s servers at source. Such tiles have Web mapping
the advantage of being generated as and
when required, which enables the inclusion The earliest web maps were created in 1993
of real-time data or rapid updates. and became more ubiquitous in the early
2000s when the growth of real-time
As the technology and data for map geographic services such as mapping,
creation become more complex, a routing and location-based advertising
dichotomy is emerging in the skillsets really took off, most notably in 2005 with
required by potential users. Web maps Google’s release of Google Maps. First-
can now be immediately accessed and generation applications provided only
utilised with limited prior experience, unidirectional flows of data and
but the data used to create them require information from websites to their user
more programming skills as spreadsheets bases. Over time, this system evolved
give way to databases. This bifurcation, into services that facilitate bi-directional
in part, fuels the data science industry as collaboration between users and sites,
it seeks to meet the increasing demand for the outcome of which is that information
innovative visualisations of new data to be is collated and made available to others.
served to a large number of non-specialist The two main technologies that stimulated
users. Companies such as Mapbox and this development were Asynchronous
CARTO all offer demographic data maps, JavaScript And XML (AJAX) and Application
alongside Esri, the leader in this sector for Programming Interfaces (APIs). AJAX
the past three decades. enabled the development of websites
that retain the look and feel of desktop
The Consumer Data Research Centre (CDRC) applications, while APIs defined and
is an academic initiative that has entered documented consistent ways of accessing
this space and developed a bespoke assets and tools created by other projects.
10. Geovisualisation of Consumer Data 143

They have improved the usability of Web typically colours a statistical unit area
mapping significantly by enabling direct according to the proportion of the
manipulation of map data where user population within it that has a particular
interactions (such as ‘click and drag’) attribute, for example the proportion of
are visualised instantaneously. the working-age population that are in
full-time employment.
Early versions of web maps were detached
from the underlying data used to create Choropleth colour ramps are usually scaled
them since in all cases the developer was from the lowest to the highest proportions
required to pre-render the maps before across all the areas and may be evenly
loading onto the server – users were given banded (stepped/graduated, i.e. discrete)
access to images only. As web browsers or use a continuous ramp. Alternatively,
have become more powerful and base- other methods of banding may be applied,
mapping data has become freely available such as Jenks, natural breaks (Jenks, 1967).
through initiatives such as OpenStreetMap To serve the choropleth map to the user it
and government open data platforms, for is partitioned into square images, or tiles,
example the London Data Store, image tiles that are created only when needed on a
have been superseded by vector-based server, following a request made by the
systems. These offer the key advantage user’s browser – the ‘client’. Because of the
that the maps are rendered on demand, enormous number of possible combinations,
that is they are generated at the time each resulting in a unique map tile that
they need to be viewed, from data served could therefore be viewed, it is essential to
via a series of database requests. As we be able to create the map tiles ‘on demand’
demonstrate below, this enables a much in an efficient and timely manner, as
greater amount of flexibility both in terms opposed to pre-rendering these maps and
of reported statistics and the cartographic storing all the tiles on a server. Therefore,
representation. In addition maps can act as CensusProfiler had a system that efficiently
platforms for data download to the point created custom-made maps. The website
that they can now be thought of as data architecture also employed limited ‘caching’
services rather than simply static of the most popular map views, to avoid
representations of a single dataset. repeated server-intensive spatial operations
on the database and accelerate response
10.2.1 time, but for the great majority of queries
CensusProfiler these were created at the time of the query.
This development was one of the key drivers
One of the first comprehensive web maps behind the creation of CensusProfiler’s
of population data was constructed from successor: the DataShine platform.
the UK’s 2001 Census data and called
CensusProfiler. It was one of the first 10.2.2
to offer panning and zooming controls, DataShine (datashine.org.uk)
a revolution in comparison to the pre-2005
standard of clicking around a map’s edge DataShine visualises and provides access to
to visualise the ‘next page’. The user the UK’s 2011 Census aggregated datasets;
interface had three key layers: a basemap users can access and map nearly 2,000
showing context such as roads and rivers, variables across a quarter of a million
the 2001 Census data, and a number of statistical unit areas. It marks a significant
moveable toolbars that controlled the data advancement on the technologies deployed
shown and colour palettes. In the simplest by its predecessor by enabling more
sense CensusProfiler was a series of advanced cartography and map
choropleth maps. This style of mapping customisation, rescaling of the data on
is widely used for demographic data and demand and data download functions.
144 CONSUMER DATA RESEARCH: PART THREE

Figure 10.1

Wolverhampton
Impact of rescaling the
colour ramp based on the
local values, shown for
metro use in north-west
Birmingham. Top:
Nationally scaled colour
ramp. Bottom: Locally
scaled colour ramp.
The rescaling allows
the variations in the
low (relative to national
use) but still significant
(in local terms) usage to
be viewed.

Birmingham

Wolverhampton

Birmingham
10. Geovisualisation of Consumer Data 145

One of the most useful functions that 10.2.3


exploits the dynamic rendering of map CDRC Maps (maps.cdrc.ac.uk)
tiles from the underlying database is the
ability to recalculate the values for the DataShine has demonstrated the value
colour breaks used in the display of the of maps for facilitating data access and
data. This feature was designed to address sharing insights. These are core aims of
the challenge of showing change over local the CDRC and therefore a mapping platform
areas when global values have been used was considered a crucial aspect of the
in the colour binning calculation. For initiative. CDRC Maps features a range
example, a user may be in a region where of socioeconomic data for the United
a particular demographic has very low Kingdom, such as population density,
(or high) values compared to the national broadband speeds and relative deprivation
average but these become under/over levels. Currently around 50 maps are
saturated and show a single colour. available on the platform.
DataShine therefore has the option
to take the average percentage and the A layered tiling approach is taken with
corresponding standard deviation using the creation and display of the maps on
only data from the area shown in the the CDRC Maps platform. A label layer
extent of the web browser. This can lies on top, along with an invisible lower-
result in the binning strategy changing, resolution gridded vector layer that
for local areas that are significantly provides information about the choropleth
divergent from global averages. For example, value, current statistical area ID and name,
the popularity of London’s underground and other useful information. A common
network with its large population, means mantra in web development is that every
that, for other cities with metros or trams, click required to view a page results in a
their usage is harder to pick out from the halving of the audience. It is therefore
census. So, in Birmingham, the Midland essential to present geodemographic
Metro can be hard to spot (see Figure 10.1). and choropleth maps as simply and as
Upon rescaling, just the local results are attractively as possible, minimising the
used when calculating the average and clicks needed to retrieve the data or view.
standard deviation, allowing usage The JQuery framework and its JQueryUI
variations, in this case along the route extensions used in CDRC Maps achieve
of the railway, to be more clearly seen. this. Every choropleth map can be accessed
For transport planners in Birmingham in just three clicks. All other map
this results in a much more useful map. customisations are selected in a single click
on the appropriate buttons. Further buttons
DataShine is also a data service that enables are available to jump to key cities in the UK.
users to download the data behind the
map in a CSV format for further analysis As with its predecessor websites, the map
– something not possible if the website itself is the dominant feature, with user
had been built exclusively from image tiles. interface controls and additional data display
This gives users the chance to source only occupying only a small part of the design,
the data they need without the arduous and not always displayed. As CDRC Data
process of navigating the myriad of large displays data from a range of sources, rather
and complex tables provided as the than just census data, a number of design
standard statistical release. In practice simplifications are necessary in order to
this has meant that students can download retain a single user interface. The available
census data for their local area and utilise it maps are split into three categories, with the
within seconds – a feature that has become corresponding key for each map displayed
particularly popular with secondary schools in a different way, and a different range of
and universities. metadata values shown, for each category.
146 CONSUMER DATA RESEARCH: PART THREE

There is the additional UI requirement for to CDRC Maps that acts as a data repository
more context and information to be shown and viewer. CDRC Maps therefore both
around some of the maps. For example, raises awareness of data and facilitates
some maps show composite indicators that access to it.
partition areas into specific categories that
need further explanation. This is provided 10.3
in a series of pop-ups with links to more Developments in web mapping
detailed guidance. This, again, adds to the
potential complexity of the user interface Techniques for showing maps of data on
that needs to be managed. the web continue to evolve rapidly, as the
geostack technology continues to be in
As part of broadening access to the CDRC’s active development. Technologists are
data holdings, we were keen to include looking to pure-vector based maps, as
a series of eye-catching and easily client-computer browsers become more
interpretable layers to CDRC Maps to sophisticated at rendering content
drive traffic to the platform from our themselves. However, the traditional raster
user groups. To this end we feature a approach can still lead to rich and effective
series of single-metric maps, showing how mapping that cannot yet easily be replaced
the metric value varies from low to high, with a vector pathway. Digital cartographers
typically using a fixed-hue colour ramp. are starting to consider augmenting the
It also shows example geodemographic basic approach of displaying the data with
maps, where areas are assigned a category colour variations, by incorporating other
based on the clustering of multiple metrics. kinds of symbology, such as texturing,
As the relationship between each category still best served as raster tiles.
is not normally directly quantifiable,
qualitative colour palettes (changing hue The following section considers some of
for each category) are typically used. the possible advances in this area. It will
Finally, a hybrid type of map is also first discuss prototypes to display of levels
included. Known as ‘Top Metric Maps’, uncertainty in a dataset. The second example
these show, for each area, the top category is the use of colour compositing of multiple
for a single qualitative metric, for example datasets, each represented in different
the most common industry type or the hues, on an automated, rule-based basis,
most popular mode used to travel to work. to generate new ways of looking at
As the categories are qualitative, hue- multivariate data. Finally, we consider an
varying colour palettes are used. Whilst alternative approach to the ‘building mask’
they have proved popular with our users, technique of DataShine and CDRC Maps
we are aware that top metric maps have (discussed above), by using colour to
to be used cautiously, as they may not be emphasise the population’s location.
representative of the wider population in
each area, particularly if the category 10.3.1
break-downs are not carefully calibrated. Uncertainty

The entire geostack used in CDRC Maps – Consumer datasets are often inputs into,
namely Mapnik and PostGIS for data or augmented by, indicators such as
storage and creation; and OpenLayers, geodemographic classifications. These
JQuery and JQueryUI for display - is open are also subject to a quantifiable degree
source, and the datasets mapped are of uncertainty that is rarely mapped, but
generally themselves derived from, or that can have important implications for
simple aggregations of, open data, with analysis and interpretation. Here we take
the data being available at CDRC Data, one such indicator, the UK Output Area
the aforementioned complementary site Classification (OAC), and demonstrate
10. Geovisualisation of Consumer Data 147

Figure 10.2 how the uncertainty inherent to it can used when visualising the uncertainty of
Left: Manchester’s urban be mapped. Unlike its commercial classification of each area.
core, showing sharp
divisions. Right: Halifax counterparts, this free-to-use classification
(south part of map) and benefits from an open-source methodology With each approach we offer an example
Bradford (north part of that facilitates the calculation of a range of of the insights it can provide into the
map) where different
demographics are uncertainty measures. Here we use the 2011 success, or otherwise, of 2011 OAC.
manifest as differences OAC developed by geographers at UCL in Across the entire OAC 2011 supergroup
in how well the central collaboration with the UK’s Office for dataset, the OA average SED for the
zone is defined. Both
maps are aligned with National Statistics. dominant supergroup is 0.913, with a
north upwards. population standard deviation of 0.239.
It is now possible to apply textures to We apply a screen compositing operation
web-based choropleth. We try two such that lightens the supergroup colour in a
approaches in this work; the first is spatially randomised way by combining it
applying an image file of noise and the with a ‘grain’ texture supplied by a source
second is to apply hatching. The level of JPEG file. The grain effect has an opacity
distortion is controlled by an uncertainty set based on the absolute SED score, from 0
measure in the 2011 OAC known as the (i.e. no compositing effect) for SED less than
‘standard equalised distance’ (SED) that 0.6, increasing linearly to 1.0 (i.e. compositing
offers an indication of how close to the the texture fully) for SED greater than 2.4.
centre of each cluster a single output area
(OA) falls. The smaller the SED, the more On examining a version of the OAC 2011
certain we can be that the bulk of an OA’s map with the textured noise applied,
population fits its assigned ‘supergroup’ untextured area boundaries show strongly
(category). The smallest SED to each on the choropleth map while boundaries
supergroup, for each area, becomes that between two areas both with a high SED
area’s designated supergroup classification, are much less distinguishable. The former
but the SED to the other supergroups are case often occurs if there is a linear feature
retained. Both the absolute SED values and that forms a physical barrier (e.g. a river or
the relative SED between the ‘winning’ major highway) separating the two areas.
(referred to as primary) and ‘runner-up’ The identification of such transitions is
(secondary, tertiary etc.) supergroups are aided by the use of texture as well as
148 CONSUMER DATA RESEARCH: PART THREE

lightness. As coarseness increases with 10.3.2


SED score the blocks of colour will appear Multivariate choropleth displays
to fade whilst evidence of the original colour
allocation – and therefore geodemographic In addition to showing uncertainty it is also
group – will remain. In addition to the now possible to combine multiple variables
visual impact of fading and intensity, this in a single map. The effect can be thought
increased coarseness also serves to highlight of as a simple visual index with similar
poorer quality data. Two examples are areas securing similar colours in the
shown in Figure 10.2. In Manchester’s case, mixing process.
the sharp changes in colour to the west of
the city centre suggest a physical barrier, For example, combining red and blues,
in this case the Manchester Ship Canal. separately showing metrics about an area
Other areas, such as the south and north, produces a hue somewhere between blue
are less geographically constrained and so and red, typically purple if both metrics
the supergroups ‘merge’ into noise and only have large values, but tending towards
then switch to another colour. Halifax and one or the other otherwise. This is already
Bradford show contrasts despite having a popular approach to mapping election
similarly sized ‘Cosmopolitan’ zones (red results – particularly in the USA – where
colours). Halifax’s zone is more sharply the maps can indicate the size of the
defined, with a more obvious transition winning result for each area. The hue
to other colours. The amount of texturing mixing effect can be easily achieved
in this city is relatively low, showing a using colour compositing operations
good individual fit for many areas to a that are available in some web mapping
single supergroup. frameworks, such as Mapnik, along with
careful layering of the component metrics.
We add white diagonal stripes of various
densities to show varying SEDs. We use This technique requires careful
four different tileable images, ranging implementation since interpretation can
from ¼ density (i.e. 25% white lines, 75% quickly become difficult, particularly as
underlying classification colour) for SED many people may not be familiar with
greater than 1.6, to 1/32 density for SED typical colour combinations, and because
greater than 1.0 but less than 1.2. SEDs different colour compositing operations
less than 1.0 were considered to be so (e.g. ‘darken’, ‘lighten’, ‘difference’) will
good a match that it was not necessary result in different results when combining
to indicate any uncertainty on the map the same two or more colours together.
for such areas. These SED thresholds can The problem can be partly minimised by
be changed to increase or decrease the supplying an interactive key that adapts as
impact of uncertainty on the visualisation. different hue-based layers are viewed, and
Figure 10.3 demonstrates the impact that shows all the possible combinations of
the different densities of stripes have on colours and the underlying metric values.
the perception of the classification across
a wide area - Scotland shows a noticeably Here we show the potential for this
higher absolute SED for rural areas, technique with a number of simple
resulting in more diagonal stripes socioeconomic variables, using composites
appearing once moving north across the of multiple hues that were created as part
border. The noticeable transition across the of a prototype website. The technique
border suggests a poorer fit to the idealised proved to be powerful in showing areas
‘rural residents’ classification shown in with similar characteristics across multiple
green, in the Scottish context. variables (similar to more sophisticated
geodemographic cluster classifications);
however the hue variations were
10. Geovisualisation of Consumer Data 149

Figure 10.3
Variations in SED across
the border between
Scotland and England,
which runs diagonally,
approximately through
the middle of the image
from the bottom left to
top right.

continuous rather than discrete as in a The approach highlights areas of similarity


classification. Basic user tests suggested and difference and can therefore aid in
it was difficult to interpret the result, initial exploratory data analysis. Users
particularly since creating a key that was can then access the underlying data to
flexible enough to show all combinations undertake more conventional quantitative
proved problematic. The map therefore analyses as they wish.
largely relies on the user knowing what
hues are formed by combining the 10.3.3
component hues together and knowing Population density colour palettes
the compositing operation. The effect is
shown in Figure 10.4, with a ‘lighten’ One final area of active development is
compositing operation and green, purple exploration of the most effective means
and red source hues. For all three source of accounting for variations in population
hues, grey represents the low value and the density. As discussed above, choropleth
full-intensity hue represents the highest maps of population-related statistics tend
value. Roughly, high red and high green to fill areas of low and high population
areas show as yellow; high red and high density with equally strong, distinct
purple areas show as magenta; and high colours, even though the statistic is likely
purple and high green colours show as only relevant to where the underlying
turquoise. Combining all three together population is actually located. Various
shows as white, as the three source hues techniques can correct this and focus the
are well separated from each other on a map on the population location. CDRC Maps
standard RGB colour wheel. takes a clipping approach, where only areas
occupied by buildings receive the colour of
the choropleth. Dot density maps are
150 CONSUMER DATA RESEARCH: PART THREE

another technique; these assign population the map, regardless of the other properties Figure 10.4
units to a random or weighted location of the colour shown. An appropriate key Combining different hues
to show relative high
within each statistical unit area. However, showing the hue or lightness variations values of unemployment
these maps are fundamentally different for different metric values, superimposed (green), South Asian
from choropleth maps and are more across a number of different saturations, ethnicity (purple) and
deprivation (red) in central
computationally intensive to produce can help emphasise this. Care should be London, alongside the
and not as easily interpretable since they taken however to ensure that large River Thames. The three
introduce a large degree of false precision. variations in saturation don’t dominate hue-based maps (above
left) are overlaid to show
the overall visual appearance of the map, the three socioeconomic
An alternative approach, assuming that at the expense of showing the variation in variables on the single
the statistical variation in the choropleth is the main population metric that is being map (above).

shown by varying the hue and/or lightness, mapped. This technique is used in maps.
is to vary the other colour variable cdrc.ac.uk/#/metrics/ruralurban/ where
(saturation in the case of the HSL colour the hue shows the category of settlement
space described here). By fading the classification – the main metric being
choropleth colour in sparsely populated mapped – and lightness is used to show
areas, and oversaturating it in areas of the variation in population density.
relatively high population density, the
eye is drawn naturally to the latter areas on
10. Geovisualisation of Consumer Data 151

10.4 Further Reading


Conclusion
Jenks, G. F. (1967). The Data Model Concept in
Statistical Mapping. International Yearbook of
This chapter has demonstrated the Cartography 7: 186–190.
progress made in using web maps to
both communicate and provide access to
large and complex datasets. As these data
become more prevalent, and arguably
complex, thanks to their generation in
both the public and private sectors, so too
must the maps continue to develop in order
to keep pace. The chapter discusses a range
of possible development areas, in particular
for the visualisation of uncertainty and
multiple variables. There is more work
to do, but in the coming years further
important technological innovations
can be expected.
11
153

Geotemporal Twitter
Demographics
Alistair Leak and Guy Lansley

11.1 such as those of Charles Booth, depicted


Introduction the population on a single ordinal scale,
classifications have increasingly sought to
The study and application of demographic split the population into ever more precise
data are widespread in both industry and groups. In the UK, the ability to perform
academia, with applications ranging from such nuanced classification has been
demographic profiling to supply and facilitated by the regular collection of
demand modelling. Yet, while there exist detailed population data which are made
numerous applications, in recent years available in the public domain. A prime
demographics has seen few major example of such data is the UK Census of
developments. This may in part reflect Population. Collected on a decennial basis,
continuing dependence on traditional the UK Census presents a snapshot of the
population data, such as the UK Census of UK population across a broad range of
Population. Given these limitations, there themes including employment, education
is an increased interest in the potential of and demographics. Using such data,
new forms of data such as are collected analysts are able to partition the population
from online social networks or by utilities into parsimonious groups which exhibit
companies. Here, we demonstrate how, generally homogenous characteristics.
given sufficient consideration, Twitter data Such classifications are, however, clearly
may be employed as an effective source of not without significant limitations. These
population insight. limitations include the regular adoption
of a ‘one size fits all applications’
Since their inception, geodemographics has methodology, the focus on residential
evolved at a rapid pace seeking to describe setting, the reliance upon infrequently
the population in ever increasing levels of published datasets and the ‘black box’
detail. While the earliest classifications, nature of many commercial products.
154 CONSUMER DATA RESEARCH: PART THREE

This said, it may be argued that the


progress has been made in the commercial
sectors, with private entities such as CACI
Ltd and Experian Ltd incorporating
multiple novel data sources into their
classifications. As with any commercial
classification, there are issues for academic
users that centre upon transparency or
reproducibility, as well as licencing and
access arrangements.

The advent of social media and other new


forms of data are now bringing focus to the
creation of new demographic insight, and
indeed this is a central motivation of the
Consumer Data Research Centre (CDRC).
The term ‘new forms of data’ is particularly
broad and includes data ranging from those
that are gleaned from social media, data
that are published as open-data and data Twitter online social network. Launched
that are generated by consumer-facing in 2006, Twitter has grown rapidly with an
organisations. In each case, the data offer estimated 328 million active users.1 Twitter
a new and novel gateway through which enables users to post short 140 character
human behaviour may be observed. messages which may contain URLs,
However, such data are often beset with pictures and personal or topical tags.
their own unique limitations. The OECD Users are able to ‘follow’ the accounts of
(2013) report ‘New Data for Understanding others as in most social networks, although
the Human Condition’ cites accessibility, there is no requirement for reciprocal
provenance, permanence, comparability, connections. Further, where users so
legality, ethics, linkage and data structure choose, a location may be recorded in
as nascent concerns. On top of this, one the form of latitude and longitude. Beyond
must also be conscious of the providing a social platform, Twitter makes
representativeness of such data versus the user data generated by its service available
population for which it is to be employed. to third parties via an Application
Programming Interface (API). The API
In this chapter, we focus on geo-tagged provides both free and paid-for options,
Tweets harvested in real-time via the though, for many applications, the free
11. Geotemporal Twitter Demographics 155

Figure 11.1 service is often sufficient. The availability limited to those who choose to disclose
Map showing a random of such data and the ease in which they their location. While various anecdotal
sample of 1 million
geo-tagged Tweets may be accessed have led to an explosion evidence exists in regards to the
collected between in the publication of academic literature demographic bias present, there are
December 2012 and demonstrating the potential insight that limited means by which said bias
January 2014. Each
Tweet is depicted by may be drawn ranging in themes from may be quantified. The reason for
a single blue point. crime and security to health and mobility. this being that users of the service
However, while significant volumes of are not explicitly required to provide
literature have been published touting the any identifying information as part
potential of Twitter data, it is often the of the registration process.
case that only lip service is paid to the
limitations of the data source – specifically Given this lack of demographic specificity,
with respect to the demographic of those it is necessary that key markers are
individuals who are users of the service modelled such that they may be assessed
versus those of the population at large. against existing data or be employed in
Unlike traditional demographic data, those the study of demographics. In seeking to
individuals who are users of Twitter are achieve this goal, individual screen names
a self-selecting sample which is further may be employed in the interference of key
156 CONSUMER DATA RESEARCH: PART THREE

demographic markers whilst the location 11.2.1


information encoded within Tweets may Data enrichment
be used to infer individuals’ places of
residence, work and travel. In the As previously noted, the raw Twitter data
remainder of this chapter, we discuss how are devoid of any demographic markers
key demographic markers may be inferred necessitating that such attribution be
based on the novel analysis of individuals’ modelled. The key to such inference is
personal names, and how such insight may the assumption that individuals’ personal
facilitate the establishment of the data’s names are a statement of aspects of their
representativeness versus the usually identities. Not only does an individual’s
resident population of the United Kingdom. name provide a means of identification,
The latter half of the chapter will develop a analysed in a suitable manner, a name may
case study demonstrating the potential of provide an indicator of gender, age and
demographically attributed Twitter data for cultural, ethnic and linguistic identity.
the observation of stocks and flows of Such an approach is in effect the
human populations, showcasing both the automation of the human process of social
recreation of traditional demographic perception. Before these techniques may be
insight and also various new insight applied, it is necessary that individuals’
facilitated by the data’s rich attribution. personal names are extracted. By default,
users’ screen names are a single character
11.2 string with no defined structure.
Data Individuals’ name tokens are extracted
based on the western-naming order in
For this analysis, the dataset employed is which forenames typically precede
a corpus of geo-tagged Tweets composed surnames. Individuals’ forenames are
of 1.4 billion unique messages submitted employed in the inference of their ages and
between December 2012 and January 2014. genders using a forenames database built
Illustrated in Figure 11.1, the data are global from birth certificate records and consumer
in coverage, though, as may be observed, data (see Lansley and Longley, 2016a).
are clearly not consistent with the global Further, individuals’ full names are
distribution of population. The data were processed using the Onomap (www.
harvested in real-time using the Twitter onomap.org) classification tool for the
Filtered Streaming API and stored within inference of their ethnicities. Note that
a PostgreSQL database. It should be noted prior to the application of the various
that the free Streaming API is limited such heuristics, it is necessary that the
that only data, equivalent to 1% of the total nationality of those users being analysed is
throughput at any given time, may be determined. While the association between
collected. While this may initially appear names, genders and ethnicities is relatively
a limiting factor, in practice only around stable, the association between names and
1% of Tweets are attributed with location. ages tend to exhibit national tendencies.
Thus, it may be presumed that the majority We determine individuals’ nationalities and
of geo-tagged Tweets are successfully regions of residence based on the analysis
obtained. The representativeness of the of the location inherent in each user’s
sample stream versus the full stream, Tweets. The rule applied to assigning
referred to as the ‘Firehose’, has previously nationality is that a user must have 50%
been established by Morstatter et al (2013). of their total Tweets and five or more
While Morstatter et al confirmed the Tweets in the area to which they are
completeness of the geo-tagged Tweets, assigned. The key premise in such an
they also found that the non-spatial sample analysis is that an individual will tweet
was not representative of the equivalent most frequently within the region and
data from the Firehose. country with which they are likely resident.
11. Geotemporal Twitter Demographics 157

In the UK, this approach results in believed to generally be younger, this may
273,000 Twitter users being identified lead to a left-wing bias being present in any
as being residents. data being analysed. Failure to account for
such bias may adversely affect interpretation
11.3 and, by effect, the conclusions drawn.
Benchmarking In possession of demographically attributed
data, however, it is possible that an
Prior to performing any analysis, it is assessment of the data’s representativeness
imperative that the nature of the sample may be obtained. In the following,
being studied is understood. Such benchmarking of age, gender, ethnicity
consideration, given existing anecdotal and geographic distribution are reported.
evidence, suggests that demographic bias
in the Twitter data will be particularly For the purpose of benchmarking, two
important where the phenomenon being reference datasets are employed: the 2013
studied bears an identifiable correspondence Consumer Register produced by CACI Ltd
with age, gender or ethnicity. For example, and the 2011 UK Census of Population. The
in the UK, political views are known to Consumer Register is an augmented version
relate to age with younger people having of the publicly available Electoral Register
a greater affinity to Labour and those who which substitutes names from other
are older leaning towards the Conservative commercial sources.
party. Given that the Twitter users are

Figure 11.2
Population pyramid of
Twitter users in the UK 85 plus
versus the equivalent
Office for National 80 − 84
Statistics data for 2011.
The ONS data are 75 − 79
illustrated in grey.
70 − 74

65 − 69

60 − 64

55 − 59
Age (years)

50 − 54 Gender
Female
45 − 49 Male

40 − 44

35 − 39

30 − 34

25 − 29

20 − 24

15 − 19

10 − 14
0.075

0.05

0.025

0.025

0.05

0.0750

Proportion
158 CONSUMER DATA RESEARCH: PART THREE

Ethnicity Group Twitter % Consumer Quotient


Register %
White – All – Gypsy- Traveller – Irish Traveller 93.36 87.20 1.07
Asian – Asian British – Indian 1.36 2.30 0.59
Asian – Asian British – Pakistani 1.11 1.90 0.58
Black – African – Caribbean – Black British 0.75 3.00 0.25
Asian – Asian British – Other Asian 0.77 1.40 0.55
Asian – Asian British – Bangladeshi 0.23 0.70 0.33
Asian – Asian British – Chinese 0.54 0.70 0.77
Mixed – Multiple Ethnic Groups 0.0006 2.00 0.00
Other Ethnic Groups 1.88 0.90 2.09

11.3.1 to the 2013 Consumer Register. The decision Table 11.1


Age and gender to benchmark against the Consumer Ethnicity breakdown
comparison between
Register as opposed to the 2011 Census was the UK Twitter population
As a first step in understanding the designed to minimise the impact of bias/ and the 2013 Consumer
demographic composition of users, a uncertainty in ethnic classification that Register. The quotient
indicates the relative
comparison is performed between the age may be manifest within the Onomap difference between the
and gender of Twitter users as determined classification tool. expected and observed
by the forenames database (as described ethnicity proportions.

in Lansley and Longley, 2016a) and the Table 11.1 presents a comparison between
equivalent data from the 2011 UK Census the ethnic composition of Twitter and the
of Population. Consumer Register as estimated by
Onomap. This highlights population
The population pyramid shown in Figure segments in which Twitter users are
11.2 confirms the anecdotal belief that likely to be more or less well represented
Twitter is predominantly used by a younger relative to the usual resident population of
proportion of the population. However, the UK. Clearly evident is that the combined
having differentiated the data by gender White Group is over-represented whilst the
it is evident that differences exist. While Asian and Black groups are systematically
female users are more prevalent in the under-represented. The Mixed group
10 to 19 bands, males become increasingly (arguably the hardest to identify using
dominant beyond this age. Beyond the 20 our chosen data classification techniques)
-24 age bracket, the proportion of male is the most under-represented. Thus,
users increases significantly suggesting in seeking to draw general inference
that it is older males who have chosen to regarding population behaviour based
adopt the platform. on Twitter it must be recognised that
the minority groups are likely to be
11.3.2 under-represented within the sample.
Ethnicity
11.3.3
As with age and gender, it is well Geographic distribution
recognised that ethnicity may play a role
in individuals’ social attitudes, health and Geographic distribution is examined using
wellbeing. In seeking to quantify the degree the Location Quotient (LQ) measure. The
to which each ethnic group is represented, LQ may be considered as the quotient of the
we applied the Onomap tool to both the UK proportion of Twitter users in a specific
Twitter population inventory and likewise geographic area versus the corresponding
11. Geotemporal Twitter Demographics 159

proportion of the observable population to


be resident in the same region. A value of
1.0 indicates the expected proportion of the
population. A value of < 1.0 indicates fewer
Twitter users than expected and a value of
>1.0 indicates a greater than expected
volume of Twitter users.

Figure 11.3 illustrates the difference in


the observed versus expected Twitter
population at local authority and district
level in the UK. Clearly evident is a south to
north progression with Scotland exhibiting
the highest proportion of Twitter users
relative to the normally residential
population. A second trend is the high
proportions in areas which have large
student populations. An example of such
an area is Swansea in the south of Wales.
While not discussed here, it must be
recognised that the uptake of Twitter
varies on a global scale within and
between countries.

11.3.4
Demographic summary

Through the application of a range of


novel data mining techniques the data
collected via Twitter have been
significantly enriched. In possession of
such knowledge, it becomes increasingly
possible that Twitter may be employed
in the observation and modelling of the
stocks and flows of population. The key
considerations which must be recognised
is the importance of performing analysis
in a data rich environment and second, that
suitable consideration must be given to the
Figure 11.3 unit and scale of analysis. Concerning the
Location Quotient map first point, where possible, all data by users
showing the geographic
distribution of Twitter within the study areas should be sourced.
users in the UK versus Such data facilitate the identification of
the resident population. critical attribution concerning nationality
and region of residence. Concerning the
second point, it is important to consider
whether one wishes to analyse either the
Tweet or the user. For example, when
analysing sentiment, we clearly want to
examine individuals’ Tweets in time and
space. Conversely, when performing an
160 CONSUMER DATA RESEARCH: PART THREE

assessment of demographics structure, we cases, the analysis is performed using data Figure 11.4
need to focus on individual users. Failure to collected from London’s Heathrow Airport. Map showing the LQ of
residential locations of
consider the appropriate unit of analysis The largest of London’s six airports, those UK-based Twitter
can result in the generation of invalid Heathrow handled 72 million passengers users observed within the
results or distorts insight. in 2013. There are various motivations for Heathrow extent.

understanding the behaviour of those


11.4 individuals travelling through the airport.
Application At the national scale there is a desire to
understand the functional catchments of
Having established a means by which the the airport and at the local level there is a
representativeness of Twitter data may be need to understand how people move
ascertained, the focus is shifted to that of within the airport complex. Traditionally,
applications. Applications are considered such analysis has been performed using a
here in two parts: the recreation of selection of manual counting and survey
conventional population insight using the techniques; however, such techniques are
demographically attributed Twitter data often expensive or laborious. Further, much
and second, new insight not previously of the data collected are unavailable in the
possible. We demonstrate the identification public domain. Here, we demonstrate how
of airport catchments based on the analysis such insight may be generated at zero cost
of Tweets and also the potential of text- in a manner that may be readily applied in
based mining as a means to draw out a range of other contexts.
previously unobtainable insight. In both
11. Geotemporal Twitter Demographics 161

11.4.1 temporal information from the contents


Airport catchments of Twitter posts on a very large scale
using text mining techniques. Spatial
Typically, an airport catchment comprises and temporal trends in what is tweeted
the region from within which the majority about within a given location could also be
of domestic travellers originate from. informative for local service planners and
Various approaches exist to the the marketing industry. In the context of
identification of such regions ranging Heathrow, the quantitative analysis of
from the collection of passenger surveys to social media data could be used to track
the modelling of activity based on probable complaints and disruptions, and harvest
travel time. In the first instance, there users’ interests to improve their service
is a requirement to perform large-scale delivery and dynamic advertising
surveys; in the second, the analysis is based portfolios. Following a methodology
on assumption rather than observation outlined in Lansley and Longley (2016b),
of existing behaviour. Here, we seek to we will use an unsupervised model to
replicate the approach based on passenger segment the textual data into distinctive
surveys; however, we recreate the analysis topics and observe key trends.
using solely spatial Tweets by those
individuals recorded within the Heathrow The content of Twitter data is notoriously
Airport extent. We identify the probable difficult to quantify due to the short length
residential location of the UK users that of documents and use of informal language
use the Twitter service within the airport. (Andrienko et al, 2013). Consequently,
there have been relatively few attempts
Figure 11.4 highlights the functional to generate the typical trends in social
catchment of Heathrow Airport based media usage for a given time and location.
solely on those Twitter users identified Most textual work has focused on unusual
within the Heathrow extent. Around events, often utilising given lists of key
Heathrow a clear pattern may be observed words or methods that detect anomalies
permeating out from the airport. While (e.g. Chae et al, 2012). However, as
shown for Heathrow, it is important in demonstrated above and in our own
such analysis to consider the interaction previous research, with appropriate
between other airports. This could data cleaning techniques it is possible to
potentially be examined. Such approaches identify ‘typical’ topics from large samples
are not without their own limitations. of Tweets (Lansley and Longley, 2016b).
Of particular note is the challenge of Lansley and Longley modelled Tweets
differentiating between those individuals from an extensive area with the intention
who are employed within the airport and of identifying the ‘typical’ geography
those who are travelling. However, while of topics across the given urban extent,
these limitations exist, the benefits of such in that case, Inner London. Each of the
an approach far exceed the drawbacks. topics were distinctive in terms of content,
and most also conveyed temporal or spatial
11.4.2 patterns across the city; often these trends
Text mining could be linked to known activities.
However, this study will focus exclusively
While the first application demonstrated on Tweets from Heathrow Airport.
the recreation of traditional demographic Therefore, the popular topics devised
analysis, this exploits just some of the by the model will be particular to the
Twitter data’s facets. Beyond space and individuals passing through Heathrow
time, a key feature of the data is the and their activities.
availability of rich textual content.
It is possible to harness useful spatio-
162 CONSUMER DATA RESEARCH: PART THREE

Table 11.2
Topic Frequency Percentage Frequencies of Tweets by
1 Destinations 3,547 8.03 topic as determined in the
LDA analysis.
2 Anger 5,183 11.73
3 Thoughts 3,597 8.14
4 Anticipation 5,031 11.39
5 Conversations 3,951 8.94
6 England 5,792 13.11
7 Travel 5,779 13.08
8 Consumers 2,975 6.73
9 Media 3,021 6.84
10 Other 5,312 12.02

Using the same sample of Tweets from Of course, a higher number would allow
Heathrow as described in the preceding more specific and intricate topics to be
sections, the Tweet messages were first generated, although these would represent
cleaned in order to ensure the topic model fewer Tweets. The results of the model were
returned valid and coherent segmentations then appended to each Tweet in the data
of documents. We removed duplicated so we could detect trends. In this case, we
Tweets under the assumption that these simply allocated each Tweet to its most
rarely reflect original content, also often probable topic based on its total probability
these can be automatically generated scores. The group sizes are relatively well
messages. We then removed stop words, balanced; the smallest group represents over
punctuation and non-Latin characters. 6.7% of all the Tweets in our final sample.
We also removed very short Tweets with
less than three words as it would be The topics are presented as a word cloud in
difficult to generate significant topics Figure 11.5. Only the most common words
from very short documents. It was also from the Tweets are shown, which have
necessary to remove all Tweets with been partitioned by the topics they are
uncertain coordinates. Following the most commonly found in. Labels have
data cleaning, 48,188 of the original 56,417 been given to each of the topics and were
Tweets from Heathrow were deemed to be manually derived from interpreting the
appropriate for topic modelling. We used top words and via observing a random
Latent Dirichlet Allocation (LDA) to selection of Tweets. Each of the topics
generate probabilistic topics from the is sufficiently distinctive. Whilst a small
Twitter documents. LDA creates semantic number of topics are perhaps typical of
groups from large collections of documents general social media usage in the UK, many
(Blei et al, 2003). Each word (or value) of the topics are probably particular to
is assigned a score for each topic. It is, airport activities. The group labelled Travel
therefore, possible to view the topics for predominantly describes travelling. These
each Tweet by looking up the scores for messages range from complaints about
each of their words. queues and delays within the airport to
flights and transport links to central
Although LDA is an unsupervised model, London. The England topic contains many
the researcher does need to select the messages about people saying their
number of topics (k) to be generated farewells to the country, notably including
(see Table 11.2). As this study is intended to complaints about the English weather. The
demonstrate the utility of topic modelling group also includes comments from those
Tweets, we have produced just 10 groups. who have arrived at the airport, both
11. Geotemporal Twitter Demographics 163

Figure 11.5
Comparative word cloud upgrade delays baggage later sleep holiday tired lots buddy babes stay ahh
illustrating the most
hrs stuck ever bags touched England long enjoy aww xxxxxx always dude
common terms observed control seat boardingsnow weatherarrivedxxxxx baby thankyou
in each of the 10 groups. britishcrew passportdelay alreadyweeks wet yes bro hey sendbless
air virgin well
board town Conversations ken fab
travel luggage Travel still flights come xxxx
placejfk bag queue half ready cold cute haha hun merry
hello luck
drinking service goodbye finally much christmas god games
late business staff first wait hours cheers safe lovely man
fish nice fail minutes gate excited soil
plane good sir follow believe
babe lets tickets
wine eating bus check security now landed soon ill birthday
lhr sun yet ticket
water tea chocolate miss fun happywon sure fella
champagne morning class delayed sfo byeengland will best gonnafriday since
min hope
Consumers glass breakfastfast
hour just home xxx got might Anticipation
lot see one till
bar pic full beer duty drinkwaitingflight back try may
terminal love yeah tonight saturday
though play
far bacon shopping lounge free minslondon let work
airports early wifi coffee food heathrow thankcan mate going
away roll bed
last
eat
shopwow english news airport
likebitthanks del
get night til tomorrow weekend
das vuelo meu
alright old bad looks look lol new
say feel que paralos pas pero pra com gracias
fans betterfan true sounds said looking day don londres est las che jag aqui och
boy saw show called neednever con det fait
hahaha done forward peopleknow read por allah aeroportomed des
fox great york
Media film name week next f##k think phone una mais dia Other
song idea way evenwtf left twitter use como non casa ver vai
cool david nah today timef##king hate
want please mas ich aku varestoy
funny job start stop trip days sat really tweet les att horas
didn mind
course watched meeting bound year guy sit omg make book number londra kan
nyc bien
music flying s##t someonefind anything sorry aeropuerto
team two amazing heading little something dont
house route san right wanna forget real bir
money
films family meet seeing face shoes many doesn etc buy sin
friends Destinations fly actually sitting Thoughts give worry
wonderful vegas end coming else watching also agree quite
dublin woman lifedad wrong thought email put car live
bird sunday big francisco via visit girls wearing card hard person call text
ahead tourconference city awkward Anger front anyone app
busy berlin exciting worldeveryoneeyes cry behind girl thing without care
able

tourists looking forward to their holiday Tweets in the Consumers topic are more
and people happy to return home. The topic likely to be found in locations across
we labelled Consumers largely represents Heathrow where the services are provided.
comments about retail and hospitality Figure 11.6 illustrates the distribution of the
services; it is probable that a substantial Consumers Tweets (red) and the rest of the
proportion of these messages refer to Tweets from the sample (blue). It can be
purchases within the departure lounges. observed that Tweets from the Consumers
Lastly, the Other group comprises Tweets topic are very densely concentrated
that could not be allocated into the other between a handful of locations across
categories; almost all of the Tweets from the site. These are departure and arrivals
this group are written in foreign languages. lounges where retail and catering facilities
With more data, it would be viable to split are located which correspond with much of
these groups into unique subgroups. the activities discussed.

It is also possible that each of the groups The classification we present here is fairly
has distinctive temporal and spatial rudimentary as it was built with a relatively
patterns, and these may resonate more so small number of Tweets and was restricted
in groups that describe particular activities to just 10 topics. However, it is a valid
that are usually restricted to particular demonstration of the types of trends that
routines or locations. For example, the can be determined from unstructured
164 CONSUMER DATA RESEARCH: PART THREE

social media data. It is also inferred that highlight key considerations for the Figure 11.6
what is spoken about on social media can effective use of such data as part of the Map of Heathrow Airport
showing the distribution
often be associated with activities that may analysis process. It has been demonstrated of Tweets relating to
have distinctive geographies. Therefore, it is how, through the novel analysis of personal consumer activity
possible that harnessing Twitter data can be names, key identity markers may be (red) versus all Tweets
submitted within the
useful in identifying activity trends across inferred and, subsequently, how such airport extent. Basemap
time and space. These can range from data may facilitate the assessment of supplied by Stamen.
unusual delays to feedback on retail representativeness. Using said data,
offerings. Looking forwards, analysts can it has been possible to estimate the
use the probability scores for our database degree to which Twitter is representative
of words to model future data into the of the UK’s residential population.
pre-existing topic categories in order to
identify trends on the fly. Beyond illustrating potential techniques
for data enrichment, we have sought to
11.5 showcase the potential of Twitter in the
Conclusion study of demography. Using the example
of London Heathrow Airport, it has been
At the outset, the objective of this chapter shown how one may recreate conventional
was to showcase the potential of Twitter forms of analysis and, further, how the
data in the study of demography and richness of the data may be exploited
11. Geotemporal Twitter Demographics 165

to extract levels of insight that previously Further Reading


would have been unobtainable. In particular,
Andrienko, G., Andrienko, N., Bosch, H., Ertl,
we have demonstrated the use of text T., Fuchs, G., Jankowski, P. and Thom, D. (2013).
mining as a means to exploit the textual Thematic patterns in georeferenced tweets through
content of the Tweets. In appreciating space-time visual analytics. Computing in Science &
Engineering, 15(3), 72-82.
the above, it must be recognised that the
methods employed are both repeatable and Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent
transferable. Further, unlike many different Dirichlet Allocation. Journal of Machine Learning
Research, 3(Jan), 993-1022.
new forms of data, the ease with which the
raw data may be sourced makes Twitter Chae, J., Thom, D., Bosch, H., Jang, Y., Maciejewski,
data the ideal starting point for the R., Ebert, D. S. and Ertl, T. (2012), October.
Spatiotemporal social media analytics for abnormal
development and presentation of novel event detection and examination using seasonal-
population insight. trend decomposition. In Visual Analytics Science and
Technology (VAST), 2012 IEEE Conference on (pp. 143-
152). IEEE.
However, for the effective use of Twitter
data in the generation of actionable Global Science Forum (2013). New Data for
population insight it is necessary that Understanding the Human Condition: International
Perspectives: OECD Global Science Forum Report
the analyst remains conscious of the on Data and Research Infrastructure for the Social
limitations manifest within the data. Sciences. 
In particular, one must consider the
Lansley, G. and Longley, P. (2016a). Deriving age
population for whom the data are and gender from forenames for consumer analytics.
representative and, arguably more Journal of Retailing and Consumer Services, 30, 271-278.
importantly, for whom they are not.
Lansley, G. and Longley, P. (2016b). The geography of
We may thus conclude on a positive Twitter topics in London. Computers, Environment and
note. While the demographic insight Urban Systems, 58, 85-96.
generated from Twitter is by no means
Morstatter, F., Pfeffer, J., Liu, H. and Carley, K. M.
a perfect replacement for conventional (2013), June. Is the sample good enough? Comparing
data and methods, it does provide an data from Twitter’s Streaming API with Twitter’s
0 0.5 1 km Firehose. In Proceedings of the 7th International
exciting insight into the future of
Conference on Weblogs and Social Media, ICWSM 2013.
demography and population studies.
Acknowledgements

The first author would like to thank the Defence


Science and Technology Laboratory (DSTL)
for supporting this research (DSTL Grant No.
12/13NatPhD_61).

Note

1. An active user is considered as an individual who


has used the service at least once in the preceding
month. Statistic accessed via Statistica for Q1 2017.
12
167

Developing Indicators for


Measuring Health-Related
Features of Neighbourhoods
Konstantinos Daras, Alec Davies, Mark A Green
and Alex Singleton

12.1 This study details the creation of a series of


Introduction national open source low-level geographical
measures of accessibility to health-related
Geographical inequalities in health outcomes features of the environment. There are
have long been observed. For example, in 1842 three main domains across the indicators:
social reformer Edwin Chadwick identified the retail environment, the provision of
that male life expectancy of labourers (i.e. low health care and the physical environment.
occupational group) in Rutland (38) was
higher than that of professional tradesmen 12.2
(high occupational group) in Liverpool Retail environment
(35). Fast forward 175 years and male life
expectancy in Rutland is 81.4 compared Unhealthy foods, smoking and alcohol
to 76.4 in Liverpool. These spatial patterns misuse represent important determinants
peaked interest into the extent that living for ill health. A common shared feature
in particular locations influence our health. across each of these very different harms
The differences that Chadwick observed is that they each require purchasing from
were due to urban-rural disparities, such as retail outlets. As such, our opportunities
the presence of slum housing, outbreaks of to consume such items may be shaped by
infectious diseases and overall air quality. the built environment surrounding us.
Today, many of these issues are no longer Accessibility to retail outlets selling
present and the environments we live in these products is therefore of interest
are very different. Therefore, it is important for understanding whether they influence
to be able to measure certain features of (and how much) our way of consuming
neighbourhoods and to be able to answer such items.
questions about whether they are
important for our health or not.
168 CONSUMER DATA RESEARCH: PART THREE

We developed indicators of accessibility National Health Service (NHS) was that the
to ‘fast food outlets’, ‘pubs, bars and provision of these services should be made
nightclubs’, ‘off-licences’ and ‘tobacconists’. available to all. One area of interest in
Fast food outlets typically sell foods that implementing such a policy has been the
are energy dense and nutritionally poor, equitable access to services with the aim
and the consumption of such foods are of minimising geographical barriers.
associated with increased risk of obesity. However, health services are not always
We have two measures of accessibility equally spread throughout the population
to alcohol. Pubs, bars and nightclubs and there has been considerable interest in
represent outlets that sell alcohol on-trade whether geographical barriers prevent the
(i.e. alcohol is purchased and consumed utilisation of such services.
on site) and off-licences are stores that
primarily sell alcohol as off-trade (i.e. We include measures of accessibility to
alcohol is purchased on site but consumed features of primary and secondary health
off site). These have different harms, care. These include ‘General Practices
with access to on-trade outlets typically (GPs)’ which are the first point of care
associated with acute alcohol-related for patients, ‘pharmacies’ which sell
harms and off-trade outlets with chronic medicines, ‘dentists’ which provide
harms. Finally, tobacconists are specialist oral health care, and finally ‘hospitals
stores which sell primarily tobacco-related with accident and emergency (A&E)
products such as cigarettes, cigars and departments’ which provide more serious
loose form tobacco. care. We also include accessibility to leisure
services that while they are not health
We also include access to ‘gambling services, offer individuals the opportunity
outlets’. These represent slightly different to exercise, which is important for
harms compared to our other indicators. promoting healthy lifestyles.
Gambling outlets represent the potential
for economic losses which are indirectly 12.4
related to health. Individuals who use them Physical environment
have also been shown to be associated with
poorer mental health. There has been longer interest in
understanding how features of the physical
Local governments in the UK (and beyond) environment impact health compared to
have sought to regulate aspects of the built other domains. We focus on two important
environment in attempts to address the aspects of the physical domain: green space
access and supply of unhealthy amenities. and air quality.
Planning regulations have been introduced
to limit the density of fast food outlets, Green space refers to areas of natural
pubs/bars and off-licences. There is also environments including grassland,
similar interest in reducing access to woodland, parks and other areas of
gambling outlets. Therefore, the interest vegetation. It has been demonstrated
in such metrics is not purely academic, to be an important determinant of
but has important policy relevance. physical and mental health. Parks offer
opportunities for physical activity, as well
12.3 as social interactions with friends and
Provision of health care family. Individuals residing in ‘greener’
environments also tend to have improved
Health care services provide important mental wellbeing.
point of care amenities for the diagnosis,
treatment and maintenance of health. One Air quality is an important determinant of
of the founding principles of the UK’s respiratory health and is viewed as one of
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 169

the most important determinants of ill role of environmental factors on health is


health globally. Ideally air should be clean that they are often considered in isolation.
and free of pollutants to allow healthy However, this is a false representation of
respiration; however this is often not reality as they each co-exist and interact.
the case. Levels of air pollution receive Developing indicators to measure the
extensive policy interest particularly in multidimensional features for how
urban areas with busy road networks, geography may influence health
airports and industry. We focus on levels is important for informing future
of three important pollutants which have research and policy applications.
each been independently shown to be Similar approaches have been useful
associated with health outcomes: for measuring poverty and deprivation
Particulate Matter (PM10), Sulphur (e.g. the Index of Multiple Deprivation).
Dioxide (SO2) and Nitrogen Dioxide (NO2).
12.6
12.5 Methodology
Opening up data
Firstly, we acquired data on the retail
It is clear that geographic context matters environment. Data on about half a million
for both understanding health patterns retail businesses throughout Great Britain
and for delivering policy strategies aimed were provided by the Local Data Company
at improving health. However, there are (LDC)1 using the Consumer Data Research
several issues that have limited our ability Centre (CDRC) services. The LDC dataset
to measure these features. Firstly, aims to include records for every operating
processing these data types at low retail business including a hierarchical
spatial resolutions requires heavy data classification of retail types (39 categories
manipulation. Researchers and policy and 370 subcategories) and the address
officials often don’t have the expertise of the store. We used this dataset as it is
available to them to process such data regularly updated and therefore more
readily. Secondly, accessibility to these accurate compared to other common
data can be restricted and often consumer sources (e.g. Ordnance Survey’s Points
data on retail outlets are either not of Interest database) used for measuring
available or must be paid for. Finally, neighbourhood features. Table 12.1 presents
where these previous issues have been the categories selected for developing the
overcome, data are often not available retail environment indicators, including
for all locations at a small spatial scale. the number of retail businesses assigned to
The majority of studies that have explored each category.
the role of these environmental features
have been undertaken in local contexts that Our health services domain integrated
may not be generalisable to the national data from multiple sources, including
level. Where they are available at the openly available data on the location of
national level, this is often only for large health services (GP practices, hospitals
geographical zones, which are not always with A&E departments, pharmacies and
useful. Our project aims to open up dentists) from NHS Digital2 (England and
geographic data on health indicators at a Wales); and the Information Services
low-level spatial resolution for Great Britain Division (ISD)3 in NHS Scotland. These data
that will address each of these barriers. were supplemented with the location of
leisure sport centres from the LDC data.
We also build on prior research using
these health indicators to develop a new Finally, in order to provide context to the
descriptive tool. One limitation common to physical environment, we integrated two
the majority of research investigating the sources of data. As a measure of air quality,
170 CONSUMER DATA RESEARCH: PART THREE

Indicator LDC Category / Subcategory Business


Addresses
Chinese Fast Food Takeaway 2,855
Fast Food Delivery 1,049
Fast Food Takeaway 11,115
Fish & Chip Shops 3,829
Accessibility to Fast Food outlets
Indian Takeaway 1,256
Pizza Takeaway 2,835
Sandwich Delivery Service 342
Take Away Food Shops 8,449
Casino Clubs 156
Accessibility to Gambling outlets
Bookmakers 8,379
Accessibility to Off-licences Off Licences 2,770
Accessibility to Tobacconists Tobacconists 1,948
Night Clubs 1,172
Accessibility to Pubs, bars and nightclubs Bars 4,520
Public Houses & Inns 18,775

we used modelled estimates from restrictions on roads as well as tagged Table 12.1
Department for Environment, Food and speed limits and barriers. We measured the LDC categories and
subcategories selected
Rural Affairs (DEFRA) for a series of air network distance between the centroid of for each indicator of
pollutants with known health implications each postcode in the National Statistics the Retail environment
(NO2, PM10 and SO2). The air pollution data Postcode Lookup (NSPL) (a database domain.

are modelled under DEFRA’s Modelling of containing all postcodes for Great Britain)
Ambient Air Quality contract to provide and the coordinates of the nearest service
policy support and are created at a 1x1 km (e.g. a postcode centroid for GP practice).
resolution. Model estimates are derived However, the overall process for calculating
from a mixture of data collected from network distances for about 2 million
monitoring sites and estimated levels postcodes in Great Britain is CPU-intensive
based on the location of industry and and the Routino tool computes distances
road networks. Additionally, we acquired sequentially. To address both these issues,
information on ‘green’ spaces available for we implemented a parallelisation
use by the public from the Open Street Map framework using 10 Docker5 containers
(OSM) through selecting areas with the that run Routino instances in parallel for
following tags: cemetery, common, dog subsets of 200,000 GB postcodes. In this
park, scrub, fell, forest, garden, greenfield, way, we achieved a significant decrease in
golf course, grass, grassland, heath, processing time from roughly eight days to
meadow, nature reserve, orchard, park, about eight hours per indicator!
pitch, recreation ground, village green,
vineyard and wood. Accessibility to each of The indicators for the physical environment
our indicators (other than the indicators of domain required a different approach.
physical environment domain) were created For measuring access to green space,
using the Routino4 open source software. we defined accessibility as a measure of
Routino is an application for finding a route the overall area of green space available
between two points using the OSM road to each postcode that falls within a 900
network, and takes into account metres buffer zone. We selected this
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 171

Domain Indicator Health promoting


Low value High value
Accessibility to Fast food outlets - +
Accessibility to Gambling outlets - +
Retail
Accessibility to Off-licences - +
Environment
Accessibility to Tobacconists - +
Accessibility to Pubs, bars and nightclubs - +
Accessibility to GP practices + -
Accessibility to A&E hospitals + -
Health Services Accessibility to Pharmacies + -
Accessibility to Dentist practices + -
Accessibility to Leisure services + -
Accessibility to Green spaces - +
Physical Nitrogen Dioxide (NO2) + -
Environment PM10 Particles + -
Sulphur Dioxide (SO2) + -

Table 12.2 measure following the recommendation (we refer to these as LSOAs for simplicity in
Indicator weights and of the European Environment Agency the rest of the chapter) although they are
direction for each
indicator of the Access which argues that each person should slightly smaller with population sizes
to Healthy Assets and have access to green space no further than between 500 and 1,000 people.
Hazards (AHAH) index. 900 metres (or a 15 min walk) from their
home (Stanners and Bourdeau, 1995). We Each indicator was then individually
additionally performed sensitivity testing standardised by ranking LSOAs from best
of additional buffer sizes; however the to worst. The direction of each variable was
results did not significantly alter. We did dictated by the literature (e.g. accessibility
not measure access to our air pollution to fast food outlets were identified as
measures but used their modelled values health negating, whereas accessibility
from DEFRA and aggregated them at the to GP practices were health promoting;
LSOA level. see Table 12.2). Each variable was then
transformed to the standard normal
Measured network distances for each distribution. The indicators within
indicator were aggregated from postcode each domain were combined with equal
into an aggregate geography. For England weights forming an overall domain score.
and Wales, these were Lower Super Output We chose to equally weight each indicator
Area (LSOA), and in Scotland, Data Zones. since there was no clear justification for
We selected these geographies since different weightings, which otherwise
they are relatively small zones which would emphasise the relative importance
are regularly used in research, local of the composite score versus those
government or health, and could be easily others considered.
aggregated to other statistical geographies
if required. To give an idea of scale, LSOAs To calculate our overall index (and domain-
contain a mean population size of 1,500 specific values), we followed an aspect of
people with a minimum of 1,000 and the methodology from the 2015 English
maximum of 3,000 people per LSOA. Index of Multiple Deprivation (Smith et al,
For Scotland, we used ‘Data Zones’, 2015). We ranked each domain R and scaled
which are the equivalent geographical scale it to the range [0,1]. R=1/N was defined as
172 CONSUMER DATA RESEARCH: PART THREE

Domain Indicator Great Britain - LSOAs England - LSOAs


All Urban Rural All Urban Rural
Retail Environment Accessibility to Fast Food outlets 1.30 1.08 7.41 1.22 1.03 5.36
(km)
Accessibility to Gambling outlets 1.21 1.02 5.85 1.19 1.02 5.98
(km)
Accessibility to Off-licences 2.58 2.02 9.55 2.24 1.85 8.34
(km)
Accessibility to Tobacconists 3.08 2.48 10.90 2.89 2.43 10.01
(km)
Accessibility to Pubs, bars and 1.12 0.96 3.70 1.03 0.92 3.24
nightclubs (km)
Health Services Accessibility to GP practices 1.05 0.93 2.92 0.99 0.89 3.02
(km)
Accessibility to A&E hospitals 7.45 6.06 18.61 7.12 6.04 17.62
(km)
Accessibility to Pharmacies 0.85 0.77 2.25 0.83 0.76 2.70
(km)
Accessibility to Dentist practices 1.05 0.92 3.65 1.00 0.90 3.78
(km)
Accessibility to Leisure services 2.45 1.98 8.92 2.23 1.88 8.09
(km)
Physical Accessibility to Green spaces 0.55 0.58 0.42 0.53 0.56 0.37
Environment (km2)
Nitrogen Dioxide (NO2) 10.60 11.55 7.23 11.44 12.20 8.23
(µg m-3)
PM10 Particles 12.74 12.99 11.76 13.36 13.48 12.82
(µg m-3)
Sulphur Dioxide (SO2) 1.15 1.19 0.97 1.21 1.23 1.07
(µg m-3)

the most ‘health promoting’ LSOA and The main domains across our indicators: Table 12.3
R=N/N for the least promoting (N is the retail environment, health services and the Median values of each
indicator for LSOAs by
number of LSOAs in Great Britain). physical environment then were combined urban/rural status.
Exponential transformation of the to form an overall index of ‘Access to
ranked domain scores was then applied to Healthy Assets & Hazards’ (AHAH).
LSOA values to reduce ‘cancellation effects’
(Smith et al, 2015). So, for example, high 12.7
levels of accessibility in one domain are not Results
completely cancelled out by low levels of
accessibility in a different domain. The Table 12.3 presents descriptive statistics for
exponential transformation applied also each of our indicators. These reveal for the
puts more emphasis on the LSOAs at the first time low-level differences in access to
end of the health demoting side of the various health-related features for the whole
distribution and so facilitates identification of Great Britain. Many features of the retail
of the neighbourhoods with the worst environment are located on average within
health promoting aspects. The exponential less than 1.5 km of the population. Pubs, bars
transformed indicator score X is given by: and nightclubs were the most accessible.
This was followed by gambling and fast food
X = – 23 ln (1– R(1– exp-100/23)) outlets which both demonstrated high
accessibility. Off-licences and tobacconists
where ‘ln’ denotes natural logarithm and were the least accessible premises in the
‘exp’ the exponential transformation. retail environment, particularly in Scotland
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 173

Finally, each of our measures of the physical


Scotland - LSOAs Wales - LSOAs
environment promoted healthier locations in
All Urban Rural All Urban Rural
rural versus urban areas. This makes sense
1.63 1.25 5.83 2.14 1.55 6.68
since the main sources of pollution (e.g.
industrial sites) are located in urban areas,
1.26 0.99 4.73 1.57 1.21 6.59
and rural areas are ‘greener’. In Scotland
4.89 3.14 14.25 6.61 3.86 13.15 and Wales physical environment indicators
such as the NO2 and the PM10 show a less
4.08 2.69 15.19 4.61 2.98 12.74 polluted environment compared to the
indicators of England (Table 12.3). We next
1.64 1.29 5.57 1.48 1.15 4.60 mapped each of the domain scores and
explored their geographic distributions
1.32 1.16 2.57 1.23 1.06 2.99
(Figure 12.1). The physical environment
domain (Figure 12.1a) demonstrates better
8.66 5.64 22.00 12.15 9.51 20.20
physical environments in rural areas (as
0.93 0.82 1.49 1.05 0.89 2.04 supported by Table 12.3). There are vast
expanses of areas grouped in the ‘best’
1.20 1.00 2.99 1.40 1.18 4.14 quintile across Scotland, Cumbria and
North Yorkshire, Wales, Devon and Cornwall.
3.45 2.46 12.37 5.25 3.23 12.34 Smaller areas can also be detected on the
map representing the locations of national
0.66 0.68 0.59 0.48 0.48 0.47
parks, such as the Peak District or the South
Downs, which vividly stick out from the
7.18 8.44 4.27 7.31 8.26 4.73
surrounding urbanised areas which perform
9.02 9.18 8.34 11.43 11.75 10.17
poorly. Considering the domain indicators,
this is to be expected as these areas feature
0.76 0.80 0.61 1.32 1.41 1.00 both good accessibility to green spaces
whilst also having legislation against
development protecting these areas from
and Wales. Each of these services were more the pollution associated with urbanisation.
accessible in urban compared to rural areas,
partly as these amenities cluster within The worse locations identified through our
settings where there are greater populations physical environment domain are clearly
and demand. Pubs, bars and nightclubs had outlined urban areas. The largest expanse
the smallest difference between urban and tracks from Humberside and follows the M1
rural areas, with each of the other amenities motorway past Doncaster and Sheffield to
fairly uncommon in rural areas. Nottingham. This M1 corridor features a
large number of power stations alongside
Each of the primary health services (GPs, large industrial sites. Other industrialised
dentists and pharmacies) demonstrated high areas can be clearly picked out, such as
access and were more accessible than any Newcastle, Liverpool, Manchester and
feature of the retail environment. While Leeds. Birmingham and London are the
access in rural areas was poorer than urban two largest conurbations with poorest
areas, average distances were not large. physical environments outside of the
Access to hospitals with A&E departments expanse in the East Midlands, and
was poor in rural areas of Great Britain, and Southampton is another clearly defined
they were the least accessible of any of our area. A more rural area that exhibits a poor
indicators. In particular, the Scottish and the environment is that encompassing Boston
Welsh population in rural areas had to travel and Peterborough around the Wash, where
an extra 3-5 kilometres to get access to an land use is predominantly large-scale
A&E hospital. agriculture. Although this area features
174 CONSUMER DATA RESEARCH: PART THREE

good green space accessibility it also has retail features in contrast to health Figure 12.1
high scores for the indicators of SO2 and services. On the contrary, rural areas have Quintiles of accessibility
in GB: a) Physical
PM10 from the farming process. poorer access to these retail outlets in environment domain,
comparison with urban areas. b) Health services domain,
The health services domain (Figure 12.1.b) c) Retail environment
domain.
has a contrasting pattern to that seen in Figure 12.2 shows our overall index of
the physical environment (Figure 12.1.a). ‘Access to Healthy Assets & Hazards’
Rural areas have poorer accessibility to (AHAH). The figure shows that the
health services than urban areas (as shown most remote rural areas are identified
in Table 12.3). Urban areas are more clearly as ‘unhealthy’ areas in terms of
defined in Figure 12.1.b, which is expected accessibility in our measure. While they
due to the distinct differences in typically performed well on our physical
infrastructure provision and population environment and retail domains (although
density. Plotting quintiles hides some not always, e.g. Lincolnshire), they perform
variation between areas particularly in poorly on accessibility to health services,
rural areas where remote regions in Wales due to their remoteness and being sparsely
and Scotland have very poor access to populated. By contrast, most urban cores
health services. of cities such as central London, central
Birmingham, and the city centres of areas
The retail services domain is very similar such as Liverpool, Leeds and Manchester
to the health domain, with urban areas also perform poorly on our index. These
once again clearly defined. Though the urban centres have high volumes of health
relationships are reversed, urban areas services, but have poor accessibility due to
have higher accessibility to health negating the high number of ‘unhealthy’ services
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 175

spaces, but further away from polluted


environments or retail services that were
unhealthy. The LSOA that performed the
best on our overall index was ‘Torridge
006B’. The area comprises ‘Great
Torrington’ which is a small town in north
Devon in the South West of England. It has
low levels of pollution, good access to parks
and green space, few retail outlets that may
encourage poor health-related behaviours,
and good access to health services, in
particular, with a hospital located in the
centre of the town. Only two of the top ten
best areas were not located in Scotland
(with the other one being the Isle of
Portland near Weymouth in the South West
of England). The Scottish areas were mainly
small towns and villages in rural areas
located across the Central and Lowlands
areas of Scotland, with two of them being
located in the Greater Glasgow region.

Figures 12.3 and 12.4 move away from


the national level and look at more local
distributions of the index and domains
(Liverpool and London respectively).
Figure 12.3.a shows AHAH for Liverpool
and demonstrates distinct geographical
Figure 12.2 from the retail indicator and higher levels patterns. There is a clear region that
Quintiles of Access to of air pollution. performs poorly on our index in the city
Healthy Assets & Hazards
in GB. centre and north west of the city. Figures
The LSOA identified by the index that 12.3.b, 12.3.c and 12.3.d help to provide
performed the worst was ‘Camden 037B’. explanation for this. Figure 12.3.b shows
The LSOA is located in Holborn, Camden the environment domain is worse in the
and incorporates Hatton Garden and city centre due to heavy traffic flow and in
Farringdon station. The LSOA scored highly the northern docklands up towards Bootle,
on the retail and environment domains, which is the old industrial area of the city.
due to its high levels of pollution and high For retail services (Figure 12.3.d) the city
accessibility to most of the retail outlets we centre has the best accessibility because of
measure (e.g. on average individuals were the volume of infrastructure; however this
only 0.11 km from their nearest fast food also extends into the north west of the city.
outlet). Each of the other LSOAs within the These patterns exist despite good access to
top 10 unhealthiest environments were also health services (Figure 12.3.c).
located in Inner London.
The two best performing areas of the city
The areas that were identified as the most are Sefton Park and Aigburth, south west
health promoting through our index are of the city centre, and Croxteth in the north
typically smaller towns and suburban areas east. These regions seem to perform well
on the outskirts of cities. These areas across each of the domains, particularly
perform well since they were generally the Croxteth region. South Liverpool also
located near to health services and green performs well overall, given the particular
176 CONSUMER DATA RESEARCH: PART THREE

performance on the physical environment


and health services domains.

Our indicators also pick out more fine scale


patterns. Figure 12.3.b demonstrates an
area of poor physical environment in the
east of the city. This region covers where
the M62 motorway enters the city, and
combined with a major train junction,
represents the main arterial flow of
traffic into Liverpool. Figure 12.3.a also
demonstrates a small area that is poor
performing just south of the city centre
adjacent to the Sefton Park area which
performs well. This is the Smithdown Road
area which attracts Liverpool’s student
population, and has a high concentration
of pubs, bars and fast food outlets. Out-
of-town shopping centres and other high
streets can be clearly seen in Figure 12.3.d
as well as they represent areas of greater
access to unhealthy aspects of the retail
environment outside of the city centre.

A second local focus is that of London


shown in Figure 12.4. Much like Liverpool,
the inner core and central region of London
exhibits the lowest access to healthy
choices (Figure 12.4.a). These higher values
are driven by the high level of air pollution
(as represented by the poor scores on the
physical environment domain; Figure
12.4.b) and high access to the unhealthy
aspects of the retail environment (Figure quality physical environments. They do not Figure 12.3
12.4.d). By contrast, the area does have good perform well on the retail environment Quintiles of accessibility
in Liverpool: 1) Bootle area,
access to health services (Figure 12.4.c). domain; however this is common across 2) Croxteth area, 3)
London due to the high density of Smithdown area, 4) Sefton
The areas that perform poorly are not infrastructure and people. Outside of Park and 5) Aigburth area.

restricted to just the urban core but also the Greater London metropolitan area,
extend out to the east and west. In the east, accessibility to retail outlets is by contrast
areas in the lowest quintile extend along poorer (i.e. further away) since these areas
the River Thames representing the location are predominantly more rural / less densely
of industry and river traffic. The west is populated. There is one area within Inner
characterised by Heathrow Airport which London that does perform well on AHAH
has high levels of pollution and poor access despite being surrounded by areas that do
to health services. not perform well. This is Richmond Park
and Wimbledon Common, two large
The areas that perform best on AHAH expenses of green space and parkland.
can be found in the periphery/outskirts This area performs best on the physical
of the city. These areas are characterised environment domain (and to a lesser
by good access to health services and high extent on the retail environment domain).
12. Developing Indicators for Measuring Health-Related Features of Neighbourhoods 177

Figure 12.4 12.8 deprivation-2015-technical-report


Quintiles of accessibility Conclusion [Accessed 10 Dec 2016]
in London: 1) Heathrow
Airport, 2) Richmond Park Stanners, D. and Bourdeau, P., (1995). The urban
& Wimbledon Common Our study details the creation of a series of environment. In Stanners, D. and Bourdeau, P. (Eds.),
and 3) East London. national open source low-level geographical Europe’s Environment: The Dobris Assessment. European
Environment Agency, Copenhagen, pp. 261–296.
measures on the accessibility to health-
related features of the environment. These Notes
measures are combined to create an index
1. data.cdrc.ac.uk/product/local-data-company-
of ‘healthiness’ for areas (‘Access to Healthy retail-data
Assets & Hazards’) and help to summarise the 2. digital.nhs.uk/
complex geographical patterns demonstrated 3. www.isdscotland.org/
4. www.routino.org/
across our indicators. The data are available 5. www.docker.com/what-docker
at indicators.cdrc.ac.uk/health where they
can be viewed and downloaded. The website Acknowledgements

will be updated over time with new The authors would like to thank the Local Data
indicators that we develop or update. Company Ltd for providing the retail unit data, the
NHS of England, Wales and Scotland and the DEFRA
Further Reading for providing the health data and the air pollution
data respectively under the OGL license and the
Smith, T., Noble, M., Noble, S., Wright, G., McLennan, OpenStreetMap Foundation (OSMF) for providing the
D. and Plunkett, E. (2015). The English Indices of GB network data under the Open Data Commons Open
Deprivation 2015, Department for Communities Database License. The second author’s PhD research is
and Local Government. Online: https://www.gov. sponsored by the Economic and Social Research Council
uk/government/publications/english-indices-of- through the North West Doctoral Training Centre.
13
179

Consumers in their Built


Environment Context
Alexandros Alexiou and Alex Singleton

13.1 input regarding attributes of the built


Introduction environment or physical space, and their
relationship to socio-economic profiles
Within consumer analytics, geodemographic within this context have not been evaluated
classifications imbued with a variety of in any systematic way. There is, however,
data are used widely as one of the most an abundance of variables that might be
powerful discriminators of consumer collected on the built forms and relative
behaviour (Graham, 2005). These divide locations that underpin neighbourhood
customers into homogenous groups and differentiation.
have a long lineage as a basic strategy
of marketing, often in order to identify The rationale for this research drew upon
population types and their correlation to a strong evidence that residential preference
product uptake (neighbourhood targeting). holds a significant relationship to the form
The advantages of such approaches were of the built environment, suggesting that
identified very early on, for example the there is an important dimension to
analysis carried out by Green and residential differentiation beyond a desire
colleagues (1967) examining the to live in areas that contain other people
relationships between newspaper we deem ‘like us’. Proximity to certain
circulation and city type. amenities is important to residential
decisions, for example, transport nodes,
Identifying socio-spatial patterns through parks, retail or healthcare facilities,
geodemographic classification has proven and such attributes may have varying
utility over a range of disciplines. While importance between different segments
most of these spatial classification systems of the population. For instance, families
include a plethora of socio-economic with children often favour greenspace and
attributes, there is arguably little to no recreational opportunities nearby, while
180 CONSUMER DATA RESEARCH: PART THREE

those without, may prefer smaller Archive (services.historicengland.org.uk/


residences closer to the city centre. NMRDataDownload/) that is regularly
As a result, consumption patterns can be updated (November 2015 update used here)
inferred by the characteristics of residential and also under Open Data License. For
location. Although geodemographic Wales, the corresponding provider is the
frameworks can incorporate a variety Cadw heritage organisation (available
of input attributes, built and physical through the UK Data Service: data.gov.uk/
environment variables are typically limited dataset/listed-buildings-in-wales-gis-
to housing conditions or types. As such, point-dataset), although the data are
this chapter presents the results of a slightly outdated (September 2011).
project that explores the generation of a Commercial buildings for local retail
neighbourhood typology with focus on centres were identified using data from
such characteristics of urban morphology, the Local Data Company, an Open version
through integration of a range of spatial of which is available through the ESRC
data from open sources. Consumer Data Retail Centre (CDRC)
(available at: data.cdrc.ac.uk/dataset/
13.2 cdrc-maps-retail-centre-locations).
Data sources
Finally, the selected datasets include
Currently, there are several providers of aggregated data on housing type from
built and physical environment data in the 2011 Census supplied by the Office
the UK. One of the main providers of for National Statistics. Unfortunately,
geographical data for England and Wales there are currently no Open Data available
is the national mapping agency Ordnance on building age or height. The UK
Survey (OS), and there are many datasets Environmental Agency has recently started
available within their repository, with providing raw LIDAR datasets that can offer
varying degrees of granularity, depending such possibilities (data.gov.uk/publisher/
on whether they are publicly accessible environment-agency), but still do not
or available for purchase. As this research offer complete coverage. Future updates
focuses on Open Data sources, a variety of this classification product may include
of open vector data sources that can be more attributes such as roof types, car
used directly or supplementary, such as parks, delineated retail clusters and
OpenStreetMap (www.openstreetmap.org), Energy Efficiency Certificate (EPC) data.
were considered. Nevertheless, in order to
maintain a consistent level of accuracy, the Table 13.1 summarises the range of inputs
OS Open Map - Local product was used, the used to derive measures featured in this
most recent and detailed open OS vector analysis.
data product currently available (Ordnance
Survey, 2015). This particular vector data The selection of the Output Area (OA)
product provides a variety of information, zonal level offers advantage over other
including outlines of buildings, street administrative units in England and
network with hierarchy, railways, Wales since many other socio-economic
woodland areas, surface water and classifications are offered at the OA level,
important functional sites. such as the 2011 Output Area Classification
(OAC), thus making comparisons possible.
While the OS Open Map – Local provides Additionally, such geography allows the
the main source of these data, there were incorporation of Census data which are
a few other sources within England and distributed for these units. However, for the
Wales deemed useful. These included data range of the derived measures that are
about listed buildings and historic parks described in the remainder of this section,
and gardens supplied by the Historic England there are problems with this approach (and
13. Consumers in their Built Environment Context 181

Dataset Name Dataset Description Source


D1: OA Boundaries 181,408 Output Area (OA) boundaries, as defined by the 2011 Ordnance Survey
Census. All other data were spatially joined with respective OAs
that they fall into (data features were split when falling into more
than one OA).
D1: Building Units 12,878,666 Building objects represented as polygons. Note that Ordnance Survey
these areas do not represent individual households.
D2: Road Network Road network is represented as line segments, approximate to Ordnance Survey
the road centre. The categories include ‘Motorway’, ‘Primary
Road’, ‘A Road’, ‘B Road’, ‘Minor Road’, ‘Pedestrianised Street’,
‘Local Street’ and ‘Private Road Publicly Accessible’, as well as
their ‘Collapsed Dual Carriageway’ counterparts.
D3: Woodland Areas of trees represented as polygons, described as coniferous Ordnance Survey
and non-coniferous.
D4: Functional Sites / Functional sites comprised of 120,677 building polygons. They are Ordnance Survey
Important Buildings categorised into themes such as ‘Air Transport’, ‘Education’,
‘Medical Care’, ‘Road Transport’ and ‘Water Transport’, which are
further classified into more discrete classes.
D5: Railway Stations Railway tracks and tunnels represented as lines and railway Ordnance Survey
and Tracks stations represented as points.
D6: Surface water Polygons of surface water. Small rivers and streams are Ordnance Survey
represented as lines and are not included in the dataset. The
dataset was also supplemented with a polygon for ‘sea water’,
derived from the country’s coastline.
D7: Registered 406,496 listed historic buildings defined as points, which were Historic England
Historic Buildings geolocated. Archive; Cadw
D8: Registered Parks 2,007 Polygon features with extents of the parks / gardens, Historic England
and Gardens classified as I, II*, or II, from most to least important. For Wales, Archive; Cadw
the 372 sites were identified from points from a ‘Named Places’
dataset and given an approximate 200m radius.
D9: Retail Centres 1,312 Retail Centres across England and Wales. There is no recent Local Data
update for this dataset which dates back to 2004. The centres Company (CDRC)
are only depicted as points and have no typology attached. We
assumed an average radius of 200m to convert them to areas.
D10: Housing Type Percentage of households that are classified by the Census as Office for National
‘Detached’, ‘Semi-detached’, ‘Terraced’ or ‘Flat’. Statistics
D11: Population Population of total persons per OA. Office for National
Statistics

Table 13.1 as a matter of fact, any other Census challenges about how such measures might
Description of the spatial geography). OA borders were designed be calculated, and to which area they
dataset compiled.
to minimise within-zone homogeneity should be attributed.
in population characteristics (population
normalisation), without regard to the To facilitate these methodological
geographical features of the area (Martin shortcomings, three different types of
et al, 2001; see Figure 13.1). As such, for attribute measures are introduced for
proximity based inputs there were each OA that related to either two types of
182 CONSUMER DATA RESEARCH: PART THREE

Figure 13.1
Map looking at the
un-generalised OA
borders (blue lines)
in the Sefton Park area,
Liverpool. Notice how the
area of the park is divided
arbitrarily between
proximal OAs (pink
hashed line pattern).
Moreover, OA borders
usually coincide with the
street network, making
any street network-to-
area measurements
impracticable.

proximity measures including adjacency This research defined adjacency effects to


effects or intermediate effects; and features measured within 100m linear
additionally direct measures. The distance, as commonly used in the
lattermost of these are simply attributes literature on negative externality effects
captured at the OA level, while the first of built environment features, such as noise
two assume buildings as the initial unit of or pollution from roads (Rijnders et al,
analysis which are then later assigned to 2001). For intermediate effects a distance
OAs. Building polygon features serve as of 600m was used, on the basis of various
observations in this input dataset, and western international definitions of ‘within
represent homogenous built-up areas walking distance’. The distance figure
which can include one or more households. generally varies depending on the context
A graphical representation of the model is of analysis, but distances between 300m
described in Figure 13.2. and 900m are considered appropriate for
urban features (Hui et al, 2007; Barbosa
For both types of proximity measures, et al, 2007).
a series of spatial queries were used that
identified buildings that fulfil certain Outside of these distances, it is assumed
criteria, for instance, ‘Which buildings there are no effects. The delineation of
are within a set distance of a major street?’. adjacency effects or intermediate effects
The surface of the buildings that met each brings additional practical considerations
criterion were then aggregated per OA and which relate to the overall density of
calculated as a ratio to the total building the built environment features being
surface. Thus, within each OA, a ratio of considered. In common with practice
the area of buildings meeting the criteria when creating inputs to multidimensional
relative to the total built area was classifications, preference should be for
calculated for each of the attributes those attributes which in addition to
considered in the analysis. theoretical rationale, also provide useful
differentiation between areas (Spielman
13. Consumers in their Built Environment Context 183

Figure 13.2
The spatial data model
used to process data and
produce OA zonal inputs Proximity Measures
to the classification. Housing

Natural and Density

Environment

Individual-level Output Area Output


Infrastructure
Building Data Aggregation Classification

Public and
Building

Private
Characteristics
Services

and Singleton, 2015). For example, in this secondary data. The derived direct
application, when 600m buffers were used measures included listed buildings
for major roads, this resulted in more than (Figure 13.3) and cul-de-sacs. The latter
50% of buildings meeting this criterion, were defined geocomputationally as the
providing a weak differentiation. These end of a line segment that did not intersect
tasks were computationally expensive, with any other such segment. A sensitivity
as the complete dataset contained more of 10m was applied to this criterion in order
than 12.8 million observations (building to avoid topological errors and intermittent
polygons). Therefore, the database was street segments. Results show that such
processed within the R coding language. measures can capture specific urban
morphologies even at the small-area level.
Finally, there were two further types of
direct measure: those that were derived For the other non-derived direct measures,
from building-level geographic features, the variables were simply aggregated
and those that were simple inputs from directly at the OA level, such as housing

Figure 13.3
The total surface area
of listed (registered)
buildings (ha) per OA
within the Greater
Manchester metropolitan
area.

Listed Buildings
Greater Manchester
Total Surface Area per OA
< 0.05 ha
0.05 - 0.25 ha
0.25 - 0.75 ha
0.75 - 1.5 ha
> 1.5 ha

0 5 10 km
184 CONSUMER DATA RESEARCH: PART THREE

Variables Variable Description, Aggregated per OA Code


Adjacency effects
1. Major Roads Percentage of the area of buildings that the centroid is within 100m of a major road
to the total building area. We defined major as those of type ‘Motorway’, ‘A Road’ and
‘Primary Road’.
2. Arterial Roads Percentage of the area of buildings that their centroid is within 100m of an arterial
road to the total building area. We defined Arterial roads as those with type ‘B Road’.
3. Pedestrian Roads Percentage of the area of buildings that their centroid is within 100m of a pedestrian
road or footway to the total building area.
4. Railway Tracks Percentage of the area of building units that their centroid is within 100m of railway
tracks, excluding tunnels to the total building area.
5. Woodland Areas Percentage of the area of building units that their centroid is within 100m of
woodland features to the total building area.
6. Surface Water Percentage of the area of building units that their centroid is within 100m of surface
water (inland) and seafront (calculated by the distance from the coastal line), but
excluding small rivers and streams, to the total building area.
Intermediate effects
7. Railway Stations Percentage of the area of building units that their centroid is within 600m from the
centroid of a railway station to the total building area.
8. Parks and Gardens Percentage of the area of building units that their centroid is within 600m from the
registered site extents to the total building area.
9. Retail Centres Percentage of the area of building units that their centroid is within 600m from the
retail centre centroid plus 200m to the total building area.
10. Schools Percentage of the area of building units that their centroid is within 600m from the
sites that are identified as primary through secondary education to the total building
area.
11. Higher Education Percentage of the area of building units that their centroid is within 600m from the
sites that are identified as further and higher education to the total building area.
Direct measures
12. Detached Ratio Percentage of unshared households that are classified by the 2011 Census as
detached housing to the total building area.
13. Semi-Detached Ratio Percentage of unshared households that are classified by the 2011 Census as
semi-detached housing to the total building area.
14. Terraced Ratio Percentage of unshared households that are classified by the 2011 Census as
terraced housing to the total building area.
15. Flat Ratio Percentage of unshared households that are classified by the 2011 Census as Flats to
the total building area.
16. Density Ratio of persons to total building area (people/ha).
17. Cul-de-sac Ratio of cul-de-sacs (dead-end street points) to the total OA area (points/ha).
18. Registered Buildings Ratio of listed buildings to the total OA area (points/ha).

Table 13.2
Built and physical
environment
attributes used
in the classification.
13. Consumers in their Built Environment Context 185

type. Population density was calculated of Census data where they seem to
using a ratio of persons per total building perform well for socio-economic data at
area, which potentially would give more the US Census tract scale (Spielman and
accurate results regarding housing Thill, 2008).
dynamics. The final OA attributes along
with their descriptions are provided in Prior to clustering, the input data,
Table 13.2. consisting of 18 variables and summarised
in Table 13.2, were transformed into
13.3 z-scores in order to standardise their
A small-area classification of urban measurement scales.
morphology features
This SOM implements a hexagonal grid
Methodologically, the cluster analysis within which OAs are projected and thus
follows the conventional geodemographic create the classification based on the
approach, as detailed in Harris et al (2005); resulting topology. A relatively unexplored
however, only the physical and built built environment classification with
environment data, detailed above, are used too many clusters would be difficult to
to create the typology. A common clustering interpret, so a selection of a 4-by-2
technique used in geodemographic analyses hexagonal grid was made, which produced
is the iterative allocation–reallocation eight distinct clusters. Once areas were
algorithm, known as K-means. Although assigned to clusters, mean attribute
this algorithm has been used in a variety of values were assigned to radar plots in order
geodemographic applications, this dataset to map cluster characteristics and label
is characterised by very sparsely populated them accordingly, as seen in Figure 13.4.
attribute values, which is not fit for
K-means applications. Essentially, the Radial plots are used extensively in
majority of values are zero, indicating the geodemographics as they are very intuitive
absence of the particular built environment in identifying the nature of formed clusters.
or physical characteristic from that area. A radial plot essentially depicts the cluster
centre; it is a vector representing each
Due to these shortcomings, an alternative attribute mean (in this case for 18 variables)
technique was used: a Self-Organizing Map within the cluster. Each attribute mean can
(SOM). A SOM is an unsupervised classifier be traced along every radial axis at their
that uses artificial neural networks to intersection, forming a unique pattern for
classify multidimensional observations every cluster. Since values were standardised
in two-dimensional space based on their to z-scores, values of zero suggest that
similarities (Kohonen, 2001). A SOM the cluster attribute mean is equal to the
typically organises observations by national mean, while values above or below
projecting them as grid units onto a plane, zero suggest that cluster attribute means
and through consecutive iterations finds are above or below the national average
the best configuration of observations so respectively. It also suggests that the values
that every observation is most similar to shown are measured in standard deviations.
the others closest to them. Typically, the
SOM mapping process employs a lattice of To illustrate, assume that Cluster C is under
squares or hexagons as the output layer, consideration. The radial plot shows that
and the results are therefore easily mapped Cluster C has an above average prevalence
as they retain their topology. SOMs have of major roads (1.0), pedestrian streets (0.4),
many applications in a broad range of parks and gardens (1.4) and retail sites (1.5).
fields, from medicine and biology to image It has below average values of detached and
analysis and computer science. SOMs have semi-detached housing ratios (-1.6 and
also been tested as an alternative classifier -1.7), but a high concentration of flats and
186 CONSUMER DATA RESEARCH: PART THREE

High Streets and Promenades Central Business District

Major Roads Major Roads


Listed Buildings 5 Arterial Roads Listed Buildings 2 Arterial Roads
Cul-de-sac 4 Pedestrian Roads Cul-de-sac 1.5 Pedestrian Roads
3 1
Density 2 Railway Tracks Density 0.5 Railway Tracks
1 0
Flat Ratio 0 Green Areas Flat Ratio -0.5 Green Areas
-1 -1
Terraced Ratio Surface Water Terraced Ratio Surface Water

Semi-Detached Ratio Railway Stations Semi-Detached Ratio Railway Stations

Detached Ratio Parks Gardens Detached Ratio Parks Gardens


Universities Retail Centres Universities Retail Centres
Schools Schools

The Old Town Railway Buzz

Major Roads Major Roads


Listed Buildings 6 Arterial Roads Listed Buildings 3 Arterial Roads
5 2.5
Cul-de-sac 4 Pedestrian Roads Cul-de-sac Pedestrian Roads
2
3
Density Railway Tracks 1.5
2 Density Railway Tracks
1
1
0 0.5
Flat Ratio Green Areas Flat Ratio 0 Green Areas
-1
-2 -0.5
Terraced Ratio Surface Water Terraced Ratio Surface Water

Semi-Detached Ratio Railway Stations Semi-Detached Ratio Railway Stations

Detached Ratio Parks Gardens Detached Ratio Parks Gardens


Universities Retail Centres Universities Retail Centres
Schools Schools

Victorian Terraces Suburban Landscapes

Major Roads Major Roads


Listed Buildings 1.5 Arterial Roads Listed Buildings 1.5 Arterial Roads
Cul-de-sac 1 Pedestrian Roads Cul-de-sac 1 Pedestrian Roads

0.5 0.5
Density Railway Tracks Density Railway Tracks
0 0

Flat Ratio -0.5 Green Areas Flat Ratio -0.5 Green Areas
-1 -1
Terraced Ratio Surface Water Terraced Ratio Surface Water

Semi-Detached Ratio Railway Stations Semi-Detached Ratio Railway Stations

Detached Ratio Parks Gardens Detached Ratio Parks Gardens


Universities Retail Centres Universities Retail Centres
Schools Schools

Countryside Sceneries Waterside Settings

Major Roads Major Roads


Listed Buildings 1.5 Arterial Roads Listed Buildings 2.5 Arterial Roads
Cul-de-sac 1 Pedestrian Roads Cul-de-sac 2 Pedestrian Roads
1.5
0.5
Density Railway Tracks Density 1 Railway Tracks
0
0.5
Flat Ratio -0.5 Green Areas Flat Ratio 0 Green Areas
-1 -0.5
Terraced Ratio Surface Water Terraced Ratio Surface Water

Semi-Detached Ratio Railway Stations Semi-Detached Ratio Railway Stations

Detached Ratio Parks Gardens Detached Ratio Parks Gardens


Universities Retail Centres Universities Retail Centres
Schools Schools
13. Consumers in their Built Environment Context 187

Figure 13.4 terraced housing (1.4 and 1). The defining of public amenities, and have plenty of
Radial plots of cluster aspect of this cluster, however, is the listed access via major roads and railways. For
attribute centres, as
produced by the SOM. buildings attribute, which has an average moderate-size cities the title holds true,
value of 5.1 within the cluster. From the but in areas such as London they tend to
mean values of attributes of Cluster C, it is be too expansive to be labelled as central.
suggested that these neighbourhoods are in
the periphery of the city centre, proximal 3. The Old Town
to some major roads and retail activities. The traditional town centre or historically
The number of historical buildings and affluent residential developments, usually
the presence of flats and semi-detached in the periphery of the main high street.
housing suggest neighbourhoods that The cluster is strongly defined by the
have been historically affluent, potentially number of registered buildings. Typically,
with a strong presence of churches or a lot of recreational facilities can be found
administrative buildings that have been here, like pubs and restaurants, along with
repurposed to housing (e.g. flats) or many administrative buildings and some
recreational facilities (e.g. pubs and historical major roads. Although the cluster
restaurants). does have a considerable number of flats,
densities remain low, potentially due to
Mapping the classification can also provide refurbishments and change of usage.
further insights in cluster labelling. For
instance, looking at the Liverpool city 4. Railway Buzz
centre, some of the OAs of Cluster C are The areas that are dominated by railway
located within the Georgian Quarter, a tracks and railway stations. They have
historic affluent housing neighbourhood no other major distinguishing attributes,
built in the 1800s. Cluster C appears to be which may suggest that they are actually
dominating the geographical extents of the rather heterogeneous in physical and
City of London as well, possibly due to the socio-economic structure.
high number of historical sites in the area.
In a similar manner, the rest of the clusters 5. Victorian Terraces
were examined in order to identify defining These are typical neighbourhoods with
characteristics. This enabled cluster types terraced housing, average densities and
to be labelled and the following short moderate access to public and private
descriptions to be created: services. In general, this is one of the
most central clusters in the classification;
1. High Streets and Promenades excluding housing types, all attributes are
These clearly depicted areas represent very close to average. It is also one of the
the main retail centres of urban regions few typologies that can be found anywhere.
located along the main commercial streets.
The main characteristic of this cluster is 6. Suburban Landscapes
the very high ratio of pedestrianised street These areas are typically of semi-detached
networks, not only around retail clusters houses, with good access to parks. They
but also along seafronts, where tend to be quite distant from retail centres.
traditionally a lot of recreational Densities are higher than average as a
and leisure venues can be found. result of the few non-domestic properties
found within (since population density is
2. Central Business District calculated per building surface). They are
The area often called city centre. primarily residential areas, and tend to be
Typically, high-rise buildings with a lot close to schools. Cul-de-sacs are relatively
of commercial and office spaces, hence common, possibly because of organised
the relatively low net population density. developments and gated communities.
These areas have proximity to the majority
188 CONSUMER DATA RESEARCH: PART THREE

MODUM Classification
London Region
High Streets and Promenades
Central Business District
The Old Town
Railway Buzz
Victorian Terraces
Suburban Landscapes
Countryside Sceneries
Waterside Settings
0 5 10 km

7. Countryside Sceneries of the Greater London Region (Figure 13.5), Figure 13.5
These areas are dotted with detached as identified by the MODUM classification. The Greater London
Region as identified
houses, and are located either near or As discussed previously, the core of the by the MODUM
within open countryside. This typology metropolitan region is identified as Cluster Classification.
is also defined by the higher than average C: The Old Town, expanding outwards
access to green spaces. Most rural villages along major transport corridors as Cluster
fall into this category, along with some city B: Central Business District (although in
fringe developments that lie beyond the the case of London, this cluster may be
classic suburbs. too expansive to provide any useful
differentiation). In general, axial zones
8. Waterside Settings exhibit much more strongly in an urban
The principal defining attribute of these morphology classification derived from
neighbourhoods is their proximity to built environment and physical features
surface water such as rivers, canals or sea which are linear in nature, such as roads,
(these are very distinctive in the East of railways and rivers.
England). Some of these neighbourhoods,
however, can also be found within close 13.4
proximity of ports, industrial or post- Conclusion
industrial sites (hence the low densities).
Among the distinctive infrastructure are The development of the MODUM
arterial roads, i.e. secondary roads wide classification illustrates that the production
enough to be used by lorries for the and analysis of a classification of the built
distribution of goods. environment using Big and Open Data can
offer unique insights into some aspects of
A visual interpretation of the classification geodemographic structure of urban areas.
is always meaningful in evaluating The results capture, through the
emergent clusters, as illustrated by the map multidimensionality of the data, both
13. Consumers in their Built Environment Context 189

microscopic and mesoscopic identifiers of Further Reading


urban morphology. Potential applications
Barbosa, O., Tratalosa, J. A., Armsworth, P. R., Davies,
of the MODUM classification involve not R. G, Fuller, R. A., Johnson, P. and Gaston, K. J. (2007).
only enhancing current socio-economic Who benefits from access to green space? A case
classifications by appending it to study from Sheffield, UK. Landscape and Urban
Planning, 83, 187–195.
conventional geodemographic systems,
but also it can prove useful in itself; Graham, S. D. N. (2005). Software-sorted geographies.
it can provide a simplified structure of the Progress in Human Geography, 29(5), 562–580.

physical properties of geographic space Green, P. E., Frank, R. E. and Robinson, P. J. (1967).
that can be used to explore correlations Cluster analysis in test market selection. Management
with other spatial phenomena, potentially Science, 13, 387–400.

in a variety of applications, from real estate Harris, R., Sleight, P. and Webber, R. (2005).
and house prices to health and wellbeing. Geodemographics, GIS, and Neighbourhood Targeting.
In a dynamic sense, it can be used by Chichester, UK: John Wiley & Sons.

urban planners and investors in the built Hui, E., Chau, C., Pun, L. and Law, M. (2007). Measuring
environment to identify the areas in which the neighboring and environmental effects on
the physical preconditions exist for residential property value: Using spatial weighting
matrix. Building and Environment, 42(6), 2333–2343.
neighbourhood renewal or upscaling.
Kohonen, T. (2001). Self-organizing Maps. Berlin:
On the other hand, the classification Springer.

process described here is very specific Martin, D., Nolan, A. and Tranmer, M. (2001). The
to the underlying data and methodology. application of zone-design methodology in the 2001 UK
An inherent disadvantage of all Census. Environment and Planning A, 33, pp. 1949-1962.

geodemographic classifications is Openshaw, S. and Gillard, A. A. (1978). On the stability


that lack of a single global optimisation of a spatial classification of census enumeration
function, making them highly susceptible district data. In P. W. S. Batey (Ed.) Theory and Methods
in Urban and Regional Analysis, London: Pion, 101-119.
to the operational decisions during the
classification procedure (Openshaw and Ordnance Survey (2015). Open Map – User guide and
Gillard, 1978). Nevertheless, this type technical specification v1.4. Crown Copyright, London:
HMSO.
of classification can be valuable in many
circumstances. The classification is Rijnders, E., Janssen, N. A., van Vliet, P. H. and
easy to use, and offers the ability to Brunekreef, B. (2001). Personal and outdoor nitrogen
dioxide concentrations in relation to degree of
append and update data as they become urbanization and traffic density. Environmental
available, while keeping the same model Health Perspectives, 109(3), 411–41.
infrastructure intact. In general, it meets
Spielman, S. E. and Folch, D. C. (2015). Social area
the growing need for geodemographic analysis with self-organizing maps. In A. Singleton
systems that are open and versatile and C. Brundson (Eds.) Geocomputation, London:
enough to handle the abundance of SAGE Press, pp. 152–169.

big data that are currently available. Spielman, S. E. and Singleton, A. D. (2015). Studying
neighborhoods using uncertain data from the
American community survey: A contextual approach.
Annals of the Association of American Geographers,
105(5), 1003-1025.

Spielman, S. E. and Thill, J. C. (2008). Social


area analysis, data mining, and GIS. Computers,
Environment and Urban Systems, 32(2), 110-122.

Acknowledgements

The authors would like to thank Local Data Company


Ltd for providing retail unit data for this research.
This research was also funded by the Economic and
Social Research Council awards 1390251.
190 CONSUMER DATA RESEARCH

Epilogue:
Researching Consumer Data
Paul Longley, James Cheshire and Alex Singleton
Epilogue 191

The contributions to this book provide For this to be operationalised in the best
wide-ranging evidence that consumer data interests of data providers as well as
are both pervasive and have the potential society as a whole, the CDRC is finding it
to generate a deeper understanding about helpful to subsume particular consumer
our society. This extends their importance data sources into composite indicators,
far beyond the realm of identifying similar to existing widely used indices of
customer tastes and preferences into multiple deprivation and geo-demographic
substantive contributions to the social classifications. The CDRC research agenda
sciences in particular. For example, the thus includes the creation and maintenance
chapters in this volume demonstrate that of indicators relating to retail dynamics, use
consumer data can help to provide insight of digital channels and media, demographic
to issues as diverse as urban vitality, structure, mobility characteristics, local
community carbon footprints, or the health and carbon footprints.
collective consumption of public transport
services. However, for their potential to be In these respects, it is important to
fully realised in tackling issues of broader be aware of an important deviation of
societal concern, the quality and CDRC interests from those of commercial
provenance of consumer datasets need data providers and those of government.
to be fully understood. Developing this Academic research has concern not only
understanding is part of the process of with the short-term gyrations of consumer-
assimilation and documentation of diverse led markets but also with their long-term
sources and forms of consumer ‘Big Data’ evolution and potential socio-economic
into appropriate digital data infrastructure; implications. In consolidating large
a core mission of the Consumer Data assemblages of data into summary
Research Centre (CDRC). For example, the indicators, the CDRC is aware that it is
desire to generalise patterns - as recorded important to have well-founded data
within consumer data - to the population infrastructure that is also enduring and
at large necessitates triangulation and facilitates comparisons across time and
validation with more conventional sources space. As such, the diverse data sources and
of data, such as the Census of Population case studies reported in this volume coalesce
and Mid Year Population Estimates. Whilst into a long-term vision that reshapes the
this work integrates consumer data into the way in which we think of digital data
national data infrastructure routinely infrastructure in the social sciences.
utilised by the academic community, it also
offers insights relevant to data providers,
many of whom may not be fully aware of
the precise sectors of society that they
serve. It is also of relevance to government
in its efforts to integrate consumer data
sources into official statistics. From this
perspective, consumer data are in a
significant part a public, rather than
a private good. They are also non-rival –
that is, the use does not undermine
the competition concerns of individual
business organisations, but they
contribute to the overall competitiveness
of the economy.
Index

A click-and-collect services 97–8, 104


Abertillery 44 clone towns 106
academic research 31, 155, 191; see also inter- cluster analysis 42–7, 50, 54, 98, 102–3, 127–36, 185,
disciplinary research 187
Access to Healthy Assets and Hazards (AHAH), index commercial data 35, 121, 154
of 172–4, 177 competitiveness, economic 191
Accident and Emergency Departments, access to 173 consumer data 8–11, 15–17, 71–8, 81–2, 101, 141, 169,
activities 191
expert labeling of 118 and ethnic geographies 72–3
sequences of 114 providers of 9
types of 113–14 Consumer Data Research Centre (CDRC) 9–10, 29–31,
activity-based modelling 111 35, 39, 42, 53–5, 58, 66, 74, 76, 86–7, 98,
activity spaces 33 103–4, 123, 128, 145–6, 154, 169, 191
address matching 17–19 CDRC Data 142
adjacency effects 182 CDRC Maps 142, 145–6
air quality 168–70 Consumer Registers 15–16, 19, 23–7, 74, 76, 81–2,
airport catchments 161 157–8
alcohol, accessibility of 168 convenience stores 37
algorithms 100, 102 country codes 66
Anglo-Saxon surnames 67 countryside 188
app-assisted labeling of activities 118 CSV format 58, 145
Application Programming Interfaces (APIs) 142, 154 customer relationship management (CRM) 30
Argos (chain) 104
Asynchronous Javascript And XML (AJAX) 142 D
auto services 42 data cleaning 20, 63, 90, 161–2
data collection, control of 35
B data-driven geography 29
behavioural patterns 30, 32, 97 data enrichment 156–7
benchmarking 157–60 data infrastructure 10–11
biases in data 8–9, 16–17, 35–9, 125–6, 135, 155, 157 data licencing 10
Big Data 8, 10, 29, 32, 39, 55, 66, 71, 82, 121, 136, 191 data linkage 21, 26–7, 31, 132–3
Birmingham 78, 145, 173–4 techniques for 15–16
Booth, Charles 153 data mining 155
Boots (chain) 104 data quality 35, 38–9
Bradford 147 data repositories 142
Brent 78 data zones (Scotland) 171
Bristol 17, 20, 44, 49–50, 122, 133, 135 DBSCAN 43–50
British Gas 122 decision trees 100
Bromley 116 demographic analysis 153–4, 164
built environment 179–82, 185 Department for Business, Energy and Industrial
Strategy (BEIS) 122, 125
C Department of Communities and Local Government
CACI Ltd 15, 154, 157 (DCLG) 41, 51
Camden 175 deprivation 25
Camden Town 94 Digital Economies Act (2017) 10
cancellation effects between different levels of digital technology 104, 191
accessibility 172 “digitisation retail” 104–5
Cardiff 44 dissimilarity indices 78–82
CARTO (company) 142 Docs Engine 59
catchment areas 105; see also airport catchments dot density maps 151
Catney, G., 72, 79 duplicate data 42
census data 85 dynamic data transformed into static format 128
drawbacks of 72–3
see also output areas E
Census of Population 72, 122, 126, 153, 157–8, 180–1, ecological fallacy 32
191 electoral registers 15–17, 22, 26, 55, 58, 62, 76, 80–2,
central business districts 187 157
Chadwick, Edwin 167 energy consumption
Chicago 72 efficiency of 136
China and Chinese ethnicity 66, 76, 80–1 at off-peak times 129–31, 134–5
choropleth maps 145–8 typical profiles of 133
churn 20, 26 variability of 126–36
city centre stores 38 visibility of 136
Clapham Junction 44 Energy Performance Certificates (EPCs) 133
environment see built environment; Huff models 105
physical environment
Environment Agency 180; see also I
European Environment Agency identity, inferred markers of 164
e-resilience 98, 104–8 IFES Election Guide 58
Esti (company) 142 immigration and immigrants 79–81
ethnic diversity 71–5, 78–82 Index of Multiple Deprivation (IMD) 25, 171–2
ethnic groups 22–3, 73, 79, 82 India and Indian ethnicity 75–6
ethnicity 73–7, 158 information and communications technologies
at address level 74–5 (ICTs) 97–8
classification 74–5 inter-disciplinary research 136
concept of 73 International IDEA project 58
at individual level 73 Internet Archive 59
Ethnicity Estimator 75 Internet resources 97–104
European Environment Agency 170–1 Internet Use Classification (IUC) 103–5, 108
Experian Ltd 154 Inverness 44
Irish ethnicity 79
F irregular activities 114
face-to-face communication 98 iterative allocation–reallocation algorithm 102
family names 53–5, 58, 67, 156
fast food outlets 167–8 J
footfall data 85–8, 91–4; trends in 92–4 John Lewis (chain) 104
forenames see given names “joiners” at particular addresses 21
funding for demographic data 141 JQuery framework 145
JSON file format 58
G
gambling outlets 168 K
Gaussian Mixture Model 128, 135 kernel density estimation (KDE) 43, 45
General Data Protection Regulation (GDPR) 10 K-means 43, 102, 128–9, 185
geocoding 73–4 K-nearest neighbours 46
geodemographic classification 30–1, 36, 98, 102, 146,
153, 179–80, 185, 188–9 L
geodemographic maps 145–6 “ladder of engagement” 9
geographic skills, need for 66 lags in data 26
Geolytix (company) 42, 45, 48, 50 Land Registry data 82
georeferencing 67 Lansley, G. 161
geotemporal classification 34, 73 Latent Dirichlet Allocation (LDA) 162
given names 53–5, 58, 156 “leavers” from particular addresses 21
Glasgow 44, 49–50, 175 Liverpool 167, 175–6, 182, 187
Google Maps 142 local authority districts (LADs) 77
Google Search 59 Local Data Company (LDC) 9, 42, 86–7, 104, 169–70,
granularity see spatial granularity; 180
temporal granularity Location Quotient (LQ) measure 158–60
Green, P.E. 179 London 17, 78, 92–4, 103, 105, 116–17, 150–1, 161,
green space, access to 168–75 173–7, 187–8; see also Heathrow Airport
grid references 74 London Travel Demand Survey (LTDS) 111, 118–19
Longley, P. 161
H long-term vision of digital data’s role
Haben, S. 126 in social science 191
Halifax 147 Lower Super Output Areas (LSOAs) 98–9, 105, 171–2,
health care provision 168–74 175
health-related indicators for local neighbourhoods loyalty card data 29–39
167–77 representativeness of 35–8
multidimensional nature of 169 Lutzenhiser, L. 122
Heathrow Airport 160–4, 177
high exposure, index of 106 M
high-street areas 187 M1 corridor 173
high-street retailers (HSRs) 30, 35 machine learning 118
Holborn Station 93 Manchester 78, 147–8, 183
home location 115–19 manual data collection 55
Hong Kong 112 Mapbox (company) 142
household change 19–21 Mapnik 147
households moving together 21 marriage 19, 53–4
Huebner, G.M. 126 Master’s Research Dissertation Programme 9
metadata 30, 88 planning regulations 168
migration 21–6 points of interest (POIs) 111
flows between districts 23–4 Polish immigrants 73
see also mobility pollution 169–73
Milton Keynes 105 population modelling 159
missing data 133 population normalisation 181
mobile phones 85–8 postcodes 17–18, 23, 30–3, 38–9, 74, 126–9, 170–1
mobility 15–18, 111–19 predictor variables 100
geographical and social 25 primary activities 111, 114
patterns based on sequences of activities 114 promenades 187
understanding of 118–19 provenance of data 30
see also migration; social mobility proximity measures 182
modifiable area units 73 public good 10
MODUM classification 188–9
Morstatter, F. 156 Q
multivariate displays 147 quality threshold methodology 43
Quick Unbiased Effective Statistical Tree (QUEST) 100
N
names 53–67, 71 R
changes in 19 radial plots 185–7
data acquisition and processing methodology for railway tracks and stations 187
55–66 random samples 35
data sources for 62–3 random walk methodology 43–5
double-barrelled 19 raster approach to mapping 146
most popular 22 repeat shopping 29
related to age-groups 21 research sponsorship 9
scale of collection of 66 residential segregation 71–2, 77–82
spelling of 20 response rates to surveys 9
target numbers of 59 retail agglomerations 42, 46, 50
uniqueness of 22–3 retail areas, definition of 42
uses of 53–4, 67, 156 retail centres 98, 104–8
naming practices 53–4, 59 “health checks” on 41
National Health Service 168 retail environment in relation to health 167–8, 171–4
Nepal 64–5 Routino 170
Newham 78 Rutland 167
Next (chain) 104
Northern Ireland 17 S
Saint Lucia 56–7
O sale of properties 19
Office of Gas and Electricity Markets (OFGEM) 122 science, public understanding of 11
official statistics 63, 191 S_Dbw indicator 43, 47
online shopping 97–8, 104–7 secondary activities 111, 114, 118
Onomap 74, 77, 156, 158 segmentation 30, 133
Open Data platforms 10 self-driving vehicles 118
Open Map – Local 180 self-organising mapping (SOM) 185, 187
open source material 146–7, 180 self-selection bias 36, 155
OpenStreetMap (OSM) 170, 180 sensitive personal information (SPI) 63
“ordinary resident” concept 59 sharing of data 10
Ordnance Survey 74, 169, 180 shopping destinations 41
Organisation for Economic Cooperation Simpson, L. 72
and Development (OECD) 154 Singapore 112
outliers 127, 129, 133, 136 Singleton, A.D. 98
output areas (OAs) for census data 17, 72, 78, 100, “slow burn” 104
133–5, 146–7, 180–5 small area estimation technique 99
Oxford Internet Survey (OxIS) 99–100, 105 smart card data 111–14, 118–19
Oyster cards 112–13 benefits from 112
smart meters
P data from 121–8, 133, 136
Pakistan 62, 75–6 installation of 121–6
passenger surveys 111–12, 161 smart systems 10–11
“passporting” of data 10 SmartStreetSensor project 86–7
Peach, C. 72 social media 62–3, 154, 161, 164
Peterborough 173 social mobility 25
physical environment 168–77, 180 social surveys 9
sociodemographic classification 71 World Bank data platform 62
Somalia 60–1 Worldnames projects 54
Southampton 173
spatial granularity 123 Y
spatio-temporal classification 39, 113, 161 young adults living with parents 20
standard equalised distance (SED) measure 147–8
stay duration 113, 115 Z
store location 33 z-scores 185
suburban landscapes 187
supply vulnerability index 104–5
surnames see family names
Sutton 116
Swansea 155

T
Tabula software 63
targeting of customers 35, 103
telephone directories 62
temporal granularity 30, 32, 73, 82
Tesco Clubcard 33
text mining 161–5
tiles used in mapping 142, 146
Tooley Street 93
Top Metric Maps 146
Torridge 175
town centres 187
boundaries of 41, 51
traffic counts 85
Transport for London (TfL) 10, 112
travel diaries 34, 111
automatically generated 111
trip-chaining 33–4
trust 10
Turkey 67
Twitter 153–8, 161–5
demographically-attributed data from 156–7,
160

U
uncertainty, mapping of 146–7
under-enumeration 17, 22

V
vacant retail units 42
validation of research results 100, 118
vector-based maps 146
Victorian terraces 187
visualisation of data 142

W
walking distances 46, 182
waterside settings 188
weather conditions, energy use related to 133
web-mapping 142–50
developments in 146–50
White British group 75–82
WiFi 85–90, 94–5
Wikileaks 55
Wikipedia 58
Winchester 44
Wolverhampton 44
word clouds 162–3
work-based trips 34
work location 116–19
First published in 2018 by
UCL Press
University College London
Gower Street
London WC1E 6BT

Available to download free: www.ucl.ac.uk/ucl-press

Text © Contributors, 2018


Images © Copyright holders named in the captions, 2018

The authors have asserted their rights under the


Copyright, Designs and Patents Act 1988 to be
identified as the authors of this work.

A CIP catalogue record for this book is available from


The British Library.

This book is published under a Creative Commons 4.0


International license (CC BY 4.0).This license allows
you to share, copy, distribute and transmit the work;
to adapt the work and to make commercial use of the
work providing attribution is made to the authors
(but not in any way that suggests that they endorse
you or your use of the work). Attribution should
include the following information:

Longley, P. A., Cheshire, J. A., and Singleton, A. D.


(eds.). 2018. Consumer Data Research. London: UCL
Press. DOI: https://doi.org/10.14324/111.9781787353886

Further details about Creative Commons licenses are


available at http://creativecommons.org/licenses/

ISBN: 978‑1‑78735‑389‑3 (Pbk.)


ISBN: 978‑1‑78735‑388‑6 (PDF)
DOI: https://doi.org/10.14324/111.9781787353886

Printed by Albe de Coker, Antwerp, Belgium


Big Data collected by customer-facing organisations “An insightful, state-of-the-art guide into the social
– such as smartphone logs, store loyalty card and commercial value of applying geographical
transactions, smart travel tickets, social media posts, thinking to the study of consumer data.”
or smart energy meter readings – account for most of Professor Richard Harris, University of Bristol
the data collected about citizens today. As a result,
they are transforming the practice of social science. “An excellent guide to leveraging the value of
Consumer Big Data are distinct from conventional academic research on valid data. Partnerships
social science data not only in their volume, variety based around consumer data should be encouraged
and velocity, but also in terms of their provenance and supported by all and their outputs used to
and fitness for ever more research purposes. The better the way we manage the world we live in.”
contributors to this book, all from the Consumer Bill Grimsey, retailer and author of
Data Research Centre, provide a first consolidated The Vanishing Highstreet
statement of the enormous potential of consumer data
research in the academic, commercial and government “The use of data from everyday consumer
sectors – and a timely appraisal of the ways in which transactions is a potential game-changer for
consumer data challenge scientific orthodoxies. understanding economic and social patterns and
trends. This is an excellent overview of the field.”
Dr.Tom Smith, Managing Director, Office for
National Statistics Data Science Campus

You might also like