Hidalgo

THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY
DRIVEN COMPLEX SYSTEMS
A Dissertation
Submitted to the Graduate School
of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Philosophy
by
César A. Hidalgo
Albert-László. Barabási, Director
Graduate Program in Physics
Notre Dame, Indiana
July 2008
THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY
DRIVEN COMPLEX SYSTEMS
Abstract
by
César A. Hidalgo
Complex systems are characterized by having emergent properties that cannot be
explained from their large number of interacting and heterogeneous components. Different
aspects of human society can be described as a complex system, as large numbers of people
aggregate into a host of complex structures.
Here we empirically study three different aspects of humanly driven complex
systems. First, we study the dynamics of a mobile phone network reconstructed from
millions of individual phone calls. By looking at time resolved data we show that the
structure of the mobile phone network is coupled to the dynamics of mobile phone links.
Second, we study the statistical properties of human mobility patterns and show that the
characteristic distance travelled by individuals follows a heterogeneous distribution which
explains the previously observed Lévy-flight properties of human mobility. Third, we
construct a network summarizing world trade to study the dynamics of countries
productive structures and show that the structure of the product space conditions the
industrial development of nations.

César A. Hidalgo
These three studies illustrate how large data sets can be used to empirically study
humanly driven complex systems. Individually, they present relevant information that can
be used to benchmark future models for each one of these complex systems or can be used
as empirical rules characterizing them.

To my family, who supported me closely while more than 5,000 miles away.
ii
CONTENTS
FIGURES .............................................................................................................................v
TABLES ........................................................................................................................... vii
ACKNOWLEDGEMENTS ............................................................................................. viii
FOREWARD ..................................................................................................................... xi
CHAPTER 1: INTRODUCTION ........................................................................................1
1.1 The Statistical Physics of Society ......................................................................2
1.2 Physics and Economy ........................................................................................4
CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK ..........................7
2.1 Introduction ........................................................................................................7
2.2 Data ....................................................................................................................9
2.3 The Persistence of Ties ....................................................................................11
2.4 Global Analysis of the Persistence of Ties ......................................................13
2.5 Network Structure and the Persistence of Ties ................................................15
2.6 Multivariate Analysis .......................................................................................18
2.7 Using Topology to Infer Future Ties ...............................................................22
CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS ........................26
3.1 Introduction ......................................................................................................26
3.2 Source Data ......................................................................................................29
3.3 The Heterogeneity of Human-Mobility Patterns .............................................30
iii
3.4 Testing the Power-Law Curve Fits ..................................................................35
3.5 The Periodicity of Human Mobility Patterns ...................................................37
3.6 The Shape of Human Mobility Patterns...........................................................39
3.7 The Anisotropy of Human Mobility Patterns .................................................42
CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF

NATIONS ..............................................................................................................45
4.1 Introduction ......................................................................................................45
4.2 Product Proximity ............................................................................................46
4.3 The Product Space ...........................................................................................48
4.4 Generating a Network Representation of the Product Space ...........................51
4.5 The Products Space and the Patterns of Comparative Advantage ...................55
4.6 Discussion ........................................................................................................64
CHAPTER 5: DISCUSSION .............................................................................................65
5.1 Physics and People ...........................................................................................65
5.2 The Product Space ...........................................................................................67
5.3 Every Tune in the Guitar..................................................................................73
CHAPTER 6: APPENDIXES ............................................................................................75
6.1 Appendix I: Papers Published During My PhD ...............................................75
6.2 Appendix II: Product Space Properties ............................................................81
6.3 Appendix III: Simulating Diffusion .................................................................89
REFERENCES ..................................................................................................................94
iv
FIGURES
Figure 1 Definition of Persistence .....................................................................................12
Figure 2 Persistence across a cellular phone network .......................................................14
Figure 3 Network structure and the persistence of ties. .....................................................17
Figure 4 Predicting future ties............................................................................................23
Figure 5 Interevent time distribution P(ΔT) of calling activity. ........................................30
Figure 6 Basic human mobility patterns ............................................................................33
Figure 7 Kolmogorv-Smirnov goodness of fit test. ...........................................................36
Figure 8 The bounded nature of human trajectories. .........................................................38
Figure 9 The shape of human trajectories.. ........................................................................43
Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product

space. ......................................................................................................................49
Figure 11 Network representation of the 1998-2000 product space. .................................50
Figure 12 Earliest version of the MST representing the "skeleton" of the product space.
................................................................................................................................52
Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55 .............................................................................................53
Figure 14 Network representation of the product space. Layout uses a force spring
algorithm. ...............................................................................................................54
Figure 15 Localization of the productive structure for different regions of the world ......57
Figure 16 Empirical evolution of countries. ......................................................................60
Figure 17 Simulated diffusion process and inequality. ......................................................62
v
Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate
................................................................................................................................69
Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red) ................................................83
Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY ..............................................................................................................84
Figure 21 Prody as a function of distance for six different products in the space. ............85
Figure 22 Average PRODY as a function of the distance for products with a given Leamer
Annotation..............................................................................................................87
Figure 23 One step diffusion process for Korea and Chile ................................................90
Figure 24 Iterated diffusion process for Chile and Korea..................................................91
Figure 25 Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities .........................................................92
vi
TABLES
TABLE 1 DATA PANELS AVAILABLE FOR THIS STUDY. .....................................10
TABLE 2 PERSISTENCE OF TIES AND LINK ATTRIBUTES. ..................................21
TABLE 3 CORRELATIONS AND REGRESSIONS BETWEEN NODE ATTRIBUTES

AND PERSEVERANCE .......................................................................................22
TABLE 4 STRENGTH OF THE LINKS BETWEEN AND WITHIN PRODUCTS AS

CLASSIFIED BY LEAMER. ................................................................................82
TABLE 5 PEARSON'S CORRELATION COEFFICIENT BETWEEN THE PRODUCT

SPACES GENERATED WITH DATA FROM 1985, 1990 AND 1998 ..............88
vii
ACKNOWLEDGMENTS
I would like to thank Albert-Laszlo Barabasi, for accepting me in his group, having
confidence in me from my early years in Graduate School and for the numerous
discussions and advices we exchanged during this four years. I am positive that I would
have not been able to have such a wonderful Graduate School experience if it was not for
Laszlo‟s support and personality. I also would like to thank Ricardo Haussmann for the
innumerable conversations advices and time spent discussing several different scientific
and non-scientific issues with me. He has greatly inspired the evolutionary view of systems
I present in this dissertation and has encouraged me to be innovative and honest in
scientific terms.
I am also enormously grateful to Christine Teutsch for her support and love during
the more than two years we have been together and to Alejandra Castro for her support in
my early years in graduate school and all along my undergraduate education.
I would also like to thank all my collaborators during these four years: Marta
Gonzalez, Bailey Klinger, Carlos-Rodriguez-Sickert, Luigi Cuccia, Denis Dupuy, Nicolas
Bertin, Nicholas Blumm, Pu Wang, Kavitha Venkatesan, Vanessa Vermeirseen, Marc
Vidal and Nicholas Christakis for being wonderful collaborators.
I am also grateful for many interactions with KI Goh, Andrea Asztalos, Alexei
Vazquez, Muhammed Yildrim, Anne-Ruxandra Carvunis, Deok-Sun Lee, Juyong Park,
viii
Julian Candia, Zehui Qu, Natali Gulbahce, Maximilian Schich, Marcio Argollo de
Menezes, Zoltan Toroczkai, Sameet Sreenivasan and Chaoming Song.
Special thanks go for the CCNR staff, especially for Suzanne Aleva, who has given
uncountable hours to the advancement, promotion and management of the complex
networks field; as well as too Nicole Halley, Nicole Leete and Agnes Petrozcky.
I would also like to thank the Notre Dame Physics department, especially professor
Kathie Newman who has been there to help me all along graduate school.
Additionally, I would like to thank Professor Francisco Claro, for his support and
several conversations during my graduate studies and during previous years. I would also
like to thanks professor Pablo Marquet for inviting me to the Santa Fe Institute among
other things and Carlos Rodriguez-Sickert, who has been a great collaborator and friend for
several years.
Finally, I would like to thank the Kellogg Institute at Notre Dame for their financial
support during my graduate studies.
The Dynamics of a Mobile Phone Network
C.A. Hidalgo was partly supported by the Kellogg Institute at Notre Dame and
acknowledges support from NSF grant ITR DMR-0426737, IIS-0513650 and the James S.
McDonnell Foundation 220020084. C. Rodriguez-Sickert acknowledges Sam Bowles and
the S.F.I. We thank Nicole Leete for proof reading our manuscript. Special
acknowledgments to A.-L. Barabasi for providing the source data and discussing the
manuscript.
ix
Human Mobility Patterns
We thank D. Brockmann, T. Geisel, J. Park, S. Redner, Z. Toroczkai and P. Wang
for discussions and comments on the manuscript. This work was supported by the James S.
McDonnell Foundation 21st Century Initiative in Studying Complex Systems, the National
Science Foundation within the analysis was performed on the Notre Dame Biocomplexity
Cluster supported in part by NSF MRI Grant No. DBI-0420980. C.A. Hidalgo
acknowledges support from the Kellogg Institute at Notre Dame.DDDAS (CNS-0540348),
ITR (DMR-0426737) and IIS-0513650 programs, and the U.S. Office of Naval Research
Award N00014-07-C. Data
The Products Space Conditions the Development of Nations
We would like to thank the following for valuable comments: Philippe Aghion,
Laura Alfaro, Olivier Blanchard, Ricardo Caballero, Oded Galor, Elhanan Helpman, Asim
Khwaja, Jim Lahey, Robert Lawrence, Daniel Lederman, Lant Pritchett, Roberto Rigobon,
Dani Rodrik, Andres Rodriguez-Clare, Charles Sabel, Ernesto Stein, Federico
Sturzenegger, and David Weil. C.A.H. acknowledges support from the Kellogg Institute at
Notre Dame. C.A.H. and A.-L.B. acknowledge support from NSF grant ITR
DMR-0426737, IIS-0513650 and the James Mc Donanld Foundation 220020084.
A Network View of Development
We would like to thank Melissa Wojciechowski for help editing this manuscript.
x
FOREWORD
In this dissertation I present some of the research conducted as a Physics graduate
student at the Center for Complex Network Research in the University of Notre Dame.
This Dissertation has been formatted to satisfy the requirements of the Physics department.
I remit anyone interested in the historical context in which this work was conducted, as
well as those interested in my view of the epistemological background underlying this
research, to search for the original copy submitted for review in my personal webpage (just
google it).
Best Wishes
Cesar A. Hidalgo
xi
CHAPTER 1:
INTRODUCTION
In this document I present my contribution to a few problems in the field of
complexity or complex systems. While many physicists have contributed to this field for
several decades, they are not the only ones to study complexity. The field of complexity
has a strong interdisciplinary nature and a great number of contributions have emerged
from the interactions between scientists trained in different fields. Throughout this
document, we will present some examples mixing Physics, Biology, Economics, Computer
Science, Psychology and Social Sciences to answer questions that lie between traditional
disciplines and at the edge of emerging scientific paradigms.
Complexity has helped unite different branches of science by implicitly
demonstrating that some problems should be worked from many different angles. There
are no constraints forbidding the use of theories and methods inspired by a particular field
in a different scientific discipline. This creates the need for a field connecting seemingly
distant branches of science. The science of complexity has arisen in part to fulfill that
particular need. Moreover, complexity science has created scientific value which is
different from that of the particular fields where its adherents were originally trained. This
dual purpose makes the field of complexity attractive from an applied as well as a
fundamental perspective.
1
1.1 The Statistical Physics of Society
One scientific combination that has gained recent popularity is that of physicists
studying people. The recent surge of physicists into the realms of social science has been
fueled largely by the availability of data collected by network administrators and corporate
database managers in recent years. Physicists, mainly from statistical mechanics, have
rushed into the strangest datasets looking to describe systems and answer questions that
could only be speculated about ten years ago. From a historical perspective this is hardly
unexpected, as statistical physicists have been flirting with less conventional topics for
several decades, such as fractals [1,2,3], stock-exchange time series [4,5] and more recently
massive communications records [6,7,8,9].
In this dissertation I present two of my first contributions to the study of social
phenomena. Both of these contributions have been made possible by the availability of
millions of mobile phone records. In Chapter 2 we will study the dynamics of a mobile
phone network [10], whereas on Chapter 3 we show how people can be characterized by
their mobility patterns [11].
1.1.1 Why study this?
Before presenting these two topics, we will briefly discuss two of the main
applications of studies characterizing individuals, either by their social network structure,
dynamics or mobility patterns. While in their purest forms, such studies might appear as
simple curiosities, marketing executives are becoming increasingly more interested in new
ways to classify people. The motivation here is obvious, as people in the marketing field
have constantly searched for new ways to segment populations and identify the individuals
that are more susceptible to purchase specific products. Heretofore, marketing
2
segmentation has been based on demographic and socio-economic attributes, such as a
person‟s gender, age, ZIP code and income information. While all these variables are
important for marketing segmentation strategies, they cannot be expected to exhaustively
represent a person, as two individuals from the same age and gender, living in the same zip
code and having comparable incomes can have extremely different behaviors. Here is
where these new layers of data fall into place, as the structure of an individual‟s social
network, its dynamics and mobility patterns could be better proxies for a person‟s behavior
than demographic and socioeconomic variables and could therefore be used to explore new
marketing segmentation strategies.
While the previous paragraph presents a very well defined and concrete industrial
application of studies in social structure and dynamics, there are several other applications
that open up thanks to this type of studies. Epidemiology is probably the most important of
these given the real threat of infectious diseases and software viruses [12,13]. This fuels the
need for studies providing data that can be used to understand the spread of biological and
digital pathogens from an empirical or theoretical perspective. Ultimately, the goal of these
studies is to create a future in which epidemiological forecasts [14,15] could be a reality.
Despite the potential applications of quantitative studies in social structure and
dynamics, there is also a fundamental angle from which to address these questions. Social
systems are extremely complex and exhibit behaviors that are interesting to study solely for
scientific curiosity. Hence together, applied and fundamental studies can help advance our
understanding of the universe at this high level of complexity, thus research conducted
from applied or fundamental perspectives are complements rather than substitutes. Here
3
we present some research conducted from an empirical perspective with some fundamental
and applied flavors.
1.1.2 A little bit of novelty
Large scale data on a person‟s social network and mobility has only become
available during the last few years. Hence, exploring ways to statistically describe millions
of individuals based on their social network structure, dynamics or spatial mobility patterns
is by definition a new field. Yet, from even more recent appearance are data allowing us to
study the dynamics of the structures defined by individual‟s social relationships and
mobility patterns. In the next chapters we present two studies that are among the first in the
literature to explore the dynamical properties of individuals‟ social networks and spatial
dynamics. The first of these studies concentrates on the stability of social ties and its
connection to network structure. This study was published in Physica A on May 2008 [10].
The second study will discuss how to characterize the spatial patterns defined by the
movement of individuals. This study was published in Nature on June 2008 [11].
1.2 Physics and Economy
On chapter 4 we will present a study combining the physics of networks with
developmental economy and industrial policy. While this approach can be considered
innovate, it is not a completely novel mixture, as physics and economy have been dialoging
for more than a century. This long precedes complexity science and goes back at least to
the times of Leon Walras and William Stanley Jevons, two XIX century scientists credited
for establishing the mathematical foundations of classical economics. Leon Walras, a late
bloomer in scientific terms [16], has been credited for introducing the notion of equilibrium
4
in classical economic theory [16] in his 1872 book Elements of a Pure Economics [17]. It is
believed that Walras adapted the notion of equilibrium from Louis Pinsot‟s Elements de
Statique. On the same front, William Stanley Jevons defined the problem of economic
choice as an exercise in constrained optimization where consumers calculate which
amount of goods will make them “happier.” Like Walras, Jevons was also inspired by
theoretical physics. This is evident in his Theory of Political Economy, as Jevons used
equations from field theory in an attempt to describe human behavior in a form as
predictable as gravity [16].
Despite these XIX century sisterhood, physics and economy have evolved mostly
separately ever since. During the last century, economy has branched out into several
different fields, many of which have adopted a view of the world that resemble
mathematics rather than physics. The field of finance however, has been an exception as its
use of random walks has kept it somewhat closer to physics. Random Walks were first
proposed as a model for financial markets by Louis Bachelier in his doctoral thesis in the
year 19001 [18] under the supervision of Henri Poincare. Bachelier‟s work was ignored for
many years and was resurfaced by Benoit Mandelbrot during the 50‟s, when he
re-discovered the power-law nature of stock returns [19]. This portion of Mandelbrot‟s
work was also ignored during the first years after its publication, as the power-law behavior
introduced difficulties in the construction of analytically solvable financial models.
Yet describing a system with random walks and power laws is not the way to keep
physicists out of the loop. During the last decades a large number of physicists entered
financial research. This is a process driven by the similarities of the two disciplines that
1
Note this is earlier than Einstein‟s 1905 paper on Random Walks.
5
was catalyzed greatly during the 1990‟s as the end of the cold war created an excess supply
of physicists that was absorbed in part by the financial sector.
Financial data has also been studied by physicists in academic settings. A superb
example of this is the work that Rosario Mantegna and Gene Stanley started at Boston
University on the mid 90‟s. Several of their findings are summarized in their book
Econophysics [4], including the existence of scaling laws for the growth of firms [20,4], the
interpretation of financial markets as an anomalous diffusion process [4,21] and some of the
best evidence for the power-law nature of stock returns [4].
Another successful physicist in the field of finance is Doyne Farmer of the Santa Fe
Institute. His approach has been somehow different than that of Mantegna and Stanley as
he has concentrated on explaining different observations in the stock market using agent
based simulations [22] or simple logic [23]. Still, an important part of his work resembles
that of the Boston University physicists as it is grounded on empirical observations of
scaling and universal relationships [24].
The fact that financial markets have been studied using random walk models and
time series analysis can partly explain the sail of physicists into such apparently far
academic waters. Developmental economy however, has been studied as the accumulation
of factors and through abstract production functions that resemble physics only on the
simplicity of some of its mathematical expressions, this being not enough fertile ground for
collaborations to occur. The introduction of a network view of development does not only
open a new angle from which information can be revealed, but opens a new window for
collaboration between scientists in the fields of networks and an important branch of
economy.
6
CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK
2.1 Introduction
Physicists are no strangers to the study of social networks. During the last decade
several groups have explored the structure of social networks captured by e-mails [6,7],
cellular phones [8,9] and professional relationships such as being costars in a movie [25] or
collaborators in a paper [26]. Studying the dynamics of such social systems, however, has
been limited by the lack of longitudinal data, and as a result, only a few studies on the
dynamics of interpersonal connections have been published [6,8,27].
In principle there are many factors that could affect the stability of a social link
[28,29,30]. The aim of the subsequent sections is not to review all these factors, but to study
the coupling between the structure of the network as characterized in previous studies
[31,32] and the temporal stability of the links.
Here we use a year‟s worth of mobile phone data as a proxy for the structure and
dynamics of a social network involving close to two million people. Automatically
collected communication records have been proposed as a source of reliable data about
personal connections [33]. Email data for example, has been used to study social processes
such as social links, or tie, formation [6] and social structure [7], whereas blog data has been
used to study the spread of political opinions [34]. Communication records overcome
problems of survey data such as subjective biases on the respondents and the intrinsic
7
limitations of ego-centered networks, like their unreliability measuring a social network‟s
structure.
It is not our intention to claim that cellular phone communications fully capture
social exchange. A social network is expressed through a host of interactions ranging from
e-mails to face-to-face contacts. People in close social contact tend to express their ties to
others through multiple interaction channels [35], such as email, cell phone
communications, instant messaging and face-to-face interaction. There are arguments,
however, favoring the use of cellular phone calls as a relevant proxy for large-scale social
networks. Specifically, it has been shown that objective measures as the one we use in our
study can accurately predict self-reported friendships [36]. Moreover, from a scientific
perspective, interest in mobile-phone studies has been expressed through the emergence of
a literature on mobile-phone networks in which people have studied the strength of social
ties in cross sections of the network [9] and the dynamics of social groups [6,8].
There are also some technical aspects that favor the use of a mobile phone records
as a proxy for social interactions. Mobile phone numbers are unlisted, thus knowing them
reveals some sort of social connection between caller and callee. Also, cellular phones
were the most widespread information technology at the time this data was collected; with
a penetration larger than 40% worldwide and close to 100% in developed countries, such as
the one considered in this study. During the same time period, internet penetration was just
over 13% worldwide and 51% for developed countries (MDGS indicators U.N.
http://mdgs.un.org/unsd/mdg), making cellular phones the most complete method to study
social interactions on the population scale. In addition, mobile-phone usage has been
8
particularly democratic to the extent that it has homogeneously penetrated different social
strata [37].
2.2 Data
Our data consists of 7,948,890 voice calls between 1,950,426 users of a service
provider holding approximately 25% of an industrialized country's market. The data
consist of ten panels, or data cross-sections, collected between April 15, 2004 and March
31, 2005. Each panel summarizes 15 days of mobile phone calls between the members
serviced by the provider who facilitated the data. Not every panel is available (see Table 1),
as this was the way in which data was made available to us. We consider only agents that
made or received at least one call in each panel to avoid dealing with dropouts or new
subscribers. We hereafter assume that at high service penetration levels (~100%) people
serviced by a particular provider are equivalent to a random sample. In our network nodes
are mobile phone numbers, which we interpret as people and links are the calls connecting
them.
9
TABLE 1 DATA PANELS AVAILABLE FOR THIS STUDY.
Time Period In Our Study?

April 16 to April 30 2004 Yes
May 1 to May 15 2004 No
May 16 to May 31 2004 Yes
June 1 to June 15 2004 No
June 16 to June 30 2004 Yes
July 1 to July 15 2004 No
July 16 to July 31 2004 Yes
August 1 to August 15 2004 Yes
August 16 to August 31 2004 No
September 1 to September 15 2004 No
September 16 to September 30 2004 Yes
October 1 to October 15 2004 Yes
October 16 to October 31 2004 No
November 1 to November 15 2004 No
November 16 to November 30 2004 No
December 1 to December 15 2004 No
December 16 to December 31 2004 No
January 1 to January 15 2005 Yes
January 16 to January 31 2005 No
February 1 to February 15 2005 No
February 16 to February 28 2005 Yes
March 1 to March 15 2005 No
March 15 to March 31 2005 Yes
10
2.3 The Persistence of Ties
We measure the stability of social ties across time as the number of panels in which
a link is observed, over the total number data panels available. We denote this measure as
persistence which can be expressed as:
∑ Aij( T)
Pij = T
, (1)
M
where Aij(T) is 1 if nodes i and j communicated in panel T and 0 otherwise, and M is the
total number of panels. Persistence is the probability of observing a tie when observing a
network for a given time. Because of the discrete nature of panel data, our definition of
persistence has a resolution that depends on the panel‟s duration. For example, if we consider
panels with a duration comparable to the one of links, (~ minutes in the case of phone calls),
our definition of persistence gives us the number of times a tie or connection appeared.
Whereas when we consider panels lasting considerably longer than the typical duration of a
link, our definition of persistence will capture the stability of a link on a larger, coarse-grained
temporal scale. Our data set consists of 10 panels, each summarizing 15 days of voice call
activity. Thus in this study we measure persistence on a monthly to yearly time scale.
We illustrate our definition of persistence using four different panels of a five node
network (Fig. 1 a). In this example, the link between nodes 2 and 4 is present in all panels
while the one between nodes 1 and 2 is present only in half of them. We say that the
persistence of the link between nodes 2 and 4 is 4/4 while the persistence of the link
connecting nodes 1 and 2 is 2/4. Each panel gives a binary representation of the network,
where a link is either present or not. Our definition of persistence summarizes the dynamics
of all binary panels by assigning a weight to each link. Thus, persistence is a change of
11
representation that allows us to map many network panels into a single weighted network
(Fig. 1 b).
Our measure of persistence weakly increases with the number of times a link is
observed; hence persistence indicates stability as understood in previous studies [38,39].
However, given that we measure whether the link is observed in N>2 panels, it will not
describe a link dichotomously as stable or unstable, but will give the degree of stability 1/N
≤ P ≤ 1, rewarding those links expressed consistently in many panels.
Figure 1 Definition of Persistence. a Four panels of a five node network in which not all
links are equally persistent. b Persistence representation of the four panels presented in a.
Persistence is a tie attribute that can be defined for a particular node as the average
persistence of all its ties. We denote this as perseverence and define it as
, (2)
where ki is the degree, or number of connections of the ith node. We will use this
quantity to study the characteristics of nodes carrying persistent ties.
12
Our definition of persistence has limitations. One could claim we are unfairly
punishing newly formed links. An alternative strategy would be to consider only the links
involved in the first panel; however an exercise in this line showed us that there is a strong
selection bias towards stable links when we consider such an option. For example, links
appearing only once, in the second to tenth panel, will not be considered if we set our
benchmark on the first panel only. Our definition also does not differentiate between links
active half of the time or those active during a particular half of the year. We do not
propose our measure as the ultimate way to reduce a set of network panels into a weighted
network, but as a simple way to do so, allowing us to characterize to first approximation the
stability of a network‟s links.
2.4 Global Analysis of the Persistence of Ties
Figure 2a shows the persistence histogram for the voice call network. The
distribution is bimodal, meaning that ties tend to be either active most of the time or rarely
expressed. This is known in social network analysis as a core-periphery structure [13],
where stable ties compose a person‟s social core and unstable ties connect people to the
more peripheral actors in their life.
13
Figure 2 Persistence across a cellular phone network a Distribution of persistence for all
links b Fraction of surviving ties as a function of time. The inset shows the same plot in a
double logarithmic scale. The continuous line is t-1/4
The decay of ties as a function of time can be approximated by a power-function
(Fig. 2b), in agreement with the 4-year study performed by Burt [40]. The fact that the
survival probability of a tie can be approximated by ~t-α with α =0.25±0.07 indicates that a
great number of ties disappear quickly, while others tend to stay for very long periods of
time. On average, less than 40% of the ties are conserved after 15 days. After this initial
drop however, ties disappear slowly allowing more than 20% of the ties to remain after a
year. We note that the discreetness and sparseness of our data does not allow us to prove
that tie decay follows a power-law. Yet the graphical analysis of Figure 2 b can be
considered as suggestive evidence motivating a hypothesis and further study in this
direction.
14
2.5 Network Structure and the Persistence of Ties
2.5.1 Bivariate Analysis
Figure 3a show a fragment of the mobile-call network extracted by considering all
connections up to 3 links from a randomly chosen user. Although this example shows less
than the 0.0008% of our network, it visually summarizes the correlations between
persistence, perseverance and the topological attributes of the mobile-call network. In
particular, we find that these temporal attributes correlate with topological variables such
as the number of connections or degree of a node ki, the average reciprocity of a node r
(fraction of ties containing both, incoming and outgoing calls) and the clustering
coefficient of a node Ci defined as:
2Δ
Ci = (3)
ki (k i −1)
where Δ is the number of triads, or fully connected triangles, in which the node is
involved. Figure 3b shows a histogram of persistence split into 9 different degree
categories revealing that persistent links represent a large fraction of the connections for
low degree nodes while transient links are more common for large degree nodes. The
number of persistent ties grows, however, as a function of degree, meaning that although
on average the persistence of high degree nodes is lower, in absolute terms their core is
larger (Figure 3 c).
15
Figure 3 d shows the distribution of persistence divided by clustering coefficient
categories, indicating that highly clustered nodes tend to have relatively large cores. In the
core periphery context, this means that persevering nodes are located in dense parts of the
social network (Fig. 3a I) while those in sparser parts tend to have non-persistent ties acting
as bridges which interruptedly connect different parts of the network (Fig. 3a II). Finally,
we split the distribution of persistence by reciprocity (figure 3e) and observe that nodes
with more reciprocated ties tend to be more persistent.
16
17
Figure 3 Network structure and the persistence of ties a A fragment of the network extracted by considering up to the second
neighbor of a randomly chosen node (indicated by a black arrow). (b-e summarize statistics for the entire network) b Distribution
of persistence divided into nine degree categories c Number of persistent links defined as those with a persistence of, from top to
bottom: 6/10, 7/10, 8/10, 9/10 and 10/10. d Distribution of persistence divided into nine clustering categories. e Distribution of
persistence divided into five different reciprocity segments.
2.6 Multivariate Analysis
2.6.1 At the link level
In the previous section we presented a bivariate analysis in which we analyzed the
effect of three single structural variables and found that persistence depends monotonically
on all of them (degree, clustering coefficient and reciprocity). The observed correlations
however, might well be redundant. To check if this is the case we perform a multivariate
analysis to quantify the effect of each of these variables on the persistence of ties. Because
of the large number of observations considered (∼ 2 million nodes, ∼ 8 million ties) the
confidence intervals of the regressions do not spread far from the predicted values. Hence
we concentrate our discussion on the relative magnitude of the effects rather than on their
significance.
On a social network, it is a well-known fact that agents tend to connect to other
agents that have a similar degree [41,42]. It is not known however, whether links connecting
same degree agents tend to be more stable than those connecting different degree agents.
To study this effect we performed a regression in which we study the persistence of a link
as a function of the difference in degree between the nodes adjacent to the ends of each
link. Furthermore, we also include in the regression the difference in clustering and
average reciprocity of nodes connected by a particular link. In addition to this, we consider
two link attributes, the reciprocity of links R (was there ever a panel in which caller and
callee reciprocally called each other?) and the topological overlap (TO) associated with
that link which is defined as
18
, (4)
where Oij is the number of neighbors that agents i and j have in common and ki and
kj are their respective degrees. Topological overlap is a local measure of betweenness
indicating the number of neighbors shared by two nodes at the ends of a link.
2.6.2 Performing Multivariate Analysis
Multivariate analysis is a standard statistical technique used to separate the
correlation between different variables. The technique is an extension of the bi-variate case
which can be used to study the variance shared by a pair of variables.
We can illustrate how multiple regression works by using an example with two
explanatory variables (x1 and x2) and one dependent variable (y). Multiple regression
analysis is based on linear regression, as many other functional forms can be linearized by
performing a change of variables.
Regression analysis begins by performing the least square fit:
y=B1x1+B2x2+A, (5)
where B1 and B2 are the regression coefficients and A is the intercept: y(x1=0,x2=0).
From the definition of the correlation coefficient we can interpret B1 and B2 as the change
in y associated with a standard deviation of x1 or x2.
To separate the effect on y from x1 and x2 we first calculate the correlation
coefficient between x1 and x2, which we denote as r12.
19
r12=cov(x1,x2)/σx1σx2 (6)
where cov is the covariance and σ stands for the standard deviation. We also use (6)
to calculate the correlation between y, x1 and x2, which we denote ry1 and ry2 respectively.
The total variance in y explained by x1 and x2 is then given by:
R(x1,x2)2= [(ry12+ry22-2ry1ry2r12)/(1-r122)] 1/2. (7)
We can split the effects of x1 and x2 on y by calculating the partial regression
coefficients which are given by:
pr1=(R2-ry2)1/2 (8)
pr2=(R2-ry1)1/2 (9)
and indicate the amount of variance in y explained by each one of this variables.
We use multiple regression analysis to show that R, TO, ΔC, Δk and ΔR explain
40% of the variance in persistence (Table 2 Persistence of ties and link attributes R2 =
0.397). The contribution of each one of them can be isolated by considering the partial
regression coefficients [43], which are a way to quantify how much of the variance is
explained by each one of the covariates used in a regression. This technique shows that
assortative mixing is not associated with the persistence of ties. Whereas the reciprocity of
the links (0 non-reciprocal, 1 reciprocal) explains 26% of persistence followed by
topological overlap which explains 3.4 % of the variance in persistence.
20
TABLE 2 PERSISTENCE OF TIES AND LINK ATTRIBUTES
Pearson’s Correlation ΔC Δk Δr R TO Persistence

ΔC 1 0.023 0.15 0.11 0.23 0.
15
Δk 1 0.02 -0.13 -0.19 -

0.16
Δr 1 -0.68 -0.073 0.
033
R 1 0.2964 0.5886
TO 1 0.3537
Regression Coefficients 0.09 0.002 0.15 0.35 0.56
Partial 0.0027 0.0032 0.007 0.26 0.034
Correlations
2.6.3 Perseverance and local topology
In the previous section we showed that high degree agents had on average less
persistent ties than low degree agents. We also saw that highly clustered agents tended to
have a larger number of persistent connections and that reciprocal ties tend to be more
persistent in average. Again, we explore the redundancy of such statements using linear
regression and split the contribution to perseverance from each of these variables by
calculating their partial correlations (Table 1). Together, these variables explain almost
50% of the variance in perseverance (R2=0.49). Their contributions are quite uneven,
however. When we look at the partial correlation coefficients extracted from our linear
model we find that most correlations vanish and the biggest contribution to perseverance is
given by the average reciprocity r of an agent‟s ties, which explains 27% of the variance.
The negative effect of degree in the persistence of an agent‟s ties is still present, but greatly
ameliorated. This means that high degree agents which reciprocate their ties have more
persistent ties as well. The negative effect of an agent‟s degree on the persistence of its ties
is in large part explained by the fact that high degree agents tend to reciprocate less of their
21
ties. Similarly, the clustering coefficient C, which appeared as the strongest predictor in the
bi-variate case, explains only 6% of the variance when reciprocity and degree are taken
into account. This shows that cliques are formed by reciprocal ties minimizing the
additional information about persistence carried by cliques themselves.
TABLE 3 CORRELATIONS AND REGRESSIONS BETWEEN NODE ATTRIBUTES

AND PERSEVERANCE
Pearson’s Correlation C k R Perseverance
C 1 -0.51 0.49 0.64
k 1 -0.34 -0.45
R 1 0.62
Regression Coefficients .0598 .0122 3626
Partial Correlation .062 .11 .27
2.7 Using topology to infer future ties
We finish our discussion by asking: How well can we predict the stability of ties
starting from a single panel? As mentioned before, persistence is a time-like, vertical
variable and is not constrained to correlate with space-like, horizontal variables. As we saw
from our multivariate analysis, the information carried by structural variables can be
redundant [44,45], thus, it is important to take into account their correlations to unveil their
real contribution to the persistence of ties. Can we use this information to predict which ties
persist in time? To answer this question we looked at our first data panel and used different
criteria to predict which ties will be stable. We then looked at the fraction of these ties
appearing after 1, 3, 6, 9 and 12 months and gauged the accuracy of our predictions by
measuring their Positive Predictive Value (PPV) defined as:
, (10)
where TP is the number of true positives and FP is the number of false positives.
22
We begin by testing the prediction that all ties observed as reciprocal in the first
panel will be conserved in the future. For this hypothesis, the PPV ranges from 70% after
one month to 43% after a year (Fig 4 a). For comparison, we picked a random set of ties
and found a PPV of 35% after a month and 20% after a year.
We can improve our predictive power by using a more stringent criterion. If we
consider all reciprocal links that also have a topological overlap larger than TO ≥ 0.01 we
improve the PPV of our prediction by 5%, while an even more stringent criterion based on
a TO ≥ 0.1, gives us an extra percent that allows us to predict with a PPV larger than 50%
after one year.
Figure 4 Predicting future ties a Accuracy of tie prediction by randomly choosing ties
(orange), choosing reciprocal ties (red), reciprocal ties with a T.O.>0.01 (green), reciprocal
ties with a T.O.>0.01, ties with a T.O.>0.01 (blue) and a T.O>0.1 (purple). b Sensitivity of
the predictive methods presented in a. using the same color scheme.
23
The increase in accuracy brought by more stringent criteria reduces the number of
links predicted to be persistent. Thus the sensitivity of our method, defined as:
(11)
where FN is the number of false negatives, decreases with the stringency of the
criteria used but increases with time (Fig. 4 b). Hence there is a tradeoff between the
accuracy of our prediction and the number of predictions we can make. Using the simple
method presented above, an increase in accuracy comes with a decrease in sensitivity so
more accurate predictions can be made only if we accept a reduction in the number of
predictions being made.
Reciprocity appears to be the best predictor of persistence; however, it is not the
only one. The fact that the variance explained by other structural variables was redundant
with that explained by reciprocity allows us to use other structural variables as alternative
predictors of a tie. Figure 4 a also shows the PPV obtained when we use topological
overlap as our only predictive criterion. In this case we see that although the accuracy is
lower, it is still significantly better than random. Thus the redundancy observed in the
system can be turned into a predictive advantage and in the absence of information about
the reciprocity of links we can use redundant measures to make good educated guesses
about the existence of future ties.
2.7.1 Discussion
We have defined and measured the persistence of ties in a one year period using 10
panels of data summarizing the activity of all voice calls carried by a mobile phone carrier
from an industrialized country. We showed that the persistence of ties and perseverance of
24
nodes depend on topological variables (degree, clustering, reciprocity and topological
overlap). In our study, topological variables explain almost half of the variance in
persistence. The stability of social ties is likely a behavioral attribute, thus, it is not
surprising that the local structure of the social network, that it is likely also a result of social
behavior, predicts the persistence of ties.
Social connections ultimately affect processes such as collective decisions [46,47]
and coordinated consumption [48]. But not all social connections are equally important;
some ties are stronger than others [49]. The strength of a social tie is not an absolute
measure. Hence there is a need to quantify the strength of ties using ad-hoc measures.
Persistence is a way to quantify the temporal stability of ties, and therefore their strength, in
one of the many possible dimensions that tie strength can be quantified. As longitudinal
data becomes available, methods like the one introduced here can be used to quantify the
strength of links and ultimately determine its effects on network dynamics.
The relationships shown here demonstrate that the temporal dynamics of social
interactions are intrinsically coupled to the social network structure in such a way that the
existence of a tie can be predicted, with a respectable accuracy, using a simple criterion.
25
CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS
3.1 Introduction
Despite their importance for urban planning [50], traffic forecasting [51], and the
spread of biological [52,53,54] and mobile viruses [55], our understanding of the basic
laws governing human motion remains limited thanks to the lack of tools to monitor the
time resolved location of individuals. Here we study the trajectory of 100,000 anonymized
mobile-phone users whose position is tracked for a six-month period. We find that in
contrast with the random trajectories predicted by the prevailing Lévy-flight and
random-walk models [56] (see Box 1), human trajectories show a high degree of temporal
and spatial regularity, each individual being characterized by a time independent
characteristic length scale and a significant probability to return to a few highly frequented
locations. After correcting for differences in travel distances and the inherent anisotropy of
each trajectory, the individual travel patterns collapse into a single spatial probability
distribution, indicating that despite the diversity of their travel history, humans follow
simple reproducible patterns. This inherent similarity in travel patterns could impact all
phenomena driven by human mobility, from epidemic prevention to emergency response,
urban planning and agent based modelling.
Given the many unknown factors that influence a population‟s mobility patterns,
ranging from means of transportation to job and family imposed restrictions and priorities,
human trajectories are often approximated with various random walk or diffusion models
26
[56,57]. Indeed, early measurements on albatrosses, bumblebees, deer and monkeys [58,59]
and more recent ones on marine predators [60] suggested that an animal trajectory can be
approximated by a Lévy flight [61, 62], a random walk whose step size Δr follows a
power-law distribution P(Δr) ~ Δr -α with α < 3. While the Lévy statistics for some
animals require further study [63], Brockmann et al. [56] generalized this finding to humans,
documenting that the distribution of distances between consecutive sightings of nearly
half-million bank notes is fat tailed. Given that money is carried by individuals, bank-note
dispersal is a proxy for human movement, suggesting that human trajectories are best
modelled as a continuous time random walk with fat tailed displacements and waiting time
distributions [56]. A particle following a Lévy flight has a significant probability to travel
very long distances in a single step [61,62], which appears to be consistent with human travel
patterns: most of the time we travel only over short distances, between home and work,
while occasionally we take longer trips.
Random Walk (RW): A random walk is a mathematical formalization of a
trajectory that consists of taking successive steps in random directions.
Lévy-Flights (LF): A Lévy flight is a type of random walk in which the size of the
steps are distributed according to a "heavy-tailed" distribution.
Truncated Levy Flight (TLF): A truncated levy flight is a random walk in which
the increments are distributed according to a heavy tailed distribution multiplied by an
exponential decay factor (P(x)∼x−˜ exp(-βx) ).
Heavy Tailed Distribution, Fat-Tail or Power-Law: A heavy-tailed distribution
is a probability distribution that has infinite variance. One of the most common forms it
takes is a power-law, which falls to zero as x−α where 0 < α < 3.
27
Each consecutive sightings of a bank note reflects the composite motion of two or
more individuals, who owned the bill between two reported sightings. Thus, it is not clear
if the observed distribution reflects the motion of individual users, or some hitherto
unknown convolution between populations based heterogeneities and individual human
trajectories. Contrary to bank notes, mobile phones are carried by the same individual
during his/her daily routine, offering the best proxy to capture individual human
trajectories [8,9,10,64,65].
We used two data sets to explore the mobility pattern of individuals. The first (D1)
consists of the mobility patterns recorded over a six-month period for 100,000 individuals
selected randomly from a sample of over 6 million anonymized mobile-phone users. Each
time a user initiates or receives a call or text message, the location of the tower routing the
communication is recorded, allowing us to reconstruct the user‟s time resolved trajectory
(Figure 6 a and b). The time between consecutive calls follows a bursty pattern [66] (Figure
5) indicating that while most consecutive calls are placed soon after a previous call,
occasionally there are long periods without any call activity. To make sure that the
obtained results are not affected by the irregular call pattern, we also study a data set (D2)
that captures the location of 206 mobile-phone users, recorded every two hours for an
entire week. In both datasets the spatial resolution is determined by the local density of the
more than 104 mobile towers, registering movement only when the user moves between
areas serviced by different towers.
28
3.2 Source Data
The D1 dataset was collected by a European mobile phone carrier for billing and
operational purposes. It contains the date, time and coordinates of the phone tower routing
the communication for each phone call and text message sent or received by 6 million
costumers. The dataset summarizes 6 months of activity. To guarantee anonymity, each
user is identified with a security key (hash code). Furthermore, we only know the
coordinates of the tower routing the communication, hence a user‟s location remains
unknown within a tower‟s service area. Each tower serves an area of approximately 3 km2.
Due to tower coverage limitations driven by geographical constraints and national frontiers
no jumps exceeding 1,000 km can be observed in the dataset.
The research was performed on a random set of 100,000 selected from those
making or receiving at least one phone call or SMS during the first and last month of the
study, translating to 16,364,308 recorded positions. We removed all jumps that took users
outside the continental territory. We did not impose any additional criterion regarding the
calling activity to avoid possible selection biases in the mobility pattern.
The D2 dataset was collected for the operation of some services provided by the
mobile phone carrier, like pollen and traffic forecasts, which rely on the approximate
knowledge of customer‟s location at all times of the day. For customers that signed up for
location dependent services, the date, time and the closest tower coordinates are recorded
on a regular basis, independent of their phone usage. We were provided such records for
1,000 users, among which we selected the group of users whose coordinates were recorded
at every two hours during an entire week, resulting in 206 users for which we have 10,613
recorded positions. Given that these users were selected based on their actions (signed up
29
to the service), in principle the sample cannot be considered unbiased, but we have not
detected any particular bias for this data set.
For each user in D1 and D2 we sorted the time resolved sequence of positions and
constructed individual trajectories.
Figure 5 Interevent time distribution P(ΔT) of calling activity. ΔT is the time elapsed
between consecutive communication records (phone calls and SMS, sent or received) for
the same user. Different symbols indicate the measurements done over groups of users
with different activity levels (# calls). The inset shows the unscaled version of this plot
3.3 The Heterogeneity of Human-Mobility Patterns
To explore the statistical properties of the population‟s mobility patterns we
measured the distance between user‟s positions at consecutive calls, capturing 16,264,308
displacements for the D1 and 10,407 displacements for the D2 datasets. We find that the
distribution of displacements over all users is well approximated by a truncated power-law
30
P(Δr) = (Δr+Δr0)-β exp(-Δr/κ), (12)
with β=1.75 ± 0.15, Δr0=1.5 km and cut-off values κ|D1 = 400 km and κ|D2 = 80 km
(Figure 6 c). Note that the observed scaling exponent is not far from β = 1.59 observed in
Ref. [56] for bank-note dispersal, suggesting that the two distributions may capture the
same fundamental mechanism driving human-mobility patterns.
Equation (12) suggests that human motion follows a truncated Lévy flight [56]. Yet,
the observed shape of P(Δr) could be explained by three distinct hypotheses: A. Each
individual follows a Lévy trajectory with jump size distribution given by (12). B. The
observed distribution captures a population based heterogeneity, corresponding to the
inherent differences between individuals. C. A population based heterogeneity coexists
with individual Lévy trajectories, hence (12) represents a convolution of hypothesis A and
B.
To distinguish between hypotheses A, B and C we calculated the radius of gyration
for each user as:
∑ /
(13)
where xcm and ycm are the coordinates of the centre of mass defined by a users
position and the sum goes over all positions (N) recorded for a user. The radius of gyration
can be interpreted as the typical distance travelled by user a when observed up to time t
(Figure 6 b). Next, we determined the radius of gyration distribution P(rg) by calculating rg
for all users in samples D1 and D2, finding that they also can be approximated with a
truncated power-law
31
P(rg) = (rg+rg0)-βr exp(-rg/κ), (14)
with rg0 = 5.8 km, βr = 1.65 ± 0.15 and κ = 350 km. Lévy flights are characterized
by a high degree of intrinsic heterogeneity, raising the possibility that (9) could emerge
from an ensemble of identical agents, each following a Lévy trajectory. Therefore, we
determined P(rg) for an ensemble of agents following a Random Walk (RW), Lévy-Flight
(LF) or Truncated Lévy-Flight (TLF) (Figure 6 d) [57,61,62]. The ensemble of random
walkers was normalized such that the mean of the distribution matches that observed in our
data, whereas the ensemble of Lévy-Flight walkers had steps drawn from a distribution
with the same exponent as that found in (12). The steps of the Truncated Lévy-Flight
walkers were extracted from the distribution presented in (12).
We find that an ensemble of Lévy agents display a significant degree of
heterogeneity in rg, yet is not sufficient to explain the truncated power-law distribution
P(rg) exhibited by the mobile-phone users. Taken together, Figs. 1c and d suggest that the
difference in the range of typical mobility patterns of individuals (rg) has a strong impact
on the truncated Lévy behavior seen in (12), ruling out hypothesis A.
If individual trajectories are described by a LF or TLF, then the radius of gyration
should increase in time as rg(t) ~ t3/(2+β) [67,68] while for a RW rg(t) ~ t1/2. That is, the longer
we observe a user, the higher the chances that she/he will travel to areas not visited before.
To check the validity of these predictions we measured the time dependence of the radius
of gyration for users whose gyration radius would be considered small (rg(T) ≤ 3 km),
medium (20 < rg(T) ≤ 30 km) or large (rg(T) > 100 km) at the end of our observation period
(T = 6 months). The results indicate that the time dependence of the average radius of
gyration of mobile phone users is better approximated by a logarithmic increase, not only a
32
manifestly slower dependence than the one predicted by a power law, but one that may
appear similar to a saturation process (Figure 8 a).
Figure 6 Basic human mobility patterns. a, Week-long trajectory of 40 mobile phone users
indicate that most individuals travel only over short distances, but a few regularly move
over hundreds of kilometres. Panel b, displays the detailed trajectory of a single user. The
different phone towers are shown as green dots, and the Voronoi lattice in grey marks the
approximate reception area of each tower. The dataset studied by us records only the
identity of the closest tower to a mobile user, thus we cannot identify the position of a user
within a Voronoi cell. The trajectory of the user shown in b is constructed from 186 two
hourly reports, during which the user visited a total of 12 different locations (tower
vicinities). Among these, the user is found 96 and 67 occasions in the two most preferred
locations, the frequency of visits for each location being shown as a vertical bar. The circle
represents the radius of gyration cantered in the trajectory‟s centre of mass. c, Probability
33
density function P(Δr) of travel distances obtained for the two studied datasets D1 and D2.
The solid line indicates a truncated power law whose parameters are provided in the text
(see Eq. 7). d, The distribution P(rg) of the radius of gyration measured for the users, where
rg(T) was measured after T = 6 months of observation. The solid line represents a similar
truncated power law fit (see Eq. 9). The dotted, dashed and dot-dashed curves show P(rg)
obtained from the standard null models (RW, LF and TLF), where for the TLF we used the
same step size distribution as the one measured for the mobile phone users.
In Figure 8 b, we have chosen users with similar asymptotic rg(T) after T = 6
months, and measured the jump size distribution P(Δr|rg) for each group. As the inset of
Figure 8 b shows, users with small rg travel mostly over small distances, whereas those
with large rg tend to display a combination of many small and a few larger jump sizes.
Once we rescale the distributions with rg (Figure 8 b), we find that the data collapses into a
single curve, suggesting that a single jump size distribution characterizes all users,
independent of their rg. This indicates that P(Δr|rg) ~ rg-α F(Δr/rg), where α ≈ 1.2 ± 0.1 and
F(x) is a rg independent function with asymptotic behavior F(x < 1) ∼ x-α and rapidly
decreasing for x >> 1. Therefore the travel patterns of individual users may be
approximated by a Lévy flight up to a distance characterized by rg. Most important,
however, is the fact that the individual trajectories are bounded beyond rg, thus large
displacements which are the source of the distinct and anomalous nature of Lévy flights,
are statistically absent. To understand the relationship between the different exponents, we
note that the measured probability distributions are related by
which suggests that up to the leading order we have β=βr+α-1, consistent, within error
bars, with the measured exponents. This indicates that the observed jump size distribution
P(Δr) is in fact the convolution between the statistics of individual trajectories P(Δr|rg) and
the population heterogeneity P(rg), consistent with hypothesis C.
34
3.4 Testing the power-law curve fits
We tested whether the empirical data could come from the fitted distributions by
performing a stringent variant of the Kolmogorov-Smirnov (KS) goodness of fit test [69].
The KS statistics is a simple way to compare whether two distributions are the same. In this
case, we use it to test the hypothesis: Could the empirically observed distributions come
from the distribution found as its best fit? To test for this we generated synthetic data
starting from the fitted distribution and then use the KS test to see whether our data behaves
as well as synthetic data generated from the fitted distribution. We use two variants of the
KS statistics to compare empirical data with the fitted distribution and synthetic data with
the fitted distribution. The first method is the standard KS statistics and is given by:
KS = max (|F − P|) (15)
where F is the cumulative distribution of the best fit and P is the cumulative
distribution of the empirical or synthetic data. The regular KS statistic is not very sensitive
on the edges of the distribution. Hence, we also use the weighted KS statistics defined as:
KSW = max(|F − P| /(P(1 − P))1/2) (16)
To test whether the empirical data behaves as good as the synthetic data we
calculated the KS and KSW statistics between the empirical data and its best fit and
compared these values with those obtained by calculating KS and KSW for 1,000 sets of
synthetic data generated from the best fit. If the values obtained for KS and KSW for the
empirical data behave as good or better than those obtained for the synthetic data, then we
can conclude that the empirical data is statistically consistent with its best fit. The results of
the KS test can be summarized using a p–value by integrating the distribution of KS values
35
Figure 7 Kolmogorv-Smirnov goodness of fit test. The figures compare the KS and KSw
statistics with that of 1000 sets of synthetic data coming from the same distribution. a. Red
line indicates the KS value for Figure 6 c D1. (p(KS)=1) b. Red line indicates the KSw value
for Figure 6 (p(KSW)=1) c D1c. Red line indicates the KS value for Figure 6 d D1
(p(KS)=0.62) d. Red line indicates the KSw value for Figure 6 d D1 (p(KSW)=0.82).
generated with the synthetic data from the value representing the empirical distribution.
When integrating such distributions from left to right we can interpret the p−value as the
probability that the observed data was the result of its best fit. A p−value close to 1 will
indicate that the empirical distribution matches its best fit as good as synthetic data
generated from the fit itself [69], whereas a relative small p−value (typically taken p < 0.01)
would suggest that the empirical distribution cannot be the result of its best fit.
The p-values for the KS tests can be read from the caption of Figure 7.
36
3.5 The periodicity of human mobility patterns
To uncover the mechanism stabilizing rg we measured the return probability for
each individual Fpt(t) [68], defined as the probability that a user returns to the position
where it was first observed after t hours (Figure 8 c). For a two dimensional random walk
Fpt(t) should follow ~ 1/(t ln(t)2) [68]. In contrast, we find that the return probability is
characterized by several peaks at 24 h, 48 h, and 72 h, capturing a strong tendency of
humans to return to locations they visited before, describing the recurrence and temporal
periodicity inherent to human mobility [70,71].
37
Figure 8 The bounded nature of human trajectories. a, Radius of gyration, vs time
for mobile phone users separated in three groups according to their final rg(T) , where T = 6
months. The black curves correspond to the analytical predictions for the random walk
models, increasing in time as (solid), and (dotted).
The dashed curves corresponding to a logarithmic fit of the form A+B ln(t), where A and B
depend on rg. b, Probability density function of individual travel distances P(Δr|rg) for
users with rg = 4, 10, 40, 100 and 200 km. As the inset shows, each group displays a quite
different P(Δr|rg) distribution. After rescaling the distance and the distribution with rg
38
(main panel), the different curves collapse. The solid line (power law) is shown as a guide
to the eye. c, Return probability distribution, Fpt(t). The prominent peaks capture the
tendency of humans to regularly return to the locations they visited before, in contrast with
the smooth asymptotic behavior ~1/(tln(t)2) (solid line) predicted for random walks. d, A
Zipf plot showing the frequency of visiting different locations. The symbols correspond to
users that have been observed to visit nL = 5, 10, 30, and 50 different locations. Denoting
with (L) the rank of the location listed in the order of the visit frequency, the data is well
approximated by R(L)~L-1. The inset is the same plot in linear scale, illustrating that 40% of
the time individuals are found at their first two preferred locations.
To explore if individuals return to the same location over and over, we ranked each
location based on the number of times an individual was recorded in its vicinity, such that a
location with L = 3 represents the third most visited location for the selected individual. We
find that the probability of finding a user at a location with a given rank L is well
approximated by P(L) ~ 1/L, independent of the number of locations visited by the user
(Figure 8 d). Therefore people devote most of their time to a few locations, while spending
their remaining time in 5 to 50 places, visited with diminished regularity. Therefore, the
observed logarithmic saturation of rg(t) is rooted in the high degree of regularity in their
daily travel patterns, captured by the high return probabilities (Figure 8 b) to a few highly
frequented locations (Figure 8 d).
3.6 The Shape of Human-Mobility Patterns
An important quantity for modeling human mobility patterns is the probability to
find an individual a in a given position (x, y). As it is evident from Figure 6 b,
individuals live and travel in different regions, yet each user can be assigned to a
well-defined area, defined by home and workplace, where she or he can be found most of
the time. We can compare the trajectories of different users by diagonalizing each
39
trajectory‟s inertia tensor, providing the probability of finding a user in a given position
(Figure 9 a) in the user‟s intrinsic reference frame.
To compare the trajectories of different users we calculate an individual reference
frame for each user. We do this by finding the set of axes in which the inertia tensor defined
by the collection of points visited by each user takes a diagonal form.
The moment of inertia tensor is given by:
(17)
where
(18)
We define the x axis of user‟s intrinsic reference frame as the eigenvector
associated with the smaller eigenvalue of the inertia tensor. Thus, we look for a reference
frame such that:
. (19)
This can be achieved by performing a rotation:
(20)
such that II
(21)
40
(22)
This leads us to three possible solutions:
(A). If
and
then
(23)
We select one of the roots to make sure.
(B). and
, then from (22) we must have sin(
, hence there is no valid solution in this case.
(C). , we derive from (22)
. We select 0 or
according to in which of these angles, the momentum of inertia is minimum.
Finally, we make a conditional rotation of π to make sure the most frequent position
has a positive value on its horizontal component.
41
3.7 The anisotropy of Human Mobility Patterns
A striking feature of is its prominent spatial anisotropy in this intrinsic
reference frame (note the different scales in Figure 9 a). Here we find that the larger an
individual‟s rg the more pronounced is this anisotropy. To quantify this effect we defined
the anisotropy ratio S ≡ σy/σx, where σx and σy represent the standard deviation of the
trajectory measured in the user‟s intrinsic reference frame. We find that S decreases
monotonically with rg (Figure 9 c), being well approximated with S ~ rg-η, for η ≈ 0.12.
Given the small value of the scaling exponent, other functional forms may offer an equally
good fit, thus mechanistic models are required to identify if this represents a true scaling
law, or only a reasonable approximation to the data.
To compare the trajectories of different users we remove the individual
anisotropies, rescaling each user trajectory with its respective σx and σy. The rescaled
distribution (Figure 9 b) is similar for groups of users with considerably
different rg, i.e., after the anisotropy and the rg dependence is removed all individuals
appear to follow the same universal probability distribution. This is particularly
evident in Fig. 3d, where we show the cross section of for the three groups of
users, finding that apart from the noise in the data the curves are indistinguishable.
42
Figure 9 The shape of human trajectories. a, The probability density function Φ(x, y) of
finding a mobile phone user in a location (x, y) in the user‟s intrinsic reference frame (see
SM for details). The three plots, from left to right, were generated for 10,000 users with: rg
≤ 3, 20 < rg ≤ 30 and rg > 100 km. The trajectories become more anisotropic as rg
increases. b, After scaling each position with σx and σy the resulting ) Φ (x/ σx ,y/ σy ) has
approximately the same shape for each group. c, The change in the shape of Φ (x, y) can be
quantified calculating the isotropy ratio S ≡ σx/ σy as a function of rg , which decreases as S
~ rg-0.12 (solid line). Error bars represent the standard error. d, Φ (x/ σx ,0) representing the
x-axis cross section of the rescaled distribution Φ (x/ σx ,y/ σy ) shown in b.
43
Taken together, our results suggest that the Lévy statistics observed in bank note
measurements capture a convolution of the population heterogeneity (9) and the motion of
individual users. Individuals display significant regularity, as they return to a few highly
frequented locations, like home or work. This regularity does not apply to the bank notes: a
bill always follows the trajectory of its current owner, i.e. dollar bills diffuse, but humans
do not.
The fact that individual trajectories are characterized by the same rg-independent
two dimensional probability distribution , suggests that key statistical
characteristics of individual trajectories are largely indistinguishable after rescaling.
Therefore, our results establish the basic ingredients of realistic agent based models,
requiring us to place users in number proportional with the population density of a given
region and assign each user an rg taken from the observed P(rg) distribution. Using the
predicted anisotropic rescaling, combined with the density function, we can obtain the
likelihood of finding a user in any location. Given the known correlations between spatial
proximity and social links, our results could help quantify the role of space in network
development and evolution [ 72,73,74,75] and improve our understanding of diffusion
processes [57,76].
44
CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF
NATIONS
4.1 Introduction
Does the type of product a country exports matter for subsequent economic
performance? The fathers of development economics held it does, suggesting that
industrialization creates ‟spill-over‟ benefits that fuel subsequent growth [77,78,79]. Yet,
lacking formal models, mainstream economic theory has been unable to incorporate these
ideas. Instead, two approaches have been used to explain a country‟s pattern of
specialization. The first focuses on the relative proportion between productive factors (i.e.
physical capital, labor, land, skills or human capital, infrastructure, and institutions [80]).
Hence, poor countries specialize in goods intensive in unskilled labor and land while richer
countries specialize in goods requiring infrastructure, institutions, human and physical
capital. The second approach emphasizes technological differences [81] and has to be
complemented with a theory of what underlies them. The varieties and quality ladders
models [82,83] assume that there is always a slightly more advanced product or just a
different one that countries can move to, disregarding product similarities when thinking
about structural transformation and growth.
45
Think of a product as a tree and the set of all products as a forest. A country is
composed of a collection of firms, i.e. of monkeys that live on different trees and exploit
those products. The process of growth implies moving from a poorer part of the forest,
where trees have little fruit, to better parts of the forest. This implies that monkeys would
have to jump distances, i.e. redeploy (human, physical and institutional) capital towards
goods that are different from those currently under production. Traditional growth theory
assumes there is always a tree within reach; hence the structure of this forest is
unimportant. However, if this forest is heterogeneous, with some dense areas and other
more deserted ones, and monkeys can jump limited distances, then countries may be
unable to move through the product space. If this is the case, the structure of this space and
a country‟s orientation within it become of great importance to the development of
countries.
4.2 Product Proximity
In theory, many possible factors may cause relatedness between products, i.e.
closeness between trees; such as the intensity of labor, land, and capital [ 84], the level of
technological sophistication [85,86], the inputs or outputs involved in a product‟s value chain
(e.g. cotton, yarn, cloth, garments) [87] or requisite institutions [88,89]. All of these are a priori
notions of what dimensions of similarity are most important, and assume that factors of
production, technological sophistication or institutional quality exhibit little specificity.
Instead, we take an agnostic approach and use an outcomes-based measure, based on the idea
that if two goods are related, because they require similar institutions, infrastructure, physical
factors, technology, or some combination thereof, they will tend to be produced in tandem,
whereas dissimilar goods are less likely to be produced together. We call this measure
46
proximity, which formalizes the intuitive idea that the ability of a country to produce a product
depends on its ability to produce other products. For example, a country with the ability to
export apples will probably have most of the conditions suitable to export pears. They would
certainly have the soil, climate, packing technologies, frigorific trucks and containers. In
addition, they would have skilled agronomists, phytosanitary laws and trade agreements that
could be easily redeployed to the pear business. If instead we consider a different product such
as copper wires or home appliance manufacture, all or most of the capabilities developed for
the apple business are rendered useless. We introduce proximity as the concept that captures
this intuitive notion.
Formally, we define the proximity φ between products i and j as the minimum of
the pairwise conditional probabilities of a country exporting a good given that it exports
another.
(24)
where RCA stands for Revealed Comparative Advantage [90]
(25)
which measures whether a country c exports more of good i, as a share of its total
exports, than the „average‟ country (RCA≥1) not (RCA<1).
47
4.3 The Product Space
International trade data is taken from Feenstra, Lipsey, Deng, Ma, & Mo's "World
Trade Flows: 1962-2000" dataset [91]. This dataset consists of imports and exports both by
country of origin and by destination, with products disaggregated to the SITC revision 4,
four-digit level. The authors build this dataset using the United Nations COMTRADE
database. The authors cleaned that dataset by calculating exports using the records of the
importing country, when available, assuming that data on imports is more accurate than
data from exporters. This is likely, as imports are more tightly controlled in order to
enforce safety standards and collect customs fees. In addition, the authors correct the UN
data for flows to and from the United States, Hong Kong, and China. We focus only on
export data, and do not disaggregate by country of destination. More information on this
dataset can be found in NBER Working Paper #11040, and the dataset itself is available at
www.nber.org/data. and http://cid.econ.ucdavis.edu/data/undata/undata.html.
Using this we calculate the 775 by 775 matrix of revealed proximities between
every pair of goods using (24) and (25).
48
Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product space.
49
Figure 11 Network representation of the 1998-2000 product space. Links are color coded
with their proximity value. The size of the nodes is proportional to world trade and their
colors are chosen according to the classification introduced by Leamer.
Figure 11 shows a hierarchically clustered version of the matrix. A smooth and
homogeneous product space would imply uniform values (homogenous coloring), while a
product-ladder model [83] would suggest a matrix with high values (or bright coloring) only
along the diagonal. Instead the product space of Figure 11 appears to be modular [92,93],
50
with some goods highly connected and others disconnected. Furthermore, as a whole the
product space is sparse, with φij distributed according to a broad distribution with 5% of its
elements equal to zero, 32% of them smaller than 0.1 and 65% of the entries taking values
below 0.2. These significant number of negligible connections calls for a network
representation [94,73], allowing us to explore the structure of the product space, together
with the proximity between products of given classifications and participation in world
trade.
4.4 Generating a network representation of the product space
The matrix representing the product space has many small values which represent
weak connections between products. That is why a network representation becomes an
adequate way to layout the products, giving us a quick visual way to show the relevant
links and to determine were countries are located and where they could be headed.
4.4.1 Maximum Spanning Tree (MST)
To include all products in our network we generated a "skeleton" for it: the
Maximum Spanning Tree (MST). This is the tree containing a sum of weights which is
maximal. In other words, it is the set of N-1 links (N being the number of nodes) that
connect all nodes in the network and maximizes the sum of the proximities in it.
We generated the MST by considering the strongest non-diagonal value of the
proximity matrix and then considered the strongest link connected to that dyad. We then
picked up the strongest link connecting a new node to our triad and continued adding links
until all the nodes on the network were considered (Figure 12).
51
Figure 12 Earliest version of the MST representing the "skeleton" of the product space.
In our visualization we also wanted to consider the strongest links which are not
necessarily in the MST. We did this by considering the MST plus all the links above a
certain threshold. A suitable visualization was obtained by keeping all links with a
proximity value of 0.55 or larger ( Figure 13). This resulted in a network with 775 nodes
and 1525 links. Lower proximity values gave rise to crowded network representations
while higher values resulted in sparse networks. As a rule of thumb, a good network
visualization can be achieved with an average degree equals to 4. This is when the number
of links is twice the one of nodes, which is the case for the φ=0.55 threshold.
52
Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55.
4.4.2 Network Layout
Good network visualization requires an appropriate layout. We lay out the network
using a force spring algorithm. Here nodes are represented as equally charged particles and
links are assumed to be springs. The layout is determined by the relaxed positions.
The force spring layout is not the ultimate solution, but it brings us close to a good
one. After this we retouch the layout manually to avoid overlapping links and untangle
dense clusters.
53
Figure 14 Network representation of the product space. Layout uses a force spring
algorithm.
4.4.3 Node Sizes and Colors
An advantage of using a network representation is that we can simultaneously look
at the structure of the space and other covariates. In our case we painted the network using
the product classifications performed by Leamer [84], and made the size of the nodes
proportional to the World Trade associated with that particular industry. To give a sense of
the proximity of the links involved in our network representation we color coded them by
using dark red and blue for strong links; and yellow and light blue for weaker ones.
54
4.5 The products space and the patterns of comparative advantage
To offer a visualization in which all 775 products are included, we reach all nodes
by calculating the maximum spanning tree, which include the 774 links maximizing the
tree‟s added proximity and superposed on it all links with a proximity larger than 0.55, as
we explained above. This set of 1525 links is used to visualize the structure of the full
proximity matrix, which far from homogenous, appears to have a core-periphery structure
(Figure 11). The core is formed by metal products, machinery and chemicals while the
periphery is formed by the rest of the product classes. The products in the top of the
periphery belong to fishing, animal, tropical and cereal agriculture. To the left there is a
strong peripheral cluster formed by garments and another one belonging to textiles,
followed by animal agriculture. The bottom of the network shows a large electronics
cluster followed to the right by mining and forest and paper products.
The network shows clusters of products somewhat related to the classification
introduced by Leamer [84], which is based on relative factor intensities (Table 4, Figure
19), i.e. the relative amount of capital, labor, land or skills required to produce each
product. Although the classification performed by Leamer was done using a different
methodology, the agreement between it and the structure of the product space is striking.
Yet it also introduces a more detailed split of some product classes. For example,
machinery is naturally split into two clusters, one consisting of vehicles and heavy
machinery, and another one belonging to electronics. The machinery cluster is interwoven
with some capital intensive metal products, but is not tightly connected to similarly
classified products such as textiles.
55
The map obtained can be used to analyze the evolution of a country‟s productive
structure. For this purpose we hold the product space fixed and study the dynamics of
production within it, although changes in the product space represent an interesting avenue
for future research.
Figure 15 shows the pattern of specialization for four regions in the product space2.
Products exported by a region with RCA>1, are shown with black squares. Industrialized
countries occupy the core, composed by machinery, metal products and chemicals. They
also participate in more peripheral products such as textiles, forest products and animal
agriculture. East-Asian countries have developed RCA in the garments, electronics and
textile clusters while Latin America and the Caribbean are further out in the periphery in
mining, agriculture and the garments sector. Finally sub-Saharan Africa exports few
product types, all of which are in the far periphery of the product space, indicating that each
region has a distinguishable pattern of specialization clearly visible in the product space (to
see a discussion of how the structure of the product space is correlated with product income
see appendix II).
Next, we show how the structure of the product space affects a country‟s pattern of
specialization. Figure 16 A shows how comparative advantage evolved in Malaysia and
Colombia between 1980 and 2000 in the electronics and garments sector respectively. We
see that both countries follow a diffusion process in which comparative advantage move
2
The network shown here represents the structure of the product space as determined from the
1998-2000 periods. Holding the product space as fixed is a good first approximation, as the dynamics of the
network is much slower than the one of countries. The Pearson correlation coefficient (PCC) between the
proximity of all links present in this network and the ones obtained from the same network in 1990 and 1985
are 0.69 and 0.66 respectively (see supplementary material). This indicates that although the network
changes over time, after 15 years, the strength of past links still predicts the strength of the current links to a
considerable extent.
56
Figure 15 Localization of the productive structure for different regions of the world. The
products for which the region has an RCA > 1 are denoted by black squares.
57
preferentially towards products close to existing goods: garments in Colombia and
electronics in Malaysia.
Beyond this graphical illustration, is it true that countries develop comparative
advantage preferentially in nearby goods? We use two different approaches to this
question. First, we measure the average proximity of a new potential product j to a
country‟s current productive structure, which we call density and define as
, (26)
where ωkj is the density around good j given the export basket of the kth country and
xi = 1 if RCAki>1 and 0 otherwise. A high density value means that the kth country has
many developed products surrounding the jth product. To study the evolution of
comparative advantage we consider transition products as those with an RCAc,i<0.5 in
1990 and an RCAc,i>1 in 1995. As a control, we consider undeveloped products those that
in 1990 and 1995 had an RCAc,i< 0.5 and disregard those cases do not fitting any of these
two criteria. Figure 16 B shows how density is distributed around transition products
(yellow) and compares it to densities around undeveloped products (red). Clearly, these
distributions are very distinct, with a higher density around transition products than among
undeveloped ones (ANOVA (analysis of variance) p < 10-30).
At the single product level, we consider the ratio between the average density of all
countries in which the jth product was a transition product and the average density of all
countries in which the jth product was not developed. Formally, we define the discovery
factor Hj as
58
, (27)
where T is the number of countries in which the jth good was a transition product
and N is the total number of countries. Figure 16 C shows the frequency distribution of this
ratio. For 79 percent of products, this ratio is greater than 1 indicating that ωjk is greater in
countries that transitioned into the jth good than in those that did not, often substantially.
An alternative way of illustrating that countries develop RCA in goods close to
those they already had, is to calculate the conditional probability of transitioning into a
product given that the nearest product with RCA>1 is at a given φ. Figure 16 D shows a
monotonic relationship between the proximity of the nearest developed good and the
probability of transitioning into it. While the probability of moving into a good at φ=0.1 in
the course of 5 years is almost nil, the probability is about 15 percent if the closest good is
at φ=0.8.3
3
We repeated the same exercise using the rank of proximity instead of proximity itself in order to
assess whether what matters is absolute or relative proximity. We found that absolute distance appears to be
what matters most. We found that while transition probability increases linearly with proximity, they decay
with rank as a power law. Moreover, the rank effect is stronger for products in sparser parts of the product
space, where transitions are also less frequent. Thus, densely connected products can develop RCA through
more paths than sparsely connected ones, indicating the importance of absolute proximity
59
Figure 16 Empirical evolution of countries. A. Examples of RCA spreading for Colombia
(COL) and Malaysia (MYS). The color code shows when this countries first developed
RCA>1 for products in the garments sector in Colombia and the electronics cluster for
Malaysia. B. Distribution of density for transition products and undeveloped products C.
Distribution for the relative increase in density for products undergoing a transition with
respect to the same products when they remain undeveloped. D. Probability of developing
RCA given that the closest connected product is at proximity φ. E. Relative size of the
largest connected component NG with respect to the total number of products in the system
N as a function of link proximity φ.
Since production shifts to nearby products, we ask whether the product space is
sufficiently connected that given enough time, all countries can reach most of it,
particularly the richest parts. Lack of connectedness may explain the difficulties faced by
countries trying to converge to the income levels of rich countries: they may not be able to
undergo structural transformation because proximities are just too low. A simple
approach is to calculate the relative size of the largest connected component as a function
of φ. Figure 16 E shows that at φ≥0.6 the largest connected component has a negligible size
compared with the total number of products while for φ≤0.3 the product space is almost
fully connected, meaning that there is always a path between two different products.
60
We study the impact of the product space structure by simulating how the position
of countries evolve when allowed to repeatedly move to products with proximities greater
than a certain φο. If countries diffuse to nearby products and these are sufficiently
connected to others, then after several iterations, 20 in our exercise, countries would be
able to reach richer parts of the product space. On the other hand, if the product space is
disconnected, countries will not be able to find paths to the richer part of the product space,
independently of how many steps they are allowed to make.
The results of our simulation for Chile and Korea are presented in Figure 17 A. At a
relatively low proximity (φο=0.55) both countries are able to diffuse through to the core of
the product space, however Korea is able to do so much faster thanks to its positioning in
core products. For higher proximities the question becomes whether a country can spread
at all. At φο=0.6 Chile is able to spread slowly throughout the space while Korea is still
able to populate the core after 4 rounds. At φο=0.65, Chile is not able to diffuse, lacking
any close enough products, while Korea develops RCA slowly to a few products close to
the machinery and electronics cluster.
61
Figure 17 Simulated diffusion process and inequality. A. Simulated diffusion process for
Chile and Korea in which we allow countries to develop RCA in all products closer than
φ≥0.55, 0.6 and 0.65. The number of steps required to develop RCA can be read from the
color code on the top right corner of the figure. B. Distribution for the average PRODY of
the best 50 products in a country‟s basket before and after 20 rounds of diffusion. The
original distribution is shown in green while the one associated with the distribution after
20 diffusion rounds with φ=0.65 is presented in yellow and φ=0.55 in red. C. Inter quartile
range of the distribution of the best 50 products after diffusing with a given φ normalized
by the inter quartile range of the best 50 products in absence of diffusion.
62
To generalize this analysis for the whole world, we need a measure to summarize
the position of a country in the product space. We adopt a measure based on Hausmann,
Hwang and Rodrik [95], which involves a two-stage process. First, for every product we
assign a value, which is the RCA weighted GDP per capita of countries with comparative
advantage in that good called PRODY [95]. We then average the PRODYs of the top N
products that a country has access to after M iterations at φο and denote it by
. Figure 17 B shows the distribution of for N = 50, M=20
and φο=1 (green), φο=0.65 (yellow) and φ =0.55 (red). The distribution for φο=1 allows us
to characterize the current distribution of countries in the product space, which shows a
bimodal distribution, signature of a world divided into rich and poor countries with few
countries occupying the center of the distribution. When we allow countries to diffuse up to
φο=0.65, this distribution does not change significantly: it shifts slightly to the right due to
the acquisition of a limited number of sophisticated products by some countries. This
diffusion process, however, stops after a few rounds and the world maintains a degree of
inequality similar to its current state. Contrarily, when we consider φο=0.55, most countries
are able to diffuse and reach the most sophisticated basket in the long run. Only a few
countries are left behind, which unsurprisingly make up the poorest end of the income
distribution (more details on the simulated diffusion process can be found on appendix III).
To quantify the level of convergence we calculated the Inter Quartile Range (IQR)
for the distribution and normalize this quantity by dividing it with the
IQR for the original distribution. Figure 17 C shows that the convergence of the system
goes through an abrupt transition and that convergence is possible if countries are able to
diffuse to products located at a proximity φ>0.65.
63
4.6 Discussion
The bi-modal distribution of international income levels and a lack of convergence
of the poor towards the rich has been explained using geographic [96] and institutional
[88,89] arguments. Here, we introduced a new factor to this discussion: the difficulties
involved in moving through the product space. The detailed structure of the product space
is shown here for the very first time and together with the location of the countries and the
characteristics of the diffusion process undergone by them, strongly suggests that not all
countries face the same opportunities when it comes to development. Poorer countries tend
to be located in the periphery where moving towards new products is harder to achieve.
More interestingly, among countries with a similar level of development and seemingly
similar levels of production and export sophistication, there is significant variation in the
option set implied by their current productive structure, with some on a path to continued
structural transformation and growth, while others are stuck in a dead end.
These findings have important consequences for economic policy, as the incentives
to promote structural transformation in the presence of proximate opportunities are quite
different from those required when a country hits a dead end. It is quite difficult for
production to shift to products far away in the space, and therefore policies to promote
large jumps are more challenging. Yet, it is precisely these long jumps that generate
subsequent structural transformation, convergence, and growth.
64
CHAPTER 5: DISCUSSION
5.1 Physics and People
For some people, studies mixing physics and people like the ones presented above may
be hard to classify. On one end, some people wonder where is the physics, while others would
most certainly claim that this is definitely not sociology.
At the beginning of the last century, Physicists were interested in understanding the
statistical properties of microscopic collections of particles. On the study presented in
chapter 3, more than 100 years later, we simply extend this question to a different, more
complex, type of “particle.” While this could be seen as a trivial question to formulate, our
opportunity to jump into it was made possible only as a spin-off of one of the world‟s
largest industries. As a research project, it would have been impossible to fund tens of
thousands of antennas and handle devices to millions of people to participate in such an
experiment. Yet, there are many places where research similar to this one is also taking
place. Several research groups have begun collaborations with communication companies
to study closely related problems [97,98,99,100]. These groups however, are not formed only
by traditional social scientists, but by physicists, mathematicians, computer scientists,
biologists, ecologists, architects and possibly scientists trained in other disciplines as well.
Natural scientists are interested in kinematic questions, even if the observed particles are
people. This example illustrates that scientific disciplines can defined by the approach
65
followed by those who practice them, rather than by the object of study. Writing a poem
about the rain does not make you a meteorologist. Similarly, studying the statistical
properties of individuals‟ kinematics does not make you a sociologist.
The same rationale can be illustrated in our study of the social network‟s dynamics.
While the study on human mobility patterns dealt with relatively uncharted scientific
territory, the literature on the empirical study of social network dynamics consisted of
several papers [27,28,29,30,38,39,40]. Putting some technical differences aside,4 there are clear
differences between the approaches taken by us and that of more traditional social
sciences. Papers published in sociological journals have results concentrating heavily on
the personal factors affecting social decay, like marriage and divorce [28] or entering
college [30]. Our study however, concentrated on discerning the correlations between the
structure of the network and its dynamics and is more closely related to the papers
published by Palla, Barabasi and Vicsek, [8] Onnela et al. [9] and Kossinets and Watts [6],
which also use massive communication records to study structural and dynamical
attributes of social interactions. In our study we showed that the coupling between a
social network structure and dynamics is strong enough that predictions can be made with
extremely naive approaches, alerting the community that there is a fertile ground for
predictive theories and mechanisms to be used in the study of social networks.
If it is valid to classify scientific disciplines by their approach, rather than by the
objects that they study, the number of interdisciplinary collaborations should increase in
the coming years. As the world self-organizes into a more globalized and interconnected
4
Like the fact that we used millions of automatically collected records rather than survey data on
tens or a few hundred individuals and that the data used in sociological studies usually consists of only two
panels.
66
state, the boundaries between scientific disciplines will blur, shift reorganize and evolve;
new scientific disciplines will be created and value will gradually emerge from the work of
scientists trading skills and problems across disciplines.
5.2 The Product Space
Traditionally, development has been measured through a host of aggregated
variables, mainly gross domestic product (GDP) adjusted by power purchasing parity. Yet,
as a concept, development has always been associated with an increase in diversity that
cannot be captured by such averages. As the human body develops, cells differentiate into
neurons, muscles, bones and several other cell types. Similarly, as nations develop,
different industries and products are born. Assessing the health of an economy solely based
on its wealth is as correct as assessing the health of a child solely based on its weight. A
more detailed view of development should ultimately concentrate on understanding how
nations develop different industries and products, rather than trying to predict how they
accumulate different types of capital. But how do we describe such a complex process?
A GDP view of development can be seen as a ramp or ladder. In such a metaphor,
development is measured by looking at the step on the ladder in which each nation is at,
regardless of the products and services that allowed them to get there. Development,
however, may not be as one dimensional as this picture suggests. An alternative metaphor
would represent nations as being spread on a rugged landscape rather than a ladder,
searching for new products in its valleys and crossing mountains and oceans in search for
new products and services – a Sewall Wright type of metaphor, for those familiar with the
great geneticist [101].
67
Although inspiring, assuming an entire landscape to study development may seem
unpractical. We can overcome this by replacing the landscape with a network. This
approach is far from new, as it was used by Euler to abstract and solve the famous
Konigsberg bridge problem [102]. In fact, network representations of physical landscapes
are ubiquitous. Trivial examples are a subway map or the highway network. Hence, if
describing economies as a set of nomadic tribes wondering on a product landscape is as
valid an analogy as describing it as a progression over a scalar function, then a network
view of development is at least as valid as a scalar one-dimensional representation.
We can illustrate how a network view of economics might look through an example
inspired by the view of the world presented in Jared Diamond‟s masterpiece Guns, Germs
and Steel (GG&S) [103]. For those not familiar with the book, it is a fascinating view of our
civilization‟s origins documenting how our society arose at the time that hunters and
gatherers discover plant and animal domestication. The book is full of beautifully
documented facts and anecdotes disclosing the history of many of our civilization‟s first
economic products, like wheat, barley, pork, flax and corn. Through a careful and well
documented discussion, the book shows how our world was shaped by a few civilizations,
which happened to be on the right place at the right time. These civilizations were able to
develop primitive farming economies enabling them to produce enough surplus to allow
individuals to specialize into soldiers and bureaucrats. Consequently, these tribes
dominated their neighbors, physically and/or culturally, and transformed our world from a
myriad of thousands of independent family groups, into a few large dominant civilizations.
But why did some of these advanced civilizations prevail over the others?
According to Diamond‟s argument, since climate changes little with longitude but a lot
68
with latitude, domesticated plants and animals can diffuse more easily if they travel East or
West than if they travel North or South. Since Eurasia is a large expanse spread out on an
East – West axis, innovations in one part could travel easily across the whole continent.
However, Africa and the America‟s are spread on a North – South axis and consequently
there are fewer areas with similar latitudes that could share new varieties of plants and
animals. As a consequence, there were more products available to the Eurasians than to the
Amerindians and Africans.
Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate.
We can use a network view of development to describe Jared Diamond‟s
explanation of such disparity. Figure 18 shows a simplified graphical representation of the
product landscape faced by our ancestors. Civilizations grew by discovering products, i.e.
domesticating plants and animals. These in turn allowed them to create more complex
69
products, such as garments, tools and weapons. Yet not all civilizations started in equally
dense parts of the product space. Eurasian populations had access to a broader set of
opportunities because of the larger base on which they could experiment and share.
Eurasians civilizations had also more starting points as the number of different agricultural
products available in the Eurasian continent was considerably more diverse than that of the
Americas [103]. Omitting details on the nature of the links connecting different products, it
is accurate to say that Eurasian populations were located in a denser part of the product
space -- where many goods were close to each other -- allowing them to expand quickly
over it. On the other hand, civilizations located in the Americas were located in a much
sparser part of the product space where product diffusion was limited by geographical
constraints. This limited the economic diversification of early American civilizations and
consequently, their ability to jump to products located further in the product space.
Clues about the nature of the links connecting different products can be gathered by
looking at how products are discovered and rediscovered by different populations. Some
jumps, like the domestication of apples, can require important technological improvements
– in this case grafting – that once achieved, opened the door to other fruits like pears and
plums [103]. Hence, even in the most ancient of times, links between some products or
industries were driven by technology. In other cases, some products or industries may be
connected to each other by input/output relationships, like flax and linen or olives and oil.
Yet a third way in which products may be connected is similarity in required infrastructure,
like the silos used to store wheat and barley. A network view of development does not
require a unique definition of a link, but rather accepting as a reasonable assumption the
fact that there are links connecting some products more strongly than others; links through
70
which knowledge, inputs and workers can flow, links that could be traversed by endeavor
or serendipity.
5.2.1 Exploring the network
In a recent paper we showed that it is possible to use export data to study
development as diffusion process over a network [104]. To do this, we first created a
measure of distance between a pair of products based on the probability that they were
exported by the same countries. This simple method allowed us to construct a network
were we showed that countries tend to diversify by developing products that are close in
the product space to those they already export [104]. In that publication, we simplified our
discussion by concentrating on the case in which the product space was fixed and countries
spread over it, which we found to be a valid assumption for short enough time scales. We
showed that apparently similar countries face very different opportunities for
diversification because they are at very different distances from other products. We also
showed that, given the structure of the product space today, most poor countries can only
converge to the levels of development of rich countries if they are able to jump distances
that are quite infrequent in the historical record. In other words, the “stairway to heaven”
has some very tall steps that are hard to overcome in one move.
There are many ways in which this analysis can be extended. It may be interesting
to study the product space from a labor perspective. One could relate products based on the
similarity of the labor skills required to make them. This would allow companies to
exchange skilled workers. A new product can more easily be developed if it uses labor
skills that are similar to products already in production, as new firms can poach trained
71
workers from older firms. One could also study the patterns of mobility of labor between
industries as workers try to adjust to changes in the demand for their skills.
The product space evolves over time, as new products and new ways of making old
products are introduced. Cell phones went from not existing, to being made in rich
countries, to being assembled in poor countries. Cell phone service is now ubiquitous in the
world. The internet allows for an exchange of information that was hitherto unimaginable.
Does this facilitate or make it harder for countries to transform themselves?
We can also study the robustness of an economy based on its position in the product
space and its ability to move in it [105].
These are just some examples of the perspectives that could be studied from a
network perspective. It opens new avenues to diagnose a country‟s problems and chart a
policy strategy. To properly do this, we will need to redeploy network techniques and
concepts developed in other branches of science and adapt them to economics.
Additionally, we will need to develop new techniques tailored especially for economic
questions and develop a common language that can be used to bridge new ideas and more
traditional approaches. As large data sets become more ubiquitous, the creation of network
maps will also become more common, as they represent a useful way to surf over new
waves of data.
5.2.2 Our own skepticism
Proposing a network description of the economy is bound to create skepticism.
Time will judge its usefulness, as the creation of a sensible and complete description of the
world economy as an evolving network is a task requiring many minds and years. From a
72
theoretical perspective, suggesting that economics should be described as a spreading
process over an evolving network is as groundbreaking as proposing that economics could
be studied using scalar functions and differential calculus. We often forget that our
“Newtonian” view of economics, pioneered by Walras and Jevons and continued by
Samuelson and many others, requires us to assume that the economy can be best described
by looking for numerical quantities and functional relationships between them. Most of us
forget that assumption because we never made it; we inherited it as college freshmen. Our
approach is not against the use of traditional mathematical methods. On the contrary, it
looks to complement them by incorporating tools that can be used to study development
from a different perspective.
There are no guarantees that this approach will be useful, as there were no
guarantees for the benefits of using calculus and physically inspired equilibrium processes
to describe economics at the beginning of the last century. The proof of the proverbial
pudding will have to be revealed by further research. Yet, markets have taught us the
importance of leaving room for innovation. A network view of development may be just
one such innovation.
5.3 Every tune in the guitar
After all, a scientist‟s work is a dance with ignorance. We have adapted our minds
to constantly describe, abstract and attempt to explain a few things around us. While
everything is uncountable, we look for configurations in systems designed by ourselves
through ingenuity, serendipity and wisdom. Yet, the goal is not to look for every possible
configuration, but to find the few that appear to matter for everyone around us. In this
73
process, scientists explore their interests, skills and intuition; not with the goal to play
every tune in the guitar, but to discover those that sing to them. This, keeping always in
mind, that tunes cannot understand the silence. As most futures cannot be predicted, a
scientist needs to be modest against the greater concept of ignorance, as its actions will add
only a few notes in the tune of the world, a tune that might have an end unforeseen from a
scientist‟s intentions.
74
CHAPTER 6: APPENDIXES
6.1 APPENDIX I: Papers Published During my PhD
6.1.1 Presented in this dissertation
“The Dynamics of a Mobile Phone Network”
CA Hidalgo, C Rodriguez-Sickert
Physica A, 387(12): 3017-3024
Abstract:
The empirical study of network dynamics has been limited by the lack of
longitudinal data. Here we introduce a quantitative indicator of link persistence to explore
the correlations between the structure of a mobile phone network and the persistence of its
links.We show that persistent links tend to be reciprocal and are more common for people
with low degree and high clustering.We study the redundancy of the associations between
persistence, degree, clustering and reciprocity and show that reciprocity is the strongest
predictor of tie persistence. The method presented can be easily adapted to characterize the
dynamics of other networks and can be used to identify the links that are most likely to
survive in the future.
75
“Understanding Human Mobility Patterns”
MC Gonzalez, CA Hidalgo, A-L Barabasi
Nature (2008) 453: 779-782
Abstract:
Despite their importance for urban planning, traffic forecasting, and the spread of
biological and mobile viruses, our understanding of the basic laws governing human
motion remains limited thanks to the lack of tools to monitor the time resolved location of
individuals. Here we study the trajectory of 100,000 anonymized mobile phone users
whose position is tracked for a six month period. We find that in contrast with the random
trajectories predicted by the prevailing Lévy flight and random walk models, human
trajectories show a high degree of temporal and spatial regularity, each individual being
characterized by a time independent characteristic length scale and a significant probability
to return to a few highly frequented locations. After correcting for differences in travel
distances and the inherent anisotropy of each trajectory, the individual travel patterns
collapse into a single spatial probability distribution, indicating that despite the diversity of
their travel history, humans follow simple reproducible patterns. This inherent similarity in
travel patterns could impact all phenomena driven by human mobility, from epidemic
prevention to emergency response, urban planning and agent based modelling.
76
“The Product Space Conditions the Development of Nations”
CA Hidalgo, B Klinger, A-L Barabasi, R Hausmann
Science (2007) 317: 482-487
Abstract:
Economies grow by upgrading the products they produce and export. The
technology, capital, institutions, and skills needed to make newer products are more easily
adapted from some products than from others. Here, we study this network of relatedness
between products, or “product space,” finding that more-sophisticated products are located
in a densely connected core whereas less sophisticated products occupy a less-connected
periphery. Empirically, countries move through the product space by developing goods
close to those they currently produce. Most countries can reach the core only by traversing
empirically infrequent distances, which may help explain why poor countries have trouble
developing more competitive exports and fail to converge to the income levels of rich
countries
“A Network View of Development”
CA Hidalgo, R Hausmann
Development Alternatives (2008) In Press
Abstract:
No Abstract
77
6.1.2 Not presented in this dissertation
"Genome-scale analysis of in vivo spatiotemporal promoter activity in C. elegans"
D Dupuy, N Bertin, CA Hidalgo, K Venkatesan, D Tu, D Lee, J Rosenberg, N Svrzikapa, A
Blanc, A Carnec, A-R Carvunis, R Pulak, J Shingles, J Reece-Hoyes, R Newbury, R Viveiros,
WA Mohler, C Le Peuch, IA Hope, R Johnsen, D Moerman, A-L Barabási, D Baillie & M
Vidal.
Nature Biotechnology (2007) 25: 663 - 668
Abstract:
Differential regulation of gene expression is essential for cell fate specification in
metazoans. Characterizing the transcriptional activity of gene promoters, in time and in
space, is therefore a critical step toward understanding complex biological systems. Here
we present an in vivo spatiotemporal analysis for B900 predicted C. elegans promoters
(B5% of the predicted proteincoding genes), each driving the expression of green
fluorescent protein (GFP). Using a flow-cytometer adapted for nematode profiling, we
generated „chronograms‟, two-dimensional representations of fluorescence intensity along
the body axis and throughout development from early larvae to adults. Automated
comparison and clustering of the obtained in vivo expression patterns show that genes
coexpressed in space and time tend to belong to common functional categories. Moreover,
integration of this data set with C. elegans protein-protein interactome data sets enables
prediction of anatomical and temporal interaction territories between protein partners.
78
"Transcription Factor Modularity in a Gene-Centered C. elegans Protein-DNA Interaction
Network"
V Vermeirssen, MI Barrasa, CA Hidalgo, JAB Babon, R Sequerra, L Doucette-Stamm,
A-L Barabási, AJM Walhout
Genome Research (2007) 17:1061-1071
Abstract:
Transcription regulatory networks play a pivotal role in the development, function,
and pathology of metazoan organisms. Such networks are comprised of protein-DNA
interactions between transcription factors (TFs) and their target genes. An important
question pertains to how the architecture of such networks relates to network functionality.
Here, we show that a Caenorhabditis elegans core neuronal protein-DNA interaction
network is organized into two TF modules. These modules contain TFs that bind to a
relatively small number of target genes and are more systems specific than the TF hubs that
connect the modules. Each module relates to different functional aspects of the network.
One module contains TFs involved in reproduction and target genes that are expressed in
neurons as well as in other tissues. The second module is enriched for paired homeodomain
TFs and connects to target genes that are often exclusively neuronal. We find that paired
homeodomain TFs are specifically expressed in C. elegans and mouse neurons, indicating
that the neuronal function of paired homeodomains is evolutionarily conserved. Taken
together, we show that a core neuronal C. elegans protein-DNA interaction network
possesses TF modules that relate to different functional aspects of the complete network.
79
"Conditions for the Emergence of Scaling in the Inter-Event Time of Uncorrelated and
Seasonal Systems"
CA Hidalgo
Physica A. (2006) 369(2): 877-883.
Abstract:
Inter-event times have been studied across various disciplines in search for
correlations. In this paper, we show analytical and numerical evidence that at the
population level a power-law can be obtained by assuming Poissonian agents with
different characteristic times, and at the individual level by assuming Poissonian agents
that change the rates at which they perform an event in a random or deterministic fashion.
The range in which we expect to see this behavior and the possible deviations from it are
studied by considering the shape of the rate distribution.
“The effect of social interactions in the primary consumption life cycle of motion pictures”
CA Hidalgo, A Castro,C Rodriguez-Sickert
New Journal of Physics (2006) 8 52
Abstract:
We develop a „basic principles‟ model which accounts for the primary life cycle
consumption of films as a social coordination problem in which information transmission
is governed by word of mouth. We fit the analytical solution of such a model to aggregated
consumption data from the film industry and derive a quantitative estimator of its quality
based on the structure of the life cycle.
80
6.2 APPENDIX II: Product Space Properties
Using a network representation for the products space we can not only see which
products are close to each other and the groups they form, but also their classifications and
values. However, the network representation is nothing more than a powerful visualization
technique and we still need to study the space properties using the entire proximity matrix
complemented.
6.2.1 The Product Space Can Classify Products
The first property we study is the ability of the product space to classify goods into
different classes. We compare our network representation with the clusters introduced by
Leamer, as it is shown in figure 1, by using a different color for each product class. We see
that the product space is not colored at random. Products in the same classes lie close to
each other and tend to form clusters.
Although the classification performed by Leamer was done used a different
methodology, the agreement between it and the structure of the product space is striking.
Beyond the intuitive proof of Figure 7s we can tests the strength of these correlations by
taking the average proximity between and within the products belonging to one of the
clusters defined by Leamer (Table 4).
81
TABLE 4 STRENGTH OF THE LINKS BETWEEN AND WITHIN PRODUCTS AS
CLASSIFIED BY LEAMER.
Table 4 shows that the average proximity of products belonging to the same cluster
is always higher than the proximity for products belonging to different clusters. But not all
clusters have the same size, thus we look at the distribution of proximities for all links
connecting products with the same or different Leamer classifications. Figure 8s shows the
distribution of proximity for links connecting nodes with the same Leamer classification
(blue) and for links connecting nodes annotated differently. It is clear from the figure that
nodes with the same classification are connected by links with higher proximity values,
and because of the large number of links present in the system (L>200'000), the difference
between these two distributions is highly significant (log(P-value)<-300 ANOVA)
82
Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red).
6.2.2 Correlations between the Position and Value of Goods
All products have a value, which in this work we consider as the average income
per-capita associated with that good or PRODY. It follows to ask: Are rich goods located in
particular parts of the product space? By looking at its network representation and setting
the size of the nodes proportional to the PRODY of a product (figure 9s), we see that the
largest nodes are located either in the center or the down most portion of the network. At a
first glance, we can say that there is a rich region of the product space, composed by
machinery, electronics and chemicals, and a poor, peripheral region, made of some
agricultural and labor intensive goods.
83
Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY.
We can look beyond the actual value of products and study the value of goods as a
function of their distance between them. Basically we ask: Is this particular product at the
top or at the bottom of the PRODY sophistication scale? To answer this we study the
average PRODY of products at a given distance of a particular node. We define distance as
-log(Proximity). Figure 10s shows six examples of products, three of them at the bottom of
the sophistication scale (Footwear, Cotton Undergarments and Coats and Jackets) which
belong to the labor intensive cluster and thus products far from them are richer or more
attractive. On the other hand, chemicals such as organo sulphur compounds, phenols and
cyclic alcohols appear at the top of the sophistication scale and see all other products as less
sophisticated.
84
Figure 21 Prody as a function of distance for six different products in the space. Plots were
calculated using the full proximity matrix.
We performed the same analysis for each product class and found that there are
products at the top of the scale, at the bottom and in local maxima (Figure 11s). If the
structural transformation only moves countries to more sophisticated goods, a local
maximum would trap countries. Examples of these are cereals and animal agriculture
85
products which are goods located in the periphery of the product space but have a relatively
large PRODY compared to their neighbors.
86
1.Petroleum 6. Cereals
2. Raw 7. Labor
Materials Intensive
3. Forest 8. Capital
Products Intensive
4. Tropical
9. Machinery
Agriculture
5. Animal 10. Chemicals

Agriculture
Figure 22 Average PRODY as a function of the distance for products with a given Leamer Annotation.
87
6.2.3 Changes in Time
How fast does the product space changes in time? We can take a simple look at
these by calculating the Pearson's Correlation Coefficient (PCC) between the matrices
representing the product space in 1985, 1990 and 1998. Table 2s shows that the structure of
the product space appears to be stable and that although links do change in time, after 10 or
13 years strong links remain strong and weak links remain weak. Thus products that are
close tend to remain close and the ones that are far tend to stay far. The correlation was
calculated over each pair of corresponding proximities between different time periods.
Proximity values equal to zero were excluded from the calculation.
TABLE 5 PEARSON'S CORRELATION COEFFICIENT BETWEEN THE PRODUCT
SPACES GENERATED WITH DATA FROM 1985, 1990 AND 1998.
PCC 1985 1990 1998
1985 .702 .696
1990 .616
1998
88
6.3 APPENDIX III: Simulating Diffusion
6.3.1 One diffusion step
Empirically, we showed using examples and statistics that products in which
countries develop RCA tend to lie close to other products for which these countries have
already developed RCA.
Using these we try to anticipate how a country will diffuse across the product space.
As an example, we show Figure 23, in which we highlighted with black squares all
products at a given proximity of the ones already developed by Chile and Korea. We refer
to this example as one diffusion step.
In this case we tuned the proximity of the jump and show that for high proximities
the set of options available is small while for low proximities is large, however different.
The available options are strongly conditioned by current exports. Korea is a
country that has developed RCA in several branches of machinery and therefore can
diffuse from the center of the space. At proximity of 0.5 its options include the entire core
of the network plus the entire electronics and garments clusters, among other things. Chile
diffuses from the periphery and to achieve a similar set of options needs to diffuse as far as
proximities of 0.3.
In summary we find that the set of options available for a country are strongly
conditioned by its position in the product space and its ability to diffuse into products up to
given proximities.
89
Figure 23. One step diffusion process for Korea and Chile. The black squares denote all
products closer than a given proximity considering their exports baskets in the year 2000.
90
Figure 24 Iterated diffusion process for Chile and Korea
6.3.2 Iterated diffusion
We can refine the diffusion process presented above by choosing a particular
proximity and iterate the one step diffusion process. This represents a set of products
potentially available to countries after diffusing to close products iteratively. At this point
we ask ourselves: Is there a critical value of proximity at which countries will be able to
diffuse across the product space? To explore this question we simulate a diffusion process
in which a country "jumps" to all goods reachable from its current export basket, such that
the proximity to them is larger or equal than a given value. Figure 24 illustrates through a
color code the products available to Chile and Korea after diffusing iteratively at different
91
Figure 25. Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities.
proximities for 4 time steps. We observe that at relatively low proximities (φ = 0.55) both
countries are able to diffuse, however Chile does so much slower and reaches the core in
the second and third rounds, compared to Korea which does so on the first and second. At
larger proximities the diffusion process halts. At φ = 0.65 Chile is unable to diffuse at all,
while Korea slowly does so close to the core of the product space.
6.3.3 Economic Convergence
We characterize the value of a certain configuration by considering the value of its
top products. We can assign value to a good by following the work of Hausmann, Hwang
and Rodrick in which the value or sophistication of a good is equal to the average GDP per
capita associated with that good. This quantity is called PRODY and in our particular
example we consider the average PRODY of the top N products of a countries export
92
basket after M diffusion steps with proximity φ. We denote this quantity by
Figure 25 shows that the original distribution of is bimodal. Indicating a world in
which countries are divided into those producing sophisticated goods and unsophisticated
ones. If we allow countries to diffuse in this space to acquire only goods that are really
close by (φ=0.65). This distribution remains practically unchanged evidencing the
structural constrains imposed by the product space. Whereas, if we allow countries to
diffuse into products at relatively large proximities (φ=0.55) we find that after a large
number of rounds most countries are able to reach the most attractive parts of the space,
except for a few of them that remain stuck in the lowest bracket of this distribution.
93
REFERENCES
1 B.B. Mandelbrot The Fractal Geometry of Nature. New York: W. H. Freeman and
Co., (1982)
2 T. Vicsek Fractal Growth Phenomena World Scientific (1991)
3 A.-L. Barabási, H.E. Stanley, Fractal Concepts in Surface Growth, Cambridge
University Press, Cambridge (1995)
4 R.N. Mantegna, H.E. Stanley, An Introduction to Econophysics: Correlations and
Complexity in Finance, Cambridge University Press, Cambridge UK (1999)
5 J McCauley, Dynamics of Markets, Econophysics and Finance, Cambridge
University Press, Cambridge UK (2004)
6 G. Kossinets, D.J. Watts, Science 311: 88-90 (2006).
7 H. Ebel, L.-I. Mielsch, S. Bornholdt, Phys. Rev. E 66:035103 (2002)
8 G. Palla, A.-L. Barabasi, T. Vicsek, Nature 446 :664-667 (2007)
9 J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész,
A.-L. Barabási, PNAS 104 7332 (2007)
10 C.A. Hidalgo, C Rodriguez-Sickert. Physica A 387:3017-3024 (2008)
11 MC Gonzalez, CA Hidalgo, A.-L. Barabási. Nature 453:779-782 (2008)
12 C Biever. New Scientist 185: 25-26 (2005)
13 M. Nekovee. New J. Phys. 9:189 (2007).
14 V. Colizza, A. Barrat, M. Barthélemy, A. Vespignani, BMC Med. 5: 34 (2007)
94
15 V. Colizza, A. Barrat, M. Barthélemy, A. Vespignani, PNAS 103: 2015–2020
(2006)
16 E. Beinhocker, The Origin of Wealth, HBS Press, Cambridge MA (2006)
17 L. Walras, Éléments d'économie politique pure, ou théorie de la richesse sociale
(Elements of Pure Economics, or the theory of social wealth, transl. W. Jaffé), (1874)
18 L. Bachelier, Annales Scientifiques de l‟École Normale Supérieure 3: 21-86
(1900)
19 B. Mandelbrot Journal of Business. 36 (1963)
20 M.H.R. Stanley, L.A.N. Amaral, S. V. Buldyrev, S. Havlin, H. Leschhorn, P.
Maass, M. A. Salinger, H.E. Stanley, Nature 379:804-806 (1996)
21 V. Plerou, P. Gopikrishnan, L.A.N. Amaral, X. Gabaix, H.E. Stanley Phys. Rev.
E 62:3023-3026 (2000)
22 J.D. Farmer, Ind. & Corp. Change 11:895-953 (2002)
23 J.D. Farmer, L. Gillemot, F. Lillo, S. Mike, A. Sen. Quant. Fin. 4:383-397 (2004)
24 J.D. Farmer, D. E. Smith, M. Shubik. Physics Today 58:37-42 (2005)
25 H. Jeong, Z. Néda, A.-L. Barabási, Europhysics Letters 61: 567-572 (2003)
26 A.-L. Barabási, H. Jeong, R. Ravasz, Z. Néda, T. Vicsek, A. Schubert, Physica A
311: 590-614 (2002)
27 P. Holme, C.R. Edling, F. Liljeros, Social Networks 26:155-174 (2004)
28 B. Wellman, R.Y. Wong, D. Tindall, N. Nazer, Social Networks 19:27-50 (1997)
29 J.L. Martin, K.-T. Yeung, Social Networks 28:331-362 (2006)
30 J.J. Suitor, S. Keeton, Social Networks 19:51-62 (1997)
31 A.-L. Barabasi, R. Albert, Science 286:509-512 (1999)
95
32 D.J. Watts, S.H. Strogatz, Nature 393:440-442 (1998)
33 M.E.J. Newman Phys. Rev. E. 67:026126 (2003)
34 L. Adamic, N. Glance, Proceedings of the 3rd international workshop on Link
discovery 36-43 (2005)
35 C. Haythornthwaite, Information, Communication, & Society 8:125-147
(2005)
36 N. Eagle, A. Pentland, D. Lazer, Inferring social network structure using mobile
phone data, PNAS (in submission).
37 Gener. H., Toward a Sociological Theory of Mobile Phone, University of Zurich,
Zurich (2004).
38 D.L. Morgan, M.B. Neal, P. Carder, Social Networks 19:9-25 (1996)
39 S.L. Feld, Social Networks 19:91-95 (1997)
40 R.S. Burt, Social Networks 22:1-28 (2000)
41 M.E.J. Newman, Phys. Rev. Lett 89:208701 (2002)
42 M.E.J. Newman, Phys. Rev. E. 67:026126 (2003)
43 J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple
Regression/Correlation Analysis for the Behavioral Sciences (3rd edition) LEA, Mahwah,
New Jersey (2003)
44 M.E.J. Newman, J. Park, Phys. Rev. E 70:066117 (2004)
45 A. Vazquez, R. Dobrin, D. Sergi, J.-P. Eckmann, Z.N. Oltvai, A.-L. Barabasi
PNAS 101:17940-17945 (2004)
46 C.A. Hidalgo, F. Claro, P.A. Marquet, Physica A 35:674 (2005)
47 K. Sznajd-Weron, J. Sznajd, IJMPC 6 (2000).
96
48 C.A. Hidalgo, A. Castro, C. Rodriguez-Sickert, New Journal of Physics 8:52
(2006)
49 M.S. Granovetter, The American Journal of Sociology 78:1360-80 (1973)
50 M.W. Horner, M.E.S. O‟Kelly Journal of Transportation Geography 9:255-265
(2001)
51 R. Kitamura, C. Chen, R.M. Pendyala, R. Narayaran, Transportation 27:25-51
(2000)
52 V. Colizza, A. Barrat, M. Barthélémy, A.-J. Valleron, A. Vespignani, PLoS
Medicine 4:095-0110 (2007)
53 S. Eubank, H. Guclu, V.S.A. Kumar, M.V. Marathe, A. Srinivasan, Z. Toroczkai,
N. Wang, N. Nature 429:180 (2004)
54 L. Hufnagel, D. Brockmann, T. Geisel, PNAS 101:15124-15129 (2004)
55 J. Kleinberg, Nature 449:287-288 (2007)
56 D. Brockmann, L. Hufnagel, T. Geisel, Nature 439:462-465 (2006)
57 S. Havlin, D. ben-Avraham, Advances in Physics 51:187-292 (2002).
58 G.M. Viswanathan, V. Afanasyev, S.V. Buldyrev, E.J. Murphy, P.A. Prince,
H.E.S. Stanley, Nature 381:413-415 (1996)
59 G. Ramos-Fernandez, J.L. Mateos, O. Miramontes, G. Cocho, H. Larralde, B.
Ayala-Orozco, Behavioral Ecology and Sociobiology 55:223-230 (2004)
60 D.W. Sims Nature 451:1098-1102 (2008)
61 J. Klafter, M.F. Shlesinger, G. Zumofen, Physics Today 49:33-39 (1996)
62 R.N. Mantegna, H.E. Stanley, Physical Review Letters 73:2946-2949 (1994)
97
63 A.M. Edwards, R.A. Phillips, N.W. Watkins, M.P. Freeman, E.J. Murphy, V.
Afanasyev, S.V. Buldyrev, M.G.E da Luz, E.P. Raposo, H.E. Stanley, G.M. Viswanathan,
Nature 449:1044-1049 (2007)
64 T. Sohn, A. Varshavsky, A. LaMarca, M.Y. Chen, T. Choudhury, I. Smith, S.
Consolvo, J. Hightower, W.G. Griswold, E. de Lara, Proc. 8th International Conference
UbiComp , Springer, Berlin, (2006)
65 M.C. González, A.L. Barabási, Nature Physics 3:224-225 (2007)
66 A.-L. Barabási, Nature 435:207-211 (2005)
67 B.D. Hughes, Random Walks and Random Environments, Oxford University
Press, USA, (1995)
68 S. Redner, A Guide to First-Passage Processes. Cambridge University Press, UK
(2001)
69 A. Clauset, R. Shalizi, M.E.J. Newman arXiv:physics:/07061062 (2007)
70 R. Schlich, K.W. Axhausen, Transportation 30:13-36 (2003)
71 N. Eagle, A. Pentland, Behavioral Ecology and Sociobiology (2007)
72 S.H. Yook, H. Jeong, A.-L. Barabási PNAS 99:13382-13386 (2002)
73 G. Caldarelli, Scale-Free Networks: Complex Webs in Nature and Technology
Oxford University Press, USA (2007)
74 S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks: From Biological Nets
to the Internet and WWW. Oxford University Press, USA, (2003)
75 C.M. Song, S. Havlin H.A. Makse. Nature 433:392-395 (2005)
76 F. Cecconi, M. Marsili, J.R. Banavar, A. Maritan, Physical Review Letters
89:088102 (2002)
98
77 A. Hirschman, The Strategy of Economic Development Yale University press,
New Haven, CT. (1958)
78 P. Rosenstein-Rodan, Economic Journal 53 (1943)
79 K. Matsuyama, Journal of Economic Theory 58 (1992)
80 E. Heckscher, B. Ohlin, Heckscher-Ohlin Trade Theory, MIT Press, Cambridge
MA, (1991)
81 P. Romer. Journal of Political Economy 94:5 (1986)
82 P. Aghion, P. Howitt. Econometrica 60:2 (1992)
83 G. Grossman, E. Helpman. Review of Economic Studies 58:1 (1991)
84 E.Leamer, Sources of Comparative Advantage: Theory and Evidence. MIT
Press, Cambridge MA, (1984)
85 S. Lall, Oxford Development Studies 28:337 (2000).
86 R. Caballero, A. Jaffe. NBER macroeconomics annual, O. Blanchard, S. Fischer,
Eds. 15 (1993)
87 E. Dietzenbacher, M. Lahr, Input-output analysis: frontiers and extensions.
Palgrave, New York, NY, (2001)
88 D. Rodrik, A. Subramanian, F. Trebbi, NBER Working Paper 9305, Cambridge
MA (2002)
89 D. Acemoglu, S. Johnson, J.A. Robinson. American Economic Review,
91:1369-1401 (2001)
90 B. Balassa, The Review of Economics and Statistics 68:315 (1986)
91 R.R. Feenstra, H. Lipsey, A. Deng, A. Ma, H. Mo. NBER working paper 11040.
Cambridge, MA (2005)
99
92 E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai, A.-L. Barabási, Science
297:1551 (2002)
93 G. Palla, I. Derenyi, I. Farkas, T. Vicsek, Nature 435:814 (2005)
94 R. Albert, A.-L. Barabási Review of Modern Physics 74:47-97 (2002)
95 R. Hausmann, J. Hwang, D. Rodrik, NBER Working Paper 11905, Cambridge
MA (2006)
96 J. Gallup, J. Sachs, A. Mellinger, International Regional Science Review
22:179-232 (1999)
97 Real Time Rome, MIT, http://senseable.mit.edu/realtimerome/
98 Human Dynamics Lab, MIT, http://hd.media.mit.edu/
99 Smart Cities, MIT, http://cities.media.mit.edu/
100 Intellione and Roger Wireless have signed an agreement to use cell phones to
create traffic maps http://www.intellione.com/Newsroom/Press/intellione-presa.html
101 S. Wright, Proceedings of the Sixth International Congress on Genetics, (1932)
102 In fact he showed that the problem had no solution.
103 J. Diamond. Guns, Germs, and Steel: The Fates of Human Societies. W.W.
Norton & Company (1997)
104 C.A. Hidalgo, B. Klinger, A.-L. Barabasi, R. Hausmann. Science 317:482-487 (2007)
105 Hausmann, Rodriguez and Wagner (2008) show that the position of a country in
the product space strongly affects the speed at which it recovers from economic crises.
100

Hidalgo

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hidalgo

Uploaded by

Copyright:

Available Formats

THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY

DRIVEN COMPLEX SYSTEMS

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Albert-László. Barabási, Director

Graduate Program in Physics

Notre Dame, Indiana

DRIVEN COMPLEX SYSTEMS

Complex systems are characterized by having emergent properties that cannot be

aggregate into a host of complex structures.

Here we empirically study three different aspects of humanly driven complex

characteristic distance travelled by individuals follows a heterogeneous distribution which

explains the previously observed Lévy-flight properties of human mobility. Third, we

construct a network summarizing world trade to study the dynamics of countries

industrial development of nations.

as empirical rules characterizing them.

TABLES ........................................................................................................................... vii

ACKNOWLEDGEMENTS ............................................................................................. viii

CHAPTER 1: INTRODUCTION ........................................................................................1

1.1 The Statistical Physics of Society ......................................................................2

1.2 Physics and Economy ........................................................................................4

CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK ..........................7

2.1 Introduction ........................................................................................................7

2.2 Data ....................................................................................................................9

2.3 The Persistence of Ties ....................................................................................11

2.4 Global Analysis of the Persistence of Ties ......................................................13

2.5 Network Structure and the Persistence of Ties ................................................15

2.6 Multivariate Analysis .......................................................................................18

2.7 Using Topology to Infer Future Ties ...............................................................22

CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS ........................26

3.1 Introduction ......................................................................................................26

3.2 Source Data ......................................................................................................29

3.3 The Heterogeneity of Human-Mobility Patterns .............................................30

3.5 The Periodicity of Human Mobility Patterns ...................................................37

3.6 The Shape of Human Mobility Patterns...........................................................39

3.7 The Anisotropy of Human Mobility Patterns .................................................42

CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF

4.1 Introduction ......................................................................................................45

4.2 Product Proximity ............................................................................................46

4.3 The Product Space ...........................................................................................48

4.4 Generating a Network Representation of the Product Space ...........................51

4.6 Discussion ........................................................................................................64

CHAPTER 5: DISCUSSION .............................................................................................65

5.1 Physics and People ...........................................................................................65

5.2 The Product Space ...........................................................................................67

5.3 Every Tune in the Guitar..................................................................................73

CHAPTER 6: APPENDIXES ............................................................................................75

6.1 Appendix I: Papers Published During My PhD ...............................................75

6.2 Appendix II: Product Space Properties ............................................................81

6.3 Appendix III: Simulating Diffusion .................................................................89

Figure 1 Definition of Persistence .....................................................................................12

Figure 2 Persistence across a cellular phone network .......................................................14

Figure 3 Network structure and the persistence of ties. .....................................................17

Figure 4 Predicting future ties............................................................................................23

Figure 5 Interevent time distribution P(ΔT) of calling activity. ........................................30

Figure 6 Basic human mobility patterns ............................................................................33

Figure 7 Kolmogorv-Smirnov goodness of fit test. ...........................................................36

Figure 8 The bounded nature of human trajectories. .........................................................38

Figure 9 The shape of human trajectories.. ........................................................................43

Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product

Figure 11 Network representation of the 1998-2000 product space. .................................50

Figure 16 Empirical evolution of countries. ......................................................................60