You are on page 1of 113

THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY

DRIVEN COMPLEX SYSTEMS

A Dissertation

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

by

César A. Hidalgo

Albert-László. Barabási, Director

Graduate Program in Physics

Notre Dame, Indiana

July 2008
THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY

DRIVEN COMPLEX SYSTEMS

Abstract

by

César A. Hidalgo

Complex systems are characterized by having emergent properties that cannot be

explained from their large number of interacting and heterogeneous components. Different

aspects of human society can be described as a complex system, as large numbers of people

aggregate into a host of complex structures.

Here we empirically study three different aspects of humanly driven complex

systems. First, we study the dynamics of a mobile phone network reconstructed from

millions of individual phone calls. By looking at time resolved data we show that the

structure of the mobile phone network is coupled to the dynamics of mobile phone links.

Second, we study the statistical properties of human mobility patterns and show that the

characteristic distance travelled by individuals follows a heterogeneous distribution which

explains the previously observed Lévy-flight properties of human mobility. Third, we

construct a network summarizing world trade to study the dynamics of countries

productive structures and show that the structure of the product space conditions the

industrial development of nations.


César A. Hidalgo

These three studies illustrate how large data sets can be used to empirically study

humanly driven complex systems. Individually, they present relevant information that can

be used to benchmark future models for each one of these complex systems or can be used

as empirical rules characterizing them.


To my family, who supported me closely while more than 5,000 miles away.

ii
CONTENTS

FIGURES .............................................................................................................................v

TABLES ........................................................................................................................... vii

ACKNOWLEDGEMENTS ............................................................................................. viii

FOREWARD ..................................................................................................................... xi

CHAPTER 1: INTRODUCTION ........................................................................................1

1.1 The Statistical Physics of Society ......................................................................2

1.2 Physics and Economy ........................................................................................4

CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK ..........................7

2.1 Introduction ........................................................................................................7

2.2 Data ....................................................................................................................9

2.3 The Persistence of Ties ....................................................................................11

2.4 Global Analysis of the Persistence of Ties ......................................................13

2.5 Network Structure and the Persistence of Ties ................................................15

2.6 Multivariate Analysis .......................................................................................18

2.7 Using Topology to Infer Future Ties ...............................................................22

CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS ........................26

3.1 Introduction ......................................................................................................26

3.2 Source Data ......................................................................................................29

3.3 The Heterogeneity of Human-Mobility Patterns .............................................30

iii
3.4 Testing the Power-Law Curve Fits ..................................................................35

3.5 The Periodicity of Human Mobility Patterns ...................................................37

3.6 The Shape of Human Mobility Patterns...........................................................39

3.7 The Anisotropy of Human Mobility Patterns .................................................42

CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF


NATIONS ..............................................................................................................45

4.1 Introduction ......................................................................................................45

4.2 Product Proximity ............................................................................................46

4.3 The Product Space ...........................................................................................48

4.4 Generating a Network Representation of the Product Space ...........................51

4.5 The Products Space and the Patterns of Comparative Advantage ...................55

4.6 Discussion ........................................................................................................64

CHAPTER 5: DISCUSSION .............................................................................................65

5.1 Physics and People ...........................................................................................65

5.2 The Product Space ...........................................................................................67

5.3 Every Tune in the Guitar..................................................................................73

CHAPTER 6: APPENDIXES ............................................................................................75

6.1 Appendix I: Papers Published During My PhD ...............................................75

6.2 Appendix II: Product Space Properties ............................................................81

6.3 Appendix III: Simulating Diffusion .................................................................89

REFERENCES ..................................................................................................................94

iv
FIGURES

Figure 1 Definition of Persistence .....................................................................................12

Figure 2 Persistence across a cellular phone network .......................................................14

Figure 3 Network structure and the persistence of ties. .....................................................17

Figure 4 Predicting future ties............................................................................................23

Figure 5 Interevent time distribution P(ΔT) of calling activity. ........................................30

Figure 6 Basic human mobility patterns ............................................................................33

Figure 7 Kolmogorv-Smirnov goodness of fit test. ...........................................................36

Figure 8 The bounded nature of human trajectories. .........................................................38

Figure 9 The shape of human trajectories.. ........................................................................43

Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product


space. ......................................................................................................................49

Figure 11 Network representation of the 1998-2000 product space. .................................50

Figure 12 Earliest version of the MST representing the "skeleton" of the product space.
................................................................................................................................52

Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55 .............................................................................................53

Figure 14 Network representation of the product space. Layout uses a force spring
algorithm. ...............................................................................................................54

Figure 15 Localization of the productive structure for different regions of the world ......57

Figure 16 Empirical evolution of countries. ......................................................................60

Figure 17 Simulated diffusion process and inequality. ......................................................62

v
Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate
................................................................................................................................69

Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red) ................................................83

Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY ..............................................................................................................84

Figure 21 Prody as a function of distance for six different products in the space. ............85

Figure 22 Average PRODY as a function of the distance for products with a given Leamer
Annotation..............................................................................................................87

Figure 23 One step diffusion process for Korea and Chile ................................................90

Figure 24 Iterated diffusion process for Chile and Korea..................................................91

Figure 25 Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities .........................................................92

vi
TABLES

TABLE 1 DATA PANELS AVAILABLE FOR THIS STUDY. .....................................10

TABLE 2 PERSISTENCE OF TIES AND LINK ATTRIBUTES. ..................................21

TABLE 3 CORRELATIONS AND REGRESSIONS BETWEEN NODE ATTRIBUTES


AND PERSEVERANCE .......................................................................................22

TABLE 4 STRENGTH OF THE LINKS BETWEEN AND WITHIN PRODUCTS AS


CLASSIFIED BY LEAMER. ................................................................................82

TABLE 5 PEARSON'S CORRELATION COEFFICIENT BETWEEN THE PRODUCT


SPACES GENERATED WITH DATA FROM 1985, 1990 AND 1998 ..............88

vii
ACKNOWLEDGMENTS

I would like to thank Albert-Laszlo Barabasi, for accepting me in his group, having

confidence in me from my early years in Graduate School and for the numerous

discussions and advices we exchanged during this four years. I am positive that I would

have not been able to have such a wonderful Graduate School experience if it was not for

Laszlo‟s support and personality. I also would like to thank Ricardo Haussmann for the

innumerable conversations advices and time spent discussing several different scientific

and non-scientific issues with me. He has greatly inspired the evolutionary view of systems

I present in this dissertation and has encouraged me to be innovative and honest in

scientific terms.

I am also enormously grateful to Christine Teutsch for her support and love during

the more than two years we have been together and to Alejandra Castro for her support in

my early years in graduate school and all along my undergraduate education.

I would also like to thank all my collaborators during these four years: Marta

Gonzalez, Bailey Klinger, Carlos-Rodriguez-Sickert, Luigi Cuccia, Denis Dupuy, Nicolas

Bertin, Nicholas Blumm, Pu Wang, Kavitha Venkatesan, Vanessa Vermeirseen, Marc

Vidal and Nicholas Christakis for being wonderful collaborators.

I am also grateful for many interactions with KI Goh, Andrea Asztalos, Alexei

Vazquez, Muhammed Yildrim, Anne-Ruxandra Carvunis, Deok-Sun Lee, Juyong Park,

viii
Julian Candia, Zehui Qu, Natali Gulbahce, Maximilian Schich, Marcio Argollo de

Menezes, Zoltan Toroczkai, Sameet Sreenivasan and Chaoming Song.

Special thanks go for the CCNR staff, especially for Suzanne Aleva, who has given

uncountable hours to the advancement, promotion and management of the complex

networks field; as well as too Nicole Halley, Nicole Leete and Agnes Petrozcky.

I would also like to thank the Notre Dame Physics department, especially professor

Kathie Newman who has been there to help me all along graduate school.

Additionally, I would like to thank Professor Francisco Claro, for his support and

several conversations during my graduate studies and during previous years. I would also

like to thanks professor Pablo Marquet for inviting me to the Santa Fe Institute among

other things and Carlos Rodriguez-Sickert, who has been a great collaborator and friend for

several years.

Finally, I would like to thank the Kellogg Institute at Notre Dame for their financial

support during my graduate studies.

The Dynamics of a Mobile Phone Network

C.A. Hidalgo was partly supported by the Kellogg Institute at Notre Dame and

acknowledges support from NSF grant ITR DMR-0426737, IIS-0513650 and the James S.

McDonnell Foundation 220020084. C. Rodriguez-Sickert acknowledges Sam Bowles and

the S.F.I. We thank Nicole Leete for proof reading our manuscript. Special

acknowledgments to A.-L. Barabasi for providing the source data and discussing the

manuscript.

ix
Human Mobility Patterns

We thank D. Brockmann, T. Geisel, J. Park, S. Redner, Z. Toroczkai and P. Wang

for discussions and comments on the manuscript. This work was supported by the James S.

McDonnell Foundation 21st Century Initiative in Studying Complex Systems, the National

Science Foundation within the analysis was performed on the Notre Dame Biocomplexity

Cluster supported in part by NSF MRI Grant No. DBI-0420980. C.A. Hidalgo

acknowledges support from the Kellogg Institute at Notre Dame.DDDAS (CNS-0540348),

ITR (DMR-0426737) and IIS-0513650 programs, and the U.S. Office of Naval Research

Award N00014-07-C. Data

The Products Space Conditions the Development of Nations

We would like to thank the following for valuable comments: Philippe Aghion,

Laura Alfaro, Olivier Blanchard, Ricardo Caballero, Oded Galor, Elhanan Helpman, Asim

Khwaja, Jim Lahey, Robert Lawrence, Daniel Lederman, Lant Pritchett, Roberto Rigobon,

Dani Rodrik, Andres Rodriguez-Clare, Charles Sabel, Ernesto Stein, Federico

Sturzenegger, and David Weil. C.A.H. acknowledges support from the Kellogg Institute at

Notre Dame. C.A.H. and A.-L.B. acknowledge support from NSF grant ITR

DMR-0426737, IIS-0513650 and the James Mc Donanld Foundation 220020084.

A Network View of Development

We would like to thank Melissa Wojciechowski for help editing this manuscript.

x
FOREWORD

In this dissertation I present some of the research conducted as a Physics graduate

student at the Center for Complex Network Research in the University of Notre Dame.

This Dissertation has been formatted to satisfy the requirements of the Physics department.

I remit anyone interested in the historical context in which this work was conducted, as

well as those interested in my view of the epistemological background underlying this

research, to search for the original copy submitted for review in my personal webpage (just

google it).

Best Wishes

Cesar A. Hidalgo

xi
CHAPTER 1:

INTRODUCTION

In this document I present my contribution to a few problems in the field of

complexity or complex systems. While many physicists have contributed to this field for

several decades, they are not the only ones to study complexity. The field of complexity

has a strong interdisciplinary nature and a great number of contributions have emerged

from the interactions between scientists trained in different fields. Throughout this

document, we will present some examples mixing Physics, Biology, Economics, Computer

Science, Psychology and Social Sciences to answer questions that lie between traditional

disciplines and at the edge of emerging scientific paradigms.

Complexity has helped unite different branches of science by implicitly

demonstrating that some problems should be worked from many different angles. There

are no constraints forbidding the use of theories and methods inspired by a particular field

in a different scientific discipline. This creates the need for a field connecting seemingly

distant branches of science. The science of complexity has arisen in part to fulfill that

particular need. Moreover, complexity science has created scientific value which is

different from that of the particular fields where its adherents were originally trained. This

dual purpose makes the field of complexity attractive from an applied as well as a

fundamental perspective.

1
1.1 The Statistical Physics of Society

One scientific combination that has gained recent popularity is that of physicists

studying people. The recent surge of physicists into the realms of social science has been

fueled largely by the availability of data collected by network administrators and corporate

database managers in recent years. Physicists, mainly from statistical mechanics, have

rushed into the strangest datasets looking to describe systems and answer questions that

could only be speculated about ten years ago. From a historical perspective this is hardly

unexpected, as statistical physicists have been flirting with less conventional topics for

several decades, such as fractals [1,2,3], stock-exchange time series [4,5] and more recently

massive communications records [6,7,8,9].

In this dissertation I present two of my first contributions to the study of social

phenomena. Both of these contributions have been made possible by the availability of

millions of mobile phone records. In Chapter 2 we will study the dynamics of a mobile

phone network [10], whereas on Chapter 3 we show how people can be characterized by

their mobility patterns [11].

1.1.1 Why study this?

Before presenting these two topics, we will briefly discuss two of the main

applications of studies characterizing individuals, either by their social network structure,

dynamics or mobility patterns. While in their purest forms, such studies might appear as

simple curiosities, marketing executives are becoming increasingly more interested in new

ways to classify people. The motivation here is obvious, as people in the marketing field

have constantly searched for new ways to segment populations and identify the individuals

that are more susceptible to purchase specific products. Heretofore, marketing

2
segmentation has been based on demographic and socio-economic attributes, such as a

person‟s gender, age, ZIP code and income information. While all these variables are

important for marketing segmentation strategies, they cannot be expected to exhaustively

represent a person, as two individuals from the same age and gender, living in the same zip

code and having comparable incomes can have extremely different behaviors. Here is

where these new layers of data fall into place, as the structure of an individual‟s social

network, its dynamics and mobility patterns could be better proxies for a person‟s behavior

than demographic and socioeconomic variables and could therefore be used to explore new

marketing segmentation strategies.

While the previous paragraph presents a very well defined and concrete industrial

application of studies in social structure and dynamics, there are several other applications

that open up thanks to this type of studies. Epidemiology is probably the most important of

these given the real threat of infectious diseases and software viruses [12,13]. This fuels the

need for studies providing data that can be used to understand the spread of biological and

digital pathogens from an empirical or theoretical perspective. Ultimately, the goal of these

studies is to create a future in which epidemiological forecasts [14,15] could be a reality.

Despite the potential applications of quantitative studies in social structure and

dynamics, there is also a fundamental angle from which to address these questions. Social

systems are extremely complex and exhibit behaviors that are interesting to study solely for

scientific curiosity. Hence together, applied and fundamental studies can help advance our

understanding of the universe at this high level of complexity, thus research conducted

from applied or fundamental perspectives are complements rather than substitutes. Here

3
we present some research conducted from an empirical perspective with some fundamental

and applied flavors.

1.1.2 A little bit of novelty

Large scale data on a person‟s social network and mobility has only become

available during the last few years. Hence, exploring ways to statistically describe millions

of individuals based on their social network structure, dynamics or spatial mobility patterns

is by definition a new field. Yet, from even more recent appearance are data allowing us to

study the dynamics of the structures defined by individual‟s social relationships and

mobility patterns. In the next chapters we present two studies that are among the first in the

literature to explore the dynamical properties of individuals‟ social networks and spatial

dynamics. The first of these studies concentrates on the stability of social ties and its

connection to network structure. This study was published in Physica A on May 2008 [10].

The second study will discuss how to characterize the spatial patterns defined by the

movement of individuals. This study was published in Nature on June 2008 [11].

1.2 Physics and Economy

On chapter 4 we will present a study combining the physics of networks with

developmental economy and industrial policy. While this approach can be considered

innovate, it is not a completely novel mixture, as physics and economy have been dialoging

for more than a century. This long precedes complexity science and goes back at least to

the times of Leon Walras and William Stanley Jevons, two XIX century scientists credited

for establishing the mathematical foundations of classical economics. Leon Walras, a late

bloomer in scientific terms [16], has been credited for introducing the notion of equilibrium
4
in classical economic theory [16] in his 1872 book Elements of a Pure Economics [17]. It is

believed that Walras adapted the notion of equilibrium from Louis Pinsot‟s Elements de

Statique. On the same front, William Stanley Jevons defined the problem of economic

choice as an exercise in constrained optimization where consumers calculate which

amount of goods will make them “happier.” Like Walras, Jevons was also inspired by

theoretical physics. This is evident in his Theory of Political Economy, as Jevons used

equations from field theory in an attempt to describe human behavior in a form as

predictable as gravity [16].

Despite these XIX century sisterhood, physics and economy have evolved mostly

separately ever since. During the last century, economy has branched out into several

different fields, many of which have adopted a view of the world that resemble

mathematics rather than physics. The field of finance however, has been an exception as its

use of random walks has kept it somewhat closer to physics. Random Walks were first

proposed as a model for financial markets by Louis Bachelier in his doctoral thesis in the

year 19001 [18] under the supervision of Henri Poincare. Bachelier‟s work was ignored for

many years and was resurfaced by Benoit Mandelbrot during the 50‟s, when he

re-discovered the power-law nature of stock returns [19]. This portion of Mandelbrot‟s

work was also ignored during the first years after its publication, as the power-law behavior

introduced difficulties in the construction of analytically solvable financial models.

Yet describing a system with random walks and power laws is not the way to keep

physicists out of the loop. During the last decades a large number of physicists entered

financial research. This is a process driven by the similarities of the two disciplines that

1
Note this is earlier than Einstein‟s 1905 paper on Random Walks.

5
was catalyzed greatly during the 1990‟s as the end of the cold war created an excess supply

of physicists that was absorbed in part by the financial sector.

Financial data has also been studied by physicists in academic settings. A superb

example of this is the work that Rosario Mantegna and Gene Stanley started at Boston

University on the mid 90‟s. Several of their findings are summarized in their book

Econophysics [4], including the existence of scaling laws for the growth of firms [20,4], the

interpretation of financial markets as an anomalous diffusion process [4,21] and some of the

best evidence for the power-law nature of stock returns [4].

Another successful physicist in the field of finance is Doyne Farmer of the Santa Fe

Institute. His approach has been somehow different than that of Mantegna and Stanley as

he has concentrated on explaining different observations in the stock market using agent

based simulations [22] or simple logic [23]. Still, an important part of his work resembles

that of the Boston University physicists as it is grounded on empirical observations of

scaling and universal relationships [24].

The fact that financial markets have been studied using random walk models and

time series analysis can partly explain the sail of physicists into such apparently far

academic waters. Developmental economy however, has been studied as the accumulation

of factors and through abstract production functions that resemble physics only on the

simplicity of some of its mathematical expressions, this being not enough fertile ground for

collaborations to occur. The introduction of a network view of development does not only

open a new angle from which information can be revealed, but opens a new window for

collaboration between scientists in the fields of networks and an important branch of

economy.

6
CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK

2.1 Introduction

Physicists are no strangers to the study of social networks. During the last decade

several groups have explored the structure of social networks captured by e-mails [6,7],

cellular phones [8,9] and professional relationships such as being costars in a movie [25] or

collaborators in a paper [26]. Studying the dynamics of such social systems, however, has

been limited by the lack of longitudinal data, and as a result, only a few studies on the

dynamics of interpersonal connections have been published [6,8,27].

In principle there are many factors that could affect the stability of a social link

[28,29,30]. The aim of the subsequent sections is not to review all these factors, but to study

the coupling between the structure of the network as characterized in previous studies

[31,32] and the temporal stability of the links.

Here we use a year‟s worth of mobile phone data as a proxy for the structure and

dynamics of a social network involving close to two million people. Automatically

collected communication records have been proposed as a source of reliable data about

personal connections [33]. Email data for example, has been used to study social processes

such as social links, or tie, formation [6] and social structure [7], whereas blog data has been

used to study the spread of political opinions [34]. Communication records overcome

problems of survey data such as subjective biases on the respondents and the intrinsic

7
limitations of ego-centered networks, like their unreliability measuring a social network‟s

structure.

It is not our intention to claim that cellular phone communications fully capture

social exchange. A social network is expressed through a host of interactions ranging from

e-mails to face-to-face contacts. People in close social contact tend to express their ties to

others through multiple interaction channels [35], such as email, cell phone

communications, instant messaging and face-to-face interaction. There are arguments,

however, favoring the use of cellular phone calls as a relevant proxy for large-scale social

networks. Specifically, it has been shown that objective measures as the one we use in our

study can accurately predict self-reported friendships [36]. Moreover, from a scientific

perspective, interest in mobile-phone studies has been expressed through the emergence of

a literature on mobile-phone networks in which people have studied the strength of social

ties in cross sections of the network [9] and the dynamics of social groups [6,8].

There are also some technical aspects that favor the use of a mobile phone records

as a proxy for social interactions. Mobile phone numbers are unlisted, thus knowing them

reveals some sort of social connection between caller and callee. Also, cellular phones

were the most widespread information technology at the time this data was collected; with

a penetration larger than 40% worldwide and close to 100% in developed countries, such as

the one considered in this study. During the same time period, internet penetration was just

over 13% worldwide and 51% for developed countries (MDGS indicators U.N.

http://mdgs.un.org/unsd/mdg), making cellular phones the most complete method to study

social interactions on the population scale. In addition, mobile-phone usage has been

8
particularly democratic to the extent that it has homogeneously penetrated different social

strata [37].

2.2 Data

Our data consists of 7,948,890 voice calls between 1,950,426 users of a service

provider holding approximately 25% of an industrialized country's market. The data

consist of ten panels, or data cross-sections, collected between April 15, 2004 and March

31, 2005. Each panel summarizes 15 days of mobile phone calls between the members

serviced by the provider who facilitated the data. Not every panel is available (see Table 1),

as this was the way in which data was made available to us. We consider only agents that

made or received at least one call in each panel to avoid dealing with dropouts or new

subscribers. We hereafter assume that at high service penetration levels (~100%) people

serviced by a particular provider are equivalent to a random sample. In our network nodes

are mobile phone numbers, which we interpret as people and links are the calls connecting

them.

9
TABLE 1 DATA PANELS AVAILABLE FOR THIS STUDY.

Time Period In Our Study?


April 16 to April 30 2004 Yes
May 1 to May 15 2004 No
May 16 to May 31 2004 Yes
June 1 to June 15 2004 No
June 16 to June 30 2004 Yes
July 1 to July 15 2004 No
July 16 to July 31 2004 Yes
August 1 to August 15 2004 Yes
August 16 to August 31 2004 No
September 1 to September 15 2004 No
September 16 to September 30 2004 Yes
October 1 to October 15 2004 Yes
October 16 to October 31 2004 No
November 1 to November 15 2004 No
November 16 to November 30 2004 No
December 1 to December 15 2004 No
December 16 to December 31 2004 No
January 1 to January 15 2005 Yes
January 16 to January 31 2005 No
February 1 to February 15 2005 No
February 16 to February 28 2005 Yes
March 1 to March 15 2005 No
March 15 to March 31 2005 Yes

10
2.3 The Persistence of Ties

We measure the stability of social ties across time as the number of panels in which

a link is observed, over the total number data panels available. We denote this measure as

persistence which can be expressed as:

∑ Aij( T)
Pij = T
, (1)
M

where Aij(T) is 1 if nodes i and j communicated in panel T and 0 otherwise, and M is the

total number of panels. Persistence is the probability of observing a tie when observing a

network for a given time. Because of the discrete nature of panel data, our definition of

persistence has a resolution that depends on the panel‟s duration. For example, if we consider

panels with a duration comparable to the one of links, (~ minutes in the case of phone calls),

our definition of persistence gives us the number of times a tie or connection appeared.

Whereas when we consider panels lasting considerably longer than the typical duration of a

link, our definition of persistence will capture the stability of a link on a larger, coarse-grained

temporal scale. Our data set consists of 10 panels, each summarizing 15 days of voice call

activity. Thus in this study we measure persistence on a monthly to yearly time scale.

We illustrate our definition of persistence using four different panels of a five node

network (Fig. 1 a). In this example, the link between nodes 2 and 4 is present in all panels

while the one between nodes 1 and 2 is present only in half of them. We say that the

persistence of the link between nodes 2 and 4 is 4/4 while the persistence of the link

connecting nodes 1 and 2 is 2/4. Each panel gives a binary representation of the network,

where a link is either present or not. Our definition of persistence summarizes the dynamics

of all binary panels by assigning a weight to each link. Thus, persistence is a change of

11
representation that allows us to map many network panels into a single weighted network

(Fig. 1 b).

Our measure of persistence weakly increases with the number of times a link is

observed; hence persistence indicates stability as understood in previous studies [38,39].

However, given that we measure whether the link is observed in N>2 panels, it will not

describe a link dichotomously as stable or unstable, but will give the degree of stability 1/N

≤ P ≤ 1, rewarding those links expressed consistently in many panels.

Figure 1 Definition of Persistence. a Four panels of a five node network in which not all
links are equally persistent. b Persistence representation of the four panels presented in a.

Persistence is a tie attribute that can be defined for a particular node as the average

persistence of all its ties. We denote this as perseverence and define it as

, (2)

where ki is the degree, or number of connections of the ith node. We will use this

quantity to study the characteristics of nodes carrying persistent ties.

12
Our definition of persistence has limitations. One could claim we are unfairly

punishing newly formed links. An alternative strategy would be to consider only the links

involved in the first panel; however an exercise in this line showed us that there is a strong

selection bias towards stable links when we consider such an option. For example, links

appearing only once, in the second to tenth panel, will not be considered if we set our

benchmark on the first panel only. Our definition also does not differentiate between links

active half of the time or those active during a particular half of the year. We do not

propose our measure as the ultimate way to reduce a set of network panels into a weighted

network, but as a simple way to do so, allowing us to characterize to first approximation the

stability of a network‟s links.

2.4 Global Analysis of the Persistence of Ties

Figure 2a shows the persistence histogram for the voice call network. The

distribution is bimodal, meaning that ties tend to be either active most of the time or rarely

expressed. This is known in social network analysis as a core-periphery structure [13],

where stable ties compose a person‟s social core and unstable ties connect people to the

more peripheral actors in their life.

13
Figure 2 Persistence across a cellular phone network a Distribution of persistence for all
links b Fraction of surviving ties as a function of time. The inset shows the same plot in a
double logarithmic scale. The continuous line is t-1/4

The decay of ties as a function of time can be approximated by a power-function

(Fig. 2b), in agreement with the 4-year study performed by Burt [40]. The fact that the

survival probability of a tie can be approximated by ~t-α with α =0.25±0.07 indicates that a

great number of ties disappear quickly, while others tend to stay for very long periods of

time. On average, less than 40% of the ties are conserved after 15 days. After this initial

drop however, ties disappear slowly allowing more than 20% of the ties to remain after a

year. We note that the discreetness and sparseness of our data does not allow us to prove

that tie decay follows a power-law. Yet the graphical analysis of Figure 2 b can be

considered as suggestive evidence motivating a hypothesis and further study in this

direction.

14
2.5 Network Structure and the Persistence of Ties

2.5.1 Bivariate Analysis

Figure 3a show a fragment of the mobile-call network extracted by considering all

connections up to 3 links from a randomly chosen user. Although this example shows less

than the 0.0008% of our network, it visually summarizes the correlations between

persistence, perseverance and the topological attributes of the mobile-call network. In

particular, we find that these temporal attributes correlate with topological variables such

as the number of connections or degree of a node ki, the average reciprocity of a node r

(fraction of ties containing both, incoming and outgoing calls) and the clustering

coefficient of a node Ci defined as:


Ci = (3)
ki (k i −1)

where Δ is the number of triads, or fully connected triangles, in which the node is

involved. Figure 3b shows a histogram of persistence split into 9 different degree

categories revealing that persistent links represent a large fraction of the connections for

low degree nodes while transient links are more common for large degree nodes. The

number of persistent ties grows, however, as a function of degree, meaning that although

on average the persistence of high degree nodes is lower, in absolute terms their core is

larger (Figure 3 c).

15
Figure 3 d shows the distribution of persistence divided by clustering coefficient

categories, indicating that highly clustered nodes tend to have relatively large cores. In the

core periphery context, this means that persevering nodes are located in dense parts of the

social network (Fig. 3a I) while those in sparser parts tend to have non-persistent ties acting

as bridges which interruptedly connect different parts of the network (Fig. 3a II). Finally,

we split the distribution of persistence by reciprocity (figure 3e) and observe that nodes

with more reciprocated ties tend to be more persistent.

16
17

Figure 3 Network structure and the persistence of ties a A fragment of the network extracted by considering up to the second
neighbor of a randomly chosen node (indicated by a black arrow). (b-e summarize statistics for the entire network) b Distribution
of persistence divided into nine degree categories c Number of persistent links defined as those with a persistence of, from top to
bottom: 6/10, 7/10, 8/10, 9/10 and 10/10. d Distribution of persistence divided into nine clustering categories. e Distribution of
persistence divided into five different reciprocity segments.
2.6 Multivariate Analysis

2.6.1 At the link level

In the previous section we presented a bivariate analysis in which we analyzed the

effect of three single structural variables and found that persistence depends monotonically

on all of them (degree, clustering coefficient and reciprocity). The observed correlations

however, might well be redundant. To check if this is the case we perform a multivariate

analysis to quantify the effect of each of these variables on the persistence of ties. Because

of the large number of observations considered (∼ 2 million nodes, ∼ 8 million ties) the

confidence intervals of the regressions do not spread far from the predicted values. Hence

we concentrate our discussion on the relative magnitude of the effects rather than on their

significance.

On a social network, it is a well-known fact that agents tend to connect to other

agents that have a similar degree [41,42]. It is not known however, whether links connecting

same degree agents tend to be more stable than those connecting different degree agents.

To study this effect we performed a regression in which we study the persistence of a link

as a function of the difference in degree between the nodes adjacent to the ends of each

link. Furthermore, we also include in the regression the difference in clustering and

average reciprocity of nodes connected by a particular link. In addition to this, we consider

two link attributes, the reciprocity of links R (was there ever a panel in which caller and

callee reciprocally called each other?) and the topological overlap (TO) associated with

that link which is defined as

18
, (4)

where Oij is the number of neighbors that agents i and j have in common and ki and

kj are their respective degrees. Topological overlap is a local measure of betweenness

indicating the number of neighbors shared by two nodes at the ends of a link.

2.6.2 Performing Multivariate Analysis

Multivariate analysis is a standard statistical technique used to separate the

correlation between different variables. The technique is an extension of the bi-variate case

which can be used to study the variance shared by a pair of variables.

We can illustrate how multiple regression works by using an example with two

explanatory variables (x1 and x2) and one dependent variable (y). Multiple regression

analysis is based on linear regression, as many other functional forms can be linearized by

performing a change of variables.

Regression analysis begins by performing the least square fit:

y=B1x1+B2x2+A, (5)

where B1 and B2 are the regression coefficients and A is the intercept: y(x1=0,x2=0).

From the definition of the correlation coefficient we can interpret B1 and B2 as the change

in y associated with a standard deviation of x1 or x2.

To separate the effect on y from x1 and x2 we first calculate the correlation

coefficient between x1 and x2, which we denote as r12.

19
r12=cov(x1,x2)/σx1σx2 (6)

where cov is the covariance and σ stands for the standard deviation. We also use (6)

to calculate the correlation between y, x1 and x2, which we denote ry1 and ry2 respectively.

The total variance in y explained by x1 and x2 is then given by:

R(x1,x2)2= [(ry12+ry22-2ry1ry2r12)/(1-r122)] 1/2. (7)

We can split the effects of x1 and x2 on y by calculating the partial regression

coefficients which are given by:

pr1=(R2-ry2)1/2 (8)

pr2=(R2-ry1)1/2 (9)

and indicate the amount of variance in y explained by each one of this variables.

We use multiple regression analysis to show that R, TO, ΔC, Δk and ΔR explain

40% of the variance in persistence (Table 2 Persistence of ties and link attributes R2 =

0.397). The contribution of each one of them can be isolated by considering the partial

regression coefficients [43], which are a way to quantify how much of the variance is

explained by each one of the covariates used in a regression. This technique shows that

assortative mixing is not associated with the persistence of ties. Whereas the reciprocity of

the links (0 non-reciprocal, 1 reciprocal) explains 26% of persistence followed by

topological overlap which explains 3.4 % of the variance in persistence.

20
TABLE 2 PERSISTENCE OF TIES AND LINK ATTRIBUTES

Pearson’s Correlation ΔC Δk Δr R TO Persistence


ΔC 1 0.023 0.15 0.11 0.23 0.

15

Δk 1 0.02 -0.13 -0.19 -


0.16
Δr 1 -0.68 -0.073 0.
033
R 1 0.2964 0.5886
TO 1 0.3537
Regression Coefficients 0.09 0.002 0.15 0.35 0.56
Partial 0.0027 0.0032 0.007 0.26 0.034
Correlations

2.6.3 Perseverance and local topology

In the previous section we showed that high degree agents had on average less

persistent ties than low degree agents. We also saw that highly clustered agents tended to

have a larger number of persistent connections and that reciprocal ties tend to be more

persistent in average. Again, we explore the redundancy of such statements using linear

regression and split the contribution to perseverance from each of these variables by

calculating their partial correlations (Table 1). Together, these variables explain almost

50% of the variance in perseverance (R2=0.49). Their contributions are quite uneven,

however. When we look at the partial correlation coefficients extracted from our linear

model we find that most correlations vanish and the biggest contribution to perseverance is

given by the average reciprocity r of an agent‟s ties, which explains 27% of the variance.

The negative effect of degree in the persistence of an agent‟s ties is still present, but greatly

ameliorated. This means that high degree agents which reciprocate their ties have more

persistent ties as well. The negative effect of an agent‟s degree on the persistence of its ties

is in large part explained by the fact that high degree agents tend to reciprocate less of their

21
ties. Similarly, the clustering coefficient C, which appeared as the strongest predictor in the

bi-variate case, explains only 6% of the variance when reciprocity and degree are taken

into account. This shows that cliques are formed by reciprocal ties minimizing the

additional information about persistence carried by cliques themselves.

TABLE 3 CORRELATIONS AND REGRESSIONS BETWEEN NODE ATTRIBUTES


AND PERSEVERANCE
Pearson’s Correlation C k R Perseverance
C 1 -0.51 0.49 0.64
k 1 -0.34 -0.45
R 1 0.62
Regression Coefficients .0598 .0122 3626
Partial Correlation .062 .11 .27

2.7 Using topology to infer future ties

We finish our discussion by asking: How well can we predict the stability of ties

starting from a single panel? As mentioned before, persistence is a time-like, vertical

variable and is not constrained to correlate with space-like, horizontal variables. As we saw

from our multivariate analysis, the information carried by structural variables can be

redundant [44,45], thus, it is important to take into account their correlations to unveil their

real contribution to the persistence of ties. Can we use this information to predict which ties

persist in time? To answer this question we looked at our first data panel and used different

criteria to predict which ties will be stable. We then looked at the fraction of these ties

appearing after 1, 3, 6, 9 and 12 months and gauged the accuracy of our predictions by

measuring their Positive Predictive Value (PPV) defined as:

, (10)

where TP is the number of true positives and FP is the number of false positives.

22
We begin by testing the prediction that all ties observed as reciprocal in the first

panel will be conserved in the future. For this hypothesis, the PPV ranges from 70% after

one month to 43% after a year (Fig 4 a). For comparison, we picked a random set of ties

and found a PPV of 35% after a month and 20% after a year.

We can improve our predictive power by using a more stringent criterion. If we

consider all reciprocal links that also have a topological overlap larger than TO ≥ 0.01 we

improve the PPV of our prediction by 5%, while an even more stringent criterion based on

a TO ≥ 0.1, gives us an extra percent that allows us to predict with a PPV larger than 50%

after one year.

Figure 4 Predicting future ties a Accuracy of tie prediction by randomly choosing ties
(orange), choosing reciprocal ties (red), reciprocal ties with a T.O.>0.01 (green), reciprocal
ties with a T.O.>0.01, ties with a T.O.>0.01 (blue) and a T.O>0.1 (purple). b Sensitivity of
the predictive methods presented in a. using the same color scheme.

23
The increase in accuracy brought by more stringent criteria reduces the number of

links predicted to be persistent. Thus the sensitivity of our method, defined as:

(11)

where FN is the number of false negatives, decreases with the stringency of the

criteria used but increases with time (Fig. 4 b). Hence there is a tradeoff between the

accuracy of our prediction and the number of predictions we can make. Using the simple

method presented above, an increase in accuracy comes with a decrease in sensitivity so

more accurate predictions can be made only if we accept a reduction in the number of

predictions being made.

Reciprocity appears to be the best predictor of persistence; however, it is not the

only one. The fact that the variance explained by other structural variables was redundant

with that explained by reciprocity allows us to use other structural variables as alternative

predictors of a tie. Figure 4 a also shows the PPV obtained when we use topological

overlap as our only predictive criterion. In this case we see that although the accuracy is

lower, it is still significantly better than random. Thus the redundancy observed in the

system can be turned into a predictive advantage and in the absence of information about

the reciprocity of links we can use redundant measures to make good educated guesses

about the existence of future ties.

2.7.1 Discussion

We have defined and measured the persistence of ties in a one year period using 10

panels of data summarizing the activity of all voice calls carried by a mobile phone carrier

from an industrialized country. We showed that the persistence of ties and perseverance of

24
nodes depend on topological variables (degree, clustering, reciprocity and topological

overlap). In our study, topological variables explain almost half of the variance in

persistence. The stability of social ties is likely a behavioral attribute, thus, it is not

surprising that the local structure of the social network, that it is likely also a result of social

behavior, predicts the persistence of ties.

Social connections ultimately affect processes such as collective decisions [46,47]

and coordinated consumption [48]. But not all social connections are equally important;

some ties are stronger than others [49]. The strength of a social tie is not an absolute

measure. Hence there is a need to quantify the strength of ties using ad-hoc measures.

Persistence is a way to quantify the temporal stability of ties, and therefore their strength, in

one of the many possible dimensions that tie strength can be quantified. As longitudinal

data becomes available, methods like the one introduced here can be used to quantify the

strength of links and ultimately determine its effects on network dynamics.

The relationships shown here demonstrate that the temporal dynamics of social

interactions are intrinsically coupled to the social network structure in such a way that the

existence of a tie can be predicted, with a respectable accuracy, using a simple criterion.

25
CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS

3.1 Introduction

Despite their importance for urban planning [50], traffic forecasting [51], and the

spread of biological [52,53,54] and mobile viruses [55], our understanding of the basic

laws governing human motion remains limited thanks to the lack of tools to monitor the

time resolved location of individuals. Here we study the trajectory of 100,000 anonymized

mobile-phone users whose position is tracked for a six-month period. We find that in

contrast with the random trajectories predicted by the prevailing Lévy-flight and

random-walk models [56] (see Box 1), human trajectories show a high degree of temporal

and spatial regularity, each individual being characterized by a time independent

characteristic length scale and a significant probability to return to a few highly frequented

locations. After correcting for differences in travel distances and the inherent anisotropy of

each trajectory, the individual travel patterns collapse into a single spatial probability

distribution, indicating that despite the diversity of their travel history, humans follow

simple reproducible patterns. This inherent similarity in travel patterns could impact all

phenomena driven by human mobility, from epidemic prevention to emergency response,

urban planning and agent based modelling.

Given the many unknown factors that influence a population‟s mobility patterns,

ranging from means of transportation to job and family imposed restrictions and priorities,

human trajectories are often approximated with various random walk or diffusion models

26
[56,57]. Indeed, early measurements on albatrosses, bumblebees, deer and monkeys [58,59]

and more recent ones on marine predators [60] suggested that an animal trajectory can be

approximated by a Lévy flight [61, 62], a random walk whose step size Δr follows a

power-law distribution P(Δr) ~ Δr -α with α < 3. While the Lévy statistics for some

animals require further study [63], Brockmann et al. [56] generalized this finding to humans,

documenting that the distribution of distances between consecutive sightings of nearly

half-million bank notes is fat tailed. Given that money is carried by individuals, bank-note

dispersal is a proxy for human movement, suggesting that human trajectories are best

modelled as a continuous time random walk with fat tailed displacements and waiting time

distributions [56]. A particle following a Lévy flight has a significant probability to travel

very long distances in a single step [61,62], which appears to be consistent with human travel

patterns: most of the time we travel only over short distances, between home and work,

while occasionally we take longer trips.

Random Walk (RW): A random walk is a mathematical formalization of a

trajectory that consists of taking successive steps in random directions.

Lévy-Flights (LF): A Lévy flight is a type of random walk in which the size of the

steps are distributed according to a "heavy-tailed" distribution.

Truncated Levy Flight (TLF): A truncated levy flight is a random walk in which

the increments are distributed according to a heavy tailed distribution multiplied by an

exponential decay factor (P(x)∼x−˜ exp(-βx) ).

Heavy Tailed Distribution, Fat-Tail or Power-Law: A heavy-tailed distribution

is a probability distribution that has infinite variance. One of the most common forms it

takes is a power-law, which falls to zero as x−α where 0 < α < 3.

27
Each consecutive sightings of a bank note reflects the composite motion of two or

more individuals, who owned the bill between two reported sightings. Thus, it is not clear

if the observed distribution reflects the motion of individual users, or some hitherto

unknown convolution between populations based heterogeneities and individual human

trajectories. Contrary to bank notes, mobile phones are carried by the same individual

during his/her daily routine, offering the best proxy to capture individual human

trajectories [8,9,10,64,65].

We used two data sets to explore the mobility pattern of individuals. The first (D1)

consists of the mobility patterns recorded over a six-month period for 100,000 individuals

selected randomly from a sample of over 6 million anonymized mobile-phone users. Each

time a user initiates or receives a call or text message, the location of the tower routing the

communication is recorded, allowing us to reconstruct the user‟s time resolved trajectory

(Figure 6 a and b). The time between consecutive calls follows a bursty pattern [66] (Figure

5) indicating that while most consecutive calls are placed soon after a previous call,

occasionally there are long periods without any call activity. To make sure that the

obtained results are not affected by the irregular call pattern, we also study a data set (D2)

that captures the location of 206 mobile-phone users, recorded every two hours for an

entire week. In both datasets the spatial resolution is determined by the local density of the

more than 104 mobile towers, registering movement only when the user moves between

areas serviced by different towers.

28
3.2 Source Data

The D1 dataset was collected by a European mobile phone carrier for billing and

operational purposes. It contains the date, time and coordinates of the phone tower routing

the communication for each phone call and text message sent or received by 6 million

costumers. The dataset summarizes 6 months of activity. To guarantee anonymity, each

user is identified with a security key (hash code). Furthermore, we only know the

coordinates of the tower routing the communication, hence a user‟s location remains

unknown within a tower‟s service area. Each tower serves an area of approximately 3 km2.

Due to tower coverage limitations driven by geographical constraints and national frontiers

no jumps exceeding 1,000 km can be observed in the dataset.

The research was performed on a random set of 100,000 selected from those

making or receiving at least one phone call or SMS during the first and last month of the

study, translating to 16,364,308 recorded positions. We removed all jumps that took users

outside the continental territory. We did not impose any additional criterion regarding the

calling activity to avoid possible selection biases in the mobility pattern.

The D2 dataset was collected for the operation of some services provided by the

mobile phone carrier, like pollen and traffic forecasts, which rely on the approximate

knowledge of customer‟s location at all times of the day. For customers that signed up for

location dependent services, the date, time and the closest tower coordinates are recorded

on a regular basis, independent of their phone usage. We were provided such records for

1,000 users, among which we selected the group of users whose coordinates were recorded

at every two hours during an entire week, resulting in 206 users for which we have 10,613

recorded positions. Given that these users were selected based on their actions (signed up

29
to the service), in principle the sample cannot be considered unbiased, but we have not

detected any particular bias for this data set.

For each user in D1 and D2 we sorted the time resolved sequence of positions and

constructed individual trajectories.

Figure 5 Interevent time distribution P(ΔT) of calling activity. ΔT is the time elapsed
between consecutive communication records (phone calls and SMS, sent or received) for
the same user. Different symbols indicate the measurements done over groups of users
with different activity levels (# calls). The inset shows the unscaled version of this plot

3.3 The Heterogeneity of Human-Mobility Patterns

To explore the statistical properties of the population‟s mobility patterns we

measured the distance between user‟s positions at consecutive calls, capturing 16,264,308

displacements for the D1 and 10,407 displacements for the D2 datasets. We find that the

distribution of displacements over all users is well approximated by a truncated power-law

30
P(Δr) = (Δr+Δr0)-β exp(-Δr/κ), (12)

with β=1.75 ± 0.15, Δr0=1.5 km and cut-off values κ|D1 = 400 km and κ|D2 = 80 km

(Figure 6 c). Note that the observed scaling exponent is not far from β = 1.59 observed in

Ref. [56] for bank-note dispersal, suggesting that the two distributions may capture the

same fundamental mechanism driving human-mobility patterns.

Equation (12) suggests that human motion follows a truncated Lévy flight [56]. Yet,

the observed shape of P(Δr) could be explained by three distinct hypotheses: A. Each

individual follows a Lévy trajectory with jump size distribution given by (12). B. The

observed distribution captures a population based heterogeneity, corresponding to the

inherent differences between individuals. C. A population based heterogeneity coexists

with individual Lévy trajectories, hence (12) represents a convolution of hypothesis A and

B.

To distinguish between hypotheses A, B and C we calculated the radius of gyration

for each user as:

∑ /
(13)

where xcm and ycm are the coordinates of the centre of mass defined by a users

position and the sum goes over all positions (N) recorded for a user. The radius of gyration

can be interpreted as the typical distance travelled by user a when observed up to time t

(Figure 6 b). Next, we determined the radius of gyration distribution P(rg) by calculating rg

for all users in samples D1 and D2, finding that they also can be approximated with a

truncated power-law

31
P(rg) = (rg+rg0)-βr exp(-rg/κ), (14)

with rg0 = 5.8 km, βr = 1.65 ± 0.15 and κ = 350 km. Lévy flights are characterized

by a high degree of intrinsic heterogeneity, raising the possibility that (9) could emerge

from an ensemble of identical agents, each following a Lévy trajectory. Therefore, we

determined P(rg) for an ensemble of agents following a Random Walk (RW), Lévy-Flight

(LF) or Truncated Lévy-Flight (TLF) (Figure 6 d) [57,61,62]. The ensemble of random

walkers was normalized such that the mean of the distribution matches that observed in our

data, whereas the ensemble of Lévy-Flight walkers had steps drawn from a distribution

with the same exponent as that found in (12). The steps of the Truncated Lévy-Flight

walkers were extracted from the distribution presented in (12).

We find that an ensemble of Lévy agents display a significant degree of

heterogeneity in rg, yet is not sufficient to explain the truncated power-law distribution

P(rg) exhibited by the mobile-phone users. Taken together, Figs. 1c and d suggest that the

difference in the range of typical mobility patterns of individuals (rg) has a strong impact

on the truncated Lévy behavior seen in (12), ruling out hypothesis A.

If individual trajectories are described by a LF or TLF, then the radius of gyration

should increase in time as rg(t) ~ t3/(2+β) [67,68] while for a RW rg(t) ~ t1/2. That is, the longer

we observe a user, the higher the chances that she/he will travel to areas not visited before.

To check the validity of these predictions we measured the time dependence of the radius

of gyration for users whose gyration radius would be considered small (rg(T) ≤ 3 km),

medium (20 < rg(T) ≤ 30 km) or large (rg(T) > 100 km) at the end of our observation period

(T = 6 months). The results indicate that the time dependence of the average radius of

gyration of mobile phone users is better approximated by a logarithmic increase, not only a

32
manifestly slower dependence than the one predicted by a power law, but one that may

appear similar to a saturation process (Figure 8 a).

Figure 6 Basic human mobility patterns. a, Week-long trajectory of 40 mobile phone users
indicate that most individuals travel only over short distances, but a few regularly move
over hundreds of kilometres. Panel b, displays the detailed trajectory of a single user. The
different phone towers are shown as green dots, and the Voronoi lattice in grey marks the
approximate reception area of each tower. The dataset studied by us records only the
identity of the closest tower to a mobile user, thus we cannot identify the position of a user
within a Voronoi cell. The trajectory of the user shown in b is constructed from 186 two
hourly reports, during which the user visited a total of 12 different locations (tower
vicinities). Among these, the user is found 96 and 67 occasions in the two most preferred
locations, the frequency of visits for each location being shown as a vertical bar. The circle
represents the radius of gyration cantered in the trajectory‟s centre of mass. c, Probability

33
density function P(Δr) of travel distances obtained for the two studied datasets D1 and D2.
The solid line indicates a truncated power law whose parameters are provided in the text
(see Eq. 7). d, The distribution P(rg) of the radius of gyration measured for the users, where
rg(T) was measured after T = 6 months of observation. The solid line represents a similar
truncated power law fit (see Eq. 9). The dotted, dashed and dot-dashed curves show P(rg)
obtained from the standard null models (RW, LF and TLF), where for the TLF we used the
same step size distribution as the one measured for the mobile phone users.

In Figure 8 b, we have chosen users with similar asymptotic rg(T) after T = 6

months, and measured the jump size distribution P(Δr|rg) for each group. As the inset of

Figure 8 b shows, users with small rg travel mostly over small distances, whereas those

with large rg tend to display a combination of many small and a few larger jump sizes.

Once we rescale the distributions with rg (Figure 8 b), we find that the data collapses into a

single curve, suggesting that a single jump size distribution characterizes all users,

independent of their rg. This indicates that P(Δr|rg) ~ rg-α F(Δr/rg), where α ≈ 1.2 ± 0.1 and

F(x) is a rg independent function with asymptotic behavior F(x < 1) ∼ x-α and rapidly

decreasing for x >> 1. Therefore the travel patterns of individual users may be

approximated by a Lévy flight up to a distance characterized by rg. Most important,

however, is the fact that the individual trajectories are bounded beyond rg, thus large

displacements which are the source of the distinct and anomalous nature of Lévy flights,

are statistically absent. To understand the relationship between the different exponents, we

note that the measured probability distributions are related by

which suggests that up to the leading order we have β=βr+α-1, consistent, within error

bars, with the measured exponents. This indicates that the observed jump size distribution

P(Δr) is in fact the convolution between the statistics of individual trajectories P(Δr|rg) and

the population heterogeneity P(rg), consistent with hypothesis C.

34
3.4 Testing the power-law curve fits

We tested whether the empirical data could come from the fitted distributions by

performing a stringent variant of the Kolmogorov-Smirnov (KS) goodness of fit test [69].

The KS statistics is a simple way to compare whether two distributions are the same. In this

case, we use it to test the hypothesis: Could the empirically observed distributions come

from the distribution found as its best fit? To test for this we generated synthetic data

starting from the fitted distribution and then use the KS test to see whether our data behaves

as well as synthetic data generated from the fitted distribution. We use two variants of the

KS statistics to compare empirical data with the fitted distribution and synthetic data with

the fitted distribution. The first method is the standard KS statistics and is given by:

KS = max (|F − P|) (15)

where F is the cumulative distribution of the best fit and P is the cumulative

distribution of the empirical or synthetic data. The regular KS statistic is not very sensitive

on the edges of the distribution. Hence, we also use the weighted KS statistics defined as:

KSW = max(|F − P| /(P(1 − P))1/2) (16)

To test whether the empirical data behaves as good as the synthetic data we

calculated the KS and KSW statistics between the empirical data and its best fit and

compared these values with those obtained by calculating KS and KSW for 1,000 sets of

synthetic data generated from the best fit. If the values obtained for KS and KSW for the

empirical data behave as good or better than those obtained for the synthetic data, then we

can conclude that the empirical data is statistically consistent with its best fit. The results of

the KS test can be summarized using a p–value by integrating the distribution of KS values

35
Figure 7 Kolmogorv-Smirnov goodness of fit test. The figures compare the KS and KSw
statistics with that of 1000 sets of synthetic data coming from the same distribution. a. Red
line indicates the KS value for Figure 6 c D1. (p(KS)=1) b. Red line indicates the KSw value
for Figure 6 (p(KSW)=1) c D1c. Red line indicates the KS value for Figure 6 d D1
(p(KS)=0.62) d. Red line indicates the KSw value for Figure 6 d D1 (p(KSW)=0.82).

generated with the synthetic data from the value representing the empirical distribution.

When integrating such distributions from left to right we can interpret the p−value as the

probability that the observed data was the result of its best fit. A p−value close to 1 will

indicate that the empirical distribution matches its best fit as good as synthetic data

generated from the fit itself [69], whereas a relative small p−value (typically taken p < 0.01)

would suggest that the empirical distribution cannot be the result of its best fit.

The p-values for the KS tests can be read from the caption of Figure 7.

36
3.5 The periodicity of human mobility patterns

To uncover the mechanism stabilizing rg we measured the return probability for

each individual Fpt(t) [68], defined as the probability that a user returns to the position

where it was first observed after t hours (Figure 8 c). For a two dimensional random walk

Fpt(t) should follow ~ 1/(t ln(t)2) [68]. In contrast, we find that the return probability is

characterized by several peaks at 24 h, 48 h, and 72 h, capturing a strong tendency of

humans to return to locations they visited before, describing the recurrence and temporal

periodicity inherent to human mobility [70,71].

37
Figure 8 The bounded nature of human trajectories. a, Radius of gyration, vs time
for mobile phone users separated in three groups according to their final rg(T) , where T = 6
months. The black curves correspond to the analytical predictions for the random walk
models, increasing in time as (solid), and (dotted).
The dashed curves corresponding to a logarithmic fit of the form A+B ln(t), where A and B
depend on rg. b, Probability density function of individual travel distances P(Δr|rg) for
users with rg = 4, 10, 40, 100 and 200 km. As the inset shows, each group displays a quite
different P(Δr|rg) distribution. After rescaling the distance and the distribution with rg

38
(main panel), the different curves collapse. The solid line (power law) is shown as a guide
to the eye. c, Return probability distribution, Fpt(t). The prominent peaks capture the
tendency of humans to regularly return to the locations they visited before, in contrast with
the smooth asymptotic behavior ~1/(tln(t)2) (solid line) predicted for random walks. d, A
Zipf plot showing the frequency of visiting different locations. The symbols correspond to
users that have been observed to visit nL = 5, 10, 30, and 50 different locations. Denoting
with (L) the rank of the location listed in the order of the visit frequency, the data is well
approximated by R(L)~L-1. The inset is the same plot in linear scale, illustrating that 40% of
the time individuals are found at their first two preferred locations.

To explore if individuals return to the same location over and over, we ranked each

location based on the number of times an individual was recorded in its vicinity, such that a

location with L = 3 represents the third most visited location for the selected individual. We

find that the probability of finding a user at a location with a given rank L is well

approximated by P(L) ~ 1/L, independent of the number of locations visited by the user

(Figure 8 d). Therefore people devote most of their time to a few locations, while spending

their remaining time in 5 to 50 places, visited with diminished regularity. Therefore, the

observed logarithmic saturation of rg(t) is rooted in the high degree of regularity in their

daily travel patterns, captured by the high return probabilities (Figure 8 b) to a few highly

frequented locations (Figure 8 d).

3.6 The Shape of Human-Mobility Patterns

An important quantity for modeling human mobility patterns is the probability to

find an individual a in a given position (x, y). As it is evident from Figure 6 b,

individuals live and travel in different regions, yet each user can be assigned to a

well-defined area, defined by home and workplace, where she or he can be found most of

the time. We can compare the trajectories of different users by diagonalizing each

39
trajectory‟s inertia tensor, providing the probability of finding a user in a given position

(Figure 9 a) in the user‟s intrinsic reference frame.

To compare the trajectories of different users we calculate an individual reference

frame for each user. We do this by finding the set of axes in which the inertia tensor defined

by the collection of points visited by each user takes a diagonal form.

The moment of inertia tensor is given by:

(17)

where

(18)

We define the x axis of user‟s intrinsic reference frame as the eigenvector

associated with the smaller eigenvalue of the inertia tensor. Thus, we look for a reference

frame such that:

. (19)

This can be achieved by performing a rotation:

(20)

such that II

(21)

40
(22)

This leads us to three possible solutions:

(A). If

and

then

(23)

We select one of the roots to make sure.

(B). and

, then from (22) we must have sin(

, hence there is no valid solution in this case.

(C). , we derive from (22)

. We select 0 or

according to in which of these angles, the momentum of inertia is minimum.

Finally, we make a conditional rotation of π to make sure the most frequent position

has a positive value on its horizontal component.

41
3.7 The anisotropy of Human Mobility Patterns

A striking feature of is its prominent spatial anisotropy in this intrinsic

reference frame (note the different scales in Figure 9 a). Here we find that the larger an

individual‟s rg the more pronounced is this anisotropy. To quantify this effect we defined

the anisotropy ratio S ≡ σy/σx, where σx and σy represent the standard deviation of the

trajectory measured in the user‟s intrinsic reference frame. We find that S decreases

monotonically with rg (Figure 9 c), being well approximated with S ~ rg-η, for η ≈ 0.12.

Given the small value of the scaling exponent, other functional forms may offer an equally

good fit, thus mechanistic models are required to identify if this represents a true scaling

law, or only a reasonable approximation to the data.

To compare the trajectories of different users we remove the individual

anisotropies, rescaling each user trajectory with its respective σx and σy. The rescaled

distribution (Figure 9 b) is similar for groups of users with considerably

different rg, i.e., after the anisotropy and the rg dependence is removed all individuals

appear to follow the same universal probability distribution. This is particularly

evident in Fig. 3d, where we show the cross section of for the three groups of

users, finding that apart from the noise in the data the curves are indistinguishable.

42
Figure 9 The shape of human trajectories. a, The probability density function Φ(x, y) of
finding a mobile phone user in a location (x, y) in the user‟s intrinsic reference frame (see
SM for details). The three plots, from left to right, were generated for 10,000 users with: rg
≤ 3, 20 < rg ≤ 30 and rg > 100 km. The trajectories become more anisotropic as rg
increases. b, After scaling each position with σx and σy the resulting ) Φ (x/ σx ,y/ σy ) has
approximately the same shape for each group. c, The change in the shape of Φ (x, y) can be
quantified calculating the isotropy ratio S ≡ σx/ σy as a function of rg , which decreases as S
~ rg-0.12 (solid line). Error bars represent the standard error. d, Φ (x/ σx ,0) representing the
x-axis cross section of the rescaled distribution Φ (x/ σx ,y/ σy ) shown in b.

43
Taken together, our results suggest that the Lévy statistics observed in bank note

measurements capture a convolution of the population heterogeneity (9) and the motion of

individual users. Individuals display significant regularity, as they return to a few highly

frequented locations, like home or work. This regularity does not apply to the bank notes: a

bill always follows the trajectory of its current owner, i.e. dollar bills diffuse, but humans

do not.

The fact that individual trajectories are characterized by the same rg-independent

two dimensional probability distribution , suggests that key statistical

characteristics of individual trajectories are largely indistinguishable after rescaling.

Therefore, our results establish the basic ingredients of realistic agent based models,

requiring us to place users in number proportional with the population density of a given

region and assign each user an rg taken from the observed P(rg) distribution. Using the

predicted anisotropic rescaling, combined with the density function, we can obtain the

likelihood of finding a user in any location. Given the known correlations between spatial

proximity and social links, our results could help quantify the role of space in network

development and evolution [ 72,73,74,75] and improve our understanding of diffusion

processes [57,76].

44
CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF

NATIONS

4.1 Introduction

Does the type of product a country exports matter for subsequent economic

performance? The fathers of development economics held it does, suggesting that

industrialization creates ‟spill-over‟ benefits that fuel subsequent growth [77,78,79]. Yet,

lacking formal models, mainstream economic theory has been unable to incorporate these

ideas. Instead, two approaches have been used to explain a country‟s pattern of

specialization. The first focuses on the relative proportion between productive factors (i.e.

physical capital, labor, land, skills or human capital, infrastructure, and institutions [80]).

Hence, poor countries specialize in goods intensive in unskilled labor and land while richer

countries specialize in goods requiring infrastructure, institutions, human and physical

capital. The second approach emphasizes technological differences [81] and has to be

complemented with a theory of what underlies them. The varieties and quality ladders

models [82,83] assume that there is always a slightly more advanced product or just a

different one that countries can move to, disregarding product similarities when thinking

about structural transformation and growth.

45
Think of a product as a tree and the set of all products as a forest. A country is

composed of a collection of firms, i.e. of monkeys that live on different trees and exploit

those products. The process of growth implies moving from a poorer part of the forest,

where trees have little fruit, to better parts of the forest. This implies that monkeys would

have to jump distances, i.e. redeploy (human, physical and institutional) capital towards

goods that are different from those currently under production. Traditional growth theory

assumes there is always a tree within reach; hence the structure of this forest is

unimportant. However, if this forest is heterogeneous, with some dense areas and other

more deserted ones, and monkeys can jump limited distances, then countries may be

unable to move through the product space. If this is the case, the structure of this space and

a country‟s orientation within it become of great importance to the development of

countries.

4.2 Product Proximity

In theory, many possible factors may cause relatedness between products, i.e.

closeness between trees; such as the intensity of labor, land, and capital [ 84], the level of

technological sophistication [85,86], the inputs or outputs involved in a product‟s value chain

(e.g. cotton, yarn, cloth, garments) [87] or requisite institutions [88,89]. All of these are a priori

notions of what dimensions of similarity are most important, and assume that factors of

production, technological sophistication or institutional quality exhibit little specificity.

Instead, we take an agnostic approach and use an outcomes-based measure, based on the idea

that if two goods are related, because they require similar institutions, infrastructure, physical

factors, technology, or some combination thereof, they will tend to be produced in tandem,

whereas dissimilar goods are less likely to be produced together. We call this measure

46
proximity, which formalizes the intuitive idea that the ability of a country to produce a product

depends on its ability to produce other products. For example, a country with the ability to

export apples will probably have most of the conditions suitable to export pears. They would

certainly have the soil, climate, packing technologies, frigorific trucks and containers. In

addition, they would have skilled agronomists, phytosanitary laws and trade agreements that

could be easily redeployed to the pear business. If instead we consider a different product such

as copper wires or home appliance manufacture, all or most of the capabilities developed for

the apple business are rendered useless. We introduce proximity as the concept that captures

this intuitive notion.

Formally, we define the proximity φ between products i and j as the minimum of

the pairwise conditional probabilities of a country exporting a good given that it exports

another.

(24)

where RCA stands for Revealed Comparative Advantage [90]

(25)

which measures whether a country c exports more of good i, as a share of its total

exports, than the „average‟ country (RCA≥1) not (RCA<1).

47
4.3 The Product Space

International trade data is taken from Feenstra, Lipsey, Deng, Ma, & Mo's "World

Trade Flows: 1962-2000" dataset [91]. This dataset consists of imports and exports both by

country of origin and by destination, with products disaggregated to the SITC revision 4,

four-digit level. The authors build this dataset using the United Nations COMTRADE

database. The authors cleaned that dataset by calculating exports using the records of the

importing country, when available, assuming that data on imports is more accurate than

data from exporters. This is likely, as imports are more tightly controlled in order to

enforce safety standards and collect customs fees. In addition, the authors correct the UN

data for flows to and from the United States, Hong Kong, and China. We focus only on

export data, and do not disaggregate by country of destination. More information on this

dataset can be found in NBER Working Paper #11040, and the dataset itself is available at

www.nber.org/data. and http://cid.econ.ucdavis.edu/data/undata/undata.html.

Using this we calculate the 775 by 775 matrix of revealed proximities between

every pair of goods using (24) and (25).

48
Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product space.

49
Figure 11 Network representation of the 1998-2000 product space. Links are color coded
with their proximity value. The size of the nodes is proportional to world trade and their
colors are chosen according to the classification introduced by Leamer.

Figure 11 shows a hierarchically clustered version of the matrix. A smooth and

homogeneous product space would imply uniform values (homogenous coloring), while a

product-ladder model [83] would suggest a matrix with high values (or bright coloring) only

along the diagonal. Instead the product space of Figure 11 appears to be modular [92,93],

50
with some goods highly connected and others disconnected. Furthermore, as a whole the

product space is sparse, with φij distributed according to a broad distribution with 5% of its

elements equal to zero, 32% of them smaller than 0.1 and 65% of the entries taking values

below 0.2. These significant number of negligible connections calls for a network

representation [94,73], allowing us to explore the structure of the product space, together

with the proximity between products of given classifications and participation in world

trade.

4.4 Generating a network representation of the product space

The matrix representing the product space has many small values which represent

weak connections between products. That is why a network representation becomes an

adequate way to layout the products, giving us a quick visual way to show the relevant

links and to determine were countries are located and where they could be headed.

4.4.1 Maximum Spanning Tree (MST)

To include all products in our network we generated a "skeleton" for it: the

Maximum Spanning Tree (MST). This is the tree containing a sum of weights which is

maximal. In other words, it is the set of N-1 links (N being the number of nodes) that

connect all nodes in the network and maximizes the sum of the proximities in it.

We generated the MST by considering the strongest non-diagonal value of the

proximity matrix and then considered the strongest link connected to that dyad. We then

picked up the strongest link connecting a new node to our triad and continued adding links

until all the nodes on the network were considered (Figure 12).

51
Figure 12 Earliest version of the MST representing the "skeleton" of the product space.

In our visualization we also wanted to consider the strongest links which are not

necessarily in the MST. We did this by considering the MST plus all the links above a

certain threshold. A suitable visualization was obtained by keeping all links with a

proximity value of 0.55 or larger ( Figure 13). This resulted in a network with 775 nodes

and 1525 links. Lower proximity values gave rise to crowded network representations

while higher values resulted in sparse networks. As a rule of thumb, a good network

visualization can be achieved with an average degree equals to 4. This is when the number

of links is twice the one of nodes, which is the case for the φ=0.55 threshold.

52
Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55.

4.4.2 Network Layout

Good network visualization requires an appropriate layout. We lay out the network

using a force spring algorithm. Here nodes are represented as equally charged particles and

links are assumed to be springs. The layout is determined by the relaxed positions.

The force spring layout is not the ultimate solution, but it brings us close to a good

one. After this we retouch the layout manually to avoid overlapping links and untangle

dense clusters.

53
Figure 14 Network representation of the product space. Layout uses a force spring
algorithm.

4.4.3 Node Sizes and Colors

An advantage of using a network representation is that we can simultaneously look

at the structure of the space and other covariates. In our case we painted the network using

the product classifications performed by Leamer [84], and made the size of the nodes

proportional to the World Trade associated with that particular industry. To give a sense of

the proximity of the links involved in our network representation we color coded them by

using dark red and blue for strong links; and yellow and light blue for weaker ones.

54
4.5 The products space and the patterns of comparative advantage

To offer a visualization in which all 775 products are included, we reach all nodes

by calculating the maximum spanning tree, which include the 774 links maximizing the

tree‟s added proximity and superposed on it all links with a proximity larger than 0.55, as

we explained above. This set of 1525 links is used to visualize the structure of the full

proximity matrix, which far from homogenous, appears to have a core-periphery structure

(Figure 11). The core is formed by metal products, machinery and chemicals while the

periphery is formed by the rest of the product classes. The products in the top of the

periphery belong to fishing, animal, tropical and cereal agriculture. To the left there is a

strong peripheral cluster formed by garments and another one belonging to textiles,

followed by animal agriculture. The bottom of the network shows a large electronics

cluster followed to the right by mining and forest and paper products.

The network shows clusters of products somewhat related to the classification

introduced by Leamer [84], which is based on relative factor intensities (Table 4, Figure

19), i.e. the relative amount of capital, labor, land or skills required to produce each

product. Although the classification performed by Leamer was done using a different

methodology, the agreement between it and the structure of the product space is striking.

Yet it also introduces a more detailed split of some product classes. For example,

machinery is naturally split into two clusters, one consisting of vehicles and heavy

machinery, and another one belonging to electronics. The machinery cluster is interwoven

with some capital intensive metal products, but is not tightly connected to similarly

classified products such as textiles.

55
The map obtained can be used to analyze the evolution of a country‟s productive

structure. For this purpose we hold the product space fixed and study the dynamics of

production within it, although changes in the product space represent an interesting avenue

for future research.

Figure 15 shows the pattern of specialization for four regions in the product space2.

Products exported by a region with RCA>1, are shown with black squares. Industrialized

countries occupy the core, composed by machinery, metal products and chemicals. They

also participate in more peripheral products such as textiles, forest products and animal

agriculture. East-Asian countries have developed RCA in the garments, electronics and

textile clusters while Latin America and the Caribbean are further out in the periphery in

mining, agriculture and the garments sector. Finally sub-Saharan Africa exports few

product types, all of which are in the far periphery of the product space, indicating that each

region has a distinguishable pattern of specialization clearly visible in the product space (to

see a discussion of how the structure of the product space is correlated with product income

see appendix II).

Next, we show how the structure of the product space affects a country‟s pattern of

specialization. Figure 16 A shows how comparative advantage evolved in Malaysia and

Colombia between 1980 and 2000 in the electronics and garments sector respectively. We

see that both countries follow a diffusion process in which comparative advantage move

2
The network shown here represents the structure of the product space as determined from the
1998-2000 periods. Holding the product space as fixed is a good first approximation, as the dynamics of the
network is much slower than the one of countries. The Pearson correlation coefficient (PCC) between the
proximity of all links present in this network and the ones obtained from the same network in 1990 and 1985
are 0.69 and 0.66 respectively (see supplementary material). This indicates that although the network
changes over time, after 15 years, the strength of past links still predicts the strength of the current links to a
considerable extent.

56
Figure 15 Localization of the productive structure for different regions of the world. The
products for which the region has an RCA > 1 are denoted by black squares.

57
preferentially towards products close to existing goods: garments in Colombia and

electronics in Malaysia.

Beyond this graphical illustration, is it true that countries develop comparative

advantage preferentially in nearby goods? We use two different approaches to this

question. First, we measure the average proximity of a new potential product j to a

country‟s current productive structure, which we call density and define as

, (26)

where ωkj is the density around good j given the export basket of the kth country and

xi = 1 if RCAki>1 and 0 otherwise. A high density value means that the kth country has

many developed products surrounding the jth product. To study the evolution of

comparative advantage we consider transition products as those with an RCAc,i<0.5 in

1990 and an RCAc,i>1 in 1995. As a control, we consider undeveloped products those that

in 1990 and 1995 had an RCAc,i< 0.5 and disregard those cases do not fitting any of these

two criteria. Figure 16 B shows how density is distributed around transition products

(yellow) and compares it to densities around undeveloped products (red). Clearly, these

distributions are very distinct, with a higher density around transition products than among

undeveloped ones (ANOVA (analysis of variance) p < 10-30).

At the single product level, we consider the ratio between the average density of all

countries in which the jth product was a transition product and the average density of all

countries in which the jth product was not developed. Formally, we define the discovery

factor Hj as

58
, (27)

where T is the number of countries in which the jth good was a transition product

and N is the total number of countries. Figure 16 C shows the frequency distribution of this

ratio. For 79 percent of products, this ratio is greater than 1 indicating that ωjk is greater in

countries that transitioned into the jth good than in those that did not, often substantially.

An alternative way of illustrating that countries develop RCA in goods close to

those they already had, is to calculate the conditional probability of transitioning into a

product given that the nearest product with RCA>1 is at a given φ. Figure 16 D shows a

monotonic relationship between the proximity of the nearest developed good and the

probability of transitioning into it. While the probability of moving into a good at φ=0.1 in

the course of 5 years is almost nil, the probability is about 15 percent if the closest good is

at φ=0.8.3

3
We repeated the same exercise using the rank of proximity instead of proximity itself in order to
assess whether what matters is absolute or relative proximity. We found that absolute distance appears to be
what matters most. We found that while transition probability increases linearly with proximity, they decay
with rank as a power law. Moreover, the rank effect is stronger for products in sparser parts of the product
space, where transitions are also less frequent. Thus, densely connected products can develop RCA through
more paths than sparsely connected ones, indicating the importance of absolute proximity

59
Figure 16 Empirical evolution of countries. A. Examples of RCA spreading for Colombia
(COL) and Malaysia (MYS). The color code shows when this countries first developed
RCA>1 for products in the garments sector in Colombia and the electronics cluster for
Malaysia. B. Distribution of density for transition products and undeveloped products C.
Distribution for the relative increase in density for products undergoing a transition with
respect to the same products when they remain undeveloped. D. Probability of developing
RCA given that the closest connected product is at proximity φ. E. Relative size of the
largest connected component NG with respect to the total number of products in the system
N as a function of link proximity φ.

Since production shifts to nearby products, we ask whether the product space is

sufficiently connected that given enough time, all countries can reach most of it,

particularly the richest parts. Lack of connectedness may explain the difficulties faced by

countries trying to converge to the income levels of rich countries: they may not be able to

undergo structural transformation because proximities are just too low. A simple

approach is to calculate the relative size of the largest connected component as a function

of φ. Figure 16 E shows that at φ≥0.6 the largest connected component has a negligible size

compared with the total number of products while for φ≤0.3 the product space is almost

fully connected, meaning that there is always a path between two different products.

60
We study the impact of the product space structure by simulating how the position

of countries evolve when allowed to repeatedly move to products with proximities greater

than a certain φο. If countries diffuse to nearby products and these are sufficiently

connected to others, then after several iterations, 20 in our exercise, countries would be

able to reach richer parts of the product space. On the other hand, if the product space is

disconnected, countries will not be able to find paths to the richer part of the product space,

independently of how many steps they are allowed to make.

The results of our simulation for Chile and Korea are presented in Figure 17 A. At a

relatively low proximity (φο=0.55) both countries are able to diffuse through to the core of

the product space, however Korea is able to do so much faster thanks to its positioning in

core products. For higher proximities the question becomes whether a country can spread

at all. At φο=0.6 Chile is able to spread slowly throughout the space while Korea is still

able to populate the core after 4 rounds. At φο=0.65, Chile is not able to diffuse, lacking

any close enough products, while Korea develops RCA slowly to a few products close to

the machinery and electronics cluster.

61
Figure 17 Simulated diffusion process and inequality. A. Simulated diffusion process for
Chile and Korea in which we allow countries to develop RCA in all products closer than
φ≥0.55, 0.6 and 0.65. The number of steps required to develop RCA can be read from the
color code on the top right corner of the figure. B. Distribution for the average PRODY of
the best 50 products in a country‟s basket before and after 20 rounds of diffusion. The
original distribution is shown in green while the one associated with the distribution after
20 diffusion rounds with φ=0.65 is presented in yellow and φ=0.55 in red. C. Inter quartile
range of the distribution of the best 50 products after diffusing with a given φ normalized
by the inter quartile range of the best 50 products in absence of diffusion.

62
To generalize this analysis for the whole world, we need a measure to summarize

the position of a country in the product space. We adopt a measure based on Hausmann,

Hwang and Rodrik [95], which involves a two-stage process. First, for every product we

assign a value, which is the RCA weighted GDP per capita of countries with comparative

advantage in that good called PRODY [95]. We then average the PRODYs of the top N

products that a country has access to after M iterations at φο and denote it by

. Figure 17 B shows the distribution of for N = 50, M=20

and φο=1 (green), φο=0.65 (yellow) and φ =0.55 (red). The distribution for φο=1 allows us

to characterize the current distribution of countries in the product space, which shows a

bimodal distribution, signature of a world divided into rich and poor countries with few

countries occupying the center of the distribution. When we allow countries to diffuse up to

φο=0.65, this distribution does not change significantly: it shifts slightly to the right due to

the acquisition of a limited number of sophisticated products by some countries. This

diffusion process, however, stops after a few rounds and the world maintains a degree of

inequality similar to its current state. Contrarily, when we consider φο=0.55, most countries

are able to diffuse and reach the most sophisticated basket in the long run. Only a few

countries are left behind, which unsurprisingly make up the poorest end of the income

distribution (more details on the simulated diffusion process can be found on appendix III).

To quantify the level of convergence we calculated the Inter Quartile Range (IQR)

for the distribution and normalize this quantity by dividing it with the

IQR for the original distribution. Figure 17 C shows that the convergence of the system

goes through an abrupt transition and that convergence is possible if countries are able to

diffuse to products located at a proximity φ>0.65.

63
4.6 Discussion

The bi-modal distribution of international income levels and a lack of convergence

of the poor towards the rich has been explained using geographic [96] and institutional

[88,89] arguments. Here, we introduced a new factor to this discussion: the difficulties

involved in moving through the product space. The detailed structure of the product space

is shown here for the very first time and together with the location of the countries and the

characteristics of the diffusion process undergone by them, strongly suggests that not all

countries face the same opportunities when it comes to development. Poorer countries tend

to be located in the periphery where moving towards new products is harder to achieve.

More interestingly, among countries with a similar level of development and seemingly

similar levels of production and export sophistication, there is significant variation in the

option set implied by their current productive structure, with some on a path to continued

structural transformation and growth, while others are stuck in a dead end.

These findings have important consequences for economic policy, as the incentives

to promote structural transformation in the presence of proximate opportunities are quite

different from those required when a country hits a dead end. It is quite difficult for

production to shift to products far away in the space, and therefore policies to promote

large jumps are more challenging. Yet, it is precisely these long jumps that generate

subsequent structural transformation, convergence, and growth.

64
CHAPTER 5: DISCUSSION

5.1 Physics and People

For some people, studies mixing physics and people like the ones presented above may

be hard to classify. On one end, some people wonder where is the physics, while others would

most certainly claim that this is definitely not sociology.

At the beginning of the last century, Physicists were interested in understanding the

statistical properties of microscopic collections of particles. On the study presented in

chapter 3, more than 100 years later, we simply extend this question to a different, more

complex, type of “particle.” While this could be seen as a trivial question to formulate, our

opportunity to jump into it was made possible only as a spin-off of one of the world‟s

largest industries. As a research project, it would have been impossible to fund tens of

thousands of antennas and handle devices to millions of people to participate in such an

experiment. Yet, there are many places where research similar to this one is also taking

place. Several research groups have begun collaborations with communication companies

to study closely related problems [97,98,99,100]. These groups however, are not formed only

by traditional social scientists, but by physicists, mathematicians, computer scientists,

biologists, ecologists, architects and possibly scientists trained in other disciplines as well.

Natural scientists are interested in kinematic questions, even if the observed particles are

people. This example illustrates that scientific disciplines can defined by the approach

65
followed by those who practice them, rather than by the object of study. Writing a poem

about the rain does not make you a meteorologist. Similarly, studying the statistical

properties of individuals‟ kinematics does not make you a sociologist.

The same rationale can be illustrated in our study of the social network‟s dynamics.

While the study on human mobility patterns dealt with relatively uncharted scientific

territory, the literature on the empirical study of social network dynamics consisted of

several papers [27,28,29,30,38,39,40]. Putting some technical differences aside,4 there are clear

differences between the approaches taken by us and that of more traditional social

sciences. Papers published in sociological journals have results concentrating heavily on

the personal factors affecting social decay, like marriage and divorce [28] or entering

college [30]. Our study however, concentrated on discerning the correlations between the

structure of the network and its dynamics and is more closely related to the papers

published by Palla, Barabasi and Vicsek, [8] Onnela et al. [9] and Kossinets and Watts [6],

which also use massive communication records to study structural and dynamical

attributes of social interactions. In our study we showed that the coupling between a

social network structure and dynamics is strong enough that predictions can be made with

extremely naive approaches, alerting the community that there is a fertile ground for

predictive theories and mechanisms to be used in the study of social networks.

If it is valid to classify scientific disciplines by their approach, rather than by the

objects that they study, the number of interdisciplinary collaborations should increase in

the coming years. As the world self-organizes into a more globalized and interconnected

4
Like the fact that we used millions of automatically collected records rather than survey data on
tens or a few hundred individuals and that the data used in sociological studies usually consists of only two
panels.

66
state, the boundaries between scientific disciplines will blur, shift reorganize and evolve;

new scientific disciplines will be created and value will gradually emerge from the work of

scientists trading skills and problems across disciplines.

5.2 The Product Space

Traditionally, development has been measured through a host of aggregated

variables, mainly gross domestic product (GDP) adjusted by power purchasing parity. Yet,

as a concept, development has always been associated with an increase in diversity that

cannot be captured by such averages. As the human body develops, cells differentiate into

neurons, muscles, bones and several other cell types. Similarly, as nations develop,

different industries and products are born. Assessing the health of an economy solely based

on its wealth is as correct as assessing the health of a child solely based on its weight. A

more detailed view of development should ultimately concentrate on understanding how

nations develop different industries and products, rather than trying to predict how they

accumulate different types of capital. But how do we describe such a complex process?

A GDP view of development can be seen as a ramp or ladder. In such a metaphor,

development is measured by looking at the step on the ladder in which each nation is at,

regardless of the products and services that allowed them to get there. Development,

however, may not be as one dimensional as this picture suggests. An alternative metaphor

would represent nations as being spread on a rugged landscape rather than a ladder,

searching for new products in its valleys and crossing mountains and oceans in search for

new products and services – a Sewall Wright type of metaphor, for those familiar with the

great geneticist [101].

67
Although inspiring, assuming an entire landscape to study development may seem

unpractical. We can overcome this by replacing the landscape with a network. This

approach is far from new, as it was used by Euler to abstract and solve the famous

Konigsberg bridge problem [102]. In fact, network representations of physical landscapes

are ubiquitous. Trivial examples are a subway map or the highway network. Hence, if

describing economies as a set of nomadic tribes wondering on a product landscape is as

valid an analogy as describing it as a progression over a scalar function, then a network

view of development is at least as valid as a scalar one-dimensional representation.

We can illustrate how a network view of economics might look through an example

inspired by the view of the world presented in Jared Diamond‟s masterpiece Guns, Germs

and Steel (GG&S) [103]. For those not familiar with the book, it is a fascinating view of our

civilization‟s origins documenting how our society arose at the time that hunters and

gatherers discover plant and animal domestication. The book is full of beautifully

documented facts and anecdotes disclosing the history of many of our civilization‟s first

economic products, like wheat, barley, pork, flax and corn. Through a careful and well

documented discussion, the book shows how our world was shaped by a few civilizations,

which happened to be on the right place at the right time. These civilizations were able to

develop primitive farming economies enabling them to produce enough surplus to allow

individuals to specialize into soldiers and bureaucrats. Consequently, these tribes

dominated their neighbors, physically and/or culturally, and transformed our world from a

myriad of thousands of independent family groups, into a few large dominant civilizations.

But why did some of these advanced civilizations prevail over the others?

According to Diamond‟s argument, since climate changes little with longitude but a lot

68
with latitude, domesticated plants and animals can diffuse more easily if they travel East or

West than if they travel North or South. Since Eurasia is a large expanse spread out on an

East – West axis, innovations in one part could travel easily across the whole continent.

However, Africa and the America‟s are spread on a North – South axis and consequently

there are fewer areas with similar latitudes that could share new varieties of plants and

animals. As a consequence, there were more products available to the Eurasians than to the

Amerindians and Africans.

Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate.

We can use a network view of development to describe Jared Diamond‟s

explanation of such disparity. Figure 18 shows a simplified graphical representation of the

product landscape faced by our ancestors. Civilizations grew by discovering products, i.e.

domesticating plants and animals. These in turn allowed them to create more complex

69
products, such as garments, tools and weapons. Yet not all civilizations started in equally

dense parts of the product space. Eurasian populations had access to a broader set of

opportunities because of the larger base on which they could experiment and share.

Eurasians civilizations had also more starting points as the number of different agricultural

products available in the Eurasian continent was considerably more diverse than that of the

Americas [103]. Omitting details on the nature of the links connecting different products, it

is accurate to say that Eurasian populations were located in a denser part of the product

space -- where many goods were close to each other -- allowing them to expand quickly

over it. On the other hand, civilizations located in the Americas were located in a much

sparser part of the product space where product diffusion was limited by geographical

constraints. This limited the economic diversification of early American civilizations and

consequently, their ability to jump to products located further in the product space.

Clues about the nature of the links connecting different products can be gathered by

looking at how products are discovered and rediscovered by different populations. Some

jumps, like the domestication of apples, can require important technological improvements

– in this case grafting – that once achieved, opened the door to other fruits like pears and

plums [103]. Hence, even in the most ancient of times, links between some products or

industries were driven by technology. In other cases, some products or industries may be

connected to each other by input/output relationships, like flax and linen or olives and oil.

Yet a third way in which products may be connected is similarity in required infrastructure,

like the silos used to store wheat and barley. A network view of development does not

require a unique definition of a link, but rather accepting as a reasonable assumption the

fact that there are links connecting some products more strongly than others; links through

70
which knowledge, inputs and workers can flow, links that could be traversed by endeavor

or serendipity.

5.2.1 Exploring the network

In a recent paper we showed that it is possible to use export data to study

development as diffusion process over a network [104]. To do this, we first created a

measure of distance between a pair of products based on the probability that they were

exported by the same countries. This simple method allowed us to construct a network

were we showed that countries tend to diversify by developing products that are close in

the product space to those they already export [104]. In that publication, we simplified our

discussion by concentrating on the case in which the product space was fixed and countries

spread over it, which we found to be a valid assumption for short enough time scales. We

showed that apparently similar countries face very different opportunities for

diversification because they are at very different distances from other products. We also

showed that, given the structure of the product space today, most poor countries can only

converge to the levels of development of rich countries if they are able to jump distances

that are quite infrequent in the historical record. In other words, the “stairway to heaven”

has some very tall steps that are hard to overcome in one move.

There are many ways in which this analysis can be extended. It may be interesting

to study the product space from a labor perspective. One could relate products based on the

similarity of the labor skills required to make them. This would allow companies to

exchange skilled workers. A new product can more easily be developed if it uses labor

skills that are similar to products already in production, as new firms can poach trained

71
workers from older firms. One could also study the patterns of mobility of labor between

industries as workers try to adjust to changes in the demand for their skills.

The product space evolves over time, as new products and new ways of making old

products are introduced. Cell phones went from not existing, to being made in rich

countries, to being assembled in poor countries. Cell phone service is now ubiquitous in the

world. The internet allows for an exchange of information that was hitherto unimaginable.

Does this facilitate or make it harder for countries to transform themselves?

We can also study the robustness of an economy based on its position in the product

space and its ability to move in it [105].

These are just some examples of the perspectives that could be studied from a

network perspective. It opens new avenues to diagnose a country‟s problems and chart a

policy strategy. To properly do this, we will need to redeploy network techniques and

concepts developed in other branches of science and adapt them to economics.

Additionally, we will need to develop new techniques tailored especially for economic

questions and develop a common language that can be used to bridge new ideas and more

traditional approaches. As large data sets become more ubiquitous, the creation of network

maps will also become more common, as they represent a useful way to surf over new

waves of data.

5.2.2 Our own skepticism

Proposing a network description of the economy is bound to create skepticism.

Time will judge its usefulness, as the creation of a sensible and complete description of the

world economy as an evolving network is a task requiring many minds and years. From a

72
theoretical perspective, suggesting that economics should be described as a spreading

process over an evolving network is as groundbreaking as proposing that economics could

be studied using scalar functions and differential calculus. We often forget that our

“Newtonian” view of economics, pioneered by Walras and Jevons and continued by

Samuelson and many others, requires us to assume that the economy can be best described

by looking for numerical quantities and functional relationships between them. Most of us

forget that assumption because we never made it; we inherited it as college freshmen. Our

approach is not against the use of traditional mathematical methods. On the contrary, it

looks to complement them by incorporating tools that can be used to study development

from a different perspective.

There are no guarantees that this approach will be useful, as there were no

guarantees for the benefits of using calculus and physically inspired equilibrium processes

to describe economics at the beginning of the last century. The proof of the proverbial

pudding will have to be revealed by further research. Yet, markets have taught us the

importance of leaving room for innovation. A network view of development may be just

one such innovation.

5.3 Every tune in the guitar

After all, a scientist‟s work is a dance with ignorance. We have adapted our minds

to constantly describe, abstract and attempt to explain a few things around us. While

everything is uncountable, we look for configurations in systems designed by ourselves

through ingenuity, serendipity and wisdom. Yet, the goal is not to look for every possible

configuration, but to find the few that appear to matter for everyone around us. In this

73
process, scientists explore their interests, skills and intuition; not with the goal to play

every tune in the guitar, but to discover those that sing to them. This, keeping always in

mind, that tunes cannot understand the silence. As most futures cannot be predicted, a

scientist needs to be modest against the greater concept of ignorance, as its actions will add

only a few notes in the tune of the world, a tune that might have an end unforeseen from a

scientist‟s intentions.

74
CHAPTER 6: APPENDIXES

6.1 APPENDIX I: Papers Published During my PhD

6.1.1 Presented in this dissertation

“The Dynamics of a Mobile Phone Network”

CA Hidalgo, C Rodriguez-Sickert

Physica A, 387(12): 3017-3024

Abstract:

The empirical study of network dynamics has been limited by the lack of

longitudinal data. Here we introduce a quantitative indicator of link persistence to explore

the correlations between the structure of a mobile phone network and the persistence of its

links.We show that persistent links tend to be reciprocal and are more common for people

with low degree and high clustering.We study the redundancy of the associations between

persistence, degree, clustering and reciprocity and show that reciprocity is the strongest

predictor of tie persistence. The method presented can be easily adapted to characterize the

dynamics of other networks and can be used to identify the links that are most likely to

survive in the future.

75
“Understanding Human Mobility Patterns”

MC Gonzalez, CA Hidalgo, A-L Barabasi

Nature (2008) 453: 779-782

Abstract:

Despite their importance for urban planning, traffic forecasting, and the spread of

biological and mobile viruses, our understanding of the basic laws governing human

motion remains limited thanks to the lack of tools to monitor the time resolved location of

individuals. Here we study the trajectory of 100,000 anonymized mobile phone users

whose position is tracked for a six month period. We find that in contrast with the random

trajectories predicted by the prevailing Lévy flight and random walk models, human

trajectories show a high degree of temporal and spatial regularity, each individual being

characterized by a time independent characteristic length scale and a significant probability

to return to a few highly frequented locations. After correcting for differences in travel

distances and the inherent anisotropy of each trajectory, the individual travel patterns

collapse into a single spatial probability distribution, indicating that despite the diversity of

their travel history, humans follow simple reproducible patterns. This inherent similarity in

travel patterns could impact all phenomena driven by human mobility, from epidemic

prevention to emergency response, urban planning and agent based modelling.

76
“The Product Space Conditions the Development of Nations”

CA Hidalgo, B Klinger, A-L Barabasi, R Hausmann

Science (2007) 317: 482-487

Abstract:

Economies grow by upgrading the products they produce and export. The

technology, capital, institutions, and skills needed to make newer products are more easily

adapted from some products than from others. Here, we study this network of relatedness

between products, or “product space,” finding that more-sophisticated products are located

in a densely connected core whereas less sophisticated products occupy a less-connected

periphery. Empirically, countries move through the product space by developing goods

close to those they currently produce. Most countries can reach the core only by traversing

empirically infrequent distances, which may help explain why poor countries have trouble

developing more competitive exports and fail to converge to the income levels of rich

countries

“A Network View of Development”

CA Hidalgo, R Hausmann

Development Alternatives (2008) In Press

Abstract:

No Abstract

77
6.1.2 Not presented in this dissertation

"Genome-scale analysis of in vivo spatiotemporal promoter activity in C. elegans"

D Dupuy, N Bertin, CA Hidalgo, K Venkatesan, D Tu, D Lee, J Rosenberg, N Svrzikapa, A

Blanc, A Carnec, A-R Carvunis, R Pulak, J Shingles, J Reece-Hoyes, R Newbury, R Viveiros,

WA Mohler, C Le Peuch, IA Hope, R Johnsen, D Moerman, A-L Barabási, D Baillie & M

Vidal.

Nature Biotechnology (2007) 25: 663 - 668

Abstract:

Differential regulation of gene expression is essential for cell fate specification in

metazoans. Characterizing the transcriptional activity of gene promoters, in time and in

space, is therefore a critical step toward understanding complex biological systems. Here

we present an in vivo spatiotemporal analysis for B900 predicted C. elegans promoters

(B5% of the predicted proteincoding genes), each driving the expression of green

fluorescent protein (GFP). Using a flow-cytometer adapted for nematode profiling, we

generated „chronograms‟, two-dimensional representations of fluorescence intensity along

the body axis and throughout development from early larvae to adults. Automated

comparison and clustering of the obtained in vivo expression patterns show that genes

coexpressed in space and time tend to belong to common functional categories. Moreover,

integration of this data set with C. elegans protein-protein interactome data sets enables

prediction of anatomical and temporal interaction territories between protein partners.

78
"Transcription Factor Modularity in a Gene-Centered C. elegans Protein-DNA Interaction

Network"

V Vermeirssen, MI Barrasa, CA Hidalgo, JAB Babon, R Sequerra, L Doucette-Stamm,

A-L Barabási, AJM Walhout

Genome Research (2007) 17:1061-1071

Abstract:

Transcription regulatory networks play a pivotal role in the development, function,

and pathology of metazoan organisms. Such networks are comprised of protein-DNA

interactions between transcription factors (TFs) and their target genes. An important

question pertains to how the architecture of such networks relates to network functionality.

Here, we show that a Caenorhabditis elegans core neuronal protein-DNA interaction

network is organized into two TF modules. These modules contain TFs that bind to a

relatively small number of target genes and are more systems specific than the TF hubs that

connect the modules. Each module relates to different functional aspects of the network.

One module contains TFs involved in reproduction and target genes that are expressed in

neurons as well as in other tissues. The second module is enriched for paired homeodomain

TFs and connects to target genes that are often exclusively neuronal. We find that paired

homeodomain TFs are specifically expressed in C. elegans and mouse neurons, indicating

that the neuronal function of paired homeodomains is evolutionarily conserved. Taken

together, we show that a core neuronal C. elegans protein-DNA interaction network

possesses TF modules that relate to different functional aspects of the complete network.

79
"Conditions for the Emergence of Scaling in the Inter-Event Time of Uncorrelated and

Seasonal Systems"

CA Hidalgo

Physica A. (2006) 369(2): 877-883.

Abstract:

Inter-event times have been studied across various disciplines in search for

correlations. In this paper, we show analytical and numerical evidence that at the

population level a power-law can be obtained by assuming Poissonian agents with

different characteristic times, and at the individual level by assuming Poissonian agents

that change the rates at which they perform an event in a random or deterministic fashion.

The range in which we expect to see this behavior and the possible deviations from it are

studied by considering the shape of the rate distribution.

“The effect of social interactions in the primary consumption life cycle of motion pictures”

CA Hidalgo, A Castro,C Rodriguez-Sickert

New Journal of Physics (2006) 8 52

Abstract:

We develop a „basic principles‟ model which accounts for the primary life cycle

consumption of films as a social coordination problem in which information transmission

is governed by word of mouth. We fit the analytical solution of such a model to aggregated

consumption data from the film industry and derive a quantitative estimator of its quality

based on the structure of the life cycle.

80
6.2 APPENDIX II: Product Space Properties

Using a network representation for the products space we can not only see which

products are close to each other and the groups they form, but also their classifications and

values. However, the network representation is nothing more than a powerful visualization

technique and we still need to study the space properties using the entire proximity matrix

complemented.

6.2.1 The Product Space Can Classify Products

The first property we study is the ability of the product space to classify goods into

different classes. We compare our network representation with the clusters introduced by

Leamer, as it is shown in figure 1, by using a different color for each product class. We see

that the product space is not colored at random. Products in the same classes lie close to

each other and tend to form clusters.

Although the classification performed by Leamer was done used a different

methodology, the agreement between it and the structure of the product space is striking.

Beyond the intuitive proof of Figure 7s we can tests the strength of these correlations by

taking the average proximity between and within the products belonging to one of the

clusters defined by Leamer (Table 4).

81
TABLE 4 STRENGTH OF THE LINKS BETWEEN AND WITHIN PRODUCTS AS

CLASSIFIED BY LEAMER.

Table 4 shows that the average proximity of products belonging to the same cluster

is always higher than the proximity for products belonging to different clusters. But not all

clusters have the same size, thus we look at the distribution of proximities for all links

connecting products with the same or different Leamer classifications. Figure 8s shows the

distribution of proximity for links connecting nodes with the same Leamer classification

(blue) and for links connecting nodes annotated differently. It is clear from the figure that

nodes with the same classification are connected by links with higher proximity values,

and because of the large number of links present in the system (L>200'000), the difference

between these two distributions is highly significant (log(P-value)<-300 ANOVA)

82
Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red).

6.2.2 Correlations between the Position and Value of Goods

All products have a value, which in this work we consider as the average income

per-capita associated with that good or PRODY. It follows to ask: Are rich goods located in

particular parts of the product space? By looking at its network representation and setting

the size of the nodes proportional to the PRODY of a product (figure 9s), we see that the

largest nodes are located either in the center or the down most portion of the network. At a

first glance, we can say that there is a rich region of the product space, composed by

machinery, electronics and chemicals, and a poor, peripheral region, made of some

agricultural and labor intensive goods.

83
Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY.

We can look beyond the actual value of products and study the value of goods as a

function of their distance between them. Basically we ask: Is this particular product at the

top or at the bottom of the PRODY sophistication scale? To answer this we study the

average PRODY of products at a given distance of a particular node. We define distance as

-log(Proximity). Figure 10s shows six examples of products, three of them at the bottom of

the sophistication scale (Footwear, Cotton Undergarments and Coats and Jackets) which

belong to the labor intensive cluster and thus products far from them are richer or more

attractive. On the other hand, chemicals such as organo sulphur compounds, phenols and

cyclic alcohols appear at the top of the sophistication scale and see all other products as less

sophisticated.
84
Figure 21 Prody as a function of distance for six different products in the space. Plots were
calculated using the full proximity matrix.

We performed the same analysis for each product class and found that there are

products at the top of the scale, at the bottom and in local maxima (Figure 11s). If the

structural transformation only moves countries to more sophisticated goods, a local

maximum would trap countries. Examples of these are cereals and animal agriculture

85
products which are goods located in the periphery of the product space but have a relatively

large PRODY compared to their neighbors.

86
1.Petroleum 6. Cereals

2. Raw 7. Labor

Materials Intensive

3. Forest 8. Capital

Products Intensive

4. Tropical
9. Machinery
Agriculture

5. Animal 10. Chemicals


Agriculture

Figure 22 Average PRODY as a function of the distance for products with a given Leamer Annotation.

87
6.2.3 Changes in Time

How fast does the product space changes in time? We can take a simple look at

these by calculating the Pearson's Correlation Coefficient (PCC) between the matrices

representing the product space in 1985, 1990 and 1998. Table 2s shows that the structure of

the product space appears to be stable and that although links do change in time, after 10 or

13 years strong links remain strong and weak links remain weak. Thus products that are

close tend to remain close and the ones that are far tend to stay far. The correlation was

calculated over each pair of corresponding proximities between different time periods.

Proximity values equal to zero were excluded from the calculation.

TABLE 5 PEARSON'S CORRELATION COEFFICIENT BETWEEN THE PRODUCT

SPACES GENERATED WITH DATA FROM 1985, 1990 AND 1998.

PCC 1985 1990 1998

1985 .702 .696

1990 .616

1998

88
6.3 APPENDIX III: Simulating Diffusion

6.3.1 One diffusion step

Empirically, we showed using examples and statistics that products in which

countries develop RCA tend to lie close to other products for which these countries have

already developed RCA.

Using these we try to anticipate how a country will diffuse across the product space.

As an example, we show Figure 23, in which we highlighted with black squares all

products at a given proximity of the ones already developed by Chile and Korea. We refer

to this example as one diffusion step.

In this case we tuned the proximity of the jump and show that for high proximities

the set of options available is small while for low proximities is large, however different.

The available options are strongly conditioned by current exports. Korea is a

country that has developed RCA in several branches of machinery and therefore can

diffuse from the center of the space. At proximity of 0.5 its options include the entire core

of the network plus the entire electronics and garments clusters, among other things. Chile

diffuses from the periphery and to achieve a similar set of options needs to diffuse as far as

proximities of 0.3.

In summary we find that the set of options available for a country are strongly

conditioned by its position in the product space and its ability to diffuse into products up to

given proximities.

89
Figure 23. One step diffusion process for Korea and Chile. The black squares denote all
products closer than a given proximity considering their exports baskets in the year 2000.

90
Figure 24 Iterated diffusion process for Chile and Korea

6.3.2 Iterated diffusion

We can refine the diffusion process presented above by choosing a particular

proximity and iterate the one step diffusion process. This represents a set of products

potentially available to countries after diffusing to close products iteratively. At this point

we ask ourselves: Is there a critical value of proximity at which countries will be able to

diffuse across the product space? To explore this question we simulate a diffusion process

in which a country "jumps" to all goods reachable from its current export basket, such that

the proximity to them is larger or equal than a given value. Figure 24 illustrates through a

color code the products available to Chile and Korea after diffusing iteratively at different

91
Figure 25. Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities.

proximities for 4 time steps. We observe that at relatively low proximities (φ = 0.55) both

countries are able to diffuse, however Chile does so much slower and reaches the core in

the second and third rounds, compared to Korea which does so on the first and second. At

larger proximities the diffusion process halts. At φ = 0.65 Chile is unable to diffuse at all,

while Korea slowly does so close to the core of the product space.

6.3.3 Economic Convergence

We characterize the value of a certain configuration by considering the value of its

top products. We can assign value to a good by following the work of Hausmann, Hwang

and Rodrick in which the value or sophistication of a good is equal to the average GDP per

capita associated with that good. This quantity is called PRODY and in our particular

example we consider the average PRODY of the top N products of a countries export

92
basket after M diffusion steps with proximity φ. We denote this quantity by

Figure 25 shows that the original distribution of is bimodal. Indicating a world in

which countries are divided into those producing sophisticated goods and unsophisticated

ones. If we allow countries to diffuse in this space to acquire only goods that are really

close by (φ=0.65). This distribution remains practically unchanged evidencing the

structural constrains imposed by the product space. Whereas, if we allow countries to

diffuse into products at relatively large proximities (φ=0.55) we find that after a large

number of rounds most countries are able to reach the most attractive parts of the space,

except for a few of them that remain stuck in the lowest bracket of this distribution.

93
REFERENCES

1 B.B. Mandelbrot The Fractal Geometry of Nature. New York: W. H. Freeman and

Co., (1982)

2 T. Vicsek Fractal Growth Phenomena World Scientific (1991)

3 A.-L. Barabási, H.E. Stanley, Fractal Concepts in Surface Growth, Cambridge

University Press, Cambridge (1995)

4 R.N. Mantegna, H.E. Stanley, An Introduction to Econophysics: Correlations and

Complexity in Finance, Cambridge University Press, Cambridge UK (1999)

5 J McCauley, Dynamics of Markets, Econophysics and Finance, Cambridge

University Press, Cambridge UK (2004)

6 G. Kossinets, D.J. Watts, Science 311: 88-90 (2006).

7 H. Ebel, L.-I. Mielsch, S. Bornholdt, Phys. Rev. E 66:035103 (2002)

8 G. Palla, A.-L. Barabasi, T. Vicsek, Nature 446 :664-667 (2007)

9 J.-P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész,

A.-L. Barabási, PNAS 104 7332 (2007)

10 C.A. Hidalgo, C Rodriguez-Sickert. Physica A 387:3017-3024 (2008)

11 MC Gonzalez, CA Hidalgo, A.-L. Barabási. Nature 453:779-782 (2008)

12 C Biever. New Scientist 185: 25-26 (2005)

13 M. Nekovee. New J. Phys. 9:189 (2007).

14 V. Colizza, A. Barrat, M. Barthélemy, A. Vespignani, BMC Med. 5: 34 (2007)

94
15 V. Colizza, A. Barrat, M. Barthélemy, A. Vespignani, PNAS 103: 2015–2020

(2006)

16 E. Beinhocker, The Origin of Wealth, HBS Press, Cambridge MA (2006)

17 L. Walras, Éléments d'économie politique pure, ou théorie de la richesse sociale

(Elements of Pure Economics, or the theory of social wealth, transl. W. Jaffé), (1874)

18 L. Bachelier, Annales Scientifiques de l‟École Normale Supérieure 3: 21-86

(1900)

19 B. Mandelbrot Journal of Business. 36 (1963)

20 M.H.R. Stanley, L.A.N. Amaral, S. V. Buldyrev, S. Havlin, H. Leschhorn, P.

Maass, M. A. Salinger, H.E. Stanley, Nature 379:804-806 (1996)

21 V. Plerou, P. Gopikrishnan, L.A.N. Amaral, X. Gabaix, H.E. Stanley Phys. Rev.

E 62:3023-3026 (2000)

22 J.D. Farmer, Ind. & Corp. Change 11:895-953 (2002)

23 J.D. Farmer, L. Gillemot, F. Lillo, S. Mike, A. Sen. Quant. Fin. 4:383-397 (2004)

24 J.D. Farmer, D. E. Smith, M. Shubik. Physics Today 58:37-42 (2005)

25 H. Jeong, Z. Néda, A.-L. Barabási, Europhysics Letters 61: 567-572 (2003)

26 A.-L. Barabási, H. Jeong, R. Ravasz, Z. Néda, T. Vicsek, A. Schubert, Physica A

311: 590-614 (2002)

27 P. Holme, C.R. Edling, F. Liljeros, Social Networks 26:155-174 (2004)

28 B. Wellman, R.Y. Wong, D. Tindall, N. Nazer, Social Networks 19:27-50 (1997)

29 J.L. Martin, K.-T. Yeung, Social Networks 28:331-362 (2006)

30 J.J. Suitor, S. Keeton, Social Networks 19:51-62 (1997)

31 A.-L. Barabasi, R. Albert, Science 286:509-512 (1999)

95
32 D.J. Watts, S.H. Strogatz, Nature 393:440-442 (1998)

33 M.E.J. Newman Phys. Rev. E. 67:026126 (2003)

34 L. Adamic, N. Glance, Proceedings of the 3rd international workshop on Link

discovery 36-43 (2005)

35 C. Haythornthwaite, Information, Communication, & Society 8:125-147

(2005)

36 N. Eagle, A. Pentland, D. Lazer, Inferring social network structure using mobile

phone data, PNAS (in submission).

37 Gener. H., Toward a Sociological Theory of Mobile Phone, University of Zurich,

Zurich (2004).

38 D.L. Morgan, M.B. Neal, P. Carder, Social Networks 19:9-25 (1996)

39 S.L. Feld, Social Networks 19:91-95 (1997)

40 R.S. Burt, Social Networks 22:1-28 (2000)

41 M.E.J. Newman, Phys. Rev. Lett 89:208701 (2002)

42 M.E.J. Newman, Phys. Rev. E. 67:026126 (2003)

43 J. Cohen, P. Cohen, S.G. West, L.S. Aiken, Applied Multiple

Regression/Correlation Analysis for the Behavioral Sciences (3rd edition) LEA, Mahwah,

New Jersey (2003)

44 M.E.J. Newman, J. Park, Phys. Rev. E 70:066117 (2004)

45 A. Vazquez, R. Dobrin, D. Sergi, J.-P. Eckmann, Z.N. Oltvai, A.-L. Barabasi

PNAS 101:17940-17945 (2004)

46 C.A. Hidalgo, F. Claro, P.A. Marquet, Physica A 35:674 (2005)

47 K. Sznajd-Weron, J. Sznajd, IJMPC 6 (2000).

96
48 C.A. Hidalgo, A. Castro, C. Rodriguez-Sickert, New Journal of Physics 8:52

(2006)

49 M.S. Granovetter, The American Journal of Sociology 78:1360-80 (1973)

50 M.W. Horner, M.E.S. O‟Kelly Journal of Transportation Geography 9:255-265

(2001)

51 R. Kitamura, C. Chen, R.M. Pendyala, R. Narayaran, Transportation 27:25-51

(2000)

52 V. Colizza, A. Barrat, M. Barthélémy, A.-J. Valleron, A. Vespignani, PLoS

Medicine 4:095-0110 (2007)

53 S. Eubank, H. Guclu, V.S.A. Kumar, M.V. Marathe, A. Srinivasan, Z. Toroczkai,

N. Wang, N. Nature 429:180 (2004)

54 L. Hufnagel, D. Brockmann, T. Geisel, PNAS 101:15124-15129 (2004)

55 J. Kleinberg, Nature 449:287-288 (2007)

56 D. Brockmann, L. Hufnagel, T. Geisel, Nature 439:462-465 (2006)

57 S. Havlin, D. ben-Avraham, Advances in Physics 51:187-292 (2002).

58 G.M. Viswanathan, V. Afanasyev, S.V. Buldyrev, E.J. Murphy, P.A. Prince,

H.E.S. Stanley, Nature 381:413-415 (1996)

59 G. Ramos-Fernandez, J.L. Mateos, O. Miramontes, G. Cocho, H. Larralde, B.

Ayala-Orozco, Behavioral Ecology and Sociobiology 55:223-230 (2004)

60 D.W. Sims Nature 451:1098-1102 (2008)

61 J. Klafter, M.F. Shlesinger, G. Zumofen, Physics Today 49:33-39 (1996)

62 R.N. Mantegna, H.E. Stanley, Physical Review Letters 73:2946-2949 (1994)

97
63 A.M. Edwards, R.A. Phillips, N.W. Watkins, M.P. Freeman, E.J. Murphy, V.

Afanasyev, S.V. Buldyrev, M.G.E da Luz, E.P. Raposo, H.E. Stanley, G.M. Viswanathan,

Nature 449:1044-1049 (2007)

64 T. Sohn, A. Varshavsky, A. LaMarca, M.Y. Chen, T. Choudhury, I. Smith, S.

Consolvo, J. Hightower, W.G. Griswold, E. de Lara, Proc. 8th International Conference

UbiComp , Springer, Berlin, (2006)

65 M.C. González, A.L. Barabási, Nature Physics 3:224-225 (2007)

66 A.-L. Barabási, Nature 435:207-211 (2005)

67 B.D. Hughes, Random Walks and Random Environments, Oxford University

Press, USA, (1995)

68 S. Redner, A Guide to First-Passage Processes. Cambridge University Press, UK

(2001)

69 A. Clauset, R. Shalizi, M.E.J. Newman arXiv:physics:/07061062 (2007)

70 R. Schlich, K.W. Axhausen, Transportation 30:13-36 (2003)

71 N. Eagle, A. Pentland, Behavioral Ecology and Sociobiology (2007)

72 S.H. Yook, H. Jeong, A.-L. Barabási PNAS 99:13382-13386 (2002)

73 G. Caldarelli, Scale-Free Networks: Complex Webs in Nature and Technology

Oxford University Press, USA (2007)

74 S.N. Dorogovtsev, J.F.F. Mendes, Evolution of Networks: From Biological Nets

to the Internet and WWW. Oxford University Press, USA, (2003)

75 C.M. Song, S. Havlin H.A. Makse. Nature 433:392-395 (2005)

76 F. Cecconi, M. Marsili, J.R. Banavar, A. Maritan, Physical Review Letters

89:088102 (2002)

98
77 A. Hirschman, The Strategy of Economic Development Yale University press,

New Haven, CT. (1958)

78 P. Rosenstein-Rodan, Economic Journal 53 (1943)

79 K. Matsuyama, Journal of Economic Theory 58 (1992)

80 E. Heckscher, B. Ohlin, Heckscher-Ohlin Trade Theory, MIT Press, Cambridge

MA, (1991)

81 P. Romer. Journal of Political Economy 94:5 (1986)

82 P. Aghion, P. Howitt. Econometrica 60:2 (1992)

83 G. Grossman, E. Helpman. Review of Economic Studies 58:1 (1991)

84 E.Leamer, Sources of Comparative Advantage: Theory and Evidence. MIT

Press, Cambridge MA, (1984)

85 S. Lall, Oxford Development Studies 28:337 (2000).

86 R. Caballero, A. Jaffe. NBER macroeconomics annual, O. Blanchard, S. Fischer,

Eds. 15 (1993)

87 E. Dietzenbacher, M. Lahr, Input-output analysis: frontiers and extensions.

Palgrave, New York, NY, (2001)

88 D. Rodrik, A. Subramanian, F. Trebbi, NBER Working Paper 9305, Cambridge

MA (2002)

89 D. Acemoglu, S. Johnson, J.A. Robinson. American Economic Review,

91:1369-1401 (2001)

90 B. Balassa, The Review of Economics and Statistics 68:315 (1986)

91 R.R. Feenstra, H. Lipsey, A. Deng, A. Ma, H. Mo. NBER working paper 11040.

Cambridge, MA (2005)

99
92 E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai, A.-L. Barabási, Science

297:1551 (2002)

93 G. Palla, I. Derenyi, I. Farkas, T. Vicsek, Nature 435:814 (2005)

94 R. Albert, A.-L. Barabási Review of Modern Physics 74:47-97 (2002)

95 R. Hausmann, J. Hwang, D. Rodrik, NBER Working Paper 11905, Cambridge

MA (2006)

96 J. Gallup, J. Sachs, A. Mellinger, International Regional Science Review

22:179-232 (1999)

97 Real Time Rome, MIT, http://senseable.mit.edu/realtimerome/

98 Human Dynamics Lab, MIT, http://hd.media.mit.edu/

99 Smart Cities, MIT, http://cities.media.mit.edu/

100 Intellione and Roger Wireless have signed an agreement to use cell phones to

create traffic maps http://www.intellione.com/Newsroom/Press/intellione-presa.html

101 S. Wright, Proceedings of the Sixth International Congress on Genetics, (1932)

102 In fact he showed that the problem had no solution.

103 J. Diamond. Guns, Germs, and Steel: The Fates of Human Societies. W.W.

Norton & Company (1997)

104 C.A. Hidalgo, B. Klinger, A.-L. Barabasi, R. Hausmann. Science 317:482-487 (2007)

105 Hausmann, Rodriguez and Wagner (2008) show that the position of a country in

the product space strongly affects the speed at which it recovers from economic crises.

100

You might also like