Professional Documents
Culture Documents
A Dissertation
Doctor of Philosophy
by
César A. Hidalgo
July 2008
THREE EMPIRICAL STUDIES ON THE AGGREGATE DYNAMICS OF HUMANLY
Abstract
by
César A. Hidalgo
explained from their large number of interacting and heterogeneous components. Different
aspects of human society can be described as a complex system, as large numbers of people
systems. First, we study the dynamics of a mobile phone network reconstructed from
millions of individual phone calls. By looking at time resolved data we show that the
structure of the mobile phone network is coupled to the dynamics of mobile phone links.
Second, we study the statistical properties of human mobility patterns and show that the
productive structures and show that the structure of the product space conditions the
These three studies illustrate how large data sets can be used to empirically study
humanly driven complex systems. Individually, they present relevant information that can
be used to benchmark future models for each one of these complex systems or can be used
ii
CONTENTS
FIGURES .............................................................................................................................v
FOREWARD ..................................................................................................................... xi
iii
3.4 Testing the Power-Law Curve Fits ..................................................................35
4.5 The Products Space and the Patterns of Comparative Advantage ...................55
REFERENCES ..................................................................................................................94
iv
FIGURES
Figure 12 Earliest version of the MST representing the "skeleton" of the product space.
................................................................................................................................52
Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55 .............................................................................................53
Figure 14 Network representation of the product space. Layout uses a force spring
algorithm. ...............................................................................................................54
Figure 15 Localization of the productive structure for different regions of the world ......57
v
Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate
................................................................................................................................69
Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red) ................................................83
Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY ..............................................................................................................84
Figure 21 Prody as a function of distance for six different products in the space. ............85
Figure 22 Average PRODY as a function of the distance for products with a given Leamer
Annotation..............................................................................................................87
Figure 23 One step diffusion process for Korea and Chile ................................................90
Figure 25 Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities .........................................................92
vi
TABLES
vii
ACKNOWLEDGMENTS
I would like to thank Albert-Laszlo Barabasi, for accepting me in his group, having
confidence in me from my early years in Graduate School and for the numerous
discussions and advices we exchanged during this four years. I am positive that I would
have not been able to have such a wonderful Graduate School experience if it was not for
Laszlo‟s support and personality. I also would like to thank Ricardo Haussmann for the
innumerable conversations advices and time spent discussing several different scientific
and non-scientific issues with me. He has greatly inspired the evolutionary view of systems
scientific terms.
I am also enormously grateful to Christine Teutsch for her support and love during
the more than two years we have been together and to Alejandra Castro for her support in
I would also like to thank all my collaborators during these four years: Marta
I am also grateful for many interactions with KI Goh, Andrea Asztalos, Alexei
viii
Julian Candia, Zehui Qu, Natali Gulbahce, Maximilian Schich, Marcio Argollo de
Special thanks go for the CCNR staff, especially for Suzanne Aleva, who has given
networks field; as well as too Nicole Halley, Nicole Leete and Agnes Petrozcky.
I would also like to thank the Notre Dame Physics department, especially professor
Kathie Newman who has been there to help me all along graduate school.
Additionally, I would like to thank Professor Francisco Claro, for his support and
several conversations during my graduate studies and during previous years. I would also
like to thanks professor Pablo Marquet for inviting me to the Santa Fe Institute among
other things and Carlos Rodriguez-Sickert, who has been a great collaborator and friend for
several years.
Finally, I would like to thank the Kellogg Institute at Notre Dame for their financial
C.A. Hidalgo was partly supported by the Kellogg Institute at Notre Dame and
acknowledges support from NSF grant ITR DMR-0426737, IIS-0513650 and the James S.
the S.F.I. We thank Nicole Leete for proof reading our manuscript. Special
acknowledgments to A.-L. Barabasi for providing the source data and discussing the
manuscript.
ix
Human Mobility Patterns
for discussions and comments on the manuscript. This work was supported by the James S.
McDonnell Foundation 21st Century Initiative in Studying Complex Systems, the National
Science Foundation within the analysis was performed on the Notre Dame Biocomplexity
Cluster supported in part by NSF MRI Grant No. DBI-0420980. C.A. Hidalgo
ITR (DMR-0426737) and IIS-0513650 programs, and the U.S. Office of Naval Research
We would like to thank the following for valuable comments: Philippe Aghion,
Laura Alfaro, Olivier Blanchard, Ricardo Caballero, Oded Galor, Elhanan Helpman, Asim
Khwaja, Jim Lahey, Robert Lawrence, Daniel Lederman, Lant Pritchett, Roberto Rigobon,
Sturzenegger, and David Weil. C.A.H. acknowledges support from the Kellogg Institute at
Notre Dame. C.A.H. and A.-L.B. acknowledge support from NSF grant ITR
We would like to thank Melissa Wojciechowski for help editing this manuscript.
x
FOREWORD
student at the Center for Complex Network Research in the University of Notre Dame.
This Dissertation has been formatted to satisfy the requirements of the Physics department.
I remit anyone interested in the historical context in which this work was conducted, as
research, to search for the original copy submitted for review in my personal webpage (just
google it).
Best Wishes
Cesar A. Hidalgo
xi
CHAPTER 1:
INTRODUCTION
complexity or complex systems. While many physicists have contributed to this field for
several decades, they are not the only ones to study complexity. The field of complexity
has a strong interdisciplinary nature and a great number of contributions have emerged
from the interactions between scientists trained in different fields. Throughout this
document, we will present some examples mixing Physics, Biology, Economics, Computer
Science, Psychology and Social Sciences to answer questions that lie between traditional
demonstrating that some problems should be worked from many different angles. There
are no constraints forbidding the use of theories and methods inspired by a particular field
in a different scientific discipline. This creates the need for a field connecting seemingly
distant branches of science. The science of complexity has arisen in part to fulfill that
particular need. Moreover, complexity science has created scientific value which is
different from that of the particular fields where its adherents were originally trained. This
dual purpose makes the field of complexity attractive from an applied as well as a
fundamental perspective.
1
1.1 The Statistical Physics of Society
One scientific combination that has gained recent popularity is that of physicists
studying people. The recent surge of physicists into the realms of social science has been
fueled largely by the availability of data collected by network administrators and corporate
database managers in recent years. Physicists, mainly from statistical mechanics, have
rushed into the strangest datasets looking to describe systems and answer questions that
could only be speculated about ten years ago. From a historical perspective this is hardly
unexpected, as statistical physicists have been flirting with less conventional topics for
several decades, such as fractals [1,2,3], stock-exchange time series [4,5] and more recently
phenomena. Both of these contributions have been made possible by the availability of
millions of mobile phone records. In Chapter 2 we will study the dynamics of a mobile
phone network [10], whereas on Chapter 3 we show how people can be characterized by
Before presenting these two topics, we will briefly discuss two of the main
dynamics or mobility patterns. While in their purest forms, such studies might appear as
simple curiosities, marketing executives are becoming increasingly more interested in new
ways to classify people. The motivation here is obvious, as people in the marketing field
have constantly searched for new ways to segment populations and identify the individuals
2
segmentation has been based on demographic and socio-economic attributes, such as a
person‟s gender, age, ZIP code and income information. While all these variables are
represent a person, as two individuals from the same age and gender, living in the same zip
code and having comparable incomes can have extremely different behaviors. Here is
where these new layers of data fall into place, as the structure of an individual‟s social
network, its dynamics and mobility patterns could be better proxies for a person‟s behavior
than demographic and socioeconomic variables and could therefore be used to explore new
While the previous paragraph presents a very well defined and concrete industrial
application of studies in social structure and dynamics, there are several other applications
that open up thanks to this type of studies. Epidemiology is probably the most important of
these given the real threat of infectious diseases and software viruses [12,13]. This fuels the
need for studies providing data that can be used to understand the spread of biological and
digital pathogens from an empirical or theoretical perspective. Ultimately, the goal of these
dynamics, there is also a fundamental angle from which to address these questions. Social
systems are extremely complex and exhibit behaviors that are interesting to study solely for
scientific curiosity. Hence together, applied and fundamental studies can help advance our
understanding of the universe at this high level of complexity, thus research conducted
from applied or fundamental perspectives are complements rather than substitutes. Here
3
we present some research conducted from an empirical perspective with some fundamental
Large scale data on a person‟s social network and mobility has only become
available during the last few years. Hence, exploring ways to statistically describe millions
of individuals based on their social network structure, dynamics or spatial mobility patterns
is by definition a new field. Yet, from even more recent appearance are data allowing us to
study the dynamics of the structures defined by individual‟s social relationships and
mobility patterns. In the next chapters we present two studies that are among the first in the
literature to explore the dynamical properties of individuals‟ social networks and spatial
dynamics. The first of these studies concentrates on the stability of social ties and its
connection to network structure. This study was published in Physica A on May 2008 [10].
The second study will discuss how to characterize the spatial patterns defined by the
movement of individuals. This study was published in Nature on June 2008 [11].
developmental economy and industrial policy. While this approach can be considered
innovate, it is not a completely novel mixture, as physics and economy have been dialoging
for more than a century. This long precedes complexity science and goes back at least to
the times of Leon Walras and William Stanley Jevons, two XIX century scientists credited
for establishing the mathematical foundations of classical economics. Leon Walras, a late
bloomer in scientific terms [16], has been credited for introducing the notion of equilibrium
4
in classical economic theory [16] in his 1872 book Elements of a Pure Economics [17]. It is
believed that Walras adapted the notion of equilibrium from Louis Pinsot‟s Elements de
Statique. On the same front, William Stanley Jevons defined the problem of economic
amount of goods will make them “happier.” Like Walras, Jevons was also inspired by
theoretical physics. This is evident in his Theory of Political Economy, as Jevons used
Despite these XIX century sisterhood, physics and economy have evolved mostly
separately ever since. During the last century, economy has branched out into several
different fields, many of which have adopted a view of the world that resemble
mathematics rather than physics. The field of finance however, has been an exception as its
use of random walks has kept it somewhat closer to physics. Random Walks were first
proposed as a model for financial markets by Louis Bachelier in his doctoral thesis in the
year 19001 [18] under the supervision of Henri Poincare. Bachelier‟s work was ignored for
many years and was resurfaced by Benoit Mandelbrot during the 50‟s, when he
re-discovered the power-law nature of stock returns [19]. This portion of Mandelbrot‟s
work was also ignored during the first years after its publication, as the power-law behavior
Yet describing a system with random walks and power laws is not the way to keep
physicists out of the loop. During the last decades a large number of physicists entered
financial research. This is a process driven by the similarities of the two disciplines that
1
Note this is earlier than Einstein‟s 1905 paper on Random Walks.
5
was catalyzed greatly during the 1990‟s as the end of the cold war created an excess supply
Financial data has also been studied by physicists in academic settings. A superb
example of this is the work that Rosario Mantegna and Gene Stanley started at Boston
University on the mid 90‟s. Several of their findings are summarized in their book
Econophysics [4], including the existence of scaling laws for the growth of firms [20,4], the
interpretation of financial markets as an anomalous diffusion process [4,21] and some of the
Another successful physicist in the field of finance is Doyne Farmer of the Santa Fe
Institute. His approach has been somehow different than that of Mantegna and Stanley as
he has concentrated on explaining different observations in the stock market using agent
based simulations [22] or simple logic [23]. Still, an important part of his work resembles
The fact that financial markets have been studied using random walk models and
time series analysis can partly explain the sail of physicists into such apparently far
academic waters. Developmental economy however, has been studied as the accumulation
of factors and through abstract production functions that resemble physics only on the
simplicity of some of its mathematical expressions, this being not enough fertile ground for
collaborations to occur. The introduction of a network view of development does not only
open a new angle from which information can be revealed, but opens a new window for
economy.
6
CHAPTER 2: THE DYNAMICS OF A MOBILE PHONE NETWORK
2.1 Introduction
Physicists are no strangers to the study of social networks. During the last decade
several groups have explored the structure of social networks captured by e-mails [6,7],
cellular phones [8,9] and professional relationships such as being costars in a movie [25] or
collaborators in a paper [26]. Studying the dynamics of such social systems, however, has
been limited by the lack of longitudinal data, and as a result, only a few studies on the
In principle there are many factors that could affect the stability of a social link
[28,29,30]. The aim of the subsequent sections is not to review all these factors, but to study
the coupling between the structure of the network as characterized in previous studies
Here we use a year‟s worth of mobile phone data as a proxy for the structure and
collected communication records have been proposed as a source of reliable data about
personal connections [33]. Email data for example, has been used to study social processes
such as social links, or tie, formation [6] and social structure [7], whereas blog data has been
used to study the spread of political opinions [34]. Communication records overcome
problems of survey data such as subjective biases on the respondents and the intrinsic
7
limitations of ego-centered networks, like their unreliability measuring a social network‟s
structure.
It is not our intention to claim that cellular phone communications fully capture
social exchange. A social network is expressed through a host of interactions ranging from
e-mails to face-to-face contacts. People in close social contact tend to express their ties to
others through multiple interaction channels [35], such as email, cell phone
however, favoring the use of cellular phone calls as a relevant proxy for large-scale social
networks. Specifically, it has been shown that objective measures as the one we use in our
study can accurately predict self-reported friendships [36]. Moreover, from a scientific
perspective, interest in mobile-phone studies has been expressed through the emergence of
a literature on mobile-phone networks in which people have studied the strength of social
ties in cross sections of the network [9] and the dynamics of social groups [6,8].
There are also some technical aspects that favor the use of a mobile phone records
as a proxy for social interactions. Mobile phone numbers are unlisted, thus knowing them
reveals some sort of social connection between caller and callee. Also, cellular phones
were the most widespread information technology at the time this data was collected; with
a penetration larger than 40% worldwide and close to 100% in developed countries, such as
the one considered in this study. During the same time period, internet penetration was just
over 13% worldwide and 51% for developed countries (MDGS indicators U.N.
social interactions on the population scale. In addition, mobile-phone usage has been
8
particularly democratic to the extent that it has homogeneously penetrated different social
strata [37].
2.2 Data
Our data consists of 7,948,890 voice calls between 1,950,426 users of a service
consist of ten panels, or data cross-sections, collected between April 15, 2004 and March
31, 2005. Each panel summarizes 15 days of mobile phone calls between the members
serviced by the provider who facilitated the data. Not every panel is available (see Table 1),
as this was the way in which data was made available to us. We consider only agents that
made or received at least one call in each panel to avoid dealing with dropouts or new
subscribers. We hereafter assume that at high service penetration levels (~100%) people
serviced by a particular provider are equivalent to a random sample. In our network nodes
are mobile phone numbers, which we interpret as people and links are the calls connecting
them.
9
TABLE 1 DATA PANELS AVAILABLE FOR THIS STUDY.
10
2.3 The Persistence of Ties
We measure the stability of social ties across time as the number of panels in which
a link is observed, over the total number data panels available. We denote this measure as
∑ Aij( T)
Pij = T
, (1)
M
where Aij(T) is 1 if nodes i and j communicated in panel T and 0 otherwise, and M is the
total number of panels. Persistence is the probability of observing a tie when observing a
network for a given time. Because of the discrete nature of panel data, our definition of
persistence has a resolution that depends on the panel‟s duration. For example, if we consider
panels with a duration comparable to the one of links, (~ minutes in the case of phone calls),
our definition of persistence gives us the number of times a tie or connection appeared.
Whereas when we consider panels lasting considerably longer than the typical duration of a
link, our definition of persistence will capture the stability of a link on a larger, coarse-grained
temporal scale. Our data set consists of 10 panels, each summarizing 15 days of voice call
activity. Thus in this study we measure persistence on a monthly to yearly time scale.
We illustrate our definition of persistence using four different panels of a five node
network (Fig. 1 a). In this example, the link between nodes 2 and 4 is present in all panels
while the one between nodes 1 and 2 is present only in half of them. We say that the
persistence of the link between nodes 2 and 4 is 4/4 while the persistence of the link
connecting nodes 1 and 2 is 2/4. Each panel gives a binary representation of the network,
where a link is either present or not. Our definition of persistence summarizes the dynamics
of all binary panels by assigning a weight to each link. Thus, persistence is a change of
11
representation that allows us to map many network panels into a single weighted network
(Fig. 1 b).
Our measure of persistence weakly increases with the number of times a link is
However, given that we measure whether the link is observed in N>2 panels, it will not
describe a link dichotomously as stable or unstable, but will give the degree of stability 1/N
Figure 1 Definition of Persistence. a Four panels of a five node network in which not all
links are equally persistent. b Persistence representation of the four panels presented in a.
Persistence is a tie attribute that can be defined for a particular node as the average
, (2)
where ki is the degree, or number of connections of the ith node. We will use this
12
Our definition of persistence has limitations. One could claim we are unfairly
punishing newly formed links. An alternative strategy would be to consider only the links
involved in the first panel; however an exercise in this line showed us that there is a strong
selection bias towards stable links when we consider such an option. For example, links
appearing only once, in the second to tenth panel, will not be considered if we set our
benchmark on the first panel only. Our definition also does not differentiate between links
active half of the time or those active during a particular half of the year. We do not
propose our measure as the ultimate way to reduce a set of network panels into a weighted
network, but as a simple way to do so, allowing us to characterize to first approximation the
Figure 2a shows the persistence histogram for the voice call network. The
distribution is bimodal, meaning that ties tend to be either active most of the time or rarely
where stable ties compose a person‟s social core and unstable ties connect people to the
13
Figure 2 Persistence across a cellular phone network a Distribution of persistence for all
links b Fraction of surviving ties as a function of time. The inset shows the same plot in a
double logarithmic scale. The continuous line is t-1/4
(Fig. 2b), in agreement with the 4-year study performed by Burt [40]. The fact that the
survival probability of a tie can be approximated by ~t-α with α =0.25±0.07 indicates that a
great number of ties disappear quickly, while others tend to stay for very long periods of
time. On average, less than 40% of the ties are conserved after 15 days. After this initial
drop however, ties disappear slowly allowing more than 20% of the ties to remain after a
year. We note that the discreetness and sparseness of our data does not allow us to prove
that tie decay follows a power-law. Yet the graphical analysis of Figure 2 b can be
direction.
14
2.5 Network Structure and the Persistence of Ties
connections up to 3 links from a randomly chosen user. Although this example shows less
than the 0.0008% of our network, it visually summarizes the correlations between
particular, we find that these temporal attributes correlate with topological variables such
as the number of connections or degree of a node ki, the average reciprocity of a node r
(fraction of ties containing both, incoming and outgoing calls) and the clustering
2Δ
Ci = (3)
ki (k i −1)
where Δ is the number of triads, or fully connected triangles, in which the node is
categories revealing that persistent links represent a large fraction of the connections for
low degree nodes while transient links are more common for large degree nodes. The
number of persistent ties grows, however, as a function of degree, meaning that although
on average the persistence of high degree nodes is lower, in absolute terms their core is
15
Figure 3 d shows the distribution of persistence divided by clustering coefficient
categories, indicating that highly clustered nodes tend to have relatively large cores. In the
core periphery context, this means that persevering nodes are located in dense parts of the
social network (Fig. 3a I) while those in sparser parts tend to have non-persistent ties acting
as bridges which interruptedly connect different parts of the network (Fig. 3a II). Finally,
we split the distribution of persistence by reciprocity (figure 3e) and observe that nodes
16
17
Figure 3 Network structure and the persistence of ties a A fragment of the network extracted by considering up to the second
neighbor of a randomly chosen node (indicated by a black arrow). (b-e summarize statistics for the entire network) b Distribution
of persistence divided into nine degree categories c Number of persistent links defined as those with a persistence of, from top to
bottom: 6/10, 7/10, 8/10, 9/10 and 10/10. d Distribution of persistence divided into nine clustering categories. e Distribution of
persistence divided into five different reciprocity segments.
2.6 Multivariate Analysis
effect of three single structural variables and found that persistence depends monotonically
on all of them (degree, clustering coefficient and reciprocity). The observed correlations
however, might well be redundant. To check if this is the case we perform a multivariate
analysis to quantify the effect of each of these variables on the persistence of ties. Because
of the large number of observations considered (∼ 2 million nodes, ∼ 8 million ties) the
confidence intervals of the regressions do not spread far from the predicted values. Hence
we concentrate our discussion on the relative magnitude of the effects rather than on their
significance.
agents that have a similar degree [41,42]. It is not known however, whether links connecting
same degree agents tend to be more stable than those connecting different degree agents.
To study this effect we performed a regression in which we study the persistence of a link
as a function of the difference in degree between the nodes adjacent to the ends of each
link. Furthermore, we also include in the regression the difference in clustering and
two link attributes, the reciprocity of links R (was there ever a panel in which caller and
callee reciprocally called each other?) and the topological overlap (TO) associated with
18
, (4)
where Oij is the number of neighbors that agents i and j have in common and ki and
indicating the number of neighbors shared by two nodes at the ends of a link.
correlation between different variables. The technique is an extension of the bi-variate case
We can illustrate how multiple regression works by using an example with two
explanatory variables (x1 and x2) and one dependent variable (y). Multiple regression
analysis is based on linear regression, as many other functional forms can be linearized by
y=B1x1+B2x2+A, (5)
where B1 and B2 are the regression coefficients and A is the intercept: y(x1=0,x2=0).
From the definition of the correlation coefficient we can interpret B1 and B2 as the change
19
r12=cov(x1,x2)/σx1σx2 (6)
where cov is the covariance and σ stands for the standard deviation. We also use (6)
to calculate the correlation between y, x1 and x2, which we denote ry1 and ry2 respectively.
pr1=(R2-ry2)1/2 (8)
pr2=(R2-ry1)1/2 (9)
and indicate the amount of variance in y explained by each one of this variables.
We use multiple regression analysis to show that R, TO, ΔC, Δk and ΔR explain
40% of the variance in persistence (Table 2 Persistence of ties and link attributes R2 =
0.397). The contribution of each one of them can be isolated by considering the partial
regression coefficients [43], which are a way to quantify how much of the variance is
explained by each one of the covariates used in a regression. This technique shows that
assortative mixing is not associated with the persistence of ties. Whereas the reciprocity of
20
TABLE 2 PERSISTENCE OF TIES AND LINK ATTRIBUTES
15
In the previous section we showed that high degree agents had on average less
persistent ties than low degree agents. We also saw that highly clustered agents tended to
have a larger number of persistent connections and that reciprocal ties tend to be more
persistent in average. Again, we explore the redundancy of such statements using linear
regression and split the contribution to perseverance from each of these variables by
calculating their partial correlations (Table 1). Together, these variables explain almost
50% of the variance in perseverance (R2=0.49). Their contributions are quite uneven,
however. When we look at the partial correlation coefficients extracted from our linear
model we find that most correlations vanish and the biggest contribution to perseverance is
given by the average reciprocity r of an agent‟s ties, which explains 27% of the variance.
The negative effect of degree in the persistence of an agent‟s ties is still present, but greatly
ameliorated. This means that high degree agents which reciprocate their ties have more
persistent ties as well. The negative effect of an agent‟s degree on the persistence of its ties
is in large part explained by the fact that high degree agents tend to reciprocate less of their
21
ties. Similarly, the clustering coefficient C, which appeared as the strongest predictor in the
bi-variate case, explains only 6% of the variance when reciprocity and degree are taken
into account. This shows that cliques are formed by reciprocal ties minimizing the
We finish our discussion by asking: How well can we predict the stability of ties
variable and is not constrained to correlate with space-like, horizontal variables. As we saw
from our multivariate analysis, the information carried by structural variables can be
redundant [44,45], thus, it is important to take into account their correlations to unveil their
real contribution to the persistence of ties. Can we use this information to predict which ties
persist in time? To answer this question we looked at our first data panel and used different
criteria to predict which ties will be stable. We then looked at the fraction of these ties
appearing after 1, 3, 6, 9 and 12 months and gauged the accuracy of our predictions by
, (10)
where TP is the number of true positives and FP is the number of false positives.
22
We begin by testing the prediction that all ties observed as reciprocal in the first
panel will be conserved in the future. For this hypothesis, the PPV ranges from 70% after
one month to 43% after a year (Fig 4 a). For comparison, we picked a random set of ties
and found a PPV of 35% after a month and 20% after a year.
consider all reciprocal links that also have a topological overlap larger than TO ≥ 0.01 we
improve the PPV of our prediction by 5%, while an even more stringent criterion based on
a TO ≥ 0.1, gives us an extra percent that allows us to predict with a PPV larger than 50%
Figure 4 Predicting future ties a Accuracy of tie prediction by randomly choosing ties
(orange), choosing reciprocal ties (red), reciprocal ties with a T.O.>0.01 (green), reciprocal
ties with a T.O.>0.01, ties with a T.O.>0.01 (blue) and a T.O>0.1 (purple). b Sensitivity of
the predictive methods presented in a. using the same color scheme.
23
The increase in accuracy brought by more stringent criteria reduces the number of
links predicted to be persistent. Thus the sensitivity of our method, defined as:
(11)
where FN is the number of false negatives, decreases with the stringency of the
criteria used but increases with time (Fig. 4 b). Hence there is a tradeoff between the
accuracy of our prediction and the number of predictions we can make. Using the simple
more accurate predictions can be made only if we accept a reduction in the number of
only one. The fact that the variance explained by other structural variables was redundant
with that explained by reciprocity allows us to use other structural variables as alternative
predictors of a tie. Figure 4 a also shows the PPV obtained when we use topological
overlap as our only predictive criterion. In this case we see that although the accuracy is
lower, it is still significantly better than random. Thus the redundancy observed in the
system can be turned into a predictive advantage and in the absence of information about
the reciprocity of links we can use redundant measures to make good educated guesses
2.7.1 Discussion
We have defined and measured the persistence of ties in a one year period using 10
panels of data summarizing the activity of all voice calls carried by a mobile phone carrier
from an industrialized country. We showed that the persistence of ties and perseverance of
24
nodes depend on topological variables (degree, clustering, reciprocity and topological
overlap). In our study, topological variables explain almost half of the variance in
persistence. The stability of social ties is likely a behavioral attribute, thus, it is not
surprising that the local structure of the social network, that it is likely also a result of social
and coordinated consumption [48]. But not all social connections are equally important;
some ties are stronger than others [49]. The strength of a social tie is not an absolute
measure. Hence there is a need to quantify the strength of ties using ad-hoc measures.
Persistence is a way to quantify the temporal stability of ties, and therefore their strength, in
one of the many possible dimensions that tie strength can be quantified. As longitudinal
data becomes available, methods like the one introduced here can be used to quantify the
The relationships shown here demonstrate that the temporal dynamics of social
interactions are intrinsically coupled to the social network structure in such a way that the
existence of a tie can be predicted, with a respectable accuracy, using a simple criterion.
25
CHAPTER 3: UNDERSTANDING HUMAN MOBILITY PATTERNS
3.1 Introduction
Despite their importance for urban planning [50], traffic forecasting [51], and the
spread of biological [52,53,54] and mobile viruses [55], our understanding of the basic
laws governing human motion remains limited thanks to the lack of tools to monitor the
time resolved location of individuals. Here we study the trajectory of 100,000 anonymized
mobile-phone users whose position is tracked for a six-month period. We find that in
contrast with the random trajectories predicted by the prevailing Lévy-flight and
random-walk models [56] (see Box 1), human trajectories show a high degree of temporal
characteristic length scale and a significant probability to return to a few highly frequented
locations. After correcting for differences in travel distances and the inherent anisotropy of
each trajectory, the individual travel patterns collapse into a single spatial probability
distribution, indicating that despite the diversity of their travel history, humans follow
simple reproducible patterns. This inherent similarity in travel patterns could impact all
Given the many unknown factors that influence a population‟s mobility patterns,
ranging from means of transportation to job and family imposed restrictions and priorities,
human trajectories are often approximated with various random walk or diffusion models
26
[56,57]. Indeed, early measurements on albatrosses, bumblebees, deer and monkeys [58,59]
and more recent ones on marine predators [60] suggested that an animal trajectory can be
approximated by a Lévy flight [61, 62], a random walk whose step size Δr follows a
power-law distribution P(Δr) ~ Δr -α with α < 3. While the Lévy statistics for some
animals require further study [63], Brockmann et al. [56] generalized this finding to humans,
half-million bank notes is fat tailed. Given that money is carried by individuals, bank-note
dispersal is a proxy for human movement, suggesting that human trajectories are best
modelled as a continuous time random walk with fat tailed displacements and waiting time
distributions [56]. A particle following a Lévy flight has a significant probability to travel
very long distances in a single step [61,62], which appears to be consistent with human travel
patterns: most of the time we travel only over short distances, between home and work,
Lévy-Flights (LF): A Lévy flight is a type of random walk in which the size of the
Truncated Levy Flight (TLF): A truncated levy flight is a random walk in which
is a probability distribution that has infinite variance. One of the most common forms it
27
Each consecutive sightings of a bank note reflects the composite motion of two or
more individuals, who owned the bill between two reported sightings. Thus, it is not clear
if the observed distribution reflects the motion of individual users, or some hitherto
trajectories. Contrary to bank notes, mobile phones are carried by the same individual
during his/her daily routine, offering the best proxy to capture individual human
trajectories [8,9,10,64,65].
We used two data sets to explore the mobility pattern of individuals. The first (D1)
consists of the mobility patterns recorded over a six-month period for 100,000 individuals
selected randomly from a sample of over 6 million anonymized mobile-phone users. Each
time a user initiates or receives a call or text message, the location of the tower routing the
(Figure 6 a and b). The time between consecutive calls follows a bursty pattern [66] (Figure
5) indicating that while most consecutive calls are placed soon after a previous call,
occasionally there are long periods without any call activity. To make sure that the
obtained results are not affected by the irregular call pattern, we also study a data set (D2)
that captures the location of 206 mobile-phone users, recorded every two hours for an
entire week. In both datasets the spatial resolution is determined by the local density of the
more than 104 mobile towers, registering movement only when the user moves between
28
3.2 Source Data
The D1 dataset was collected by a European mobile phone carrier for billing and
operational purposes. It contains the date, time and coordinates of the phone tower routing
the communication for each phone call and text message sent or received by 6 million
user is identified with a security key (hash code). Furthermore, we only know the
coordinates of the tower routing the communication, hence a user‟s location remains
unknown within a tower‟s service area. Each tower serves an area of approximately 3 km2.
Due to tower coverage limitations driven by geographical constraints and national frontiers
The research was performed on a random set of 100,000 selected from those
making or receiving at least one phone call or SMS during the first and last month of the
study, translating to 16,364,308 recorded positions. We removed all jumps that took users
outside the continental territory. We did not impose any additional criterion regarding the
The D2 dataset was collected for the operation of some services provided by the
mobile phone carrier, like pollen and traffic forecasts, which rely on the approximate
knowledge of customer‟s location at all times of the day. For customers that signed up for
location dependent services, the date, time and the closest tower coordinates are recorded
on a regular basis, independent of their phone usage. We were provided such records for
1,000 users, among which we selected the group of users whose coordinates were recorded
at every two hours during an entire week, resulting in 206 users for which we have 10,613
recorded positions. Given that these users were selected based on their actions (signed up
29
to the service), in principle the sample cannot be considered unbiased, but we have not
For each user in D1 and D2 we sorted the time resolved sequence of positions and
Figure 5 Interevent time distribution P(ΔT) of calling activity. ΔT is the time elapsed
between consecutive communication records (phone calls and SMS, sent or received) for
the same user. Different symbols indicate the measurements done over groups of users
with different activity levels (# calls). The inset shows the unscaled version of this plot
measured the distance between user‟s positions at consecutive calls, capturing 16,264,308
displacements for the D1 and 10,407 displacements for the D2 datasets. We find that the
30
P(Δr) = (Δr+Δr0)-β exp(-Δr/κ), (12)
with β=1.75 ± 0.15, Δr0=1.5 km and cut-off values κ|D1 = 400 km and κ|D2 = 80 km
(Figure 6 c). Note that the observed scaling exponent is not far from β = 1.59 observed in
Ref. [56] for bank-note dispersal, suggesting that the two distributions may capture the
Equation (12) suggests that human motion follows a truncated Lévy flight [56]. Yet,
the observed shape of P(Δr) could be explained by three distinct hypotheses: A. Each
individual follows a Lévy trajectory with jump size distribution given by (12). B. The
with individual Lévy trajectories, hence (12) represents a convolution of hypothesis A and
B.
∑ /
(13)
where xcm and ycm are the coordinates of the centre of mass defined by a users
position and the sum goes over all positions (N) recorded for a user. The radius of gyration
can be interpreted as the typical distance travelled by user a when observed up to time t
(Figure 6 b). Next, we determined the radius of gyration distribution P(rg) by calculating rg
for all users in samples D1 and D2, finding that they also can be approximated with a
truncated power-law
31
P(rg) = (rg+rg0)-βr exp(-rg/κ), (14)
with rg0 = 5.8 km, βr = 1.65 ± 0.15 and κ = 350 km. Lévy flights are characterized
by a high degree of intrinsic heterogeneity, raising the possibility that (9) could emerge
determined P(rg) for an ensemble of agents following a Random Walk (RW), Lévy-Flight
walkers was normalized such that the mean of the distribution matches that observed in our
data, whereas the ensemble of Lévy-Flight walkers had steps drawn from a distribution
with the same exponent as that found in (12). The steps of the Truncated Lévy-Flight
heterogeneity in rg, yet is not sufficient to explain the truncated power-law distribution
P(rg) exhibited by the mobile-phone users. Taken together, Figs. 1c and d suggest that the
difference in the range of typical mobility patterns of individuals (rg) has a strong impact
should increase in time as rg(t) ~ t3/(2+β) [67,68] while for a RW rg(t) ~ t1/2. That is, the longer
we observe a user, the higher the chances that she/he will travel to areas not visited before.
To check the validity of these predictions we measured the time dependence of the radius
of gyration for users whose gyration radius would be considered small (rg(T) ≤ 3 km),
medium (20 < rg(T) ≤ 30 km) or large (rg(T) > 100 km) at the end of our observation period
(T = 6 months). The results indicate that the time dependence of the average radius of
gyration of mobile phone users is better approximated by a logarithmic increase, not only a
32
manifestly slower dependence than the one predicted by a power law, but one that may
Figure 6 Basic human mobility patterns. a, Week-long trajectory of 40 mobile phone users
indicate that most individuals travel only over short distances, but a few regularly move
over hundreds of kilometres. Panel b, displays the detailed trajectory of a single user. The
different phone towers are shown as green dots, and the Voronoi lattice in grey marks the
approximate reception area of each tower. The dataset studied by us records only the
identity of the closest tower to a mobile user, thus we cannot identify the position of a user
within a Voronoi cell. The trajectory of the user shown in b is constructed from 186 two
hourly reports, during which the user visited a total of 12 different locations (tower
vicinities). Among these, the user is found 96 and 67 occasions in the two most preferred
locations, the frequency of visits for each location being shown as a vertical bar. The circle
represents the radius of gyration cantered in the trajectory‟s centre of mass. c, Probability
33
density function P(Δr) of travel distances obtained for the two studied datasets D1 and D2.
The solid line indicates a truncated power law whose parameters are provided in the text
(see Eq. 7). d, The distribution P(rg) of the radius of gyration measured for the users, where
rg(T) was measured after T = 6 months of observation. The solid line represents a similar
truncated power law fit (see Eq. 9). The dotted, dashed and dot-dashed curves show P(rg)
obtained from the standard null models (RW, LF and TLF), where for the TLF we used the
same step size distribution as the one measured for the mobile phone users.
months, and measured the jump size distribution P(Δr|rg) for each group. As the inset of
Figure 8 b shows, users with small rg travel mostly over small distances, whereas those
with large rg tend to display a combination of many small and a few larger jump sizes.
Once we rescale the distributions with rg (Figure 8 b), we find that the data collapses into a
single curve, suggesting that a single jump size distribution characterizes all users,
independent of their rg. This indicates that P(Δr|rg) ~ rg-α F(Δr/rg), where α ≈ 1.2 ± 0.1 and
F(x) is a rg independent function with asymptotic behavior F(x < 1) ∼ x-α and rapidly
decreasing for x >> 1. Therefore the travel patterns of individual users may be
however, is the fact that the individual trajectories are bounded beyond rg, thus large
displacements which are the source of the distinct and anomalous nature of Lévy flights,
are statistically absent. To understand the relationship between the different exponents, we
which suggests that up to the leading order we have β=βr+α-1, consistent, within error
bars, with the measured exponents. This indicates that the observed jump size distribution
P(Δr) is in fact the convolution between the statistics of individual trajectories P(Δr|rg) and
34
3.4 Testing the power-law curve fits
We tested whether the empirical data could come from the fitted distributions by
performing a stringent variant of the Kolmogorov-Smirnov (KS) goodness of fit test [69].
The KS statistics is a simple way to compare whether two distributions are the same. In this
case, we use it to test the hypothesis: Could the empirically observed distributions come
from the distribution found as its best fit? To test for this we generated synthetic data
starting from the fitted distribution and then use the KS test to see whether our data behaves
as well as synthetic data generated from the fitted distribution. We use two variants of the
KS statistics to compare empirical data with the fitted distribution and synthetic data with
the fitted distribution. The first method is the standard KS statistics and is given by:
where F is the cumulative distribution of the best fit and P is the cumulative
distribution of the empirical or synthetic data. The regular KS statistic is not very sensitive
on the edges of the distribution. Hence, we also use the weighted KS statistics defined as:
To test whether the empirical data behaves as good as the synthetic data we
calculated the KS and KSW statistics between the empirical data and its best fit and
compared these values with those obtained by calculating KS and KSW for 1,000 sets of
synthetic data generated from the best fit. If the values obtained for KS and KSW for the
empirical data behave as good or better than those obtained for the synthetic data, then we
can conclude that the empirical data is statistically consistent with its best fit. The results of
the KS test can be summarized using a p–value by integrating the distribution of KS values
35
Figure 7 Kolmogorv-Smirnov goodness of fit test. The figures compare the KS and KSw
statistics with that of 1000 sets of synthetic data coming from the same distribution. a. Red
line indicates the KS value for Figure 6 c D1. (p(KS)=1) b. Red line indicates the KSw value
for Figure 6 (p(KSW)=1) c D1c. Red line indicates the KS value for Figure 6 d D1
(p(KS)=0.62) d. Red line indicates the KSw value for Figure 6 d D1 (p(KSW)=0.82).
generated with the synthetic data from the value representing the empirical distribution.
When integrating such distributions from left to right we can interpret the p−value as the
probability that the observed data was the result of its best fit. A p−value close to 1 will
indicate that the empirical distribution matches its best fit as good as synthetic data
generated from the fit itself [69], whereas a relative small p−value (typically taken p < 0.01)
would suggest that the empirical distribution cannot be the result of its best fit.
The p-values for the KS tests can be read from the caption of Figure 7.
36
3.5 The periodicity of human mobility patterns
each individual Fpt(t) [68], defined as the probability that a user returns to the position
where it was first observed after t hours (Figure 8 c). For a two dimensional random walk
Fpt(t) should follow ~ 1/(t ln(t)2) [68]. In contrast, we find that the return probability is
humans to return to locations they visited before, describing the recurrence and temporal
37
Figure 8 The bounded nature of human trajectories. a, Radius of gyration, vs time
for mobile phone users separated in three groups according to their final rg(T) , where T = 6
months. The black curves correspond to the analytical predictions for the random walk
models, increasing in time as (solid), and (dotted).
The dashed curves corresponding to a logarithmic fit of the form A+B ln(t), where A and B
depend on rg. b, Probability density function of individual travel distances P(Δr|rg) for
users with rg = 4, 10, 40, 100 and 200 km. As the inset shows, each group displays a quite
different P(Δr|rg) distribution. After rescaling the distance and the distribution with rg
38
(main panel), the different curves collapse. The solid line (power law) is shown as a guide
to the eye. c, Return probability distribution, Fpt(t). The prominent peaks capture the
tendency of humans to regularly return to the locations they visited before, in contrast with
the smooth asymptotic behavior ~1/(tln(t)2) (solid line) predicted for random walks. d, A
Zipf plot showing the frequency of visiting different locations. The symbols correspond to
users that have been observed to visit nL = 5, 10, 30, and 50 different locations. Denoting
with (L) the rank of the location listed in the order of the visit frequency, the data is well
approximated by R(L)~L-1. The inset is the same plot in linear scale, illustrating that 40% of
the time individuals are found at their first two preferred locations.
To explore if individuals return to the same location over and over, we ranked each
location based on the number of times an individual was recorded in its vicinity, such that a
location with L = 3 represents the third most visited location for the selected individual. We
find that the probability of finding a user at a location with a given rank L is well
approximated by P(L) ~ 1/L, independent of the number of locations visited by the user
(Figure 8 d). Therefore people devote most of their time to a few locations, while spending
their remaining time in 5 to 50 places, visited with diminished regularity. Therefore, the
observed logarithmic saturation of rg(t) is rooted in the high degree of regularity in their
daily travel patterns, captured by the high return probabilities (Figure 8 b) to a few highly
individuals live and travel in different regions, yet each user can be assigned to a
well-defined area, defined by home and workplace, where she or he can be found most of
the time. We can compare the trajectories of different users by diagonalizing each
39
trajectory‟s inertia tensor, providing the probability of finding a user in a given position
frame for each user. We do this by finding the set of axes in which the inertia tensor defined
(17)
where
(18)
associated with the smaller eigenvalue of the inertia tensor. Thus, we look for a reference
. (19)
(20)
such that II
(21)
40
(22)
(A). If
and
then
(23)
(B). and
. We select 0 or
Finally, we make a conditional rotation of π to make sure the most frequent position
41
3.7 The anisotropy of Human Mobility Patterns
reference frame (note the different scales in Figure 9 a). Here we find that the larger an
individual‟s rg the more pronounced is this anisotropy. To quantify this effect we defined
the anisotropy ratio S ≡ σy/σx, where σx and σy represent the standard deviation of the
trajectory measured in the user‟s intrinsic reference frame. We find that S decreases
monotonically with rg (Figure 9 c), being well approximated with S ~ rg-η, for η ≈ 0.12.
Given the small value of the scaling exponent, other functional forms may offer an equally
good fit, thus mechanistic models are required to identify if this represents a true scaling
anisotropies, rescaling each user trajectory with its respective σx and σy. The rescaled
different rg, i.e., after the anisotropy and the rg dependence is removed all individuals
evident in Fig. 3d, where we show the cross section of for the three groups of
users, finding that apart from the noise in the data the curves are indistinguishable.
42
Figure 9 The shape of human trajectories. a, The probability density function Φ(x, y) of
finding a mobile phone user in a location (x, y) in the user‟s intrinsic reference frame (see
SM for details). The three plots, from left to right, were generated for 10,000 users with: rg
≤ 3, 20 < rg ≤ 30 and rg > 100 km. The trajectories become more anisotropic as rg
increases. b, After scaling each position with σx and σy the resulting ) Φ (x/ σx ,y/ σy ) has
approximately the same shape for each group. c, The change in the shape of Φ (x, y) can be
quantified calculating the isotropy ratio S ≡ σx/ σy as a function of rg , which decreases as S
~ rg-0.12 (solid line). Error bars represent the standard error. d, Φ (x/ σx ,0) representing the
x-axis cross section of the rescaled distribution Φ (x/ σx ,y/ σy ) shown in b.
43
Taken together, our results suggest that the Lévy statistics observed in bank note
measurements capture a convolution of the population heterogeneity (9) and the motion of
individual users. Individuals display significant regularity, as they return to a few highly
frequented locations, like home or work. This regularity does not apply to the bank notes: a
bill always follows the trajectory of its current owner, i.e. dollar bills diffuse, but humans
do not.
The fact that individual trajectories are characterized by the same rg-independent
Therefore, our results establish the basic ingredients of realistic agent based models,
requiring us to place users in number proportional with the population density of a given
region and assign each user an rg taken from the observed P(rg) distribution. Using the
predicted anisotropic rescaling, combined with the density function, we can obtain the
likelihood of finding a user in any location. Given the known correlations between spatial
proximity and social links, our results could help quantify the role of space in network
processes [57,76].
44
CHAPTER 4: THE PRODUCT SPACE CONDITIONS THE DEVELOPMENT OF
NATIONS
4.1 Introduction
Does the type of product a country exports matter for subsequent economic
industrialization creates ‟spill-over‟ benefits that fuel subsequent growth [77,78,79]. Yet,
lacking formal models, mainstream economic theory has been unable to incorporate these
ideas. Instead, two approaches have been used to explain a country‟s pattern of
specialization. The first focuses on the relative proportion between productive factors (i.e.
physical capital, labor, land, skills or human capital, infrastructure, and institutions [80]).
Hence, poor countries specialize in goods intensive in unskilled labor and land while richer
capital. The second approach emphasizes technological differences [81] and has to be
complemented with a theory of what underlies them. The varieties and quality ladders
models [82,83] assume that there is always a slightly more advanced product or just a
different one that countries can move to, disregarding product similarities when thinking
45
Think of a product as a tree and the set of all products as a forest. A country is
composed of a collection of firms, i.e. of monkeys that live on different trees and exploit
those products. The process of growth implies moving from a poorer part of the forest,
where trees have little fruit, to better parts of the forest. This implies that monkeys would
have to jump distances, i.e. redeploy (human, physical and institutional) capital towards
goods that are different from those currently under production. Traditional growth theory
assumes there is always a tree within reach; hence the structure of this forest is
unimportant. However, if this forest is heterogeneous, with some dense areas and other
more deserted ones, and monkeys can jump limited distances, then countries may be
unable to move through the product space. If this is the case, the structure of this space and
countries.
In theory, many possible factors may cause relatedness between products, i.e.
closeness between trees; such as the intensity of labor, land, and capital [ 84], the level of
technological sophistication [85,86], the inputs or outputs involved in a product‟s value chain
(e.g. cotton, yarn, cloth, garments) [87] or requisite institutions [88,89]. All of these are a priori
notions of what dimensions of similarity are most important, and assume that factors of
Instead, we take an agnostic approach and use an outcomes-based measure, based on the idea
that if two goods are related, because they require similar institutions, infrastructure, physical
factors, technology, or some combination thereof, they will tend to be produced in tandem,
whereas dissimilar goods are less likely to be produced together. We call this measure
46
proximity, which formalizes the intuitive idea that the ability of a country to produce a product
depends on its ability to produce other products. For example, a country with the ability to
export apples will probably have most of the conditions suitable to export pears. They would
certainly have the soil, climate, packing technologies, frigorific trucks and containers. In
addition, they would have skilled agronomists, phytosanitary laws and trade agreements that
could be easily redeployed to the pear business. If instead we consider a different product such
as copper wires or home appliance manufacture, all or most of the capabilities developed for
the apple business are rendered useless. We introduce proximity as the concept that captures
the pairwise conditional probabilities of a country exporting a good given that it exports
another.
(24)
(25)
which measures whether a country c exports more of good i, as a share of its total
47
4.3 The Product Space
International trade data is taken from Feenstra, Lipsey, Deng, Ma, & Mo's "World
Trade Flows: 1962-2000" dataset [91]. This dataset consists of imports and exports both by
country of origin and by destination, with products disaggregated to the SITC revision 4,
four-digit level. The authors build this dataset using the United Nations COMTRADE
database. The authors cleaned that dataset by calculating exports using the records of the
importing country, when available, assuming that data on imports is more accurate than
data from exporters. This is likely, as imports are more tightly controlled in order to
enforce safety standards and collect customs fees. In addition, the authors correct the UN
data for flows to and from the United States, Hong Kong, and China. We focus only on
export data, and do not disaggregate by country of destination. More information on this
dataset can be found in NBER Working Paper #11040, and the dataset itself is available at
Using this we calculate the 775 by 775 matrix of revealed proximities between
48
Figure 10 Hierarchically clustered proximity matrix representing the 1998-2000 product space.
49
Figure 11 Network representation of the 1998-2000 product space. Links are color coded
with their proximity value. The size of the nodes is proportional to world trade and their
colors are chosen according to the classification introduced by Leamer.
homogeneous product space would imply uniform values (homogenous coloring), while a
product-ladder model [83] would suggest a matrix with high values (or bright coloring) only
along the diagonal. Instead the product space of Figure 11 appears to be modular [92,93],
50
with some goods highly connected and others disconnected. Furthermore, as a whole the
product space is sparse, with φij distributed according to a broad distribution with 5% of its
elements equal to zero, 32% of them smaller than 0.1 and 65% of the entries taking values
below 0.2. These significant number of negligible connections calls for a network
representation [94,73], allowing us to explore the structure of the product space, together
with the proximity between products of given classifications and participation in world
trade.
The matrix representing the product space has many small values which represent
adequate way to layout the products, giving us a quick visual way to show the relevant
links and to determine were countries are located and where they could be headed.
To include all products in our network we generated a "skeleton" for it: the
Maximum Spanning Tree (MST). This is the tree containing a sum of weights which is
maximal. In other words, it is the set of N-1 links (N being the number of nodes) that
connect all nodes in the network and maximizes the sum of the proximities in it.
proximity matrix and then considered the strongest link connected to that dyad. We then
picked up the strongest link connecting a new node to our triad and continued adding links
until all the nodes on the network were considered (Figure 12).
51
Figure 12 Earliest version of the MST representing the "skeleton" of the product space.
In our visualization we also wanted to consider the strongest links which are not
necessarily in the MST. We did this by considering the MST plus all the links above a
certain threshold. A suitable visualization was obtained by keeping all links with a
proximity value of 0.55 or larger ( Figure 13). This resulted in a network with 775 nodes
and 1525 links. Lower proximity values gave rise to crowded network representations
while higher values resulted in sparse networks. As a rule of thumb, a good network
visualization can be achieved with an average degree equals to 4. This is when the number
of links is twice the one of nodes, which is the case for the φ=0.55 threshold.
52
Figure 13 Representation of the product space based on the MST plus all links with a
proximity above 0.55.
Good network visualization requires an appropriate layout. We lay out the network
using a force spring algorithm. Here nodes are represented as equally charged particles and
links are assumed to be springs. The layout is determined by the relaxed positions.
The force spring layout is not the ultimate solution, but it brings us close to a good
one. After this we retouch the layout manually to avoid overlapping links and untangle
dense clusters.
53
Figure 14 Network representation of the product space. Layout uses a force spring
algorithm.
at the structure of the space and other covariates. In our case we painted the network using
the product classifications performed by Leamer [84], and made the size of the nodes
proportional to the World Trade associated with that particular industry. To give a sense of
the proximity of the links involved in our network representation we color coded them by
using dark red and blue for strong links; and yellow and light blue for weaker ones.
54
4.5 The products space and the patterns of comparative advantage
To offer a visualization in which all 775 products are included, we reach all nodes
by calculating the maximum spanning tree, which include the 774 links maximizing the
tree‟s added proximity and superposed on it all links with a proximity larger than 0.55, as
we explained above. This set of 1525 links is used to visualize the structure of the full
proximity matrix, which far from homogenous, appears to have a core-periphery structure
(Figure 11). The core is formed by metal products, machinery and chemicals while the
periphery is formed by the rest of the product classes. The products in the top of the
periphery belong to fishing, animal, tropical and cereal agriculture. To the left there is a
strong peripheral cluster formed by garments and another one belonging to textiles,
followed by animal agriculture. The bottom of the network shows a large electronics
cluster followed to the right by mining and forest and paper products.
introduced by Leamer [84], which is based on relative factor intensities (Table 4, Figure
19), i.e. the relative amount of capital, labor, land or skills required to produce each
product. Although the classification performed by Leamer was done using a different
methodology, the agreement between it and the structure of the product space is striking.
Yet it also introduces a more detailed split of some product classes. For example,
machinery is naturally split into two clusters, one consisting of vehicles and heavy
machinery, and another one belonging to electronics. The machinery cluster is interwoven
with some capital intensive metal products, but is not tightly connected to similarly
55
The map obtained can be used to analyze the evolution of a country‟s productive
structure. For this purpose we hold the product space fixed and study the dynamics of
production within it, although changes in the product space represent an interesting avenue
Figure 15 shows the pattern of specialization for four regions in the product space2.
Products exported by a region with RCA>1, are shown with black squares. Industrialized
countries occupy the core, composed by machinery, metal products and chemicals. They
also participate in more peripheral products such as textiles, forest products and animal
agriculture. East-Asian countries have developed RCA in the garments, electronics and
textile clusters while Latin America and the Caribbean are further out in the periphery in
mining, agriculture and the garments sector. Finally sub-Saharan Africa exports few
product types, all of which are in the far periphery of the product space, indicating that each
region has a distinguishable pattern of specialization clearly visible in the product space (to
see a discussion of how the structure of the product space is correlated with product income
Next, we show how the structure of the product space affects a country‟s pattern of
Colombia between 1980 and 2000 in the electronics and garments sector respectively. We
see that both countries follow a diffusion process in which comparative advantage move
2
The network shown here represents the structure of the product space as determined from the
1998-2000 periods. Holding the product space as fixed is a good first approximation, as the dynamics of the
network is much slower than the one of countries. The Pearson correlation coefficient (PCC) between the
proximity of all links present in this network and the ones obtained from the same network in 1990 and 1985
are 0.69 and 0.66 respectively (see supplementary material). This indicates that although the network
changes over time, after 15 years, the strength of past links still predicts the strength of the current links to a
considerable extent.
56
Figure 15 Localization of the productive structure for different regions of the world. The
products for which the region has an RCA > 1 are denoted by black squares.
57
preferentially towards products close to existing goods: garments in Colombia and
electronics in Malaysia.
, (26)
where ωkj is the density around good j given the export basket of the kth country and
xi = 1 if RCAki>1 and 0 otherwise. A high density value means that the kth country has
many developed products surrounding the jth product. To study the evolution of
1990 and an RCAc,i>1 in 1995. As a control, we consider undeveloped products those that
in 1990 and 1995 had an RCAc,i< 0.5 and disregard those cases do not fitting any of these
two criteria. Figure 16 B shows how density is distributed around transition products
(yellow) and compares it to densities around undeveloped products (red). Clearly, these
distributions are very distinct, with a higher density around transition products than among
At the single product level, we consider the ratio between the average density of all
countries in which the jth product was a transition product and the average density of all
countries in which the jth product was not developed. Formally, we define the discovery
factor Hj as
58
, (27)
where T is the number of countries in which the jth good was a transition product
and N is the total number of countries. Figure 16 C shows the frequency distribution of this
ratio. For 79 percent of products, this ratio is greater than 1 indicating that ωjk is greater in
countries that transitioned into the jth good than in those that did not, often substantially.
those they already had, is to calculate the conditional probability of transitioning into a
product given that the nearest product with RCA>1 is at a given φ. Figure 16 D shows a
monotonic relationship between the proximity of the nearest developed good and the
probability of transitioning into it. While the probability of moving into a good at φ=0.1 in
the course of 5 years is almost nil, the probability is about 15 percent if the closest good is
at φ=0.8.3
3
We repeated the same exercise using the rank of proximity instead of proximity itself in order to
assess whether what matters is absolute or relative proximity. We found that absolute distance appears to be
what matters most. We found that while transition probability increases linearly with proximity, they decay
with rank as a power law. Moreover, the rank effect is stronger for products in sparser parts of the product
space, where transitions are also less frequent. Thus, densely connected products can develop RCA through
more paths than sparsely connected ones, indicating the importance of absolute proximity
59
Figure 16 Empirical evolution of countries. A. Examples of RCA spreading for Colombia
(COL) and Malaysia (MYS). The color code shows when this countries first developed
RCA>1 for products in the garments sector in Colombia and the electronics cluster for
Malaysia. B. Distribution of density for transition products and undeveloped products C.
Distribution for the relative increase in density for products undergoing a transition with
respect to the same products when they remain undeveloped. D. Probability of developing
RCA given that the closest connected product is at proximity φ. E. Relative size of the
largest connected component NG with respect to the total number of products in the system
N as a function of link proximity φ.
Since production shifts to nearby products, we ask whether the product space is
sufficiently connected that given enough time, all countries can reach most of it,
particularly the richest parts. Lack of connectedness may explain the difficulties faced by
countries trying to converge to the income levels of rich countries: they may not be able to
undergo structural transformation because proximities are just too low. A simple
approach is to calculate the relative size of the largest connected component as a function
of φ. Figure 16 E shows that at φ≥0.6 the largest connected component has a negligible size
compared with the total number of products while for φ≤0.3 the product space is almost
fully connected, meaning that there is always a path between two different products.
60
We study the impact of the product space structure by simulating how the position
of countries evolve when allowed to repeatedly move to products with proximities greater
than a certain φο. If countries diffuse to nearby products and these are sufficiently
connected to others, then after several iterations, 20 in our exercise, countries would be
able to reach richer parts of the product space. On the other hand, if the product space is
disconnected, countries will not be able to find paths to the richer part of the product space,
The results of our simulation for Chile and Korea are presented in Figure 17 A. At a
relatively low proximity (φο=0.55) both countries are able to diffuse through to the core of
the product space, however Korea is able to do so much faster thanks to its positioning in
core products. For higher proximities the question becomes whether a country can spread
at all. At φο=0.6 Chile is able to spread slowly throughout the space while Korea is still
able to populate the core after 4 rounds. At φο=0.65, Chile is not able to diffuse, lacking
any close enough products, while Korea develops RCA slowly to a few products close to
61
Figure 17 Simulated diffusion process and inequality. A. Simulated diffusion process for
Chile and Korea in which we allow countries to develop RCA in all products closer than
φ≥0.55, 0.6 and 0.65. The number of steps required to develop RCA can be read from the
color code on the top right corner of the figure. B. Distribution for the average PRODY of
the best 50 products in a country‟s basket before and after 20 rounds of diffusion. The
original distribution is shown in green while the one associated with the distribution after
20 diffusion rounds with φ=0.65 is presented in yellow and φ=0.55 in red. C. Inter quartile
range of the distribution of the best 50 products after diffusing with a given φ normalized
by the inter quartile range of the best 50 products in absence of diffusion.
62
To generalize this analysis for the whole world, we need a measure to summarize
the position of a country in the product space. We adopt a measure based on Hausmann,
Hwang and Rodrik [95], which involves a two-stage process. First, for every product we
assign a value, which is the RCA weighted GDP per capita of countries with comparative
advantage in that good called PRODY [95]. We then average the PRODYs of the top N
and φο=1 (green), φο=0.65 (yellow) and φ =0.55 (red). The distribution for φο=1 allows us
to characterize the current distribution of countries in the product space, which shows a
bimodal distribution, signature of a world divided into rich and poor countries with few
countries occupying the center of the distribution. When we allow countries to diffuse up to
φο=0.65, this distribution does not change significantly: it shifts slightly to the right due to
diffusion process, however, stops after a few rounds and the world maintains a degree of
inequality similar to its current state. Contrarily, when we consider φο=0.55, most countries
are able to diffuse and reach the most sophisticated basket in the long run. Only a few
countries are left behind, which unsurprisingly make up the poorest end of the income
distribution (more details on the simulated diffusion process can be found on appendix III).
To quantify the level of convergence we calculated the Inter Quartile Range (IQR)
for the distribution and normalize this quantity by dividing it with the
IQR for the original distribution. Figure 17 C shows that the convergence of the system
goes through an abrupt transition and that convergence is possible if countries are able to
63
4.6 Discussion
of the poor towards the rich has been explained using geographic [96] and institutional
[88,89] arguments. Here, we introduced a new factor to this discussion: the difficulties
involved in moving through the product space. The detailed structure of the product space
is shown here for the very first time and together with the location of the countries and the
characteristics of the diffusion process undergone by them, strongly suggests that not all
countries face the same opportunities when it comes to development. Poorer countries tend
to be located in the periphery where moving towards new products is harder to achieve.
More interestingly, among countries with a similar level of development and seemingly
similar levels of production and export sophistication, there is significant variation in the
option set implied by their current productive structure, with some on a path to continued
structural transformation and growth, while others are stuck in a dead end.
These findings have important consequences for economic policy, as the incentives
different from those required when a country hits a dead end. It is quite difficult for
production to shift to products far away in the space, and therefore policies to promote
large jumps are more challenging. Yet, it is precisely these long jumps that generate
64
CHAPTER 5: DISCUSSION
For some people, studies mixing physics and people like the ones presented above may
be hard to classify. On one end, some people wonder where is the physics, while others would
At the beginning of the last century, Physicists were interested in understanding the
chapter 3, more than 100 years later, we simply extend this question to a different, more
complex, type of “particle.” While this could be seen as a trivial question to formulate, our
opportunity to jump into it was made possible only as a spin-off of one of the world‟s
largest industries. As a research project, it would have been impossible to fund tens of
experiment. Yet, there are many places where research similar to this one is also taking
place. Several research groups have begun collaborations with communication companies
to study closely related problems [97,98,99,100]. These groups however, are not formed only
biologists, ecologists, architects and possibly scientists trained in other disciplines as well.
Natural scientists are interested in kinematic questions, even if the observed particles are
people. This example illustrates that scientific disciplines can defined by the approach
65
followed by those who practice them, rather than by the object of study. Writing a poem
about the rain does not make you a meteorologist. Similarly, studying the statistical
The same rationale can be illustrated in our study of the social network‟s dynamics.
While the study on human mobility patterns dealt with relatively uncharted scientific
territory, the literature on the empirical study of social network dynamics consisted of
several papers [27,28,29,30,38,39,40]. Putting some technical differences aside,4 there are clear
differences between the approaches taken by us and that of more traditional social
the personal factors affecting social decay, like marriage and divorce [28] or entering
college [30]. Our study however, concentrated on discerning the correlations between the
structure of the network and its dynamics and is more closely related to the papers
published by Palla, Barabasi and Vicsek, [8] Onnela et al. [9] and Kossinets and Watts [6],
which also use massive communication records to study structural and dynamical
attributes of social interactions. In our study we showed that the coupling between a
social network structure and dynamics is strong enough that predictions can be made with
extremely naive approaches, alerting the community that there is a fertile ground for
objects that they study, the number of interdisciplinary collaborations should increase in
the coming years. As the world self-organizes into a more globalized and interconnected
4
Like the fact that we used millions of automatically collected records rather than survey data on
tens or a few hundred individuals and that the data used in sociological studies usually consists of only two
panels.
66
state, the boundaries between scientific disciplines will blur, shift reorganize and evolve;
new scientific disciplines will be created and value will gradually emerge from the work of
variables, mainly gross domestic product (GDP) adjusted by power purchasing parity. Yet,
as a concept, development has always been associated with an increase in diversity that
cannot be captured by such averages. As the human body develops, cells differentiate into
neurons, muscles, bones and several other cell types. Similarly, as nations develop,
different industries and products are born. Assessing the health of an economy solely based
on its wealth is as correct as assessing the health of a child solely based on its weight. A
nations develop different industries and products, rather than trying to predict how they
accumulate different types of capital. But how do we describe such a complex process?
development is measured by looking at the step on the ladder in which each nation is at,
regardless of the products and services that allowed them to get there. Development,
however, may not be as one dimensional as this picture suggests. An alternative metaphor
would represent nations as being spread on a rugged landscape rather than a ladder,
searching for new products in its valleys and crossing mountains and oceans in search for
new products and services – a Sewall Wright type of metaphor, for those familiar with the
67
Although inspiring, assuming an entire landscape to study development may seem
unpractical. We can overcome this by replacing the landscape with a network. This
approach is far from new, as it was used by Euler to abstract and solve the famous
are ubiquitous. Trivial examples are a subway map or the highway network. Hence, if
We can illustrate how a network view of economics might look through an example
inspired by the view of the world presented in Jared Diamond‟s masterpiece Guns, Germs
and Steel (GG&S) [103]. For those not familiar with the book, it is a fascinating view of our
civilization‟s origins documenting how our society arose at the time that hunters and
gatherers discover plant and animal domestication. The book is full of beautifully
documented facts and anecdotes disclosing the history of many of our civilization‟s first
economic products, like wheat, barley, pork, flax and corn. Through a careful and well
documented discussion, the book shows how our world was shaped by a few civilizations,
which happened to be on the right place at the right time. These civilizations were able to
develop primitive farming economies enabling them to produce enough surplus to allow
dominated their neighbors, physically and/or culturally, and transformed our world from a
myriad of thousands of independent family groups, into a few large dominant civilizations.
But why did some of these advanced civilizations prevail over the others?
According to Diamond‟s argument, since climate changes little with longitude but a lot
68
with latitude, domesticated plants and animals can diffuse more easily if they travel East or
West than if they travel North or South. Since Eurasia is a large expanse spread out on an
East – West axis, innovations in one part could travel easily across the whole continent.
However, Africa and the America‟s are spread on a North – South axis and consequently
there are fewer areas with similar latitudes that could share new varieties of plants and
animals. As a consequence, there were more products available to the Eurasians than to the
Figure 18 Sketch of the GG&S product space. Links are not scientifically accurate.
product landscape faced by our ancestors. Civilizations grew by discovering products, i.e.
domesticating plants and animals. These in turn allowed them to create more complex
69
products, such as garments, tools and weapons. Yet not all civilizations started in equally
dense parts of the product space. Eurasian populations had access to a broader set of
opportunities because of the larger base on which they could experiment and share.
Eurasians civilizations had also more starting points as the number of different agricultural
products available in the Eurasian continent was considerably more diverse than that of the
Americas [103]. Omitting details on the nature of the links connecting different products, it
is accurate to say that Eurasian populations were located in a denser part of the product
space -- where many goods were close to each other -- allowing them to expand quickly
over it. On the other hand, civilizations located in the Americas were located in a much
sparser part of the product space where product diffusion was limited by geographical
constraints. This limited the economic diversification of early American civilizations and
consequently, their ability to jump to products located further in the product space.
Clues about the nature of the links connecting different products can be gathered by
looking at how products are discovered and rediscovered by different populations. Some
jumps, like the domestication of apples, can require important technological improvements
– in this case grafting – that once achieved, opened the door to other fruits like pears and
plums [103]. Hence, even in the most ancient of times, links between some products or
industries were driven by technology. In other cases, some products or industries may be
connected to each other by input/output relationships, like flax and linen or olives and oil.
Yet a third way in which products may be connected is similarity in required infrastructure,
like the silos used to store wheat and barley. A network view of development does not
require a unique definition of a link, but rather accepting as a reasonable assumption the
fact that there are links connecting some products more strongly than others; links through
70
which knowledge, inputs and workers can flow, links that could be traversed by endeavor
or serendipity.
measure of distance between a pair of products based on the probability that they were
exported by the same countries. This simple method allowed us to construct a network
were we showed that countries tend to diversify by developing products that are close in
the product space to those they already export [104]. In that publication, we simplified our
discussion by concentrating on the case in which the product space was fixed and countries
spread over it, which we found to be a valid assumption for short enough time scales. We
showed that apparently similar countries face very different opportunities for
diversification because they are at very different distances from other products. We also
showed that, given the structure of the product space today, most poor countries can only
converge to the levels of development of rich countries if they are able to jump distances
that are quite infrequent in the historical record. In other words, the “stairway to heaven”
has some very tall steps that are hard to overcome in one move.
There are many ways in which this analysis can be extended. It may be interesting
to study the product space from a labor perspective. One could relate products based on the
similarity of the labor skills required to make them. This would allow companies to
exchange skilled workers. A new product can more easily be developed if it uses labor
skills that are similar to products already in production, as new firms can poach trained
71
workers from older firms. One could also study the patterns of mobility of labor between
industries as workers try to adjust to changes in the demand for their skills.
The product space evolves over time, as new products and new ways of making old
products are introduced. Cell phones went from not existing, to being made in rich
countries, to being assembled in poor countries. Cell phone service is now ubiquitous in the
world. The internet allows for an exchange of information that was hitherto unimaginable.
We can also study the robustness of an economy based on its position in the product
These are just some examples of the perspectives that could be studied from a
network perspective. It opens new avenues to diagnose a country‟s problems and chart a
policy strategy. To properly do this, we will need to redeploy network techniques and
Additionally, we will need to develop new techniques tailored especially for economic
questions and develop a common language that can be used to bridge new ideas and more
traditional approaches. As large data sets become more ubiquitous, the creation of network
maps will also become more common, as they represent a useful way to surf over new
waves of data.
Time will judge its usefulness, as the creation of a sensible and complete description of the
world economy as an evolving network is a task requiring many minds and years. From a
72
theoretical perspective, suggesting that economics should be described as a spreading
be studied using scalar functions and differential calculus. We often forget that our
Samuelson and many others, requires us to assume that the economy can be best described
by looking for numerical quantities and functional relationships between them. Most of us
forget that assumption because we never made it; we inherited it as college freshmen. Our
approach is not against the use of traditional mathematical methods. On the contrary, it
looks to complement them by incorporating tools that can be used to study development
There are no guarantees that this approach will be useful, as there were no
guarantees for the benefits of using calculus and physically inspired equilibrium processes
to describe economics at the beginning of the last century. The proof of the proverbial
pudding will have to be revealed by further research. Yet, markets have taught us the
importance of leaving room for innovation. A network view of development may be just
After all, a scientist‟s work is a dance with ignorance. We have adapted our minds
to constantly describe, abstract and attempt to explain a few things around us. While
through ingenuity, serendipity and wisdom. Yet, the goal is not to look for every possible
configuration, but to find the few that appear to matter for everyone around us. In this
73
process, scientists explore their interests, skills and intuition; not with the goal to play
every tune in the guitar, but to discover those that sing to them. This, keeping always in
mind, that tunes cannot understand the silence. As most futures cannot be predicted, a
scientist needs to be modest against the greater concept of ignorance, as its actions will add
only a few notes in the tune of the world, a tune that might have an end unforeseen from a
scientist‟s intentions.
74
CHAPTER 6: APPENDIXES
CA Hidalgo, C Rodriguez-Sickert
Abstract:
The empirical study of network dynamics has been limited by the lack of
the correlations between the structure of a mobile phone network and the persistence of its
links.We show that persistent links tend to be reciprocal and are more common for people
with low degree and high clustering.We study the redundancy of the associations between
persistence, degree, clustering and reciprocity and show that reciprocity is the strongest
predictor of tie persistence. The method presented can be easily adapted to characterize the
dynamics of other networks and can be used to identify the links that are most likely to
75
“Understanding Human Mobility Patterns”
Abstract:
Despite their importance for urban planning, traffic forecasting, and the spread of
biological and mobile viruses, our understanding of the basic laws governing human
motion remains limited thanks to the lack of tools to monitor the time resolved location of
individuals. Here we study the trajectory of 100,000 anonymized mobile phone users
whose position is tracked for a six month period. We find that in contrast with the random
trajectories predicted by the prevailing Lévy flight and random walk models, human
trajectories show a high degree of temporal and spatial regularity, each individual being
to return to a few highly frequented locations. After correcting for differences in travel
distances and the inherent anisotropy of each trajectory, the individual travel patterns
collapse into a single spatial probability distribution, indicating that despite the diversity of
their travel history, humans follow simple reproducible patterns. This inherent similarity in
travel patterns could impact all phenomena driven by human mobility, from epidemic
76
“The Product Space Conditions the Development of Nations”
Abstract:
Economies grow by upgrading the products they produce and export. The
technology, capital, institutions, and skills needed to make newer products are more easily
adapted from some products than from others. Here, we study this network of relatedness
between products, or “product space,” finding that more-sophisticated products are located
periphery. Empirically, countries move through the product space by developing goods
close to those they currently produce. Most countries can reach the core only by traversing
empirically infrequent distances, which may help explain why poor countries have trouble
developing more competitive exports and fail to converge to the income levels of rich
countries
CA Hidalgo, R Hausmann
Abstract:
No Abstract
77
6.1.2 Not presented in this dissertation
Vidal.
Abstract:
space, is therefore a critical step toward understanding complex biological systems. Here
(B5% of the predicted proteincoding genes), each driving the expression of green
the body axis and throughout development from early larvae to adults. Automated
comparison and clustering of the obtained in vivo expression patterns show that genes
coexpressed in space and time tend to belong to common functional categories. Moreover,
integration of this data set with C. elegans protein-protein interactome data sets enables
78
"Transcription Factor Modularity in a Gene-Centered C. elegans Protein-DNA Interaction
Network"
Abstract:
interactions between transcription factors (TFs) and their target genes. An important
question pertains to how the architecture of such networks relates to network functionality.
network is organized into two TF modules. These modules contain TFs that bind to a
relatively small number of target genes and are more systems specific than the TF hubs that
connect the modules. Each module relates to different functional aspects of the network.
One module contains TFs involved in reproduction and target genes that are expressed in
neurons as well as in other tissues. The second module is enriched for paired homeodomain
TFs and connects to target genes that are often exclusively neuronal. We find that paired
homeodomain TFs are specifically expressed in C. elegans and mouse neurons, indicating
possesses TF modules that relate to different functional aspects of the complete network.
79
"Conditions for the Emergence of Scaling in the Inter-Event Time of Uncorrelated and
Seasonal Systems"
CA Hidalgo
Abstract:
Inter-event times have been studied across various disciplines in search for
correlations. In this paper, we show analytical and numerical evidence that at the
different characteristic times, and at the individual level by assuming Poissonian agents
that change the rates at which they perform an event in a random or deterministic fashion.
The range in which we expect to see this behavior and the possible deviations from it are
“The effect of social interactions in the primary consumption life cycle of motion pictures”
Abstract:
We develop a „basic principles‟ model which accounts for the primary life cycle
is governed by word of mouth. We fit the analytical solution of such a model to aggregated
consumption data from the film industry and derive a quantitative estimator of its quality
80
6.2 APPENDIX II: Product Space Properties
Using a network representation for the products space we can not only see which
products are close to each other and the groups they form, but also their classifications and
values. However, the network representation is nothing more than a powerful visualization
technique and we still need to study the space properties using the entire proximity matrix
complemented.
The first property we study is the ability of the product space to classify goods into
different classes. We compare our network representation with the clusters introduced by
Leamer, as it is shown in figure 1, by using a different color for each product class. We see
that the product space is not colored at random. Products in the same classes lie close to
methodology, the agreement between it and the structure of the product space is striking.
Beyond the intuitive proof of Figure 7s we can tests the strength of these correlations by
taking the average proximity between and within the products belonging to one of the
81
TABLE 4 STRENGTH OF THE LINKS BETWEEN AND WITHIN PRODUCTS AS
CLASSIFIED BY LEAMER.
Table 4 shows that the average proximity of products belonging to the same cluster
is always higher than the proximity for products belonging to different clusters. But not all
clusters have the same size, thus we look at the distribution of proximities for all links
connecting products with the same or different Leamer classifications. Figure 8s shows the
distribution of proximity for links connecting nodes with the same Leamer classification
(blue) and for links connecting nodes annotated differently. It is clear from the figure that
nodes with the same classification are connected by links with higher proximity values,
and because of the large number of links present in the system (L>200'000), the difference
82
Figure 19 Distribution of proximity for links connecting products with the same Leamer
classification (blue) and with a different one (red).
All products have a value, which in this work we consider as the average income
per-capita associated with that good or PRODY. It follows to ask: Are rich goods located in
particular parts of the product space? By looking at its network representation and setting
the size of the nodes proportional to the PRODY of a product (figure 9s), we see that the
largest nodes are located either in the center or the down most portion of the network. At a
first glance, we can say that there is a rich region of the product space, composed by
machinery, electronics and chemicals, and a poor, peripheral region, made of some
83
Figure 20 Network representation of the product space in which node sizes are proportional
to PRODY.
We can look beyond the actual value of products and study the value of goods as a
function of their distance between them. Basically we ask: Is this particular product at the
top or at the bottom of the PRODY sophistication scale? To answer this we study the
-log(Proximity). Figure 10s shows six examples of products, three of them at the bottom of
the sophistication scale (Footwear, Cotton Undergarments and Coats and Jackets) which
belong to the labor intensive cluster and thus products far from them are richer or more
attractive. On the other hand, chemicals such as organo sulphur compounds, phenols and
cyclic alcohols appear at the top of the sophistication scale and see all other products as less
sophisticated.
84
Figure 21 Prody as a function of distance for six different products in the space. Plots were
calculated using the full proximity matrix.
We performed the same analysis for each product class and found that there are
products at the top of the scale, at the bottom and in local maxima (Figure 11s). If the
maximum would trap countries. Examples of these are cereals and animal agriculture
85
products which are goods located in the periphery of the product space but have a relatively
86
1.Petroleum 6. Cereals
2. Raw 7. Labor
Materials Intensive
3. Forest 8. Capital
Products Intensive
4. Tropical
9. Machinery
Agriculture
Figure 22 Average PRODY as a function of the distance for products with a given Leamer Annotation.
87
6.2.3 Changes in Time
How fast does the product space changes in time? We can take a simple look at
these by calculating the Pearson's Correlation Coefficient (PCC) between the matrices
representing the product space in 1985, 1990 and 1998. Table 2s shows that the structure of
the product space appears to be stable and that although links do change in time, after 10 or
13 years strong links remain strong and weak links remain weak. Thus products that are
close tend to remain close and the ones that are far tend to stay far. The correlation was
calculated over each pair of corresponding proximities between different time periods.
1990 .616
1998
88
6.3 APPENDIX III: Simulating Diffusion
countries develop RCA tend to lie close to other products for which these countries have
Using these we try to anticipate how a country will diffuse across the product space.
As an example, we show Figure 23, in which we highlighted with black squares all
products at a given proximity of the ones already developed by Chile and Korea. We refer
In this case we tuned the proximity of the jump and show that for high proximities
the set of options available is small while for low proximities is large, however different.
country that has developed RCA in several branches of machinery and therefore can
diffuse from the center of the space. At proximity of 0.5 its options include the entire core
of the network plus the entire electronics and garments clusters, among other things. Chile
diffuses from the periphery and to achieve a similar set of options needs to diffuse as far as
proximities of 0.3.
In summary we find that the set of options available for a country are strongly
conditioned by its position in the product space and its ability to diffuse into products up to
given proximities.
89
Figure 23. One step diffusion process for Korea and Chile. The black squares denote all
products closer than a given proximity considering their exports baskets in the year 2000.
90
Figure 24 Iterated diffusion process for Chile and Korea
proximity and iterate the one step diffusion process. This represents a set of products
potentially available to countries after diffusing to close products iteratively. At this point
we ask ourselves: Is there a critical value of proximity at which countries will be able to
diffuse across the product space? To explore this question we simulate a diffusion process
in which a country "jumps" to all goods reachable from its current export basket, such that
the proximity to them is larger or equal than a given value. Figure 24 illustrates through a
color code the products available to Chile and Korea after diffusing iteratively at different
91
Figure 25. Distribution for the average PRODY of the top 50 products reached after 20
diffusion steps at three different proximities.
proximities for 4 time steps. We observe that at relatively low proximities (φ = 0.55) both
countries are able to diffuse, however Chile does so much slower and reaches the core in
the second and third rounds, compared to Korea which does so on the first and second. At
larger proximities the diffusion process halts. At φ = 0.65 Chile is unable to diffuse at all,
while Korea slowly does so close to the core of the product space.
top products. We can assign value to a good by following the work of Hausmann, Hwang
and Rodrick in which the value or sophistication of a good is equal to the average GDP per
capita associated with that good. This quantity is called PRODY and in our particular
example we consider the average PRODY of the top N products of a countries export
92
basket after M diffusion steps with proximity φ. We denote this quantity by
which countries are divided into those producing sophisticated goods and unsophisticated
ones. If we allow countries to diffuse in this space to acquire only goods that are really
diffuse into products at relatively large proximities (φ=0.55) we find that after a large
number of rounds most countries are able to reach the most attractive parts of the space,
except for a few of them that remain stuck in the lowest bracket of this distribution.
93
REFERENCES
1 B.B. Mandelbrot The Fractal Geometry of Nature. New York: W. H. Freeman and
Co., (1982)
94
15 V. Colizza, A. Barrat, M. Barthélemy, A. Vespignani, PNAS 103: 2015–2020
(2006)
(Elements of Pure Economics, or the theory of social wealth, transl. W. Jaffé), (1874)
(1900)
E 62:3023-3026 (2000)
23 J.D. Farmer, L. Gillemot, F. Lillo, S. Mike, A. Sen. Quant. Fin. 4:383-397 (2004)
95
32 D.J. Watts, S.H. Strogatz, Nature 393:440-442 (1998)
(2005)
Zurich (2004).
Regression/Correlation Analysis for the Behavioral Sciences (3rd edition) LEA, Mahwah,
96
48 C.A. Hidalgo, A. Castro, C. Rodriguez-Sickert, New Journal of Physics 8:52
(2006)
(2001)
(2000)
97
63 A.M. Edwards, R.A. Phillips, N.W. Watkins, M.P. Freeman, E.J. Murphy, V.
Afanasyev, S.V. Buldyrev, M.G.E da Luz, E.P. Raposo, H.E. Stanley, G.M. Viswanathan,
(2001)
89:088102 (2002)
98
77 A. Hirschman, The Strategy of Economic Development Yale University press,
MA, (1991)
Eds. 15 (1993)
MA (2002)
91:1369-1401 (2001)
91 R.R. Feenstra, H. Lipsey, A. Deng, A. Ma, H. Mo. NBER working paper 11040.
Cambridge, MA (2005)
99
92 E. Ravasz, A.L. Somera, D.A. Mongru, Z.N. Oltvai, A.-L. Barabási, Science
297:1551 (2002)
MA (2006)
22:179-232 (1999)
100 Intellione and Roger Wireless have signed an agreement to use cell phones to
103 J. Diamond. Guns, Germs, and Steel: The Fates of Human Societies. W.W.
104 C.A. Hidalgo, B. Klinger, A.-L. Barabasi, R. Hausmann. Science 317:482-487 (2007)
105 Hausmann, Rodriguez and Wagner (2008) show that the position of a country in
the product space strongly affects the speed at which it recovers from economic crises.
100