You are on page 1of 30

Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give?

Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Mapping Brazil’s connectivity – do we really get more than we


give?1

Suely Fragoso, Unisinos, Brazil


<suely@unisinos.br>

This paper presents and discusses the first steps – and above all the first impasses
– of research that started in January 20052. This research intended to identify and
discuss the hyperlinks connecting sites on the World Wide Web that are registered
as being within Brazilian domains (.br) with sites of other nationalities. It is unlikely
that a country code Top Level Domain3 (ccTLD) includes all of the sites created by
people and institutions of that nationality, or all of the pages hosted on servers
located in territory belonging to that country. Various Brazilians host their sites
under other ccTLDs, amongst the reasons for which is the disproportionate amount
of bureaucracy required to register a .br domain in comparison that required in

1
This paper presents the partial results of research sponsored by the Conselho Nacional
de Desenvolvimento Científico e Tecnológico - CNPq, a Brazilian government organ which
promotes scientific and technological development.
2
The research team consists of:– Rosana Vieira de Souza, M.A (Associate Researcher);
Theo Lucas de S. Felizolla, Maria Cândida Lucca di Primio and Ana Lúcia Migowski
(Research Assistants)
3
Domain Names refer to specific computers on the Internet and distinguish each one from
all others. The last part of a Domain Name is a Top Level Domain (TLD). There are two
main types of TLD: generic and country code. Generic Top Level Domains (gTLDs), such
as .com, .org or .net, are to be used by the general Internet public, in principle
distinguishing a particular type of association. Country Code Top Level Domains (ccTLDs),
such as .br, .ca or .ar, identify a particular country or geographical territory.
1
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

other nations. On the other hand it is not difficult to find sites belonging to other
nationalities in the .br domain. It is known for example that “multinational
companies commonly register their name under many domains to protect their
brands” (Gomes e Silva, 2005, p. 2). It is considered here, however, that the set of
pages hosted under the ccTLD .br is representative of the pages published by the
social actors of Brazilian nationality on the World Wide Web, not just because this
is indicated by everyday experience, but more importantly because, despite the
initial uptake of the Internet in Brazil having been rather late, the number of .br
domains registered increased very rapidly (Figure 1) and reached meaningful
figures in few years. The country is currently ninth on the worldwide list of nations
by the number of hosts present (second in the Americas, behind only the USA, and
first in Latin America, with more than three times as many hosts as second placed
Argentina).
Due in part to the lack of an empirical tradition in communications research, not
even the Google4 phenomenon was capable of leading Brazilian researchers in the
direction of hyperlink analysis. It is not uncommon, however to find sites with a .br
Top Level Domain in the samples taken by researchers of other nationalities. The
fact that these investigations have been predominantly carried out by authors from
the Northern hemisphere has resulted in the particular sets of data concerning
Brazil not being examined or discussed at great length.

4
Google was the first search engine to make use of an algorithm that used the link
structure of the web to predict the best quality matching pages. After Google’s success,
the efficacy of Google’s PageRank algorithm has even been taken as a given by related
research that has borrowed aspects of its functionality (cf. Thelwall inpraiseofgoogle.pdf),
notably the analogy between the creation of a link to a site and an academic citation as a
measure of popularity and/or importance. (googlepaper).
2
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 1: Number of hosts with ccTLD .br per year. Source: ISC Internet Domain Survey, Internet
Systems Consortium. Data available from http://www.isc.org [18th September, 2005]

The results of some surveys in which .br domain sites appear collaterally are
particularly interesting. In his Master’s dissertation, Halavais (1988) carried out one
of the first investigations of the relationship between the structure of web linkage
and national and territorial borders.
Given that the total number of hosts registered with .br domains was still relatively
low (Figure 1) and that the tool used by Halavais for the construction of his quasi-
random sample5 probably implied a bias toward English language sites, the study
registered just 9 sites with there domain registered in Brazil, the only developing
nation to be named in the sample6. According to the data presented, these 9 sites
with .br domains received 0.2% (123) of the total international inlinks verified in the
5
Halavais’ sample consisted of 4,000 sites drawn with a randomizer which was a feature
of Excite’s Webcrawler search engine (the ‘roulette’ page). (Halavais, 1998, p. 62).
3
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

sample, which implies an average of 13.66 links per .br site, placing Brazilian
websites in sixth place with respect to their international connectedness7. (Table 1).

Country Total n° of % of Sites Total n° links Average n° % of links (to


sites in (to whole by domain in links by site total n° of
sample sample) sample links in whole
sample)
Australia 43 1,2 861 20,02326 1,6
Switzerland 18 0,5 288 16 0,5
Japan 27 0,7 410 15,18519 0,8
USA 2874 78 41208 14,3382 77,2
Canada 88 2,4 1241 14,10227 2,3
Brazil 9 0,2 123 13,66667 0,2
South Africa 8 0,2 98 12,25 0,2
Germany 101 2,7 1166 11,54455 2,2
Netherlands 49 1,3 546 11,14286 1
France 25 0,7 262 10,48 0,5
UK 157 4,3 1586 10,10191 3
Sweden 62 1,7 623 10,04839 1,2
Italy 37 1 357 9,648649 0,7
New Zealand 7 0,2 63 9 0,1
Norway 20 0,5 1 0,05 0
TOTAL 3686 100 53366 N/A 100
Others 161 4,4 4533 28,15528 8,5
Table 1: Data from the sample taken by Halavais (1998, p. 62) re-presented to highlight the aspects
most relevant to the present argument. The countries are ordered by the average number of links
found in the sites of each domain, in descending order. The category Others was not included in the
calculation of the totals.

For a country in development, with low indices of digital inclusion and education,
and whose population is far from being proficient in English, the position is
surprising. Of the 5 countries whose domains contain more connected sites than
6
The sample can, however, have included sites from other nations in development, which
were aggregated in the ‘Others’ category.
7
It is worthwhile noting that the connectivity indicators used by Halavais do not describe
the link totals for a given country, but correspond to the average proportion of links for
sites in a given country (Halavais, 1988, p 62).
4
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

those of .br, in only 2 cases is English not the principal language:- Switzerland and
Japan. Switzerland, as widely known, is a particularly multicultural country, with as
many as three different official languages (German, French and Italian). It is also a
highly developed country, described in the CIA World Factbook as a “stable
modern market economy with low unemployment, a highly skilled labor force, and
a per capita GDP larger than that of the big Western European economies” (10th
place in the worlds highest GDP per capita in 2004 (est. $ 33,800.) . Located in the
heart of Europe, its comprehensive state educational system supports the 99%
literacy rate index8 estimated in 1988. Japan, despite having very a different
cultural and ethnic composition to Switzerland, has a very similar profile in terms of
education and economic prosperity: 99% literacy rate (data for 2002) and a GDP
per capita that was estimated at $ 29,400 in 2004 (CIA, 2005, s.p.). In fact, the only
other ‘developing nation’ to appear in the table of connectivity presented by
Halavais is The Republic of South Africa9, which, as with the other Southern
Hemisphere nations that are included (Australia and New Zealand) has English as
one of its principal languages.
Also projecting the patterns of linkage onto the national frontiers given in
‘traditional’ geopolitical maps in 2003 and 2004, in co-authorship with Park and
others and Jun respectively, Barnett based his considerations concerning the
international information flow on the Internet on data that indicated separately the
number of inlinks (received) and outlinks (sent) by domain for 47 different

8
This value from the CIA WorldFactbook uses as its basis the proportion of people above
15 years of age that are capable of reading and writing.
9
According to the CIA World Factbook, South Africa has an estimated GPD per capita of $
11,100, literacy rate of 86,4% and 50% of the population living under poverty line (2000
est.). Brazil’s GPD per capita is estimated by the same source as $ 8,100, the literacy rate
is also estimated as 86,4% and the percentage of population living below poverty line was
estimated at 22% (1998).
5
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

nationalities10. Working from this data, the authors estimate the degree of centrality
of different nodes in the network11 according to two values – the total number of
links with other nodes (Freeman, 1979) and the eigenvector measure (Borgatti,
2005, p. 61)12. Reworking this information with a multidimensional scaling of the
network, Barnett et al. found a completely interconnected system in which the US
occupies the most central position. Next most central are Australia, UK, China,
Japan, Canada. and Germany. The authors call attention to Norway’s position in
the two-dimensional graphical representation of the network, which is more central
than other Nordic nations and located closer to the US than expected, which they
attribute to Norwegian efforts to market through the web. According to the two-
dimensional representation and the color code used by the authors, Brazil appears
to be positioned practically as centrally as Norway (Figure 2). The only comment
possibly related to this positioning that figures in the Barnett et al. paper concerns
the fact that links between Brazil and Portugal are particularly strong, as pointed
out by previous authors (as Bharat et al., 2001, p. 5)13.

10
The initial sample of Barnett et al and Barnett and Jun did not include Brazilian TLDs,
being comprised of the TLDs of the nation members of the OECD (except Poland) and six
generic TLDs (.com, .net, .edu, .mil, .org, .gov)
11
Understanding as nodes the countries to which the domains encountered pertain, the
centrality of each domain reflects its importance, influence and pre-eminence in the
network.
12
A node has a high eigenvector centrality when it is connected to many nodes which are
themselves connected to many nodes.
13
Bharat et al. did not include, in this 2001 work, data about the indegree or weighted
indegree of websites in the .br domain. 6
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 2: International Internet Hyperlink Structure. Reproduced from Barnett and Jun, 2004, slide
15 (figures also in Barnett et al., 2003). Thickness of the connection line is proportional to the
number of hyperlinks between two countries (50,000 links is the minimum value for a connection to
be indicated). The intensity of the circle representing each country indicates its centrality in the
network. Black arrows and country names were added to emphasize the features which interest the
argument herein developed.

7
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

It is also worth noticing that the analysis of the bandwidth network by the same
authors resulted in a graphical representation in which once more the US was the
most central country, followed by UK, Germany, Hong Kong, Singapore, Japan,
France and Italy. This time, Brazil occupies what appears to be a less central
position and is definitely more closely related to its neighbors in Latin America
(Figure 3).
Barnett et al. (2003) and Barnett and Jun (2004) consider that such results
corroborate the claims of World-System Theory (Wallerstein, 1979) in indicating
the existence of an information flow in the center-periphery direction, “with the
United States and the wealthier nations of Western Europe at the center and the
poor less developed nations of Latin America, Asia and Africa along the margins”
(Barnett et al., 2003, p.11). With regards to bandwidth, the data concerning Brazil
seems to confirm the existence of a hierarchy between the central and peripheral
nations, in that there are structural connections from the periphery to the hub, but
not among the peripheral nations themselves. “The U.S. dominates internet flows
due to its central position in the network. While there are some flows entirely within
Europe or the Asian-Pacific region and limited flows within Latin America, flows
between these localities primarily go through the U.S.” (Barnett et al., 2003, p. 11).

8
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 3: International Internet Infrastructure. Reproduced from Barnett and Jun, 2004, slide 16
(figures also in Barnett et al., 2003). Thickness of the connection line is proportional to the
bandwidth capacity between two countries (13Mbps is the minimum value for the presence of a
connection to be indicated). Colors indicate membership in a cluster. Black arrows and country
names were added to emphasize the features which interest the argument herein developed.

Returning to Figure 2, it is reiterated that the degree of centrality of Brazilian


domains in the projection of the international hyperlink structure is greater than
9
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

would be expected. Looking at the detail of the inlinks and outlinks found for each
nationality (ccTLD) in the work of Barnett et al. and Barnett and Jun, a still more
intriguing result appears: the sites with a .br domain feature as receiving more
international links than they provide (Table 2).
At first sight it can appear that the larger number of links into sites with a .br
domain than the number of links out of these is indicative of an information flow
that runs from the informationally rich nations to the informationally poor ones,
corroborating once more the World System Theory. Links between websites,
however, are not equivalent to tracks through which goods, capital, humans - or
even information itself – flow, as if going from the source anchor to the link target.
On the contrary:– at least since the seminal work in which Brin and Page
presented the prototype of Google (1998), the quantity of inlinks received by a web
page has been widely accepted as an indicator of its importance. According to the
rationale behind such argument, the creation of a link functions as an endorsement
of the destination page by the Publisher which established the connection. Thus,
when I place a link to AoIR on my personal webpage, I give the AoIR site a link
with which I declare it to be a destination that I consider pertinent to the readers of
my Web page. In the terms of Walker (2002), with this reference I “create value” for
the AoIR site. It is evident that the value that a given page is capable of
aggregating to another by the establishment of a link is proportional to the value of
the page that contains the outlink: a connection on the first page of Yahoo! (which
receives a high number of daily visitors) aggregates more value to the AoIR site
than a connection from my personal page (which passes days without a single visit
occasionally).

Country OutDegree InDegree


(nº of links sent by (n° of links received 
node) by node) (OutDegree-InDegree)
Norway 1325859 4071733 -2745874
USA 12870134 15604977 -2734843
10
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Sweden 1300896 3158267 -1857371


Italy 3449312 4839254 -1389942
Finland 1075524 2417304 -1341780
Spain 1220562 2509513 -1288951
Brazil 1531697 2602113 -1070416
Belgium 1262965 2314083 -1051118
France 3902700 4810245 -907545
Netherlands 1727226 2519543 -792317
Switzerland 2644175 2785815 -141640
Canada 3095233 3093532 1701
Denmark 975896 946539 29357
New Zealand 810539 451855 358684
Republic of Korea 2073988 1183832 890156
Taiwan 2326265 1054423 1271842
Australia 5426344 2560601 2865743
Japan 4903376 1258347 3645029
United Kingdom 13199222 3158211 10041011
Germany 21057460 1654674 19402786
Table 2: data from Barnett et al. (2003, s.p.) and Barnett and Jun (2004, s.p.) partially reproduced
and reorganized (in increasing order of the OutDegree - InDegree difference). The intention is to
illustrate the difference between the quantity of inlinks and outlinks.

Seen through this lens, the pre-eminence of inlinks to sites in the .br domain found
in the samples of Barnet and Park (2003) and Barnet and Jung (2004) appears
somewhat more paradoxical: what attributes would lead publishers of various
nationalities to create a profusion of links to a developing, Latin American country,
which speaks Portuguese? It is true that Brazil was not part of the set of
nationalities initially selected to comprise the author’s sample, which probably
created a bias toward the inclusion of .br pages that possessed international inlinks
(to the exclusion of the many that do not receive such links). There is no reason,
though, for the methods and criteria in the sampling of Barnett et al. and Barnett
and Jun to have induced selection of pages with lower numbers of international
outlinks. Further, in respect to the degree of outlinkage of sites in the .br domain it

11
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

must be added that Brazil also appears in the 20 top level ccTLDs with the highest
weighted outdegree14 in the research done by Bharat and Ruhl (2001, p. 4-5).
Veloso et al. (2000, p. 5) observe that most of the content within websites with a .br
domain is in Portuguese (more than 75%). With this being so, the existence of a
significantly larger number of international inlinks in comparison to outlinks in
Brazilian websites contradicts the supposed concentration of the international
structure of the web around websites in English (Keniston, 1999). It also goes
against the current conception of the Brazilian people and institutions as being
particularly open and desirous of international contact but without the country being
able to attract the interest of other nations around the world.

Impasses and Resolutions


It was with these paradoxes in mind that we proposed an investigation of the
structure of the hyperlinks connecting sites on the World Wide Web that are
registered with Brazilian TLDs (.br) with sites of other nationalities. Our
fundamental premise was that the configuration of the sites that represent different
countries would provide an important portrait of the relationships established
between these countries (within and without the digital communication networks).
Such configurations would thus serve as complementary geographical variables,
which would help reveal and provide understanding of how information and
communication technologies are reconfiguring the proximities and distances of the
contemporary world.

14
Outdegree being the value that represents the number of distinct hosts to which the host
in question provides links, the weighted outdegree is the total number of hyperlinks
established by the host in question to the pages of other hosts. Conversely, the indegree is
the number of distinct hosts which link to the corresponding host and weighted indegree is
the number of hyperlinks to pages on corresponding host from other hosts (Bharat and
Ruhl, 2001, p. 3)
12
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Our first objective was to produce a mapping of the pattern of links between
Brazilian websites and those of other nationalities, projecting the potential
information flows (corresponding to the international hyperlinks) onto a traditional
world map. Having done this we would target our second objective, namely the
qualitative characterization, by sample, of the hyperlinks found in the most visible
sites (those that receive the highest number of international inlinks) within the most
significant flows (initially understood as the most numerous).
The first stage of the research required the building of a sample of websites in the
.br domain that receive (or emit) links from (or to) sites in other countries. So that
we could ensure the collection of all the data needed to carry out the second stage,
we needed to know not only the address (URL) of all of the .br sites that made up
the sample, but also be certain that we also recorded the addresses of the sites
that originated or were the destination of the verified international inlinks and
outlinks relating to the .br sites. We believed that this could be done quickly:– it
would be sufficient to reproduce the method adopted by Barnett et al. (2003) and
Barnet and Jun (2004), which was to search via AltaVista using the algorithm
<domain:xx AND link:yy> where .xx and .yy correspond to the most frequent
TLDs15 and to the .br domain. We were aware of some concerns about the use of
AltaVista for collecting this type of data, but it appeared to us, at first, that the
simplicity of the procedure compensated for any eventual inaccuracies that we
would find.
The results we obtained however, were shockingly inconsistent, indicating a level
of instability in the search engine that was impossible to ignore. (Figure 4)

15
In the preliminary conception of our investigation, the top 20 TLD names by host count in
January 2005 according to the Internet Systems Consortium (http://www.isc.org/)
13
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 4: At top left the screen of AltaVista with the result of a search using the syntax indicated by
Barnett et al. <domain:br AND link:uk>. The engine found 0 pages. In the centre above, the result of
a search with just the command <domain:br> and a result of 194,000,000 pages. To the right
above, the result of a search with the command <link:uk> and the outcome: 448 pages. A second
trial was made with the command <domain:.br AND link:.br> (37 pages, lower left) and <link:.br>
(65 pages, lower right). Evidently there should not exist only 65 pages with links to other .br domain
14
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

items amongst the 194,000,000 pages listed by AltaVista. The majority of the pages found using the
query term link contained the text <link> and <.br> but, curiously, not all of them proved to have this
connection with the search command used.

A series of complementary tests using AltaVista confirmed the stability of queries


using <domain:>. The term <link:> proved to be the source of the inconsistencies.
According to the special search terms description given by AltaVista Help,
searches with <link:> require the specification of a complete destination URL
(<link:URL> as in <link:www.unisinos. br>) with just the definition of a TLD name
(<link:.br>) not being sufficient. (Overture Services, 2005, s.p.). The evidence
suggests that there has been a change in the functionality of the query term <link:>
since the construction of the sample used by Barnett et al. (2003) and Barnett and
Jun (2004) which may well have been linked to the acquisition of AltaVista by
Overture in February 2003 (Richardson, 2003, s.p.) and later, of this by Yahoo! in
the second half of 2003) (Cullen, 2003, s.p.)
Tests combining <domain:> and <link:> using a complete URL for this latter term
(for example <domain:.ca AND link:www.unisinos.br>) produced more consistent
results. The specification of complete URLs rather than TLDs was, however,
incompatible with the aims of our research. Despite not encountering any
wildcard16 character indicated as valid for AltaVista, we tried some of the more
common alternatives, substituting parts of the URL with <*>, <?>, <$>, <%> and
<&> (Figure 5). It was not possible to regain the functionality of the <domain:.yy>
syntax this simply. Nor did we find any other search engine that allowed us to use
in conjunction query terms with functionality equivalent to the syntax <domain:xx
AND link:yy>.

16
A wildcard is a special symbol especial which stands for one or more characters.
15
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 5: To the left, the results of searching for sites with .br domain that offer links to
http://www.aoir.org (9 results) and sites within the .ar domain that provide links to
http://www.unisinos.br (281 results). To the right, corresponding searches that attempted potential
wildcards (0 results).

16
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

The impossibility of the straightforward reproduction of the methods used by


Barnett et al. (2003) and Barnett and Jun (2004) caused us to consider the use of
crawlers17 as an alternative for data collection. Our first concern with this was the
inexperience of the team: we are well aware that a badly formed or badly instructed
crawler could have effects similar to those of a distributed denial of service (DDoS)
attack, causing disruption and damage to sites and servers. We could obviously
have run a pre-built, pre-tested, publicly available crawler (such as Halavais’
Informicant (2001) or Thelwall’s SocSciBot (2004). This would not, however, solve
another problem inherent in the use of crawlers to collect samples from the World
Wide Web, namely the fact that whatever starting point is chosen for a crawl, it
functions as a strong ‘gravitational source’, generating a bias towards the subset of
pages that are close to it. This can be mitigated by choosing a large and diverse
set of initial starting points for the crawl and collating the results or – preferably and
– by performing a sufficiently long crawl to overcome the gravitational pull of the
starting point. Either of these strategies implied more demand on our universities’
computing facilities than we could reasonably expect to be made available to us. In
addition, the literature indicates that small and random samples from the Web are
inadequate for investigations concerning the pattern of international linkage not just
due to their low representativity but also because “any small pseudorandom
sample of pages from the World Wide Web is likely to consist entirely of isolates.
(Halavais, 2003, p. 1)
We returned, therefore, to our decision to use search engines to construct our
samples. Search engines certainly do not index more than a fraction of the WWW,
tend to be biased along linguistic or national lines (Vaughan and Thelwall, 2004, p.
13) and provide results with variable degrees of inconsistency more often than not.
17
A crawler (also known as a spider) is a software agent that follows hyperlinks
systematically and accordingly to a set of heuristics. In the process it can record
information about the documents these links are embedded in, what is usually done for the
purpose of building an index.
17
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

On the other hand, as it was already planned that the second stage of the research
would concentrate on the sites with the highest international visibility, it appeared
reasonable to us to consider the indexing and retrieval by the search engines as an
additional indicator of the visibility of the .br sites in the sample selection. There
are, after all, three basic methods of finding a website:– (a) using search engines,
(b) following hyperlinks given on other sites – which is the reasoning behind the
presupposition that there is a directly proportional relationship between the number
of inlinks and the visibility of the site – or (c) entering a previously known URL –
supplied by a person or institution either online or offline, the user has previously
visited the site, etc. The analysis of hyperlinks is based, above all, on access of
type (b). The use of search engines for the composition of the sample involved
access of type (a), and in so doing emphasizes the considerably higher visibility of
the sites indexed by the search engines when compared to those not indexed. The
use of inlinks as a measure of visibility by the search engines themselves, the
basis of Google’s Page Rank and also used now by Yahoo!, reiterates the
presence of the access modes (a) and (b) in our sample. To include, minimally,
access of type (c)18 we also added to our sample some of the Top 100 Third-Level
Domain Names as indicated by the Internet Systems Consortium for July 2005,
whose countries of origin (indicated by the ccTLDs) were included in our initial list
of ccTLDs for analysis.
None of these considerations mean, however, that the largest of all the obstacles
encountered to date has been overcome. As with AltaVista, none of the other
search engines that we tested was capable of carrying out searches conjugated to
a restriction of domains and the localization of inlinks in the way that our research
required. We opted therefore to adopt a mixed procedure, combining the potential
of the search engines with that of individual site mappings made using limited-

18
Which, it is worth noting, emphasizes a type of visibility whose inducers can be located in
or out of the Web.
18
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

reach crawlers. This way we ended up formulating an alternative set of techniques


to collect and process data for hyperlink samples intended for qualitative analysis.
This is a multi-stage process that being conceived for this research does not seek
to only analyze quantitatively the international information flow suggested by the
hyperlinks, but above all to provide a qualitative approach to international
hyperlinks.
To guarantee the greatest possible confidence in the sample and bearing in mind
that only a minority of searchers use exclusively one search engine (Dierkes and
Yen, 2005, s.p.), we chose to work with the two most popular search engines,
Google e Yahoo!. Together they represent almost 70% of online searches done by
US home and work web surfers (Sullivan, 2005, s.p.). These two search engines
are also the most popular amongst Brazilian users (Ibope/Netratings, 2003, s.p.)
and allow for searches restricted by domain (in the command <site:.br>). Lastly,
but not least, they are really two distinct search providers, not just different search
engines that use one or more databases in common (Sullivan, 2004, s.p.).
The searches for ccTLD .br were complemented by sweeps made with both tools
that addressed Brazilian Second Level Domains (SLD) with the largest number of
hosts, with hosts that supposedly have access frequencies higher than expected
(for example the .gov.br sites (Ibope/Netratings, 14/07/2005)) and/or with
expectation of higher numbers of international links. (for example the .edu.br
sites19). In total seven searches were made for TLD or SLD (.br, .com.br, .org.br,
.gov.br, .edu.br, .ind.br e .inf.br) with each engine. To increase the redundancy of
the sample, the same procedures were carried out on two separate occasions (on
the 21st and 23rd of June, 2005).
Before starting each series of searches the total number of DNS hosts registered 20
was checked and recorded by TLD and SLD for that date. The ratio between the

19
A good part of the Brazilian universities do not use the SLD .edu.br, being registered as
just .br (Comité Gestor da Internet no Brasil, 20/09/2005, s.p.).
20
Public Access data, at http://registro.br. 19
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

number of results found by each search engine and the total number of registered
hosts should function as an indication of the representativity of the sample with
respect to the complete universe of hosts. It is not reasonable to use for this
calculation the number of results indicated by the engines at the end of each
search (in the form <1-100 results of about n,000 for…>). In the end, as with the
indicated total number of pages indexed in the data base by the search engine
always being questionable, the page estimates listed at the top of web results page
has never been accurate either21. Additionally, the degree of success of the
clustering22 done by each search engine would greatly influence the total number
of results indicated and the number of single DNS hosts effectively located.

21
A good listing and discussion of the factors that compromise the reliability of this data
can be found, for example in Price, 2005, s.p.
22
In the present context, clustering means showing only one page from the same DNS per
results page. 20
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 6: Partial view of Google’s first results page for site:.edu.br in 23rd June, 2005.

21
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

The number of results effectively offered by the engines23 varied between 800 and
1000 addresses. These were stored as HTML files. (Figure 6). The raw initial
sample was composed of 28 lists with an average of 900 addresses per list
(around 25,200 URLs). The URLs of the host pages were manually extracted from
the HTML files (Figure 7) and then organized and counted using a Perl script
(repeated addresses were substituted by an indication of how often they appeared
in the original list). The repetitions of host pages were not frequent, indicating that
the clustering of both search engines, Google and Yahoo!, is efficient. The total
number of results with single URLs listed was considered to be the real sample
size obtained for each TLD or SLD and will be used to calculate the quantitative
representativity of the samples.

23
In other words, those results whose URLs were really made available by the engine, in
contrast to the total results that the engine announced as having been located but the
addresses of which were not made available to the user.
22
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 7: Partial view of three phases of the cleaning process for Google’s first results page for
site:.edu.br on 23rd June, 2005.

A comparison of the clean and counted results listings indicated important


variations in the results obtained by Google on the two search dates. Differences of
the same scale were noted throughout the entire first half of 2005, on various
occasions and at different times of each month. Due to this we discarded the
hypothesis that our sample had been coincidently affected by an event similar to
the phenomenon known as ‘Google Dance’24. Seeking a greater consistency to the
sample, we chose to work with the collated list from both search dates for each
category and search engine. We thus set to work with 14 lists, each with an initial
count of about 1,800 URLs. After reduction of duplicates in the combined lists and
refinement of the clustering the results for each category over the two days each
contained between 1,000 and 1,500 addresses. (Figure 8).
These final cleaned lists were compared using another Perl script that had been
specially written for the research. The crossing of the data from the different lists
permitted the identification of which sites were indicated by which of the search
engines, just by Yahoo!, just by Google or by both, and in each case, how often a
given site figured in the results.

24
‘Google Dance’ was a commonly used denomination of the index update of the Google
search engine, which was undertaken about once a month, and during which the Google
search results varied significantly.. It has been reported that since 2003 Google has been
updating its index continuously, thus Google Dance no longer happens. There are claims,
however, that there has to be an update of the complete index once in a while and that this
could still cause similar disruptions in Google’s search results.
23
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Figure 8: Google’s results for site:.edu.br on 23rd June, 2005 after counting (left) and organized in
decreasing number of occurrences.

We are currently working on the crossing of the lists by domain type and between
search engines. Once we have the results of this process, we shall pass on to
mapping the sites that have appeared with the highest frequency in our searches.
It is believed that a local crawler, covering up to five levels of depth, will be
sufficient to obtain a first indication of the existence or otherwise of international
24
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

outlinks. To locate inlinks, searches will be made with the query syntax
<domain:.xx AND link:URL> targeting the host page of each site and a random set
of pages from the 5 levels identified through the mapping.
With the mapping in hand we will move on to the qualitative analysis of the verified
international hyperlinks, in principle characterizing them by function (structural or
associative, as in Obendorf and Weinreich, 2003), the types of elements that
function as source and destination anchors (text or image), the meaning of the
source and destination anchors and their context (in the middle of text, at the start
of text, at the end of text, in a list etc.) The type of website will also be analyzed
(looking at, over and above the SLD type, the effective content of the page) and
the depth at which the international hyperlink is located (the host page being level
0).

Conclusion
In the conception of our research, we planned to reproduce the procedures
adopted by Barnett et al. (2003) and Barnett and Jun (2004), considering this
replication to be a preliminary step that would be completed quickly. However, on
attempting to replicate the procedure we found that it was now impossible to obtain
the desired results using the techniques described. Reviewing the literature, we
considered and discussed some alternative techniques for the collection of
international linkage data – all of which demanded resources that we did not have
available.
Faced with the non viability of collecting a quantitatively representative sample of
.br sites with international hyperlinks, we sought to develop a set of procedures
that would allow us to obtain data from samples that may possibly be smaller than
usual. The methodological strategies that we are proposing are based upon
chaining together a sequence of qualitative selections and as a result are
particularly labor intensive.

25
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

We consider that, despite being small by the quantitative standards of


representativity, the sample that we are nearing completion in constructing is
representative of the universe of .br domain sites that are effectively accessible to
– and are most accessed by – World Wide Web users. It will be particularly
interesting to compare it with the results obtained by researchers who have used
the traditional techniques of Hyperlink Analysis or of Webometrics to construct their
samples, above all in what it reveals with respect to the identification of the main
potential flows defined by international hyperlinks to and from the .br sites.

Bibliography

Barnett, G. A. and S. J. Jun, “An Examination of the Determinants of International Internet


Structure”, Proceedings of IR 5.0: ubiquity?, 2004. Associaton of Internet
Researchers, available online at http://www.aoir.org/ (13th September, 2005)
[restricted access]

Barnett, G. A. et al., “The Structure of International Internet Flows”, Proceedings of IR 4.0:


broadening the band, 2003. Associaton of Internet Researchers, available online at
http://www.aoir.org/ (21rst March, 2004) [restricted access]

Borgatti, S.P., “Centrality and network flow”, Social Networks, volume 27 number 1, p. 55-
71. 2005: Analytic Technologies, available online at
http://www.analytictech.com/borgatti/papers/centflow.pdf (28th September, 2005).

Brin, S. and Page, L., "The Anatomy of a Large-Scale Hypertextual Web Search Engine".
Proceedings of the Seventh International World-Wide Web Conference, Elsevier
Science B.V., 1998. Available at
http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm (12th June,
2005).

CIA, CIAWorld Factbook. Available online at http://www.cia.gov/cia/publications/factbook/


(20 September, 2005).

26
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Comitê Gestor da Internet no Brasil, Registro.br. Available online at


http://registro.br/estatisticas.html (20th September, 2005)

Cullen, D., Yahoo!buys!Overture!, The Register: Sci/Tech News for the World, 14th July
2003. Available online at
http://www.theregister.co.uk/2003/07/14/yahoo_buys_overture/ (14th September,
2005)

Dierkes, M. and T. Yen, Nielsen//NetRatings MegaView Search, Press Release, 28th


February, 2005. Available online at http://www.nielsen-netratings.com/news.jsp
(20th March, 2005)

Freeman, L.C., “Centrality in social networks: conceptual clarification”, Social Networks,


volume 1, number 3, 1979. Science Direct, available online at
http://www.sciencedirect.com/science?_ob=IssueURL&_tockey=%23TOC%235969
%231978%23999989996%23327546%23FLP%23&_auth=y&view=c&_acct=C000
050221&_version=1&_urlVersion=0&_userid=10&md5=948e9f9229196e2d3b50a1
216080ca55 (21rst March, 2004). [acesso restrito]

Gomes, D. and M. J. Silva, “Characterizing a National Community Web”. ACM


Transactions on Internet Technology, volume 5, number 3, August 2005. ACM
Digital Libraries, available online at http://www.acm.org/dl.cfm (3rd September,
2005) [restricted access]

Halavais, A. M. C., Measuring National Borders on the World Wide Web. Thesis submitted
in partial fulfillment of the requirements for the degree of Master of Arts, University
of Washington, 1998. Available online at http://alex.halavais.net/research/thesis.pdf
(20th September, 2003).

Halavais, A. M. C., The Slashdot Effect: Analysis of a Large-Scale Public Conversation on


the World Wide Web. Dissertation submitted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy, University of Washington,
2001. . Available online at http://alex.halavais.net/news/archives/halavais-ir3b.pdf.
(14th September, 2005).

27
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Halavais, A. M. C., Networks and Flows of Content on the World Wide Web. International
Communication Association Convention, 2003. Available online at
http://alex.halavais.net/research/halavais-ica03a.pdf (12th March, 2005).

Ibope/Netratings, IBOPE eRatings: brasileiros batem recorde de tempo de navegação, 11


horas e 48 minutos. Press Release, 1rst August, 2003. Available online at
http://www.ibope.com.br/calandraWeb/servlet/CalandraRedirect?temp=5&proj=Port
alIBOPE&pub=T&db=caldb&comp=Internet&docid=0C4A6C58E5C42D3483256EC
A00657AC7 (14th August, 2005)

Ibope/Netratings, Relatório Analisa uso da Internet no Brasil, Estados Unidos e Espanha.


Press Release, 14th July, 2005. Available online at
http://www.ibope.com.br/calandraWeb/servlet/CalandraRedirect?temp=5&proj=Port
alIBOPE&pub=T&db=caldb&comp=Not%EDcias&docid=4325C3D9D3EDC999832
5703E00567FC7 (14th August, 2005)

Internet Systems Consortium, ISC HomePage. Available online at http://www/isc.org (30th


September 2005).

Keniston, K., “Language, Power and Software” in C. Ess and F. Sudweeks (eds.) Culture,
Technology, Communication: Towards an Intercultural Global Village, 1999. New
York, Suny Press. Available online at
http://web.mit.edu/~kken/Public/PDF/Language%20Power%20Software.pdf (20th
September, 2005)

Obendorf, H. and H. Weinreich, “Comparing link marker visualization techniques: changes


in reading behavior”. Proceedings of WWW2003: 12th International World Wide
Web Conference, 2003, Budapeste. Available online at
http://www2003.org/cdrom/index.html (14th September, 2005).

Overture Services, AltaVista Help, Search, Special Search Items. Available online at
http://www.altavista.com/help/search/syntax (30th May, 2005 ).

Page, L. et al., The PageRank Citation Ranking: Bringing Order to the Web, Stanford
Digital Library Technologies Project, 1998. Available at
http://dbpubs.stanford.edu/pub/1999-66 (10th March, 2005).

28
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Price. G., More on the Total Database Size Battle and Googlewhacking With Yahoo.
Search Engine Watch Blog, 11th August, 2005. Available online at
http://blog.searchenginewatch.com/blog/050811-231448 (17th September, 2005)

Richardson, T., “Altavista flogged to Overture”, The Register: Sci/Tech News for the World,
19th February 2003. Available online at
http://www.theregister.co.uk/2003/02/19/altavista_flogged_to_overture/ (14th
September, 2005)

Sullivan, D., Who Powers Whom? Search Providers Chart. Search Engine Watch Reports,
23rd July, 2004. Available online at
http://searchenginewatch.com/reports/article.php/2156401 (7th June, 2005)

Thelwall, M., Research Note: in praise of Google finding law journal websites. Online
Information Review, volume 26, number 4, 2002, p. 271-272. Available online at
http://www.scit.wlv.ac.uk/~cm1993/papers/2002_In_praise_of_Google.pdf (18th
September, 2005).

Thelwall, M., SocSciBot 3, Link crawler for the social sciences. Available online at
http://socscibot.wlv.ac.uk/ (29th September, 2005)

Thelwall, M., The Responsiveness of Search Engine Indexes. Cybermetrics, International


Journal of Scientometrics, Informetrics and Bibliometrics, volume 5, 2001. Available
online at http://www.cindoc.csic.es/cybermetrics/cybermetrics.html (15th July,
2005)

Vaughan, L. and M. Thelwall, “Search Engine Coverage Bias: Evidence and Possible
Causes”. Information Processing and Management: an International Journal,
volume 40, issue 4, May, 2004. ACM Digital Libraries, available online at
http://www.acm.org/dl.cfm (25th September, 2005) [restricted access]

Veloso, E. et al., “Um retrato da web brasileira”, Anais do XXI Seminário Integrado de
Hardware e Software (SEMISH 00), 2000, Curitiba, Paraná, Brazil. Available online
at http://stat.akwan.com.br/~golgher/semish00.ps.gz. (15th July, 2005)

Walker, J., "Links and Power: The Political Economy of Linking on the Web", Proceedings
of the thirteenth ACM conference on Hypertext and hypermedia - Hypertext 2002.
29
Fragoso, S. Mapping Brazil's Connectivity – do we really get more than we give? Presented
at the IR 6.0, 6th International Conference of the Association of Internet Researchers,
Chicago, USA, October 2005.

Baltimore: ACM Press, 2002. 78-79. ACM Digital Libraries, available online at
http://www.acm.org/dl.cfm (7th June, 2005) [restricted access]

Wallerstein, I., El Moderno Sistema Mundial. Madrid, Siglo Veintiuno Editores, 1979.

Weinreich, H. et al., The Look of the Link - Concepts for the User Interface of Extended
Hyperlinks. Proceedings of the Twelfth ACM conference on Hypertext and
Hypermedia – Hypertext 2001. Denmark, ACM Press, 2001. 19-28. ACM Digital
Libraries, available online at http://www.acm.org/dl.cfm (3rd September, 2005)
[restricted access].

30

You might also like