You are on page 1of 12

Jointly published by Akadmiai Kiad, Budapest and Kluwer Academic Publishers, Dordrecht

Scientometrics, Vol. 54, No. 1 (2002) 119130

A new rank-size distribution of Zipfs Law and its applications


GUOHUA JIANG,1 SHI SHAN,2 LAN JIANG,2 XUESONG XU2
2 1 China National Institute for Educational Research, Beijing, (China) Department of Management and Information Engineering, Shanghai University, Shanghai (China)

Developing the probability function to describe rank-size Zipfian phenomena, i.e., a form like P(R = r)~c/r (>0) with a rank type random variable R, has been an important problem in scientometrics and informetrics. In this article a new rank-size distribution of Zipfs law is presented and applied to an actual distribution of scientific productivities in Chinese universities.

Introduction This article presents an exact theoretical model for the rank-size form of Zipfs law. By the rank-size form of Zipfs law we mean a mathematical relationship in which the rth largest of a set of quantities is in some specified sense approximately proportional to r, where >0. In many different fields it is common that a plot of size against rank yields a close fit to such a law. For example, a plot against r of the scientific productivity of the rth most productive university, or of the income of the rth richest unit, or of the number of articles on a given subject in the journal having the rth largest number of such articles, seems to fit the Zipf form. We use the article-university terminology for vividness but the argument is of course general (a more general treatment will be given in a later article). Consider a set of concerned universities published at least one article in a given period, denoted by set A, and a set of articles which were produced by the universities in set A in the period, denoted by set B. Let the number of the universities in set A be m and articles in set B be n. We define the size of a university in set A by its scientific productivity (i.e., the number of its corresponding articles in set B). Let different units in set A be arranged in the order of decreasing size and let r be the order in that list, called rank, assuming the ranks of units (in set A) of equal size have been assigned arbitrarily among them.

01389130/2002/US $ 15.00 Copyright 2002 Akadmiai Kiad, Budapest All rights reserved

G. JIANG et al.: A new rank-size distribution of Zipfs Law

The probability of the event an article in set B is produced by the university having rank r in set A is denoted by P(R = r), r = 1, 2, ..., m. P(R = r) is called the rank-size distribution. We present a new exact rank-size distribution as follows P( R = r ) = 1 + [1 ( + 1)( m 1) / ( + )( m 1) ] / ( 1) ( r 1) / ( + )( r 1) , (1)

for r = 1, 2, ..., m, where (>0), (>0) are parameters, and (r) = (+1)(+r1). It can be shown that

P( R = r ) = 1
i =1

and P(R = r) ~ cr, as r (surely, m meanwhile), where c is a constant. The latter property holds through the following relation ( r 1) ( + )
( r 1)

( + r 1) ( + ) ( + ) ~ r , ( ) ( + + r 1) ( )

where (x) is the gamma function of x. A historical note about rank-size laws Lotka (1925), in mentioning the rank-size relations observed in the populations of American cities, refers to Auerbachs law and gives the date of publication of the relevant paper erroneously as 1923. In fact, the paper was published ten years earlier (Auerbach, 1913). In it an approximate constancy relationship of ranksize established for the 47 largest cities in Germany (census of 1910). Condon (1928) refers to word counts by Ayres (1915) and ventures an explanation of the rank-size relation in terms of Webers law of psychology. Zipf (1935) discussed the constancy of ranksize. Zipf himself later suggested a generalization of his rank-size law (Zipf, 1949), namely. (rank)size = constant, >0.

120

Scientometrics 54 (2002)

G. JIANG et al.: A new rank-size distribution of Zipfs Law

With as an extra free parameter, clearly more of the observed rank-size curves could be fitted than without it. Mandelbrot (1953) was able to derive Zipfs rank-size curve as a consequence of an assumption related to Zipfs principle of least effort, namely, the minimization of cost, given the amount of information to be conveyed. Mandelbrot obtained from his model the formula Pr = c/(r+b), in which Pr, being the frequency of occurrence, represents size, r is rank, and c, b and are constants. This formula is called Zipfs generalized rank-size law. For the historical context of Zipfs work, see Rousseau (2000), in which a brief biography of Zipf is also given. A new rank-size distribution From the following formula s 1 (1 ) m s m = 1 (1 ) s

(1 )r 1 s r

where 0<<1, we have

1 (1 ) m s m 1 ( + ) f ( s) = s (1 )1 d 1 (1 ) s ( ) ( ) =

 ( + ) 1 1 r +2  r  ( ) () (1 ) ds   r =1  0
0 m

I
1

r =1

(2)

( + )(r 1) s r ,
r =1

( r 1)

where >0, >0 and

( r 1) = 1 ( + 1)

 ( + r 2 )

if if

r =1 . r >1

Scientometrics 54 (2002)

121

G. JIANG et al.: A new rank-size distribution of Zipfs Law

{(r1)/(+)(r1),

This finite power series, f(s), is called the generating function of the sequence r = 1, 2, , m}. Obviously, we have f (1) =
( m 1)   1 = 1+ 1 ((++))(m1)  . 1 0

( + ) ( 2 (1 ) 1 2 (1 ) r + 1 ) d ( ) ( )

I
1

(3)

The new rank-size distribution (i.e., the formula given in Eq. (1)) can be expressed as P( R = r ) = f (1)( + )( r 1) (r 1)

Putting g(s) = f(s)/f(1), we obtain the generating function of the sequence {P(R = r), r = 1, 2, , m} as g( s ) = ( + ) 1 (1 ) m s m 1 s (1 )1 d f (1) ( ) ( ) 1 (1 ) s
0 1

, r = 1, 2 ,

, m .

(4)

P( R = r ) s r ,
r =1

where P(R = r) is the new rank-size distribution. The generating function determines the new rank-size distribution uniquely. Since the right side of Eq. (4) is a power series in s we can use properties of power series to great advantage. The function g(s) is infinitely differentiable for |s|< and the kth derivative of g(s) is given by g (k ) ( s ) =

r (r 1)( r k + 1) P( R = r )s r k
m r =k

where km. In particular, g' ( s ) = and g ' (1) = E ( R ) . We have the following property (the proof is given in the appendix). Property. Let R be the rank type random variable having the rank-size distribution given in Eq. (1). It holds that

rP( R = r )s r 1
r =1

122

Scientometrics 54 (2002)

G. JIANG et al.: A new rank-size distribution of Zipfs Law

E ( R ) = g' (1)

= 1+

( + 1) ( + 1)( m ) ( 2 ) m 1 1+ ( m) 2 +m ( + 1)

with

as two special cases for = 1 and = 2. Thus Eq. (5) is a reasonable formula as the expected value of the rank type random variable R, which follows the new rank-size distribution. Using the gamma function and noting that ( + r 1) ( r 1) = , ( ) we have ( + )( r 1) ( r 1) = ( + r 1) ( + ) . ( ) ( + + r 1)

m 1 m 1 % Klim1 h(, ) = 1 + 1 m 1 + (1 6 1 i   1 + 1 i  +   +  K  i =1 i =1 E ( R) = & K lim h(, ) = ( + 1)  m1 1 m   1 + m 1  K1  + i + m   + m  K i=0 '

 ( + 1)(m1)  1 + 1  ( + )(m1) 

 

 

  

(5)

= h(, ) .

( = 1) ( = 2 )

The gamma function has the property ( + r ) / (r ) = 1. lim r r Hence for large r ( r 1) ( + )
( r 1)

( + ) 1 + o( r ) . ( ) ( r + 1)

 

 

(6)

For an infinite population, i.e., m , the new rank-size distribution holds only if the parameter >1, and it has the form P( R = r ) = ( + 1)( + )(r 1) ( 1) ( r 1) . (7)

Scientometrics 54 (2002)

123

G. JIANG et al.: A new rank-size distribution of Zipfs Law

For a finite population, i.e., m is finite, the new distribution is valid, whatever a positive valued number the parameter is. In particular, when = 1, using the relation 1 d 1 1 = ( > 0, > 0 ) , ( m 1) ++i d ( + )( m 1) ( + ) i =0 we can show by lHospitals rule that
m 2

 

 

lim (1 + (1 ( + 1)( m 1) / ( + )( m 1) ) / ( 1))


m 1 i =1

= 1+

+i ,
1

then the new distribution reduces to / ( + r 1) P( R = r ) = m 1 1 1+ +i

i =1

as a limiting case for 1. From Eq. (6), putting c = (+)/(()(1+(1(+1)(m1)/(+)(m1)))) and expressing the new distribution as P( R = r ) = c

 1 + o ( r )  ( r + 1) 

for large r (and large m, rm), it can be concluded that the random variable R of the new distribution would follow approximately Zipfs generalized rank-size law. It is worth noting that the formula given in Eq. (7) has a form (put = 1 (>1) in Eq. (7)) P( R = r ) = (r 1) ( + )( r ) (8)

which bears great resemblances to the Waring distribution in form but not in content. The latter is definited by P( X = k ) = ab ( k 1) ( a + b )(k ) , ( k = 1, 2,

) ,

where a>0, b>0, and X is a size type random variable (while R in Eq. (8) is a rank type random variable). The underlying chance mechanisms, from which the rank-size distribution is derived, will be considered in another paper.

124

Scientometrics 54 (2002)

G. JIANG et al.: A new rank-size distribution of Zipfs Law

Applications The rank-size distribution is used to describe the actual distribution of international articles (those were written in foreign languages) in Chinese universities (the actual data are available in the book Ranking List of Scientific Productivities for Chinese Universities, edited by Institute of Scientific and Technical Information of China).
Table 1 Observed and expected rank-size distribution of international articles in Chinese universities Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 . . . 108 rank classes Total Parameter Estimates 2 Fit Test Observed value 230 174 168 143 130 98 98 82 77 72 70 69 65 58 57 54 54 47 47 43 Expected value 296.0 230.7 186.9 155.8 132.7 115.0 101.0 89.7 80.5 72.8 66.3 60.7 55.9 51.7 48.1 44.9 41.2 39.4 37.1 35.0

3248 = 1.35 = 4.78 2 = 119.49 125 P 0.45

Scientometrics 54 (2002)

125

G. JIANG et al.: A new rank-size distribution of Zipfs Law

In Table 1 we give the detailed results for the top 20 annual scientific productivities in Chinese universities during 1992 and the summary statistics. As seen from the table, the rank-size distribution represents the data extremely well. Although the sample size is large (n = 3248) and although the observed tail is very long, we could not reject the new rank-size distribution hypothesis as the associated 2 probability is p 0.45 for 125 degrees of freedom. A regression estimation model for parameter based on the relation of logP(R = r)~logclogr is developed which enables one to estimate in advance. The parameter is estimated by equating the sample mean e with corresponding population moment E(R). This new distribution has also been used as a model for word rank-size counts. Some observed distributions quoted in the literature were fitted and results seem encouraging. This work will be reported somewhere else. *
We are grateful to Dr. Mari Davis for her inviting us to present this paper on the conference. This work has been supported by a research grant from the Shanghai Education Foundation.

References
AUERBACH, F. (1913), Das Gesetz der Bevlkerungskonzentration, Petermanns Geographische Mittelungen, 59 : 7476. AYRES, L. P. (1915), A Measuring Scale for Ability in Spelling. New York: Russel Sage. CONDON, E. U. (1928), Statistics of vocabulary, Science, 67 : 300. GLNZEL, W., TELCS, A., SCHUBERT, A. (1984), Characterization by truncated moments and its application to Pearson-type distributions, Zeitschrift fr Wahrscheinlichkeitstheorie und verwandte Gebiete, 66 : 173183. HILL, M. (1974), The rank-frequency form of Zipfs Law, Journal of the American Statistical Association, 69 : 10171026. KRETSCHMER, H. (2000), Configurations in international coauthorship, (in Chinese) In: JIANG GUO-HUA (Ed.), Research Evaluation and Its Indicators, Beijing: Red Flag Press, pp. 95117. LOTKA, A. J. (1925), Elements of Mathematical Biology, New York: Dover. MANDELBROT, B. (1953), An information theory of the statistical structure of language. In: W. JACKSON (Ed.), Communication Theory. New York: Academic Press, pp. 486502. ROUSSEAU, R. (2000), George Kingsley Zipf: Life, ideas and recent development of his theories, (in Chinese) in: JIANG GUO-HUA (Ed.), Research Evaluation and Its Indicators, Beijing: Red Flag Press, pp. 458467. ZIPF, G. K. (1935), The Psycho-biology of Language: An Introduction to Dynamic Philology. Cambridge, Mass.: M.I.T. Press. Reprinted: (1968). ZIPF, G. K. (1949), Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology, Cambridge, Mass.: Addison-Wesley Press.

126

Scientometrics 54 (2002)

G. JIANG et al.: A new rank-size distribution of Zipfs Law

Noting the generating function g ( s) =

where f(1) = 1+(1(+1)(m1)/(+)(m1)/(1) and using the formula


1

I
0

( + ) f (1) ( ) ( )

I
1 0

Appendix

s (1 ) m s m+1 1 (1 ) 1 d , 1 (1 ) s

1 (1 ) +i 1 d ( ) ( + i ) ( + + i ) ( ) () (i ) ( + ) ( + ) (i ) (i = 1, 2 ,

= = we obtain
g ' (1) = E ( R ) 1+ =

) ,
      .

( + 1) + 1 ( + 2 ) (m1) ( + 1) ( m1) + 1 ( m + 1) ( m1) 1 +1 ( 1)( 2 ) ( + ) ( + ) ( m1)


( m1) ( m1)

 

   ( + 1) 1+ 1 1  ( + ) 

(A1)

We get, with some algebra,

E ( R) = 1+

( + 1) ( + 2 ) ( m1) + 1 m( + 1) + ( m 1) 2 +m ( 1)( 2 ) ( 1)( + )

= 1+

( + 1) ( + 1) ( m) ( 2)m 1 1+ ( m) 2 +m ( + 1) 1+ 1 ( + 1) ( m1) ( + )
( m1)

 

1+

( + 1) ( m1) 1 1 ( + ) ( m1)

 

 

 

 

 

 
(A2)

 

    .

Scientometrics 54 (2002)

127

G. JIANG et al.: A new rank-size distribution of Zipfs Law

Let ( ) =

( + 1) ( m1) ( + ) ( m1)

and ( ) =

E ( R ) = h ( , ) 1+ =

+1 ( ) 1 ( m + 1) ( ) + 1 2 . 1+ (1 ( )) 1

 

+ 1 ( + 2 ) ( m1) in (A1), then we have + 1 ( + ) ( m1)

 

(A3)

From
1

lim ( ) =

it is obvious that

lim 1 ( m + 1) ( ) +

Using the lHospital rule and the relation 1 d 1 = ( m 1) dx ( + x ) ( + x ) ( m1) we have

 

 

m and lim ( ) = 1 , +1 1 +1 ( ) = 0 . 2

 

 

+ x +i 1 ,
1
i =0

m1

 + 1  '( ) ( )  ( m + 1) ' ( ) = lim  2  2


1

1 ( m + 1) ( ) + 1

lim

+1 ( ) 2

(A4)

= m 1 + (1 ) and

+i
1
i =1

m1

128

Scientometrics 54 (2002)

G. JIANG et al.: A new rank-size distribution of Zipfs Law

lim

1 ( ) = lim ( 1) ' ( ) 1 1 =

i =1

m1

1 . +i

(A5)

From (A3), it follows that


1

E ( R ) = lim h ( , ) 1 + m 1 + (1 ) =
m 1 i =1

  

i =1

m 1

1 +i

  

(A6)

1+ for = 1. Substituting
2 2

+i
1

lim

( )

= lim '( )
2

= and
2

1 + +1

m 2 i =0

+2+i
1

lim ( ) =

+1 +m

into (A3) for the special case of = 2 yields

Scientometrics 54 (2002)

129

G. JIANG et al.: A new rank-size distribution of Zipfs Law

E ( R ) = lim h ( , )
2

   =  + 1  1 +  1  + m  1 m  ( + 1)   + i + m  =  + 1  1 +  1  + m
1+ 1

 

( m + 1)( + 1) 1 + ( + 1) + +1 m+

 

m 2

i =0

1 +2+i

  
(A7)

m1 i =0

for = 2.

Received October 12, 2001. Address for correspondence: JIANG GUOHUA China National Institute for Educational Research Beijing 100088, China

130

Scientometrics 54 (2002)

You might also like