2010 - Ioan Iovitz Popescu - Zipfslawanotherview (Retrieved-2014!07!26)

Qual Quant (2010) 44:713731
DOI 10.1007/s11135-009-9234-y
Zipfs lawanother view

Ioan-Iovitz Popescu Gabriel Altmann
Reinhard Khler
Published online: 9 May 2009

Springer Science+Business Media B.V. 2009
Abstract In many scientific disciplines phenomena are observed which are usually
described using one of the various versions of the so-called Zipfs Law. As no single formula
could be found so far which would be able to sufficiently fit to all data more and more variants
(modifications, ad-hoc formulas, derivations etc.) are created, each of which agrees quite well
with some given data sets but fail with others. The present paper proposes a new approach to
the problem, based on the assumption that every data set which displays a Zipf-like structure
is composed of several system components. A corresponding model is presented and tested
on data from 100 texts from 20 languages.
Keywords
Zipfs law Rank-frequency distribution Synthetic language
The present paper argues on the background of linguistics but the line of argumentation holds
for a broad range of phenomena from economy over self-organising criticality (cf. sand-pile
experiments, geological structures etc., cf. Gell-Mann 1994; Bak 1999) to small world
systems.
George Kingsley Zipf, linguist and academic teacher of German, was the first to systematically investigate certain phenomena of the statistical structure of data sets from language,
demography and other areas. He found a stable relationship between the frequency rank of
words and their frequency or, represented in spectral form, between the word frequencies
and the size of the corresponding frequency classes.1 He gave a theoretical derivation of
his model but his argumentation was mathematically incorrect and suffered from ill-defined
1 In economics, this relationship became known as Paretos Law.
I.-I. Popescu
Bucharest, Romania
G. Altmann (B)
Sprachwissenschaftsliches Institut, Universitt Bouchum, Stttinghauser Ringstr. 44,
58515 Ldenscheid, Germany
e-mail: RAM-Verlag@t-online.de
R. Khler
Trier, Germany
123
714
I.-I. Popescu et al.
linguistic concepts (cf. Rapoport 1982). Nevertheless, his findings and the basic idea of his
model survived the flood of criticisms and denials, which appear in the literature until today.
Zipfs flawed model was later modified, extended, and transferred by numerous researchers
in various fields of application. It is not easy to get an overview over the diversity of the existing approaches. In Chitashvili and Baayen (1993) a few of them are reviewed and an own
approach is added. Some of the formulae which have been obtained were based on plausible justifications; others lacked any theoretical background and are therefore useful only for
descriptive purposes. The models with a theoretical derivation can be divided into two principally different types: (1) The Simon type (cf. Simon 1955). These models are based on
the assumption that the observed phenomena are due to specific processes in the generation
of the objects under study (in linguistic terms: cognitive processes of text generation). (2)
The Mandelbrot type (cf. Mandelbrot 1953). These approaches assume that the statistical
structure of the individual objects reflects the organization of a higher order system, a supersystem (in terms of linguistics: statistical text structure reflects the statistical organization
of a language as a wholein particular, the optimization of word production cost). Both
assumptions are, to some degree, plausible and at the same time unsatisfactory: If, under
assumption (1), the process of generation is responsible then an a priori probability distribution must be postulated, which remains unexplained. In the case of texts, this is the a piori
probability of words to occur in a text (e.g., in Simons derivation one of the two factors his
models consists of). Under assumption (2), if a super-system such as a language organizes
itself according to some principle such as word cost optimization then why should exactly
the probability distribution which results from an overall optimization be found in every
individual text, where deviations from the structure of the super-system would not do any
harm? We should be able to find at least some texts in which rare words occur rather often and
frequent words are infrequent or even absent and which display a form considerably deviant
from the Zipf-like shape.
In reality however, all texts in all languages seem to conform to the Zipfian shape more
or less closely; the observed differences can be taken account of by using a specific variant
of the many modifications and extensions of the original model. Hence, the general idea of
a universal principle behind the varying patterns seems to be justified. However, the current
situation is unsatisfactory because instead of a general model only a number of unrelated
alternatives, some of them ad-hoc creations, is available.
Several decades of research have shown that the frequency distributions of words in texts
show characteristics of individual texts as well as of languages. The degree of syntheticism of
a language influences the shape of the distributions: the more different word-forms a language
has the flatter becomes the rank-frequency distribution. The more analytic languages concentrate the frequency of the words in a limited number of classes. Therefore, the longer the
tail of a rank distribution (or, the larger the relative size of the first classes in the frequency
spectrum) the more synthetic is the given language. This typological characteristic disappears if the texts are lemmatized (i.e. if the word-forms are replaced by the corresponding
lemmasthe dictionary forms). There are many more sampling options which can change
the appearance of the distribution. We should also keep in mind that language data differ
considerably from most other kinds of data in many respects. Thus, a population in the sense
of a language as a whole does not exist in reality. Hence, texts can by no means be considered as samples from a population, and many generally accepted statistical methods may not
be applied to linguistic problems. Another consequence of the particularity of language is
that the law of large numbers is not valid for texts, i.e. an increase of text or corpus size does
not improve reliability; on the contrary, the longer a text the less its homogeneity, caused by
inevitably changing boundary conditions.
123
715
In any case, linguistic material seems to be composed of a number of components of

different origin, partly due to characteristics of individual languages, mainly, however, due
to factors of grammatical, semantic, and pragmatic (including stylistic) necessity and variability. Simple examples of grammatical factors are the difference between function words
(such as the, and, at) and content words (such as love, table, beautiful) and the existence
of word classes. Semantic factors can be illustrated by polysemy (many words have several
meanings, over which the number of occurrences of a word in a text may be distributed)
and by synonymy (one meaning can be expressed in several ways). Pragmatics includes the
view that the same expression can be used for various purposes and can be interpreted in
different ways. How many and which kinds of components should be taken into account for
an analysis depends on the purpose of the study and, in the first line, on the model applied and
the concepts related with it. Different views and different degrees of resolution are always
possible.
We do not disregard the fact that actually there are not any crisp boundary lines between,
e.g. function words (synsemantics) and content words (autosemantics) which could form
dichotomies, or between classes such as parts-of-speech. All these properties can be considered as continua, as features with gradual dimensions. Nevertheless, simplified models
which press continua in categories are often necessary in order to be useful and operable. We
will see below that even very simplistic assumptions can yield almost perfect results.
In this way, any linguistic data taken from authentic material can be considered as a composition of a number of components. Each component has a statistical structure on its own
right, and therefore a text must be regarded as a mixture of distributions. Text corpora, with
an even more inhomogeneous character, must of course be composed of an extremely large
number of components, which explains why the frequency distribution of text corpora does
not coincide with the distributions of individual texts.
Now, seeing the text as a stratified construct we may approach the problem in different
ways. The components we take into account represent a part of the boundary conditions.
If we reduce the text components to a unique view, i.e. press all components in just one
rank frequency mould, then usually the Zipf (or Zeta) distribution is the first and sufficient
approximation. But if the components vary in their proportions, the Zipfian approach does
not allow us to state how many components have been deranged, i.e. which or how many
boundary conditions must be taken into account. In such cases the usual strategy results in
modifications, extensions etc. to the classical formulae. If one works with distributions, the
following possibilities are given principally:
(1)
Compounding two distributions in such a way that one parameter of the parent distribution is itself a random variable with its own distribution (discrete or continuous).
One symbolizes compounding (of two distributions with two parameters each) as
P(x; a, b) P(a; c, d)
a
(2)
(1)
and obtains a new distribution P(x; b, c, d) after summing or integrating according

to a. This is, e.g. the origin of the negative hypergeometric distribution which is the
result of compounding the binomial distribution with the beta distribution. It is used
very frequently as a very general rank-frequency distribution. Compounding is a kind
of mixing.
Feller-type generalization of a distribution by another in such a way that the argument
t in the probability generating function of the parent distribution
is replaced by another
probability generating distribution. Symbolically, if G (t) = x Px t x is the probability
123
716
Table 1 Ranking of different

components
Rank
f1
f2
f3
f4
1
2
3
4
5
6
7
8
9
10
30
20
12
7
4
2
1
1
1
1
16
12
8
5
2
1
1
1
1
1
8
5
2
1
1
1
4
2
1
1
1
generating function of the parent distribution, then

G (H (t)) =
Px (H (t))x
(2)
(3)
is the probability generating function of the generalized distribution. The best known
class is the generalized Poisson family, defined as G(t) = exp(a(H (t) 1)). The
probabilities can be obtained by derivation.
Convolution of two distributions symbolized as P1 (x) P2 (x) that can be obtained
from the product of two probability generating functions G(t)H (t). The result is
Px =
x
P1 (x j)P2 ( j)
(3)
(4)
(5)
In all these forms the probabilities of two distributions are combined in such a way that
one obtains a new valid distribution.
Partitioning the curve in two or more sections, hereby separating sharply the vocabulary of the text in parts, which lack any linguistic significances (cf. Ferrer I Cancho and
Sol 2001; Ha and Smith 2004; Naranan and Balasubrahmanyan 1998; Tuldava 1996).
The result is

P1 (x), for x = 1, . . . , K
Px =
(4)
P2 (x), for x = K + 1, K + 2, . . . , V
Mixing is a plain addition of weighted portions of two distributions of the same or
different sort. Symbolically
Px = (1 )P1 (x) + P2 (x)
(5)
All these modifications can be extended to more variables. The first three yield mostly complex formulas and are not easy to justify linguistically. Nevertheless, some of the well known
distributions used in linguistics, though derived sometimes in a different way, are of the first
three kinds (Waring, Yule, Simon, Sichel, negative binomial, negative hypergeometric etc.).
The fourth method is easy but unrealistic because the rank-frequency sequence is a whole
in which classes (of any kind) are not separated, on the contrary, they are interwoven, a fact
that is ignored by (4).
The philosophy presented in our contribution corresponds to case (5), however not in
distribution form but as a function Hence the weighting parameter is not necessary and
instead of P we can write f . We consider only absolute frequencies and consider text as
123
717
Fig. 1 Mixing four frequency components
composed of frequency layers of different, however, not defined classes.2 One can illustrate
this situation using the hypothetical, meaningless data in Table 1, in which there are four
components with different frequencies of elements. The frequencies are ranked separately
and each ranking displays automatically a decreasing trend. The same situation is displayed
in Fig. 1.
Now, if we sum the components up then a greater frequency of the element of one component attains automatically the respective rank and the smaller frequencies of other components
are placed at the next ranks. In this way, we obtain the usual picture of rank-frequency distributions. In the above example we obtain the sequence: 30, 20, 16, 12, 12, 8, 8, 7, 5, 5, 2,
2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.
But now we consider it as a mixture of different layers which can be expressed in form
of a sum of functions or sequences
f (r ) = f 1 (r ) + f 2 (r ) + f 3 (r ) + + f k (r ) =
k
f i (r )
(6)
i=1
Now, f i (r ) can be chosen as a zeta function, a geometric sequence or, as is usual in decay
processes, as an exponential function (which can be presented as a geometrical sequence,
too), etc. There is no linguistically compelling reason to use the zeta function, hence we shall
adhere to the simplest special case of the Wimmer and Altmann (2005) given here by the
differential equation in which the relative rate of change of frequency equals the relative rate
of change of rank
d f (r )
dr
=
f (r )
ri
2 Miller et al. (1968) differentiated layers of function words and content words but studied only their lengths.
123
718
yielding the exponential function, and define the common ranking sequence as the sum of
these results, namely
f (r ) =
k
Ai exp(r/ri )
(7)
i=1
Since the individual elements of this sequence converge to zero, we add 1 without loss of
generality because there are no smaller frequencies than 1. Thus we obtain finally
f (r ) = 1 +
k
Ai exp(r/ri )
(8)
i=1
where the Ai s are the amplitudes and ri s are the rank scale constants of the individual
sequences. Formula (7) could be presented also as a distribution but since there is no software for fitting a multiple mixture of exponential distributions of this kind, we shall use (8)
as a continuous function. For the sake of illustration, we fit (8) using two mixed exponential
components to the rank-frequency sequence of word forms of 100 texts in 20 languages (see
Table 2). Thus, for instance, for Goethes Erlknig (G 17) we obtain R 2 = 0.9645 with
one component, R 2 = 0.9824 with two components, while the usual Zipfian fitting (power
curve) yields only R 2 = 0.9349. Also further modified Zipf fitting functions such as Zipf
Mandelbrot (R 2 = 0.9617), the right truncated zeta function (R 2 = 0.9202), and the negative hypergeometric distribution (R 2 = 0.7660) cannot compete with the presently proposed
Table 2 Fitting (8) with two mixed exponential components to 100 texts in 20 languages
ID
B 01
B 02
A1
r1
47.6626
11.8851
1.8310
9.8591
r2
R 2 present
R 2 Zipf
11.7204
1.8616
24.6430
22.8208
0.9878
0.9858
0.9837
0.8705
A2
B 03
11.3883
10.6940
3.8979
30.8570
0.9905
0.8790
B 04
18.2279
2.0296
9.5736
17.9197
0.9901
0.9619
B 05
12.2884
12.7005
10.4218
1.9141
0.9888
0.9367
Cz 01
89.8092
1.5034
10.0834
29.8773
0.9918
0.9764
0.9767
Cz 02
139.1469
0.7742
17.2807
19.6472
0.9821
Cz 03
425.0694
0.8345
54.2745
20.1402
0.9845
0.9832
Cz 04
70.6745
0.7588
7.1370
23.8965
0.9744
0.9537
Cz 05
194.2721
0.9509
15.1312
20.7528
0.9919
0.9715
E 01
142.7565
3.7490
16.3789
53.1611
0.9956
0.9620
E 02
166.6249
2.3215
47.3363
30.8223
0.9819
0.9661
E 03
262.9265
3.0351
34.0817
37.5747
0.9873
0.9752
E 04
507.9130
2.2501
39.3783
49.7372
0.9881
0.9870
0.9822
E 05
339.0010
2.4760
58.7252
33.4563
0.9802
E 07
255.4869
4.8007
35.4756
54.8410
0.9933
0.9347
E 13
793.2101
3.1952
106.0313
52.8682
0.9682
0.9800
G 05
32.3930
3.4098
4.6964
29.8436
0.9938
0.9646
G 09
25.7526
3.3468
8.1681
25.4876
0.9850
0.9626
G 10
17.3132
4.1391
4.6533
26.4403
0.9870
0.9402
G 11
17.1652
2.6471
5.8477
23.9797
0.9796
0.9593
123
719
Table 2 continued
ID
A1
r1
A2
r2
R 2 present
R 2 Zipf
G 12
25.1615
0.6162
8.9478
9.1158
0.9873
0.9514
G 14
7.0453
7.6272
5.3720
1.5982
0.9867
0.9349
G 17
6.3872
14.9185
6.1660
2.4572
0.9824
0.9349
H 01
692.3412
0.8109
21.2372
23.5043
0.9847
0.9600
H 02
1047.3379
0.4214
36.7323
6.3047
0.9809
0.9365
H 03
165.3035
0.7452
3.9648
14.8825
0.9962
0.8864
H 04
132.8165
1.5791
5.3124
33.1712
0.9923
0.9451
H 05
Hw 03
53.8344
1.5057
4.4435
15.7462
0.9732
0.9093
335.2463
2.6859
63.4549
32.2045
0.9866
0.9489
Hw 04
523.6375
3.6290
156.9179
30.5724
0.9845
0.9154
Hw 05
414.8637
7.2201
74.9028
52.8651
0.9911
0.8742
Hw 06
865.3743
3.0457
275.3415
27.1237
0.9868
0.9352
I 01
381.6416
6.4239
52.9336
84.5937
0.9886
0.9336
I 02
251.6319
5.5728
26.3236
86.1026
0.9906
0.9559
I 03
832.7973
0.3395
20.6710
13.6196
0.9854
0.9523
I 04
118.2232
5.0067
24.1761
53.5882
0.9884
0.9385
I 05
39.6387
5.7866
8.8788
46.0830
0.9912
0.9293
In 01
12.5099
2.4489
7.0419
19.6398
0.9821
0.9486
In 02
14.2350
3.6192
4.8314
26.4565
0.9722
0.9583
In 03
10.0663
2.8763
5.5343
24.6120
0.9805
0.9565
In 04
9.5023
2.5123
3.8019
31.2883
0.9712
0.9574
In 05
232.2849
0.2549
10.9070
21.3090
0.9901
0.8843
Kn 003
90.7911
2.1976
10.4363
106.3604
0.9730
0.9775
Kn 004
22.7563
2.4168
5.7317
48.8011
0.9713
0.9699
Kn 005
133.3226
4.0828
11.0701
162.9435
0.9576
0.9105
Kn 006
60.1808
9.2704
11.7639
184.1050
0.9884
0.9522
Kn 011
51.6393
8.4017
9.1712
169.9285
0.9886
0.9666
Lk 01
13.5811
3.7493
8.3654
16.0503
0.9863
0.9348
Lk 02
107.5765
3.8529
28.6775
25.5444
0.9843
0.9510
Lk 03
58.5481
3.3771
16.6441
21.9140
0.9929
0.9527
0.9801
Lk 04
20.5803
1.1772
8.9748
10.1449
0.9888
Lt 01
423.6948
0.7583
18.3468
32.9066
0.9684
0.9078
Lt 02
751.1495
0.6394
32.6985
31.7868
0.9837
0.9335
Lt 03
88.3092
4.8911
15.2459
101.7612
0.9720
0.9832
Lt 04
95.6604
6.4113
16.2275
101.1050
0.9895
0.9463
Lt 05
33.0904
3.2367
6.6660
52.5286
0.9862
0.9713
Lt 06
12.6603
4.4508
5.3476
33.0414
0.9667
0.9325
M 01
171.7642
4.3245
19.7968
52.0033
0.9879
0.9225
M 02
533.0407
0.5286
48.7393
14.2415
0.9791
0.9693
M 03
146.7155
3.2391
21.3397
36.0318
0.9944
0.9557
M 04
419.3777
0.6036
60.6588
11.5735
0.9666
0.9763
M 05
252.6075
4.1815
46.6180
45.1287
0.9947
0.9306
123
720
Table 2 continued
ID
A1
r1
A2
r2
R 2 present
R 2 Zipf
Mq 01
951.7665
0.5637
89.5533
17.7090
0.9863
0.9588
Mq 02
40.6507
1.5969
20.4135
12.6506
0.9926
0.9655
Mq 03
353.1811
1.7158
22.5446
31.8669
0.9931
0.9856
Mr 001
88.2739
2.6860
13.1552
87.4096
0.9842
0.9815
Mr 018
155.3090
2.1488
24.7275
67.2207
0.9799
0.9863
Mr 026
68.5407
7.3158
13.1804
115.1534
0.9863
0.9633
Mr 027
90.9947
4.9799
20.1624
107.3354
0.9846
0.9456
Mr 288
77.1276
5.0675
15.4370
92.4166
0.9750
0.9683
R 01
58.4179
3.4730
17.2254
36.6148
0.9745
0.9571
R 02
143.5423
1.5010
37.3312
19.1053
0.9780
0.9802
R 03
150.1334
0.8192
20.1034
20.6761
0.9820
0.9778
R 04
53.0530
2.7114
10.7788
39.8437
0.9919
0.9798
R 05
52.8419
1.3771
19.3840
18.2109
0.9783
0.9743
R 06
780.6832
0.2459
16.7087
14.2653
0.9835
0.9350
Rt 01
149.5175
2.2242
20.1494
24.3286
0.9875
0.9645
Rt 02
63.7045
3.7346
19.8587
21.9803
0.9946
0.9316
Rt 03
66.1102
3.1313
18.7931
27.3862
0.9844
0.9465
Rt 04
51.4438
3.1368
14.4738
21.9869
0.9873
0.9358
Rt 05
67.5625
2.8563
25.8076
27.2637
0.9950
0.9469
Ru 01
32.4246
3.8477
5.9434
38.7897
0.9894
0.9604
Ru 02
164.8501
2.2460
23.2986
39.4434
0.9829
0.9915
Ru 03
159.8507
1.4801
59.9352
23.2243
0.9750
0.9620
Ru 04
1238.3964
0.4439
101.5338
20.6088
0.9614
0.9571
Ru 05
729.2764
3.9273
78.0923
76.1505
0.9823
0.9807
Sl 01
74.4823
1.4018
9.1997
24.1027
0.9873
0.9760
Sl 02
78.1667
1.7823
19.2220
32.4544
0.9916
0.9823
Sl 03
125.5762
3.3095
11.1064
60.4119
0.9899
0.9604
Sl 04
466.1931
2.0011
36.1672
36.6358
0.9900
0.9912
Sl 05
205.4954
4.8968
27.2849
75.8361
0.9905
0.9490
Sm 01
184.5525
1.8381
55.9253
15.6411
0.9927
0.9678
0.9450
Sm 02
111.5533
3.5886
20.4309
29.3903
0.9910
Sm 03
39.1556
7.4565
9.0093
24.0300
0.9928
0.8708
Sm 04
85.2935
2.8670
18.1053
21.9164
0.9961
0.9563
27.5367
1.6852
25.0664
11.8320
0.9930
0.9263
105.9746
5.7859
7.9899
49.1539
0.9887
0.8817
Sm 05
T 01
T 02
135.6945
5.2386
8.7591
53.5378
0.9759
0.8685
T 03
136.0871
6.1231
14.8813
41.0608
0.9907
0.8923
B Bulgarian, Cz Czech, E English, G German, H Hungarian, Hw Hawaiian, I Italian, In Indonesian, Kn

Kannada, Lk Lakota, Lt Latin, M Maori, Mq Marquesan, Mr Marathi, R Romanian, Rt Rarotongan, Ru
Russian, Sl Slovenian, Sm Samoan, T Tagalog) (data taken from Popescu et al. 2008)
123
721
Fig. 2 Comparing the present fit with the Zipfian (the abscissa does not represent a variable but the number
of the text from Table 2)
two exponential component function (R 2 = 0.9824). The last two columns of Table 2 give
a comparison of the fitting determination coefficients of the present two exponential components function and that of Zipfs f (r ) = c/r a function. The corresponding column mean
values are mean R 2 present = 0.9848 and mean R 2 Zipf = 0.9498. The gain of the present
approach against that of Zipf is presented graphically in Fig. 2.
A better result could be expected simply by the fact that we have now four parameters
as compared with the two with Zipf but the Zipfian function does not take into account linguistic boundary conditions e.g. the degree of synthetism resulting in the long tail of hapax
legomena (words which appear only once in a given text), which may be captured by the
second and further components of the present function. Nevertheless, the Zipfian approach
is an excellent first approximation to the situation in texts.
Table 2 shows that only in three cases out of 100 A2 A1 (texts B 05, G 14, and
G 17), otherwise always A1 > A2, that means, the first component of (8) represents the most
frequent class, in our case perhaps that of synsemantics (function words). This is especially
conspicuous in the Romanian text R 06 in which the ratio A1 /A2 attains a high number
caused by a synsemantic outlier, as can be seen in Fig. 3. A similar linguistic phenomenon
occurs in the Hungarian text H 03, in the Italian text I 03, and in many other languages, as
illustrated in Table 3 as excerpted from Table 2 in decreasing order of the A1 /A2 ratio. In all
these cases, our conjecture is corroborated by the fact that we always have r2 > r1 meaning
that the component responsible for the hapax legomena is always represented by the second
exponential.
As can be seen in the general Table 2, in all cases two components are sufficient. None of
the determination coefficients is smaller than about 0.96. If one would set a fixed necessary
R 2 , then one should add as many components to (8) as needed in order to attain this criterion. However, this does not apply to all cases. If one adds a third component, the mean R 2
(of all texts) surely increases but the fitting does not necessarily improve in every text. This
123
722
Fig. 3 Text R 06 with a synsemantic outlier causing great A1
Table 3 Excerpted from Table 2 in decreasing order of the A1 /A2 ratio, with texts showing a synsemantic
outlier at the top of the rank frequency distribution
ID
A1
r1
A2
R2
r2
A1/A2
R 06
780.6832
0.2459
16.7087
14.2653
0.9835
46.7232
H 03
165.3035
0.7452
3.9648
14.8825
0.9962
41.6929
I 03
832.7973
0.3395
20.6710
13.6196
0.9854
40.2881
H 01
692.3412
0.8109
21.2372
23.5043
0.9847
32.6004
H 02
1047.3379
0.4214
36.7323
6.3047
0.9809
28.5128
H 04
132.8165
1.5791
5.3124
33.1712
0.9923
25.0013
Lt 01
423.6948
0.7583
18.3468
32.9066
0.9684
23.0937
Lt 02
751.1495
0.6394
32.6985
31.7868
0.9837
22.9720
In 05
232.2849
0.2549
10.9070
21.3090
0.9901
21.2969
Mq 03
353.1811
1.7158
22.5446
31.8669
0.9931
15.6659
T 02
135.6945
5.2386
8.7591
53.5378
0.9759
15.4919
T 01
105.9746
5.7859
7.9899
49.1539
0.9887
13.2636
E 04
507.9130
2.2501
39.3783
49.7372
0.9881
12.8983
Sl 04
466.1931
2.0011
36.1672
36.6358
0.9900
12.8899
12.8392
Cz 05
194.2721
0.9509
15.1312
20.7528
0.9919
Ru 04
1238.3964
0.4439
101.5338
20.6088
0.9614
12.1969
53.8344
1.5057
4.4435
15.7462
0.9732
12.1154
12.0435
H 05
Kn 005
133.3226
4.0828
11.0701
162.9435
0.9576
Sl 03
125.5762
3.3095
11.1064
60.4119
0.9899
11.3067
M 02
533.0407
0.5286
48.7393
14.2415
0.9791
10.9366
Mq 01
951.7665
0.5637
89.5533
17.7090
0.9863
10.6279
123

Table 4 Cases in which the
determination coefficients for
fitting (8) with two and three
components are identical
723
ID
R2
B 02
0.9858
B 03
0.9905
E 01
0.9956
G 05
0.9938
G 10
0.9870
G 11
0.9796
G 14
0.9867
G 17
0.9824
H 05
0.9732
Hw 03
0.9866
In 01
0.9821
In 03
0.9805
In 04
0.9712
In 05
0.9901
Kn 005
0.9576
Lk 01
0.9863
M 01
0.9879
M 03
0.9944
M 05
0.9947
Rt 01
0.9875
Rt 02
0.9946
Ru 92
0.9894
Sl 03
0.9899
Sm 03
0.9928
Sm 04
0.9961
T 01
0.9887
T 02
0.9759
is best illustrated in Table 4 where are given 27 cases out of 100 in which the determination
coefficient for fitting (8) with two and three components are identical. At the same time two
of three ri -values in all these three component fittings are equal, thus confirming that the
texts of Table 4 are actually composed by only two word components.
It should be pointed out that adding a further component and improving R 2 does not
necessarily mean a real improvement. On the contrary, the result could be even worse because
the number of degrees of freedom changes, as can be proved by an F-test. For example the
Latin text Lt 06 (Horace) fitted with two components yields R 2 = 0.9667, with three components R 2 = 0.9822 but the respective F-tests yield both P < 0.00001, the difference
consisting only in the decimal place at which numbers greater than zero begin. Hence, practically no improvement has been achieved but the number of parameters increased.
The disadvantage of approach (8) as compared with the power curve is the fact that while
zeta as rank-frequency distribution can easily be transformed in frequency spectrum, with
(8) a simple result follows only if there is one component only.
123
123
DU
MIT
ER
DEN
SO
MIR
ICH
ES
IN
NICHT
DICH
DEM
MEINE
HAT
EIN
MICH
MANCH
WAS
IHN
1.7260
1.8216
1.9297
2.0521
2.1906
2.3473
2.5247
2.7254
2.9525
3.2095
3.5004
3.8296
4.2020
4.6236
5.1006
5.6404
6.2512
6.9425
7.7247
9.6118
8.6100
9
5
MEIN
UND
11
9
WEIDEN
WIEGEN
WILLIG
WILLST
WOHL
ALTEN
CHZENDE
ARMEN
ERLENKNIG
ERLKNIG
FAT
HLT
REITET
RUHIG
SCHNE
SIEHST
WIND
TCHTER
SOHN
VATER
KIND
1.0021
1.0031
1.0047
1.0070
1.0106
1.0159
1.0239
1.0360
1.0542
1.0816
1.1229
1.1850
1.2784
1.4191
1.6309
1.9497
2.4295
3.1519
4.2393
8.3401
5.8761
Autosem fit
Freq
Word
Freq
Word
R 2 = 0.9304
R 2 = 0.9547
Synsem fit
Autosemantics
f 2 (r ) = 1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 Partitioning of Goethes Erlknig in two components
21
20
19
18
17
16
15
14
13
12
11
10
1
2
Rank
Total real
11
9
Freq
HAT
WIND
MEINE
DEM
DICH
TCHTER
NICHT
IN
ES
ICH
SOHN
MIR
SO
DEN
KIND
ER
MIT
DU
VATER
MEIN
UND
Word
2.3473
2.4295
2.5247
2.7254
2.9525
3.1519
3.2095
3.5004
3.8296
4.2020
4.2393
4.6236
5.1006
5.6404
5.8761
6.2512
6.9425
7.7247
8.3401
9.6118
8.6100
Total reconstructed (ranked

by this column)
724
DAS
AN
IST
WER
SEINEN
SEINEM
SIND
WAR
SEI
GAR
DURCH
DORT
DIR
DIE
DEINE
DEIN
BIST
AM
1.0693
1.0784
1.0887
1.1004
1.1136
1.1285
1.1455
1.1646
1.1863
1.2108
1.2385
1.2699
1.3055
1.3457
1.3912
1.4427
1.5010
1.5669
1.6415
REIHN
ORT
NOT
NEBELSTREIF
REIZT
SUSELT
SCHEINEN
SCHN
SCHWEIF
SEH
SICHER
SINGEN
SOLLEN
SPT
SPIEL
SPIELE
STRAND
TANZEN
TOT
VERSPRICHT
WARM
WARTEN
DER
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0001
1.0001
1.0001
1.0002
1.0003
1.0004
1.0006
1.0009
1.0014
Autosem fit
Freq
Word
Freq
Word
R 2 = 0.9304
R 2 = 0.9547
Synsem fit
Autosemantics
f 2 (r ) = 1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 continued
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
Rank
Total real
Freq
DURCH
HLT
GAR
SEI
WAR
SIND
REITET
SEINEM
SEINEN
WER
RUHIG
IST
AN
DAS
SCHNE
DER
IHN
WAS
MANCH
SIEHST
MICH
EIN
Word
1.1646
1.1850
1.1863
1.2108
1.2385
1.2699
1.2784
1.3055
1.3457
1.3912
1.4191
1.4427
1.5010
1.5669
1.6309
1.6415
1.7260
1.8216
1.9297
1.9497
2.0521
2.1906

by this column)

725
123
123
NCHTLICHEN
NACHT
MUTTER
MHE
LIEBES
LIEBE
LEISE
LEIDS
KRON
KOMM
KNABEN
KNABE
JETZT
HREST
HOF
GLDEN
GRAUSETS
GRAU
GEWAND
GEWALT
GETAN
GESTALT
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
Autosem fit
Word
Freq
Freq
Synsem fit
R 2 = 0.9304
R 2 = 0.9547
Word
Autosemantics
f 2 (r ) = 1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 continued
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
Rank
Total real
Freq
TOT
VERSPRICHT
WARM
WARTEN
WEIDEN
WIEGEN
WILLIG
WILLST
WOHL
ALTEN
CHZENDE
ARMEN
ERLENKNIG
AM
BIST
ERLKNIG
DEIN
DEINE
DIE
FAT
DIR
DORT
Word
1.0004
1.0006
1.0009
1.0014
1.0021
1.0031
1.0047
1.0070
1.0106
1.0159
1.0239
1.0360
1.0542
1.0693
1.0784
1.0816
1.0887
1.1004
1.1136
1.1229
1.1285
1.1455

by this column)
726
GESICHT
GESCHWIND
GENAU
GEHN
GEH
FHREN
FEINER
ERREICHT
ERLKNIGS
DSTERN
DRREN
BUNTE
BRAUCH
BLUMEN
BLEIBE
BLTTERN
BIRGST
BANG
ARM
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
Autosem fit
Word
Freq
Freq
Synsem fit
R 2 = 0.9304
R 2 = 0.9547
Word
Autosemantics
f 2 (r ) =1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 continued
1
1
1
85
86
Freq
84
83
82
81
80
79
78
77
76
75
74
73
72
71
70
69
68
67
66
Rank
Total real
MUTTER
NACHT
NCHTLICHEN
NEBELSTREIF
NOT
ORT
REIHN
REIZT
SUSELT
SCHEINEN
SCHN
SCHWEIF
SEH
SICHER
SINGEN
SOLLEN
SPT
SPIEL
SPIELE
STRAND
TANZEN
Word
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0001
1.0001
1.0001
1.0002
1.0003

by this column)

727
123
123
Word
Autosem fit
Freq
Freq
Synsem fit
R 2 = 0.9304
R 2 = 0.9547
Word
Autosemantics
f 2 (r ) =1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 continued
1
1
108
101
107
100
99
106
98
97
96
105
95
104
94
93
92
103
91
102
1
1
88
89
87
90
Freq
Rank
Total real
GENAU
GESCHWIND
GESICHT
GESTALT
GETAN
GEWALT
GEWAND
GRAU
GRAUSETS
GLDEN
HOF
HREST
JETZT
KNABE
KNABEN
KOMM
KRON
LEIDS
LEISE
LIEBE
LIEBES
MHE
Word
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000

by this column)
728
Word
Autosem fit
Freq
Freq
Synsem fit
R 2 = 0.9304
R 2 = 0.9547
Word
Autosemantics
f 2 (r ) = 1 + 11.0491 exp(r /2.4450)
Synsemantics
f 1 (r ) = 1 + 9.7454 exp(r /8.0862)
Table 5 continued
1
1
1
123
124
120
119
122
118
121
1
1
115
117
114
116
1
1
113
1
1
110
112
109
111
Freq
Rank
Total real
ARM
BANG
BIRGST
BLTTERN
BLEIBE
BLUMEN
BRAUCH
BUNTE
DRREN
DSTERN
ERLKNIGS
ERREICHT
FEINER
FHREN
GEH
GEHN
Word
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000

by this column)

729
123
730
Fig. 4 Distribution of synsemantics in Goethes Erlknig
Fig. 5 Distribution of autosemantics in Goethes Erlknig
The present proposal trespasses the linguistic commandment shun mixtures (because
they can capture everything) but seeing text as a composition of components of heterogeneous sets of linguistic entities does not allow another interpretation. Besides its explanative
power and better fitting, it evidently does not depend on language type as is the case of the
power functionthe original Zipfs law (see Popescu and Altmann 2008) but captures this
boundary condition with one of the components. However, a deep (linguistic or textological)
analysis is necessary to show which component captures a given component. Further, it will
123
731
be necessary to scrutinize why the parameters of (8) are so different even for texts of the
same genre or the same author. Thus, the results themselves may be further analysed.
For the sake of illustration we show two components in Goethes Erlknig, that of synsemantics and that of autosemantics as presented in Table 5 and Figs. 4 and 5. Evidently, the
components taken separately display a similar behaviour with different parameters. When
mixed, the frequencies themselves do not change but the ranking variable is adapted to the
mixing, hence the parameters of the common distribution must be adapted, too. It can be
supposed that further components will be discovered.
Since different components are possible, this approach can help to discover them and
characterize different textual entities (author, genre, language etc.).
References
Bak, P.: How Nature Works. The Science of Self-Organized Criticality. Copernicus Springer, New York (1999)
Chitashvili, R.J., Baayen, R.H.: Word frequency distributions of text and corpora as large number of rare
event distributions. In: Altmann, G., Hrebcek, L. (eds.) Quantitative Text Analysis, pp. 54135. WVT,
Trier (1993)
Ferrer I Cancho, R., Sol, R.V.: Two regions in the frequency of words and the origin of complex lexicons:
Zipfs law revisited. J. Quant. Linguist. 8, 165173 (2001)
Gell-Mann, M.: The Quark and the Jaguar. Freeman, New York (1994)
Ha, L.Q., Smith, F.J.: Zipf and type-token rules for the English and Irish language. http://www.nslij-genetics.
org/wli/zipf/ha04.pdf. Cited 10 October 2007 (2004)
Mandelbrot, B.: An information theory of the statistical structure of language. In: Jackson, W. (ed.) Communication Theory, pp. 486502. Butterworth, London (1953)
Miller, G.A., Newman, E.B., Friedman, E.A.: Length-frequency statistics for written English. Information
and Control 1(1958), 370389 (1968)
Naranan, S., Balasubrahmanyan, V.K.: Models for power relations in linguistics and information science.
J. Quant. Linguist. 5, 3561 (1998)
Popescu, I.-I., Altmann, G.: Hapax legomena and language typology. J. Quant. Linguist. 15, 370378 (2008)
Popescu, I.-I., Vidya, M.N., Uhlrov, L., Pustet, R., Mehler, A., Macutek, J., Krupa, V., Khler, R., Jayaram,
B.D., Grzybek, P., Altmann, G.: Word Frequency Studies. Berlin, New York: Mouton de Gruyter (2008)
(in print)
Rapoport, A.: Zipfs Law Re-visited. In: Guiter, H., Arapov, M.V. (eds.) Studies on Zipfs Law, pp. 128.
Brockmeyer, Bochum (1982)
Simon, H.: On a class of skew distribution functions. Biometrika 42, 435440 (1955)
Wimmer, G. , Altmann, G.: Unified derivation of some linguistic laws. In: Khler, R., Altmann, G., Piotrowski, R.G. (eds.) Quantitative Linguistics. An International Handbook, pp. 791807. De Gruyter, Berlin,
New York (2005)
Tuldava, J.: The frequency spectrum of text and vocabulary. J. Quant. Linguist. 3, 3850 (1996)
123

2010 - Ioan Iovitz Popescu - Zipfslawanotherview (Retrieved-2014!07!26)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2010 - Ioan Iovitz Popescu - Zipfslawanotherview (Retrieved-2014!07!26)

Uploaded by

Copyright:

Available Formats

Qual Quant (2010) 44:713731

Zipfs lawanother view

Published online: 9 May 2009

Zipfs law Rank-frequency distribution Synthetic language

I.-I. Popescu et al.

Zipfs lawanother view

In any case, linguistic material seems to be composed of a number of components of

and obtains a new distribution P(x; b, c, d) after summing or integrating according

I.-I. Popescu et al.

Table 1 Ranking of different

generating function of the parent distribution, then

Zipfs lawanother view

Fig. 1 Mixing four frequency components

I.-I. Popescu et al.

Zipfs lawanother view

I.-I. Popescu et al.

B Bulgarian, Cz Czech, E English, G German, H Hungarian, Hw Hawaiian, I Italian, In Indonesian, Kn

Zipfs lawanother view

I.-I. Popescu et al.

Fig. 3 Text R 06 with a synsemantic outlier causing great A1

Zipfs lawanother view

Table 5 Partitioning of Goethes Erlknig in two components

Total reconstructed (ranked

Total reconstructed (ranked

Zipfs lawanother view

Total reconstructed (ranked

Total reconstructed (ranked

Zipfs lawanother view

Total reconstructed (ranked

Total reconstructed (ranked

Zipfs lawanother view

I.-I. Popescu et al.

Fig. 4 Distribution of synsemantics in Goethes Erlknig

Fig. 5 Distribution of autosemantics in Goethes Erlknig

Zipfs lawanother view

You might also like