Professional Documents
Culture Documents
Generation
Nimalan Nandapalan Richard Brent
30 March 2011
(revised 2 April 2011)
Acknowledgements
Thanks to Jir Jaros and Alistair Rendell for their assistance with Chapter 3.
2
Contents
1 Introduction and Conclusions 5
2 Pseudo-Random Number Generators 6
2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Spurious Requirements . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Linear Congruential Generators . . . . . . . . . . . . . . . . . 11
2.5 LFSR Generators and Generalizations . . . . . . . . . . . . . 12
2.6 The Mersenne Twister . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Combined Tausworthe Generators . . . . . . . . . . . . . . . . 15
2.8 xorshift Generators . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 xorgens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9.1 Cryptographic applications . . . . . . . . . . . . . . . . 17
3 Contemporary Multicore Systems 18
3.1 Intel Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 AMD Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 NVIDIA Tegra . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Parallel PRNG 23
4.1 Statistical Testing of Parallel PRNGs . . . . . . . . . . . . . . 23
4.2 The Contiguous Subsequences Technique . . . . . . . . . . . . 25
4.3 The Leapfrog Technique . . . . . . . . . . . . . . . . . . . . . 26
4.4 Independent Subsequences . . . . . . . . . . . . . . . . . . . . 26
4.5 Combined and Hybrid Generators . . . . . . . . . . . . . . . . 27
4.6 Mersenne Twister for Graphic Processors . . . . . . . . . . . . 29
4.7 Parallel Lagged Fibonacci Generators . . . . . . . . . . . . . . 31
4.8 xorgensGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Other Parallel PRNG . . . . . . . . . . . . . . . . . . . . . . . 36
4.9.1 CUDA SDK Mersenne Twister . . . . . . . . . . . . . 36
3
4.9.2 CURAND . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.3 Park-Miller on GPU . . . . . . . . . . . . . . . . . . . 37
4.9.4 Parallel LCG . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.5 Parallel MRG . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.6 Other Parallel Lagged Fibonacci Generators . . . . . . 37
4.9.7 Additional Notes . . . . . . . . . . . . . . . . . . . . . 38
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 40
4
Chapter 1
Introduction and Conclusions
This is a preliminary report on parallel pseudo-random number generation.
It was written under tight time constraints, so makes no claim to being an
exhaustive survey of the eld, which is already extensive, and in a state of
ux as new computer architectures are introduced.
Chapter 2 summarises the requirements of a good pseudo-random number
generator (PRNG), whether serial or parallel, and describes some popular
classes of generators, including LFSR generators, the Mersenne Twister,
and xorgens. This chapter also includes comments on statistical testing of
PRNGs.
In Chapter 3 we summarise the features and contraints of contemporary
multicore computer systems, including GPUs, with particular reference to the
constraints that their hardware and APIs impose on parallel implementations
of PRNGs.
Some specic implementations of parallel PRNGs are described in Chap-
ter 4. Our conclusion is that the jury is still out on the best parallel PRNG,
but quite likely what is best depends on the precise architecture on which
it is to be implemented. The same conclusion was reached by Thomas et
al [64] in their recent comparison of CPUs, GPUs, FPGAs, and massively
parallel processor arrays for pseudo-random number generation. There is
also a denite tradeo between speed and quality/large-state-space, so it is
hard to say what is best in all applications.
We include (in Chapter 4) some specic recommendations for testing
of parallel PRNGs, which is a more demanding task than testing a single
sequential PRNG.
Finally, we include a bibliography which is by no means exhaustive, but
does include both important historical references and a sample of recent
references on both serial and parallel pseudo-random number generation.
5
Chapter 2
Pseudo-Random Number
Generators
In this chapter we survey popular and/or historically important pseudo-
random number generators (PRNGs) and comment on their speed, quality,
and suitability for parallel implementation. Specic comments on some par-
allel PRNGs are given in Chapter 4.
Applications require random numbers with various distributions (e.g. uni-
form, normal, exponential, Poisson, . . .) but the algorithms used to generate
these random numbers, and other random objects such as random permuta-
tions, almost invariably require a good uniform random number generator. In
this report we consider only the generation of uniformly distributed numbers.
For other distributions see, for example, [3, 5, 6, 7, 11, 25, 57, 63]. Usually
we are concerned with real numbers u
n
which are intended to be uniformly
distributed on the interval [0, 1). Often it is convenient to consider integers
U
n
in some range 0 U
n
< m. In this case we require u
n
= U
n
/m to be
(approximately) uniformly distributed. Typically m is a power of two, e.g.
2
63
or 2
64
on a machine that supports 64-bit integer arithmetic.
Pseudo-random numbers generated in a deterministic manner on a digi-
tal computer can not be truly random [65]. We should always say pseudo-
random to distinguish such numbers from truly random number (gener-
ated, for example, by tossing dice, counting cosmic rays, or observing quan-
tum uctuations). However, for brevity we often omit the qualier pseudo
when it is clear from the context. What is required of a PRNG is that nite
segments of the sequence u
0
, u
1
, behave in a manner indistinguishable
from a truly random sequence. In practice, this means that they pass all sta-
tistical tests which are relevant to the problem at hand. Since the problems
to which a library routine will be applied are not known in advance, random
number generators in subroutine libraries should pass a number of stringent
6
statistical tests (and not fail any) before being released for general use.
A sequence u
0
, u
1
, depending on a nite state must eventually be pe-
riodic, i.e. there is a positive integer such that u
n+
= u
n
for all suciently
large n. The minimal such is called the period. If the generator has s state
bits, then clearly 2
s
. It is often (but not always) true that u
= u
0
, i.e.
the sequence generated is a pure cycle with no tail.
2.1 Requirements
Following are the main requirements for a good uniform PRNG and its im-
plementation in a subroutine library
Uniformity. The sequence of random numbers should pass statistical
tests for uniformity of distribution. In one dimension this is easy to
achieve. Most generators in common use are provably uniform (apart
from discretization due to the nite wordlength) when considered over
their full period.
Independence. Subsequences of the full sequence u
0
, u
1
, should be
independent. For example, members of the even subsequence u
0
, u
2
, u
4
,
should be independent of their odd neighbours u
1
, u
3
, . This im-
plies that the sequence of pairs (u
2n
, u
2n+1
) should be uniformly dis-
tributed in the unit square. More generally, random numbers are
often used to sample a d-dimensional space, so the sequence of d-
tuples (u
dn
, u
dn+1
, . . . , u
dn+d1
) should be uniformly distributed in the
d-dimensional cube [0, 1]
d
for all reasonable values of d (certainly for
all d 6).
Passing Statistical Tests. As a generalisation of the two requirements
above (uniformity and independence), we can ask for our PRNG not
to fail statistical tests that should be passed by truly random numbers.
Of course, there is an innite number of conceivable statistical tests
in practice a nite battery of tests is chosen. This is discussed further
in 2.3.
Long Period versus Small State Space. A long simulation on a parallel
computer might use 10
20
random numbers. In such a case the period
must exceed 10
20
. For many generators there are strong correlations
between u
0
, u
1
, and u
m
, u
m+1
, , where m = /2 or ( +1)/2 (and
similarly for other simple fractions of the period). Thus, in practice
the period should be much larger than the number of random numbers
7
which will ever be used. A good rule of thumb is that at most
1/2
random numbers should be used from a generator with period . To
be conservative,
1/2
might be reduced to
1/3
if we are concerned
with passing certain statistical tests such as the birthday-spacings
test [39].
On the other hand, the state space required by the PRNG should not
be too large, especially if multiple copies are to be implemented on a
parallel machine where the memory per processor is small. For period
, the state space must be at least log
2
bits. As a tradeo between
long period and small state space, we recommend generators with log
2
in the range 256 to 1024. Generators with smaller will probably fail
some statistical tests, and generators with larger will consume too
much memory on certain parallel machines (e.g. GPGPUs). Generators
with very large may also take a signicant time to initialise. In
practice, it is convenient to have a family of related generators with
dierent periods , so a suitable selection can be made, depending on
the statistical requirements and memory constraints.
Repeatability. For testing and development it is useful to be able to
repeat a run with exactly the same sequence of random numbers as
was used in an earlier run [23]. This is usually easy if the sequence
is restarted from the beginning (u
0
). It may not be so easy if the
sequence is to be restarted from some other value, say u
m
for a large
integer m, because this requires saving the state information associated
with the random number generator. In some applications it is desirable
to incorporate some physical source of randomness (e.g. as a seed for a
deterministic PRNG), but this clearly rules out repeatability.
Unpredictability. In cryptographic applications we usually require the
sequence of random numbers to be unpredictable, in the sense that it
is computationally very dicult, if not impossible, to predict the next
number u
n
in the sequence, given the preceeding numbers u
0
, . . . , u
n1
.
This requires the generator to have some hidden state bits; also, u
n
must not be a linear function of u
0
, . . . , u
n1
. If we insist on unpre-
dictability, then several important classes of generators are ruled out
(e.g. any that are linear over GF(2)). In the following we generally
assume that unpredictability is not a requirement. For more on unpre-
dictable generators, see [2].
Portability. For testing and development purposes, it is useful to be
able to generate exactly the same sequence of random numbers on two
8
dierent machines, possibly with dierent wordlengths. In practice it
will be expensive to simulate a long wordlength on a machine with a
short wordlength, but the converse should be easy a machine with
a long wordlength (say w = 64) should be able to simulate a machine
with a smaller wordlength (say w = 32) without loss of eciency.
Disjoint Subsequences. If a simulation is to be run on a machine with
several processors, or if a large simulation is to be performed on several
independent machines, it is essential to ensure that the sequences of
random numbers used by each processor are disjoint. Two methods
of subdivision are commonly used [35]. Suppose, for example, that
we require 4 disjoint subsequences for a machine with 4 processors.
One processor could use the subsequence (u
0
, u
4
, u
8
, ), another the
subsequence (u
1
, u
5
, u
9
, ), etc. For eciency each processor should
be able to skip over the terms which it does not require. Alterna-
tively (and generally easier to implement), processor j could use the
subsequence (u
m
j
, u
m
j
+1
, ), where the indices m
0
, m
1
, m
2
, m
3
are suf-
ciently widely separated that the (nite) subsequences do not overlap.
This requires some ecient method of generating u
m
for large m with-
out generating all the intermediate values u
1
, . . . , u
m1
(this is called
jumping ahead below).
Eciency. It should be possible to implement the method eciently
so that only a few arithmetic operations are required to generate each
random number, all vector/parallel capabilities of the machine are used,
and overheads such as those for subroutine calls are minimal. This
implies that the random number routine should return an array of
(optionally) several numbers at a time, not just one, to avoid procedure
call overheads.
Proper Initialisation. It is critically important that the PRNG is ini-
tialised correctly. A surprising number of problems observed with im-
plementations of PRNGs are due, not to the choice of a bad generator
or a faulty implementation of the generator itself, but by inadequate
initialisation. For example, Matsumoto et al [49] tested 58 generators
in the GNU Scientic Library, and found some kind of initialisation
defect in 40 of them.
A common situation on a parallel machine is that dierent processors
will use the same PRNG, initialised with consecutive seeds. It is a
requirement that the sequences generated will be independent. (Unfor-
tunately, implementations often fail to meet this requirement, see 4.1.)
9
2.2 Spurious Requirements
In the literature one sometimes sees requirements for random number gen-
erators that we might class as spurious because, although attractive or
convenient, they are not necessary. In this category we mention the follow-
ing.
Equidistribution. This is a property that, for certain popular classes
of generators, can be tested eciently without generating a complete
cycle of random numbers, even though the denition is expressed in
terms of a complete cycle. We have argued in [12] that equidistribution
is neither necessary nor sucient for a good pseudo-random number
generator. Briey, the equidistribution test would be failed by any
genuine (physical) random number generator; also, we have noted above
(under Period) that one should not use the whole period provided by
a pseudo-random number generator, so a criterion based on behaviour
over the whole period is not necessarily relevant.
Ability to Jump Ahead. Under Disjoint Subsequences above we men-
tioned that when using a pseudo-random number generator on several
(say p) processors we need to split the cycle generated into p distinct,
non-overlapping segments. This can be done if we know in advance
how many random numbers will be used by each processor (or an up-
per bound on this number, say b) and we have the ability to jump
ahead from u
0
to u
b
and restart the generator from index b (also 2b,
3b, . . . , (p1)b). Recently Haramoto et al [18, 19] proposed an ecient
way of doing this for the important class of F
2
-linear generators.
1
If
the period of the generator is and we have p processors, then by a
birthday paradox argument we can take p randomly chosen seeds to
start the generator at p dierent points in its cycle, and the probability
that the segments of length b starting from these points are not disjoint
is O(p
2
b/). For example, if p 2
20
, b 2
64
, and 2
128
, this proba-
bility is O(2
24
), which is negligible. We need to be at least this large
for the reasons stated above under Long Period. Thus, the ability to
jump ahead is unnecessary. Since jumping ahead is non-trivial to imple-
ment and imposes a signicant startup overhead at runtime, it is best
1
We note that this is nothing new, since essentially the same idea and ecient im-
plementation via polynomial instead of matrix operations was proposed in [4] and imple-
mented in RANU4 (1991) in the Fujitsu SSL for the VP series of vector processors. The
idea of jumping ahead is called fast leap-ahead by Mascagni et al [43], who give exam-
ples, but use matrix operations instead of (equivalent but faster) polynomial operations
to implement the idea.
10
avoided in a practical implementation. That is why we did not include
such a feature in our recent random number generator xorgens [9, 10].
2.3 Statistical Testing
Theoretically, the performance of some PRNGs on certain statistical tests
can be predicted, but usually this only applies if the test is performed over a
complete period of the PRNG. In practice, statistical testing of PRNGs over
realistic subsets of their periods requires empirical methods [31, 32, 40].
For a given statistical test and PRNG to be tested, a test statistic is com-
puted using a nite number of outputs from the PRNG. It is necessary that
the distribution of the test statistic for the test when applied to a sequence
of uniform, independently distributed random numbers is known, or at least
that a suciently good approximation can be computed [34]. Typically, a p-
value is computed, which gives the probability that the test statistic exceeds
the observed value.
The p-value can be thought of as the probability that the test statistic
or a larger value would be observed for perfectly uniform and independent
input. Thus the p-value itself should be distributed uniformly on (0, 1). If the
p-value is extremely small, for example of the order 10
10
, then the PRNG
denitely fails the test. Similarly if 1 p is extremely small. If the p-value
is neither close to 0 nor 1, then the PRNG is said to pass the test, although
this only says that the test failed to detect any problem with the PRNG.
Typically a whole battery of tests are applied, so there are many p-values,
not just one. We need to be cautious in interpreting the results of such
a battery of tests. For example, if 1000 tests are applied, we should not
be surprised to observe at least one p-value smaller than 0.001 or larger
than 0.999. A
2
-test might be used test the hypothesis that the p-values
themselves look like a random sample from a uniform distribution.
For comments on the statistical testing of parallel PRNGs, and on testing
packages such as TestU01, see 4.1.
2.4 Linear Congruential Generators
Linear conguential generators (LCGs) are PRNGs of the form
U
n+1
= (aU
n
+ c) mod m,
where m > 0 is the modulus, a is the multiplier (0 < a < m), and c
is an additive constant. Usually the modulus is chosen to be a power of 2,
11
say m = 2
w
where w is close to the wordlength, or close to a power of 2, say
m = 2
w
, where a small is chosen so that m is prime.
LCGs have period at most m, which is too small for demanding applica-
tions. Thus, although there is much theory regarding them, and historically
they have been used widely, we regard them as obsolete, except possibly when
used in combination with another generator.
2
Some references on LCGs are
[4, 17, 25, 38, 56].
The add-with-carry and multiply-with-carry generators of Marsaglia and
Zaman [42] can be regarded as a way of implementing a multiple-precision
modulus in a linear congruential generator. However, such generators tend
to be slower than LFSR generators, and may suer from statistical problems
due to the special form of multiplier.
2.5 LFSR Generators and Generalizations
Menezes et al [50, 6.2.1] gives the following denition of linear feedback shift
register:
A linear feedback shift register (LSFR) of length L consists of L
stages (or delay elements) numbered 0, 1, . . . , L1, each capable
of storing one bit and having one input and one output; and a
clock which controls the movement of data. During each unit of
time the following operations are performed:
1. the content of stage 0 is output and forms part of the output
sequence;
2. the content of stage i is moved to stage i 1 for each i, 1
i L 1; and
3. the new content of stage L 1 is the feedback bit s
j
which
is calculated by adding together modulo 2 the previous con-
tents of a xed subset of stages 0, 1, . . . , L 1.
The denition is illustrated in Figure 2.1. The c
1
, c
2
, . . . , c
L
in Figure 2.1
are either zeros or ones. These c
i
values are used in combination with the
logical AND gates (semi-circles). When c
i
is 1, the output of the gate is the
output of the i
th
stage. When c
i
is 0, the output of the gate is 0. This means
the feedback bit s
j
is then the modulo 2 of the sum of the outputs of the
stages where c
i
is 1.
2
For example, the Weyl generator, sometimes used in combination with LFSR genera-
tors to hide their defect of linearity over GF(2), e.g. in xorgens [9, 10], is a special case of
a LCG with multiplier 1.
12
196 Ch. 6 Stream Ciphers
(ii) the content of stage i is moved to stage i 1 for each i, 1 i L 1; and
(iii) the new content of stage L 1 is the feedback bit s
j
which is calculated by adding
together modulo 2 the previous contents of a xed subset of stages 0, 1, . . . , L 1.
Figure 6.4 depicts an LFSR. Referring to the gure, each c
i
is either 0 or 1; the closed
semi-circles are AND gates; and the feedback bit s
j
is the modulo 2 sum of the contents of
those stages i, 0 i L 1, for which c
Li
= 1.
Stage Stage
L-2
s
j
L-1
c
2
c
1
c
L1
c
L
output
0
Stage Stage
1
Figure 6.4: A linear feedback shift register (LFSR) of length L.
6.8 Denition The LFSR of Figure 6.4 is denoted L, C(D), where C(D) = 1 + c
1
D +
c
2
D
2
+ + c
L
D
L
Z
2
[D] is the connection polynomial. The LFSR is said to be non-
singular if the degree of C(D) is L (that is, c
L
= 1). If the initial content of stage i is
s
i
{0, 1} for each i, 0 i L 1, then [s
L1
, . . . , s
1
, s
0
] is called the initial state of
the LFSR.
6.9 Fact If the initial state of the LFSR in Figure 6.4 is [s
L1
, . . . , s
1
, s
0
], then the output
sequence s = s
0
, s
1
, s
2
, . . . is uniquely determined by the following recursion:
s
j
= (c
1
s
j1
+ c
2
s
j2
+ + c
L
s
jL
) mod 2 for j L.
6.10 Example (output sequence of an LFSR) Consider the LFSR 4, 1 + D + D
4
depicted
in Figure 6.5. If the initial state of the LFSR is [0, 0, 0, 0], the output sequence is the zero
sequence. The following tables show the contents of the stages D
3
, D
2
, D
1
, D
0
at the end
of each unit of time t when the initial state is [0, 1, 1, 0].
t D
3
D
2
D
1
D
0
0 0 1 1 0
1 0 0 1 1
2 1 0 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
6 1 0 0 0
7 1 1 0 0
t D
3
D
2
D
1
D
0
8 1 1 1 0
9 1 1 1 1
10 0 1 1 1
11 1 0 1 1
12 0 1 0 1
13 1 0 1 0
14 1 1 0 1
15 0 1 1 0
The output sequence is s = 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, . . . , and is periodic with
period 15 (see Denition 5.25).
The signicance of an LFSR being non-singular is explained by Fact 6.11.
c 1997 by CRC Press, Inc. See accompanying notice at front of chapter.
Figure 2.1: Diagram of An LFSR: an illustration of the denition of an LFSR.
The c
i
allow the selection of the xed subset of stages that are added. Source:
Menezes et al [50].
The denition above is convenient for a hardware implementation. For
a software implementation, it is convenient to extend it so the data at each
stage is a vector of bits rather than a single bit. For example, this vector
might consist of the 32 or 64 bits in a computer word.
More generally still, LEcuyer and Panneton [30, 55] consider a general
framework for representing linear generators over GF(2). Consider the matrix
linear recurrence:
x
i
= Ax
i1
,
y
i
= Bx
i
,
u
i
=
w
=1
y
i,1
2
,
where x
i
is a k-bit state vector, y
i
is a w-bit output vector, u
i
[0, 1) is the
output at step i, A is a k k transition matrix, and B is a w k output
transformation matrix.
By making appropriate choices of the matrices A and B, several well-
known types of generators can be obtained as special cases, including the
Tausworthe, linear feedback shift register (LFSR), generalized (linear) feed-
back shift register (GFSR), twisted GFSR, Mersenne Twister, WELL, xor-
shift, etc.
These generators have the advantage that they can be implemented e-
ciently using full-word logical operations (since the full-word exclusive or
operation corresponds to addition of vectors of size w over GF(2) if w is the
computer wordlength). They also have the property that the output is a
linear function of the initial state x
0
(by linear here we mean linear over
GF(2)). To avoid linearity, we can combine these generators with a cheap
generator of another class, e.g. a Weyl generator, or combine two succes-
13
sive outputs using a nonlinear function (e.g. addition mod 2
w
1, which is
nonlinear over GF(2)).
In practice, the matrix A is usually sparse, an extreme case being gen-
erators based on primitive trinomials over GF(2), and the matrix B is often
the identity I (in which case k = w). The transformation from x
i
to y
i
when
B = I is often called tempering.
2.6 The Mersenne Twister
The Mersenne Twister is a PRNG and member of the class of LFSR gen-
erators. This generator was introduced by Matsumoto and Nishimura [46]
(although the algorithm has been improved several times, so there are now
several variants). This particular PRNG has attracted a lot of attention
and has become quite popular, although it has also attracted some crit-
icism [36, 37]. It is the default generator used by a number of dierent
software packages including Goose, SPSS, EViews, GNU, R and VSL [32].
If M
p
= 2
p
1 is a (Mersenne) prime and p = 1 mod 8, then it is usually
possible to nd a primitive trinomial of degree p. For examples, see [13]. Any
such primitive trinomial can be used to give a PRNG with period M
p
. For
example, RANU4 (1991) and the closely related ranut (2001) implemented a
choice of 25 dierent primitive trinomials, with degrees ranging from 127 to
3021377, depending on how much memory the user wished to allocate. The
Mersenne Twister uses degree 19937. This gives a very long period, 2
19937
1,
but a correspondingly large state space (at least 19937 bits). The exponent
19937 might be reduced to say 607 for a GPU implementation.
We omit details of the Mersenne Twister, except to note that the rst
version did not include a tempering step, but this was introduced in later
versions to ameliorate some problems shared by all generators based on prim-
itive trinomials [45] (or, more generally, primitive polynomials with small
Hamming weight, see [55, 7]).
Although generally well accepted, the Mersenne Twister has received
some criticism. For example, Marsaglia [36, 37] and others have observed
that it is not a particularly elegant algorithm, there is considerable complex-
ity in its implementation, and its state space is rather large.
The Mersenne Twister (version with tempering) generally performs well
on statistical tests, but it has been observed to fail two of the tests in the
TestU01 BigCrush benchmark. Specically, it can fail linear dependency
tests, because of its linearity over GF(2). If this is regarded as a prob-
lem, it can be avoided by combining the Mersenne Twister with a generator
that is nonlinear over GF(2). Similar remarks applied to the rst version
14
of the xorgens generator the current version combines a LFSR generator
with a Weyl generator to avoid linearity, and passes all tests in the TestU01
BigCrush benchmark.
2.7 Combined Tausworthe Generators
Tausworthe PRNGs based on primitive trinomials can have a long period
and fast implementation, but have bad statistical properties unless certain
precautions are taken. Roughly speaking, this is because each output de-
pends on only two previous outputs (the very fact that makes the implemen-
tation ecient). Tezuka and LEcuyer showed how to combine several bad
generators to get a generator with good statistical properties.
Of course, the combined generator is slower than each of its component
generators, but still quite fast. For further details see LEcuyer [29], Tezuka
and LEcuyer [62].
2.8 xorshift Generators
The class of xorshift (exclusive OR-shift) generators was rst proposed by
Marsaglia [41]. These generators are relatively simple in design and imple-
mentation when compared to generators like the Mersenne Twister (2.6).
This is because they are designed to be easy to implement with full-word
logical operations, whereas the Mersenne Twister uses a trinomial of de-
gree 19937, which is not a convenient multiple of the wordlength (usually 32
or 64).
An xorshift PRNG produces its sequence by a process of repeatedly ex-
ecuting a series of simple operations the exclusive OR (XOR) of a word,
and the left or right shifts of a word. In the C programming language these
are basic operations (^, <<, >>) which map well onto the instruction set of
most computers.
The xorshift class of generators is a sub-class of the more general and bet-
ter known LFSR class of PRNG generators [8, 54]. Xorshift class generators
inherit the theoretical properties of LFSR class generators, both good and
bad. From the viewpoint of software implementations, xorshift generators
have a number of advantages over equivalent LFSR generators. The period
of the xorshift generators is usually chosen to be 2
n
1, where n is some
multiple of the word length, see [41, 8].
This is in contrast with generators like the Mersenne Twister which re-
quire n to be a Mersenne exponent. In the case of the most common imple-
15
mentation of the Mersenne Twister, n = 19937.
Unlike the Mersenne Twister, xorshift generators do not require a temper-
ing step, because each output bit depends on more than two previous output
bits. Because of this, xorshift generators have the potential to execute faster
than other LFSR implementations, as they require fewer CPU operations
and are more ecient in their memory operations per random number.
2.9 xorgens
Marsaglias original paper [41] only gave xorshift generators with periods up
to 2
192
1. xorgens [9] is a recently proposed family of PRNGs that generalise
the idea and have period 2
n
1 where n can be chosen to be any convenient
power of two up to 4096. The xorgens generator has been released as a free
software package, in a C language implementation (most recently xorgens
version 3.05 [10]).
Compared to previous xorshift generators, the xorgens family has several
advantages:
A family of generators with dierent periods and corresponding mem-
ory requirements, instead of just one.
Parameters are chosen optimally, subject to certain criteria designed
to give the best quality output.
The defect of linearity over GF(2) is overcome eciently by combining
the output with that of a Weyl generator.
Attention has been paid to the initialisation code (see comments in
2.1 on proper initialisation), so the generators are suitable for use in
a parallel environment.
For details of the design and implementation of the xorgens family, we
refer to [9, 10]. Here we just comment on the combination with a Weyl
generator.
This step is performed to avoid the problem of linearity over GF(2) that is
common to all LFSR generators. A Weyl generator has the following simple
form:
w
k
= w
k1
+ mod 2
w
,
where is some odd constant (a recommended choice is an odd integer close
to 2
w1
(
) + x
k
mod 2
w
, (2.1)
16
where x
k
is the output before addition of the Weyl generator, is some integer
constant close to w/2, and R is the right-shift operator. The inclusion of the
term R