You are on page 1of 45

Task 2: Parallel Pseudo-Random Number

Generation
Nimalan Nandapalan Richard Brent
30 March 2011
(revised 2 April 2011)
Acknowledgements
Thanks to Jir Jaros and Alistair Rendell for their assistance with Chapter 3.
2
Contents
1 Introduction and Conclusions 5
2 Pseudo-Random Number Generators 6
2.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Spurious Requirements . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Linear Congruential Generators . . . . . . . . . . . . . . . . . 11
2.5 LFSR Generators and Generalizations . . . . . . . . . . . . . 12
2.6 The Mersenne Twister . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Combined Tausworthe Generators . . . . . . . . . . . . . . . . 15
2.8 xorshift Generators . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 xorgens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9.1 Cryptographic applications . . . . . . . . . . . . . . . . 17
3 Contemporary Multicore Systems 18
3.1 Intel Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 AMD Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 NVIDIA Tegra . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Parallel PRNG 23
4.1 Statistical Testing of Parallel PRNGs . . . . . . . . . . . . . . 23
4.2 The Contiguous Subsequences Technique . . . . . . . . . . . . 25
4.3 The Leapfrog Technique . . . . . . . . . . . . . . . . . . . . . 26
4.4 Independent Subsequences . . . . . . . . . . . . . . . . . . . . 26
4.5 Combined and Hybrid Generators . . . . . . . . . . . . . . . . 27
4.6 Mersenne Twister for Graphic Processors . . . . . . . . . . . . 29
4.7 Parallel Lagged Fibonacci Generators . . . . . . . . . . . . . . 31
4.8 xorgensGP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.9 Other Parallel PRNG . . . . . . . . . . . . . . . . . . . . . . . 36
4.9.1 CUDA SDK Mersenne Twister . . . . . . . . . . . . . 36
3
4.9.2 CURAND . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.3 Park-Miller on GPU . . . . . . . . . . . . . . . . . . . 37
4.9.4 Parallel LCG . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.5 Parallel MRG . . . . . . . . . . . . . . . . . . . . . . . 37
4.9.6 Other Parallel Lagged Fibonacci Generators . . . . . . 37
4.9.7 Additional Notes . . . . . . . . . . . . . . . . . . . . . 38
4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 40
4
Chapter 1
Introduction and Conclusions
This is a preliminary report on parallel pseudo-random number generation.
It was written under tight time constraints, so makes no claim to being an
exhaustive survey of the eld, which is already extensive, and in a state of
ux as new computer architectures are introduced.
Chapter 2 summarises the requirements of a good pseudo-random number
generator (PRNG), whether serial or parallel, and describes some popular
classes of generators, including LFSR generators, the Mersenne Twister,
and xorgens. This chapter also includes comments on statistical testing of
PRNGs.
In Chapter 3 we summarise the features and contraints of contemporary
multicore computer systems, including GPUs, with particular reference to the
constraints that their hardware and APIs impose on parallel implementations
of PRNGs.
Some specic implementations of parallel PRNGs are described in Chap-
ter 4. Our conclusion is that the jury is still out on the best parallel PRNG,
but quite likely what is best depends on the precise architecture on which
it is to be implemented. The same conclusion was reached by Thomas et
al [64] in their recent comparison of CPUs, GPUs, FPGAs, and massively
parallel processor arrays for pseudo-random number generation. There is
also a denite tradeo between speed and quality/large-state-space, so it is
hard to say what is best in all applications.
We include (in Chapter 4) some specic recommendations for testing
of parallel PRNGs, which is a more demanding task than testing a single
sequential PRNG.
Finally, we include a bibliography which is by no means exhaustive, but
does include both important historical references and a sample of recent
references on both serial and parallel pseudo-random number generation.
5
Chapter 2
Pseudo-Random Number
Generators
In this chapter we survey popular and/or historically important pseudo-
random number generators (PRNGs) and comment on their speed, quality,
and suitability for parallel implementation. Specic comments on some par-
allel PRNGs are given in Chapter 4.
Applications require random numbers with various distributions (e.g. uni-
form, normal, exponential, Poisson, . . .) but the algorithms used to generate
these random numbers, and other random objects such as random permuta-
tions, almost invariably require a good uniform random number generator. In
this report we consider only the generation of uniformly distributed numbers.
For other distributions see, for example, [3, 5, 6, 7, 11, 25, 57, 63]. Usually
we are concerned with real numbers u
n
which are intended to be uniformly
distributed on the interval [0, 1). Often it is convenient to consider integers
U
n
in some range 0 U
n
< m. In this case we require u
n
= U
n
/m to be
(approximately) uniformly distributed. Typically m is a power of two, e.g.
2
63
or 2
64
on a machine that supports 64-bit integer arithmetic.
Pseudo-random numbers generated in a deterministic manner on a digi-
tal computer can not be truly random [65]. We should always say pseudo-
random to distinguish such numbers from truly random number (gener-
ated, for example, by tossing dice, counting cosmic rays, or observing quan-
tum uctuations). However, for brevity we often omit the qualier pseudo
when it is clear from the context. What is required of a PRNG is that nite
segments of the sequence u
0
, u
1
, behave in a manner indistinguishable
from a truly random sequence. In practice, this means that they pass all sta-
tistical tests which are relevant to the problem at hand. Since the problems
to which a library routine will be applied are not known in advance, random
number generators in subroutine libraries should pass a number of stringent
6
statistical tests (and not fail any) before being released for general use.
A sequence u
0
, u
1
, depending on a nite state must eventually be pe-
riodic, i.e. there is a positive integer such that u
n+
= u
n
for all suciently
large n. The minimal such is called the period. If the generator has s state
bits, then clearly 2
s
. It is often (but not always) true that u

= u
0
, i.e.
the sequence generated is a pure cycle with no tail.
2.1 Requirements
Following are the main requirements for a good uniform PRNG and its im-
plementation in a subroutine library
Uniformity. The sequence of random numbers should pass statistical
tests for uniformity of distribution. In one dimension this is easy to
achieve. Most generators in common use are provably uniform (apart
from discretization due to the nite wordlength) when considered over
their full period.
Independence. Subsequences of the full sequence u
0
, u
1
, should be
independent. For example, members of the even subsequence u
0
, u
2
, u
4
,
should be independent of their odd neighbours u
1
, u
3
, . This im-
plies that the sequence of pairs (u
2n
, u
2n+1
) should be uniformly dis-
tributed in the unit square. More generally, random numbers are
often used to sample a d-dimensional space, so the sequence of d-
tuples (u
dn
, u
dn+1
, . . . , u
dn+d1
) should be uniformly distributed in the
d-dimensional cube [0, 1]
d
for all reasonable values of d (certainly for
all d 6).
Passing Statistical Tests. As a generalisation of the two requirements
above (uniformity and independence), we can ask for our PRNG not
to fail statistical tests that should be passed by truly random numbers.
Of course, there is an innite number of conceivable statistical tests
in practice a nite battery of tests is chosen. This is discussed further
in 2.3.
Long Period versus Small State Space. A long simulation on a parallel
computer might use 10
20
random numbers. In such a case the period
must exceed 10
20
. For many generators there are strong correlations
between u
0
, u
1
, and u
m
, u
m+1
, , where m = /2 or ( +1)/2 (and
similarly for other simple fractions of the period). Thus, in practice
the period should be much larger than the number of random numbers
7
which will ever be used. A good rule of thumb is that at most
1/2
random numbers should be used from a generator with period . To
be conservative,
1/2
might be reduced to
1/3
if we are concerned
with passing certain statistical tests such as the birthday-spacings
test [39].
On the other hand, the state space required by the PRNG should not
be too large, especially if multiple copies are to be implemented on a
parallel machine where the memory per processor is small. For period
, the state space must be at least log
2
bits. As a tradeo between
long period and small state space, we recommend generators with log
2

in the range 256 to 1024. Generators with smaller will probably fail
some statistical tests, and generators with larger will consume too
much memory on certain parallel machines (e.g. GPGPUs). Generators
with very large may also take a signicant time to initialise. In
practice, it is convenient to have a family of related generators with
dierent periods , so a suitable selection can be made, depending on
the statistical requirements and memory constraints.
Repeatability. For testing and development it is useful to be able to
repeat a run with exactly the same sequence of random numbers as
was used in an earlier run [23]. This is usually easy if the sequence
is restarted from the beginning (u
0
). It may not be so easy if the
sequence is to be restarted from some other value, say u
m
for a large
integer m, because this requires saving the state information associated
with the random number generator. In some applications it is desirable
to incorporate some physical source of randomness (e.g. as a seed for a
deterministic PRNG), but this clearly rules out repeatability.
Unpredictability. In cryptographic applications we usually require the
sequence of random numbers to be unpredictable, in the sense that it
is computationally very dicult, if not impossible, to predict the next
number u
n
in the sequence, given the preceeding numbers u
0
, . . . , u
n1
.
This requires the generator to have some hidden state bits; also, u
n
must not be a linear function of u
0
, . . . , u
n1
. If we insist on unpre-
dictability, then several important classes of generators are ruled out
(e.g. any that are linear over GF(2)). In the following we generally
assume that unpredictability is not a requirement. For more on unpre-
dictable generators, see [2].
Portability. For testing and development purposes, it is useful to be
able to generate exactly the same sequence of random numbers on two
8
dierent machines, possibly with dierent wordlengths. In practice it
will be expensive to simulate a long wordlength on a machine with a
short wordlength, but the converse should be easy a machine with
a long wordlength (say w = 64) should be able to simulate a machine
with a smaller wordlength (say w = 32) without loss of eciency.
Disjoint Subsequences. If a simulation is to be run on a machine with
several processors, or if a large simulation is to be performed on several
independent machines, it is essential to ensure that the sequences of
random numbers used by each processor are disjoint. Two methods
of subdivision are commonly used [35]. Suppose, for example, that
we require 4 disjoint subsequences for a machine with 4 processors.
One processor could use the subsequence (u
0
, u
4
, u
8
, ), another the
subsequence (u
1
, u
5
, u
9
, ), etc. For eciency each processor should
be able to skip over the terms which it does not require. Alterna-
tively (and generally easier to implement), processor j could use the
subsequence (u
m
j
, u
m
j
+1
, ), where the indices m
0
, m
1
, m
2
, m
3
are suf-
ciently widely separated that the (nite) subsequences do not overlap.
This requires some ecient method of generating u
m
for large m with-
out generating all the intermediate values u
1
, . . . , u
m1
(this is called
jumping ahead below).
Eciency. It should be possible to implement the method eciently
so that only a few arithmetic operations are required to generate each
random number, all vector/parallel capabilities of the machine are used,
and overheads such as those for subroutine calls are minimal. This
implies that the random number routine should return an array of
(optionally) several numbers at a time, not just one, to avoid procedure
call overheads.
Proper Initialisation. It is critically important that the PRNG is ini-
tialised correctly. A surprising number of problems observed with im-
plementations of PRNGs are due, not to the choice of a bad generator
or a faulty implementation of the generator itself, but by inadequate
initialisation. For example, Matsumoto et al [49] tested 58 generators
in the GNU Scientic Library, and found some kind of initialisation
defect in 40 of them.
A common situation on a parallel machine is that dierent processors
will use the same PRNG, initialised with consecutive seeds. It is a
requirement that the sequences generated will be independent. (Unfor-
tunately, implementations often fail to meet this requirement, see 4.1.)
9
2.2 Spurious Requirements
In the literature one sometimes sees requirements for random number gen-
erators that we might class as spurious because, although attractive or
convenient, they are not necessary. In this category we mention the follow-
ing.
Equidistribution. This is a property that, for certain popular classes
of generators, can be tested eciently without generating a complete
cycle of random numbers, even though the denition is expressed in
terms of a complete cycle. We have argued in [12] that equidistribution
is neither necessary nor sucient for a good pseudo-random number
generator. Briey, the equidistribution test would be failed by any
genuine (physical) random number generator; also, we have noted above
(under Period) that one should not use the whole period provided by
a pseudo-random number generator, so a criterion based on behaviour
over the whole period is not necessarily relevant.
Ability to Jump Ahead. Under Disjoint Subsequences above we men-
tioned that when using a pseudo-random number generator on several
(say p) processors we need to split the cycle generated into p distinct,
non-overlapping segments. This can be done if we know in advance
how many random numbers will be used by each processor (or an up-
per bound on this number, say b) and we have the ability to jump
ahead from u
0
to u
b
and restart the generator from index b (also 2b,
3b, . . . , (p1)b). Recently Haramoto et al [18, 19] proposed an ecient
way of doing this for the important class of F
2
-linear generators.
1
If
the period of the generator is and we have p processors, then by a
birthday paradox argument we can take p randomly chosen seeds to
start the generator at p dierent points in its cycle, and the probability
that the segments of length b starting from these points are not disjoint
is O(p
2
b/). For example, if p 2
20
, b 2
64
, and 2
128
, this proba-
bility is O(2
24
), which is negligible. We need to be at least this large
for the reasons stated above under Long Period. Thus, the ability to
jump ahead is unnecessary. Since jumping ahead is non-trivial to imple-
ment and imposes a signicant startup overhead at runtime, it is best
1
We note that this is nothing new, since essentially the same idea and ecient im-
plementation via polynomial instead of matrix operations was proposed in [4] and imple-
mented in RANU4 (1991) in the Fujitsu SSL for the VP series of vector processors. The
idea of jumping ahead is called fast leap-ahead by Mascagni et al [43], who give exam-
ples, but use matrix operations instead of (equivalent but faster) polynomial operations
to implement the idea.
10
avoided in a practical implementation. That is why we did not include
such a feature in our recent random number generator xorgens [9, 10].
2.3 Statistical Testing
Theoretically, the performance of some PRNGs on certain statistical tests
can be predicted, but usually this only applies if the test is performed over a
complete period of the PRNG. In practice, statistical testing of PRNGs over
realistic subsets of their periods requires empirical methods [31, 32, 40].
For a given statistical test and PRNG to be tested, a test statistic is com-
puted using a nite number of outputs from the PRNG. It is necessary that
the distribution of the test statistic for the test when applied to a sequence
of uniform, independently distributed random numbers is known, or at least
that a suciently good approximation can be computed [34]. Typically, a p-
value is computed, which gives the probability that the test statistic exceeds
the observed value.
The p-value can be thought of as the probability that the test statistic
or a larger value would be observed for perfectly uniform and independent
input. Thus the p-value itself should be distributed uniformly on (0, 1). If the
p-value is extremely small, for example of the order 10
10
, then the PRNG
denitely fails the test. Similarly if 1 p is extremely small. If the p-value
is neither close to 0 nor 1, then the PRNG is said to pass the test, although
this only says that the test failed to detect any problem with the PRNG.
Typically a whole battery of tests are applied, so there are many p-values,
not just one. We need to be cautious in interpreting the results of such
a battery of tests. For example, if 1000 tests are applied, we should not
be surprised to observe at least one p-value smaller than 0.001 or larger
than 0.999. A
2
-test might be used test the hypothesis that the p-values
themselves look like a random sample from a uniform distribution.
For comments on the statistical testing of parallel PRNGs, and on testing
packages such as TestU01, see 4.1.
2.4 Linear Congruential Generators
Linear conguential generators (LCGs) are PRNGs of the form
U
n+1
= (aU
n
+ c) mod m,
where m > 0 is the modulus, a is the multiplier (0 < a < m), and c
is an additive constant. Usually the modulus is chosen to be a power of 2,
11
say m = 2
w
where w is close to the wordlength, or close to a power of 2, say
m = 2
w
, where a small is chosen so that m is prime.
LCGs have period at most m, which is too small for demanding applica-
tions. Thus, although there is much theory regarding them, and historically
they have been used widely, we regard them as obsolete, except possibly when
used in combination with another generator.
2
Some references on LCGs are
[4, 17, 25, 38, 56].
The add-with-carry and multiply-with-carry generators of Marsaglia and
Zaman [42] can be regarded as a way of implementing a multiple-precision
modulus in a linear congruential generator. However, such generators tend
to be slower than LFSR generators, and may suer from statistical problems
due to the special form of multiplier.
2.5 LFSR Generators and Generalizations
Menezes et al [50, 6.2.1] gives the following denition of linear feedback shift
register:
A linear feedback shift register (LSFR) of length L consists of L
stages (or delay elements) numbered 0, 1, . . . , L1, each capable
of storing one bit and having one input and one output; and a
clock which controls the movement of data. During each unit of
time the following operations are performed:
1. the content of stage 0 is output and forms part of the output
sequence;
2. the content of stage i is moved to stage i 1 for each i, 1
i L 1; and
3. the new content of stage L 1 is the feedback bit s
j
which
is calculated by adding together modulo 2 the previous con-
tents of a xed subset of stages 0, 1, . . . , L 1.
The denition is illustrated in Figure 2.1. The c
1
, c
2
, . . . , c
L
in Figure 2.1
are either zeros or ones. These c
i
values are used in combination with the
logical AND gates (semi-circles). When c
i
is 1, the output of the gate is the
output of the i
th
stage. When c
i
is 0, the output of the gate is 0. This means
the feedback bit s
j
is then the modulo 2 of the sum of the outputs of the
stages where c
i
is 1.
2
For example, the Weyl generator, sometimes used in combination with LFSR genera-
tors to hide their defect of linearity over GF(2), e.g. in xorgens [9, 10], is a special case of
a LCG with multiplier 1.
12
196 Ch. 6 Stream Ciphers
(ii) the content of stage i is moved to stage i 1 for each i, 1 i L 1; and
(iii) the new content of stage L 1 is the feedback bit s
j
which is calculated by adding
together modulo 2 the previous contents of a xed subset of stages 0, 1, . . . , L 1.
Figure 6.4 depicts an LFSR. Referring to the gure, each c
i
is either 0 or 1; the closed
semi-circles are AND gates; and the feedback bit s
j
is the modulo 2 sum of the contents of
those stages i, 0 i L 1, for which c
Li
= 1.
Stage Stage
L-2
s
j
L-1
c
2
c
1
c
L1
c
L
output
0
Stage Stage
1
Figure 6.4: A linear feedback shift register (LFSR) of length L.
6.8 Denition The LFSR of Figure 6.4 is denoted L, C(D), where C(D) = 1 + c
1
D +
c
2
D
2
+ + c
L
D
L
Z
2
[D] is the connection polynomial. The LFSR is said to be non-
singular if the degree of C(D) is L (that is, c
L
= 1). If the initial content of stage i is
s
i
{0, 1} for each i, 0 i L 1, then [s
L1
, . . . , s
1
, s
0
] is called the initial state of
the LFSR.
6.9 Fact If the initial state of the LFSR in Figure 6.4 is [s
L1
, . . . , s
1
, s
0
], then the output
sequence s = s
0
, s
1
, s
2
, . . . is uniquely determined by the following recursion:
s
j
= (c
1
s
j1
+ c
2
s
j2
+ + c
L
s
jL
) mod 2 for j L.
6.10 Example (output sequence of an LFSR) Consider the LFSR 4, 1 + D + D
4
depicted
in Figure 6.5. If the initial state of the LFSR is [0, 0, 0, 0], the output sequence is the zero
sequence. The following tables show the contents of the stages D
3
, D
2
, D
1
, D
0
at the end
of each unit of time t when the initial state is [0, 1, 1, 0].
t D
3
D
2
D
1
D
0
0 0 1 1 0
1 0 0 1 1
2 1 0 0 1
3 0 1 0 0
4 0 0 1 0
5 0 0 0 1
6 1 0 0 0
7 1 1 0 0
t D
3
D
2
D
1
D
0
8 1 1 1 0
9 1 1 1 1
10 0 1 1 1
11 1 0 1 1
12 0 1 0 1
13 1 0 1 0
14 1 1 0 1
15 0 1 1 0
The output sequence is s = 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, . . . , and is periodic with
period 15 (see Denition 5.25).
The signicance of an LFSR being non-singular is explained by Fact 6.11.
c 1997 by CRC Press, Inc. See accompanying notice at front of chapter.
Figure 2.1: Diagram of An LFSR: an illustration of the denition of an LFSR.
The c
i
allow the selection of the xed subset of stages that are added. Source:
Menezes et al [50].
The denition above is convenient for a hardware implementation. For
a software implementation, it is convenient to extend it so the data at each
stage is a vector of bits rather than a single bit. For example, this vector
might consist of the 32 or 64 bits in a computer word.
More generally still, LEcuyer and Panneton [30, 55] consider a general
framework for representing linear generators over GF(2). Consider the matrix
linear recurrence:
x
i
= Ax
i1
,
y
i
= Bx
i
,
u
i
=
w

=1
y
i,1
2

,
where x
i
is a k-bit state vector, y
i
is a w-bit output vector, u
i
[0, 1) is the
output at step i, A is a k k transition matrix, and B is a w k output
transformation matrix.
By making appropriate choices of the matrices A and B, several well-
known types of generators can be obtained as special cases, including the
Tausworthe, linear feedback shift register (LFSR), generalized (linear) feed-
back shift register (GFSR), twisted GFSR, Mersenne Twister, WELL, xor-
shift, etc.
These generators have the advantage that they can be implemented e-
ciently using full-word logical operations (since the full-word exclusive or
operation corresponds to addition of vectors of size w over GF(2) if w is the
computer wordlength). They also have the property that the output is a
linear function of the initial state x
0
(by linear here we mean linear over
GF(2)). To avoid linearity, we can combine these generators with a cheap
generator of another class, e.g. a Weyl generator, or combine two succes-
13
sive outputs using a nonlinear function (e.g. addition mod 2
w
1, which is
nonlinear over GF(2)).
In practice, the matrix A is usually sparse, an extreme case being gen-
erators based on primitive trinomials over GF(2), and the matrix B is often
the identity I (in which case k = w). The transformation from x
i
to y
i
when
B = I is often called tempering.
2.6 The Mersenne Twister
The Mersenne Twister is a PRNG and member of the class of LFSR gen-
erators. This generator was introduced by Matsumoto and Nishimura [46]
(although the algorithm has been improved several times, so there are now
several variants). This particular PRNG has attracted a lot of attention
and has become quite popular, although it has also attracted some crit-
icism [36, 37]. It is the default generator used by a number of dierent
software packages including Goose, SPSS, EViews, GNU, R and VSL [32].
If M
p
= 2
p
1 is a (Mersenne) prime and p = 1 mod 8, then it is usually
possible to nd a primitive trinomial of degree p. For examples, see [13]. Any
such primitive trinomial can be used to give a PRNG with period M
p
. For
example, RANU4 (1991) and the closely related ranut (2001) implemented a
choice of 25 dierent primitive trinomials, with degrees ranging from 127 to
3021377, depending on how much memory the user wished to allocate. The
Mersenne Twister uses degree 19937. This gives a very long period, 2
19937
1,
but a correspondingly large state space (at least 19937 bits). The exponent
19937 might be reduced to say 607 for a GPU implementation.
We omit details of the Mersenne Twister, except to note that the rst
version did not include a tempering step, but this was introduced in later
versions to ameliorate some problems shared by all generators based on prim-
itive trinomials [45] (or, more generally, primitive polynomials with small
Hamming weight, see [55, 7]).
Although generally well accepted, the Mersenne Twister has received
some criticism. For example, Marsaglia [36, 37] and others have observed
that it is not a particularly elegant algorithm, there is considerable complex-
ity in its implementation, and its state space is rather large.
The Mersenne Twister (version with tempering) generally performs well
on statistical tests, but it has been observed to fail two of the tests in the
TestU01 BigCrush benchmark. Specically, it can fail linear dependency
tests, because of its linearity over GF(2). If this is regarded as a prob-
lem, it can be avoided by combining the Mersenne Twister with a generator
that is nonlinear over GF(2). Similar remarks applied to the rst version
14
of the xorgens generator the current version combines a LFSR generator
with a Weyl generator to avoid linearity, and passes all tests in the TestU01
BigCrush benchmark.
2.7 Combined Tausworthe Generators
Tausworthe PRNGs based on primitive trinomials can have a long period
and fast implementation, but have bad statistical properties unless certain
precautions are taken. Roughly speaking, this is because each output de-
pends on only two previous outputs (the very fact that makes the implemen-
tation ecient). Tezuka and LEcuyer showed how to combine several bad
generators to get a generator with good statistical properties.
Of course, the combined generator is slower than each of its component
generators, but still quite fast. For further details see LEcuyer [29], Tezuka
and LEcuyer [62].
2.8 xorshift Generators
The class of xorshift (exclusive OR-shift) generators was rst proposed by
Marsaglia [41]. These generators are relatively simple in design and imple-
mentation when compared to generators like the Mersenne Twister (2.6).
This is because they are designed to be easy to implement with full-word
logical operations, whereas the Mersenne Twister uses a trinomial of de-
gree 19937, which is not a convenient multiple of the wordlength (usually 32
or 64).
An xorshift PRNG produces its sequence by a process of repeatedly ex-
ecuting a series of simple operations the exclusive OR (XOR) of a word,
and the left or right shifts of a word. In the C programming language these
are basic operations (^, <<, >>) which map well onto the instruction set of
most computers.
The xorshift class of generators is a sub-class of the more general and bet-
ter known LFSR class of PRNG generators [8, 54]. Xorshift class generators
inherit the theoretical properties of LFSR class generators, both good and
bad. From the viewpoint of software implementations, xorshift generators
have a number of advantages over equivalent LFSR generators. The period
of the xorshift generators is usually chosen to be 2
n
1, where n is some
multiple of the word length, see [41, 8].
This is in contrast with generators like the Mersenne Twister which re-
quire n to be a Mersenne exponent. In the case of the most common imple-
15
mentation of the Mersenne Twister, n = 19937.
Unlike the Mersenne Twister, xorshift generators do not require a temper-
ing step, because each output bit depends on more than two previous output
bits. Because of this, xorshift generators have the potential to execute faster
than other LFSR implementations, as they require fewer CPU operations
and are more ecient in their memory operations per random number.
2.9 xorgens
Marsaglias original paper [41] only gave xorshift generators with periods up
to 2
192
1. xorgens [9] is a recently proposed family of PRNGs that generalise
the idea and have period 2
n
1 where n can be chosen to be any convenient
power of two up to 4096. The xorgens generator has been released as a free
software package, in a C language implementation (most recently xorgens
version 3.05 [10]).
Compared to previous xorshift generators, the xorgens family has several
advantages:
A family of generators with dierent periods and corresponding mem-
ory requirements, instead of just one.
Parameters are chosen optimally, subject to certain criteria designed
to give the best quality output.
The defect of linearity over GF(2) is overcome eciently by combining
the output with that of a Weyl generator.
Attention has been paid to the initialisation code (see comments in
2.1 on proper initialisation), so the generators are suitable for use in
a parallel environment.
For details of the design and implementation of the xorgens family, we
refer to [9, 10]. Here we just comment on the combination with a Weyl
generator.
This step is performed to avoid the problem of linearity over GF(2) that is
common to all LFSR generators. A Weyl generator has the following simple
form:
w
k
= w
k1
+ mod 2
w
,
where is some odd constant (a recommended choice is an odd integer close
to 2
w1
(

5 1)). The nal output of an xorgens generator is given by:


w
k
(I + R

) + x
k
mod 2
w
, (2.1)
16
where x
k
is the output before addition of the Weyl generator, is some integer
constant close to w/2, and R is the right-shift operator. The inclusion of the
term R

ensures that the least-signicant-bits have high linear complexity


(if we omitted this term, the Weyl generator would do little to improve the
quality of the least-signicant bit, since (w
k
mod 2) is periodic with period 2).
As addition mod 2
w
is a non-linear operation over GF(2), the result is
a mixture of operations from two dierent algebraic structures, allowing the
sequence produced by this generator to pass all of the empirical tests in
BigCrush, including those failed by the Mersenne Twister. A bonus is that
the period is increased by a factor 2
w
(though this is not free, since the state
space size is increased by w bits).
2.9.1 Cryptographic applications
xorgens is not recommended for cryptographic applications, since an oppo-
nent who observed an intial segment of output might be able to subtract
the Weyl generator component and then use linearity over GF(2) to predict
the subsequent output. At the cost of a slowdown by a factor of about two,
we can combine two successive outputs of xorgens using a suitable nonlinear
operation, for example addition mod (2
w
1). This generator [implemented
but not yet released] is believed to be cryptographically secure if n 256,
and it is certainly faster than most other cryptographically secure generators
(e.g. those obtained by feeding the output of an insecure generator into a
secure hash function, see for example [47, 48]).
17
Chapter 3
Contemporary Multicore
Systems
Contemporary multicore systems can be loosely categorized into systems
that have evolved based on replicating on a single die what was previously
considered to be a single general purpose CPU, and those that have evolved
based on the concept of having multiple units executing the same instruction
on dierent data in a single instruction multiple data (SIMD) programming
model. The former began to emerge in the x86 systems oered by AMD and
Intel around 2005 through chips such as the model 865 Opteron and Pentium
D respectively. The latter is most closely associated with the modern Graphic
Processing Units (GPUs), which arguably began to transition from dedicated
graphics engines into more general purpose computing systems around 2006
with the release of the NVIDIA GTX8800 system.
Figure 3.1: Structural comparison between GPUs and CPUs [51].
The dierent philosophies behind the CPU and GPU architectures is
further illustrated in Figure 3.1. This shows that the GPU architecture
reserves a signicantly higher proportion of its transistors for data processing
18
compared to the CPU [58]. That is, the GPU has evolved into a massively
parallel, multi-threaded and multi-core processor, while CPUs have evolved
to have a small number of cores where each is a highly ecient serial, single-
threaded processor.
Existing work on the ecient implementation of random number genera-
tors on contemporary processors has tended to focus on either CPU or GPU
type systems. In reality, however, the boundary between these two paradigms
is somewhat blurred, with conventional x86 processors being capable of per-
forming some SIMD like instructions while GPUs contain multiple SIMD
execution units each capable of executing a dierent instruction stream.
In the future the boundary between CPU and GPU will blur even fur-
ther. Thus in this chapter rather than review existing CPU and GPU sys-
tems we instead focus on three emerging multicore systems from Intel, AMD
and NVIDIA that incorporate concepts from the CPU and GPU paradigms;
namely the Intel Sandy-Bridge, AMD Fusion and NVIDIA Tegra systems.
We then briey consider OpenCL as a portable programming language for
multicore systems.
3.1 Intel Sandy Bridge
The Sandy Bridge architecture [22] represents a new processor architecture
proposed by Intel this year. Sandy Bridge is a 32nm CPU with an on-
die GPU. While Clarkdale/Arrandale architectures have a 45nm GPU on
package, Sandy Bridge moves the transistors on die enabling GPU to share
L3 cache with CPU.
There are a lot of improvements in the Sandy Bridge architecture like re-
designed CPU front-end, new power saving and turbo modes, and improved
memory controller. However, from the high performance computation point
of view the most important are new Advanced Vector Extension (AVX) in-
structions and a more powerful GPU integration.
The AVX [21] instructions support 256-bit operands which is twice as
much as standard SSE registers. The extension is done at minimal die ex-
pense. This minimizes the impact of AVX on the execution die area while
enabling twice the Floating Point operation throughput. Intel AVX improves
performance due to wider vectors, new extensible syntax, and rich function-
ality. This results in better management of data and general purpose ap-
plications like image, audio/video processing, scientic simulations, nancial
analytics and 3D modeling and analysis.
The largest performance improvement on Sandy Bridge vs. current West-
mere architectures actually has nothing to do with the CPU, its all graphics.
19
While the CPU cores show a 10 - 30% improvement in performance, Sandy
Bridge graphics performance is easily double.
The GPU is treated like an equal citizen in the Sandy Bridge world, it
gets equal access to the L3 cache. The graphics driver controls what gets
into the L3 cache and you can even limit how much cache the GPU is able
to use. Storing graphics data in the cache is particularly important as it
saves trips to main memory which are costly from both a performance and
power standpoint. Redesigning a GPU to make use of a cache is not a simple
task. It usually requires the sort of complete re-design that NVIDIA did with
GF100.
Sandy Bridge graphics is the anti-Larrabee. While Larrabee focused on
extensive use of fully programmable hardware (with the exception of the tex-
ture hardware), Sandy Bridge graphics (internally referred to as Gen 6 graph-
ics) makes extensive use of xed function hardware. The design mentality
was that anything that could be described by a xed function should be im-
plemented in xed function hardware. The benet is performance/power/die
area eciency, at the expense of exibility. Keeping much of the GPU xed
function conforms to Intels CPU-centric view of the world. By contrast,
making the GPU as programmable as possible makes more sense for a GPU
focused company like NVIDIA.
The programmable shader hardware is composed of shaders/cores/
execution units that Intel calls Execution units (EU). Each EU can dual
issue, picking instructions from multiple threads. The internal ISA maps
one-to-one with most DirectX 10 API instructions resulting in a very CISC-
like architecture. Moving to one-to-one API to instruction mapping increases
IPC by eectively increasing the width of the EUs.
There are other improvements within the EU. Transcendental functions
are handled by hardware in the EU and its performance has been sped up
considerably. Intel informed us that sine and cosine operations are several
orders of magnitude faster now than in current HD Graphics.
3.2 AMD Fusion
The Advanced Micro Devices (AMD) Fusion Family [14, 16] has a design
philosophy that represents another example of the blur between CPUs and
GPUs. These processors are classed by AMD as an Accelerated Process-
ing Unit (APU). The AMD Fusion processor is currently available in a 40nm
microarchitecture processor with a 32nm version in development. The philos-
ophy taken by AMD makes the Fusion processors similar to the Sandy Bridge
in that they both combine an x86 based processor with a vector processor,
20
similar to those on GPUs, on the same die.
Other improvements that the Fusion design oers include direct support
for advanced memory controllers, input and output controllers, video de-
coders, display outputs, and bus interfaces, and other features such as SSE.
The main advantage however is the inclusion and direct support of both scalar
and vector hardware as high-level programmable processors. This allows for
high-level industry-standard tools such as DirectCompute and OpenCL to
be fully utilised on this device. The advantage of this level of support for
high-level tools such as OpenCL means that code designed on other OpenCL
capable devices is readily portable to these architectures and should oer
a notable performance increase because of the integrated vector processing
elements and redesigned northbridge.
As Fusion is a high-abstraction design philosophy for processors, many
of the specic details of actual processors are not readily available. The
few Fusion based processors that have been released are targeted at small
form factor desktop machines and portable devices (Bobcat based). Conse-
quently, these processors are made of lightweight serial and vector cores. The
development plan for Fusion includes the provision for larger, more complex,
more powerful desktop based processors with multiple serial cores (Bull-
dozer based).
3.3 NVIDIA Tegra
NVIDIA are one of the largest producers of GPU processors and devices and
up until recently developed exclusively for this platform. The Tegra plat-
form [53] is a recently announced architecture from NVIDIA that represents
a shift from their GPU focus. The Tegra is designed with portable devices
in particular consideration.
Like the Intel Sandy bridge and AMD Fusion based processors, the Tegra
gains performance by incorporating serial general-purpose CPU based proces-
sors with vectorised GPU-based processors on the same chip. The dierence
with the Tegra based processors is that they also include the northbridge,
southbridge and complete memory controller on the same die. This approach
is otherwise known as a System On a Chip (SOC), and is notably dierent
from the APUs and Sandy Bridge architecture discussed earlier.
The serial CPU based processor in the Tegra system uses an ARM based
processor. Nvidia have announced that a quad core system will available.
Like the Fusion systems, all Tegra systems will be fully programmable.
21
3.4 OpenCL
OpenCL is a standardized, cross-platform, parallel-computing API based on
the C language. It is designed to enable the development of portable parallel
applications for systems with heterogeneous computing devices. It addresses
signicant limitations of the previous programming models for heterogeneous
parallel-computing systems. Many of the concepts in OpenCL derive from
NVIDIAs CUDA and can be found in [51].
The CPU-based parallel programming models have typically been based
on standards such as OpenMP but usually do not encompass the use of
special memory types or SIMD execution by high-performance programmers.
Joint CPU/GPU heterogeneous parallel programming models such as CUDA
address complex memory hierarchies and SIMD execution but are typically
platform, vendor, or hardware specic. These limitations make it dicult
for an application developer to access the computing power of CPUs, GPUs,
and other types of processing units from a single multiplatform source-code
base.
The development of OpenCL was initiated by Apple and developed by the
Khronos Group, the same group that manages the OpenGL standard. On
one hand, it draws heavily on CUDA in the areas of supporting a single code
base for heterogeneous parallel computing, data parallelism, and complex
memory hierarchies. This is the reason why CUDA programmers will nd
these aspects of OpenCL familiar once they connect the terminologies (e.g.
CUDA thread is equal to OpenCL Work item).
On the other hand, OpenCL has a more complex platform and device
management model that reects its support for multiplatform and multiven-
dor portability. It leads to a more complex initialization phase, device context
creation, and maintaining of several command queues for dierent devices.
OpenCL implementations already exist on AMD and NVIDIA GPUs as well
as x86 CPUs and IBM Cell.
OpenCL programs must be prepared to deal with much greater hardware
diversity and thus will exhibit more complexity. Also, many OpenCL features
are optional and may not be supported on all devices, so a portable OpenCL
code must avoid using these optional features. Some of these optional fea-
tures, though, allow applications to achieve signicantly more performance in
devices that support them. As a result, a portable OpenCL code may not be
able to achieve its performance potential on any of the devices; therefore, one
should expect that a portable application that achieves high performance on
multiple devices will employ sophisticated runtime tests and choose among
multiple code paths according to the capabilities of the actual device used.
22
Chapter 4
Parallel PRNG
In this chapter we review of some of the recent literature relating to the imple-
mentation of parallel PRNGs on GPUs and multi-core CPUs, and include a
discussion of some of the general techniques used. First we discuss statistical
testing of parallel PRNGs. Then we mention some approaches for gener-
ating random numbers on parallel architectures as suggested by Aluru [1].
These are the contiguous subsequence, leapfrog, and independent subsequence
techniques. We then present several current examples of parallel PRNG and
comment upon their design and performance.
4.1 Statistical Testing of Parallel PRNGs
We discussed the statistical testing of PRNGs briey in 2.3. Here we con-
tinue that discussion with an emphasis on the testing of parallel PRNGs.
Often statistical tests are applied to the sequence generated by a PRNG
with one initial seed. This is inadequate because it does not test the Proper
Initialisation requirement stated in 2.1. Suppose a PRNG generates a
sequence (x
s,j
)
j0
when initialised with seed s. We can think of
X = (x
s,j
)
s0,j0
as a two-dimensional semi-innite array. Typical testing only tests the rows
X, i.e. the tests are applied with s xed. We should also test columns, i.e.
keep j xed and vary s. The case j = 0, s varying is important in this
case the PRNG is being used as a pseudo-random function of the seed s,
specically
f(s) = x
s,0
should behave like a pseudo-random function.
23
Most common generators will fail this test unless it is considered as a pos-
sibility when the code for initialisation is implemented.
1
Similarly, we should
test sequences obtained by interleaving rows of X, since such sequences might
be generated on a parallel machine. For example, we could test the sequence
x
0,0
, x
1,0
, x
0,1
, x
1,1
, x
0,2
, x
1,2
, . . .
obtained by interleaving rows 0 and 1, which could correspond to the sequence
generated on a machine with two processors. Judging from a lack of mention
in the literature, we surmise that relatively little such testing of parallel
PRNGs has yet been done.
In some applications a relatively small number of random numbers are
used, then the PRNG is re-initialised with a dierent seed. For example, a
PRNG might be used to shue a pack of cards, with a dierent seed for each
shue. Thus we should test the sequence
x
0,0
, . . . , x
0,
, x
1,0
, . . . , x
1,
, . . . , x
k,0
, . . . , x
k,
for various choices of parameters (k, ).
TestU01: A Software Library for the Testing PRNGs
The TestU01 software library was created by LEcuyer and Simard to assist
in the empirical statistical testing of uniform PRNGs [32].
The number of possible statistical tests is clearly innite [31]. It is also
clear that no single PRNG can pass all conceivable tests. However, statistical
tests are still extremely useful as a bad PRNG will fail simple tests, while a
good PRNG will only fail overly complicated tests (perhaps designed with
inside knowledge of the algorithm used in the PRNG), or tests that would
take an impractically long time to run.
Perhaps the rst de facto standard set of tests for PRNG was suggested by
Knuth in the rst edition of The Art of Computer Programming, Vol. 2. Since
then, many authors have contributed other tests for detecting regularities
within the sequences generated by PRNGs [34]. The TestU01 library includes
many of these and represents a good framework for the testing of PRNG in
a general environment.
Apart from TestU01, the main public-domain testing package for general
PRNG is DIEHARD [40]. However, the DIEHARD package suers from a
1
For example, Pedro Gimenes and Richard Brent [unpublished] found such a defect
in a PRNG recommended in the (draft) third edition of Knuths The Art of Computer
Programming, Volume 2 [25]. Fortunately, the defect was found and corrected just before
the book was printed.
24
number of limitations. Firstly, the suite of tests run, and the parameters they
run under are xed inside the code of the package. Furthermore, the sample
sizes for the tests are relatively small and so the package is not as stringent as
TestU01. The interface for DIEHARD also requires the output of the PRNG
to be provided in a formatted le of 32 bit blocks, which is slower and quite
restrictive when compared with the interface of TestU01. The tests within
DIEHARD are available in TestU01 in a much less restrictive form, as the
parameters and tests used can be changed during runtime. Thus, we can
regard DIEHARD as obsolete.
The TestU01 library includes three benchmarks with increasing levels
of stringency for the purpose of testing PRNGs [31]. These are the Small-
Crush, Crush, and BigCrush benchmarks, which require from a few sec-
onds to many hours on present desktop personal computers. The BigCrush
suite requires approximately 2
38
random samples and has 106 tests producing
160 test statistics and corresponding p-values.
Note that TestU01 is designed to test a single, serial PRNG. It is not
specically designed to test a parallel PRNG. Thus, in order to test various
combinations of rows, columns, merged rows etc of the array X = (x
s,j
)
s0,j0
above, some coding eort by the user is required in order to generate the
proper input for TestU01.
4.2 The Contiguous Subsequences Technique
Now we mention some approaches for generating random numbers on par-
allel architectures as suggested by Aluru [1] and others, starting with the
contiguous subsequences technique.
In this approach each processor is responsible for producing and process-
ing a contiguous subsequence of length b of the original sequence. Thus, if
the original sequence is (x
j
)
j0
, then processor k is responsible for producing
x
kb
, . . . , x
kb+b1
. In particular, processor k starts at x
kb
.
There are a number of implications. Clearly b should be chosen su-
ciently large, so the subsequences of pseudo-random numbers required by
each processor do not exceed b (if this happened, then the subsequences
used on each processor would overlap, with disastrous consequence for their
statistical independence). If it is not known in advance how many random
numbers will be required, the only option is to set b very large. Producing
the initial state for each processor to generate the correct subsequence is a
nontrivial task, and is only feasible for certain classes of PRNGs (including
however the common classes LCG and LFSR).
25
4.3 The Leapfrog Technique
The Leapfrog technique is another approach to allocating subsequences be-
tween parallel processors. The motivation for Leapfrog is to preserve sta-
tistical qualities of the underlying PRNG. If there are p processors, then
processor i, 0 i < p, is responsible for generating x
i
, x
i+p
, x
i+2p
, . . . This
implies that we need an ecient way of generating every p-th member of
the original PRNG sequence. In general, this is dicult or impossible with-
out generating all the intermediate values, but in certain special cases it is
feasible.
For example, consider a linear congruential generator (LCG) of the form
x
j
= (x
j1
+ ) mod m.
Then it is easy to see that the subsequence (x
i+jp
)
j0
has the same form:
x
i+jp
= (Ax
i+(j1)p
+ B) mod m,
where A =
p
mod m, B = (
p
1)/( 1) mod m.
Unfortunately, we saw in 2.4 that LCGs are not recommended for use as
PRNGs, because of their short periods and poor statistical qualities. They
are satisfactory in some undemanding applications, but unsuitable for in-
clusion in a library of (serial or parallel) PRNGs intended for demanding
applications.
Provided p is a power of two, the Leapfrog technique is applicable to par-
allel lagged Fibonacci generators, for which see 4.7. However, the statistical
quality of these PRNGS is also dubious [55, 7], since they are based on
3-term recurrences (see the discussion in 2.6).
4.4 Independent Subsequences
The nal technique we consider as a general approach to parallel PRNG is
the independent subsequence technique [15]. This is a conceptually simpler
approach to distributing random number generation on parallel architectures.
The Leapfrog and Contiguous Subsequences techniques are limited in that
they are dependent on the ecient calculation of certain elements of the
sequence, which is only possible in special cases.
We already outlined the independent subsequence technique when dis-
cussing ability to jump ahead in 2.2.
The independent subsequence technique is similar to the contiguous sub-
sequence technique in that each processor produces a contiguous subsequence
26
of the original sequence. The dierence is that these subsequences start at
pseudo-random points in the full period of the underlying PRNG. We simply
choose distinct pseudo-random seeds, one per processor, and each processor
generates its sequence starting with its particular seed. In practice the seeds
might be chosen to be some simple function of the processor number, but
this is hardly a random choice, so care has to be taken with initialisation of
a PRNG to be used in this way see the discussion of proper initialisation
in 2.1.
In 2.2 we showed that, if there are p processors and each produces b
pseudo-random numbers, then the period of the underlying PRNG should
be much larger than p
2
b in order to ensure that the chance of dierent pro-
cessors producing overlapping sequences is negligible. In practice this means
2
128
(approximately), which is not a severe constraint since we want a
period at least as large as 2
128
for other reasons discussed in 2.1.
4.5 Combined and Hybrid Generators
In this section we present two closely related techniques that can be used
in aid of the creation of PRNG suitable for parallel architectures. These
are combined generators and hybrid generators. The goal of these tech-
niques is to amalgamate the output of two dierent generators to improve
the properties of the sequence of random numbers produced. This possibil-
ity has been known, both empirically and theoretically, for some time [15].
Hybrid generators dier from combined generators in that the underlying al-
gorithms behind the PRNGs in the combination are dierent
2
. An example
of a combined generator is the Combined Tausworthe Generator (cf 2.7) of
LEcuyer [29].
There are several ways that the output of two generators can be com-
bined. Some comments on this have been made by Marsaglia in [39], by
LEcuyer [28] in the context of LCGs, and later in the context of Tausworthe
generators [29]. Some simple ways to combine outputs, say x and y, are to
take the exclusive or x y, the sum mod 2
w
(where w is the wordlength and
x, y are regarded as unsigned integers), and the sum mod 2
w
1 (which has
the advantage that the least-signicant bit is modied by a carry from the
most-signicant bit position).
In essence, the combination of two generators can be described as follows.
If one generator produces the sequence x
1
, x
2
, x
3
. . . and another generator
produces the sequence y
1
, y
2
, y
3
. . . in some nite set which has some binary
2
The terminology used in the literature is inconsistent hybrid generators are often
called combined generators.
27
operator that forms a Latin square, then the combination generator dened
by producing the sequence x
1
y
1
, x
2
y
2
, x
3
y
3
. . . should be better than,
or at least no worse than, each of the original generators. Marsaglia [39]
showed that, if a suitable operator, , is used to create a random variable
z = x y, then the distribution of z should be no less uniform than each
of x or y. A useful side-eect of combining generators is an increase in the
period the period of the combined generator is the least common multiple
of the periods of the component generators.
An example which uses both techniques is the Hybrid Combined Taus-
worthe Generator. This generator was developed in the context of GPU
implementations and Monte Carlo simulations and is presented by Howes
et al [20]. The Hybrid Combined Tausworthe Generator is a combination
of two earlier generators: taus88 (LEcuyer), a three-component combined
Tausworthe generator, and Quick and Dirty (Press et al), a 32-bit LCG.
The defects of each generator make them unsuitable individually as PRNGs.
However, in combination the properties of one generator mask the statistical
defects of the other, making this combined generator satisfactory (according
to the authors).
This generator has a combined period of 2
121
. Because of the simplicity
of the Tausworthe generator and LCG designs, the Hybrid Combined Taus-
worthe Generator only requires a state space of four 32-bit words, for the
32-bit generator.
Howes et al compare their implementation of the Hybrid Tausworthe
Combined Generator in CUDA with another generator that they imple-
mented in CUDA on a G80 GPU. The second generator is a Wallace Gaussian
Generator which returns random numbers that are normally distributed. A
Box-Muller transformation [25] was used to convert the output of the Hybrid
Tausworthe generator to the normal distribution. The Wallace generator is
slightly faster at 5.210
9
RN/s than the Hybrid Combined Tausworthe Gen-
erator plus Box-Muller transformation at 4.310
9
RN/s, but both are fast
one hundred times faster than generating random numbers on the CPU.
There are a number of other notable hybrid/combined generators. The
xorgens generator (section 2.9) and its GPU implementation, xorgensGP
(section 4.8) are examples of hybrid generators showing improvements over
simpler versions such as all xorshift generators (section 2.8). The xorgens
class generators combine xorshift generators with a Weyl Generator. Another
example is the KISS family of PRNG. One instance of this class of generator,
KISS99 [32], combines three dierent types of generators: an LCG, three
dierent LFSR generators, and two multiply-with-carry generators. Perhaps
a total of six component generators is overkill!
The techniques for hybrid and combined generators have several impli-
28
cations which make them suitable for use in parallel PRNG. If we recall the
contiguous subsequence technique, the increased period size of the combined
generators allow for b, the size of the subsequences, to be set comfortably
large. Although not implemented in the Hybrid Combined Tausworthe Gen-
erator, near maximally equidistributed generators can be found dynamically
for a specied number of combinations [29]. This means that a large number
of distinct combined generators can be created from a set of generators suit-
able for this technique, with an obvious application to parallel generation
of PRNGs. In the extreme case, we could use a dierent PRNG on each
processor.
4.6 Mersenne Twister for Graphic Processors
The Mersenne Twister for Graphic Processors (MTGP) is a recently released
variant of the Mersenne Twister [46, 60]. As its name suggests, it was de-
signed for GPGPU applications. In particular, it was designed with parallel
Monte Carlo simulations in mind. It is released with a parameter generator
for the Mersenne Twister algorithm to supply users with distinct generators
on request (MTGPs with dierent sequences). The MTGP is implemented in
NVIDIA CUDA [51] in both 32-bit and 64-bit versions. Following the popu-
larity of the original Mersenne Twister PRNG, this generator has become a
de facto standard for comparing GPU-based PRNGs.
The approach taken by the MTGP to make the Mersenne Twister parallel
can be explained as follows. The next element of the sequence, x
i
, is expressed
as some function, h, of a number of previous elements in the sequence:
x
i
= h(x
iN
, x
iN+1
, x
1N+M
)
The parallelism that can be exploited in this algorithm becomes apparent
when we consider the pattern of dependency between further elements of the
sequence:
x
i
= h(x
iN
, x
iN+1
, x
iN+M
)
x
i+1
= h(x
iN+1
, x
iN+2
, x
iN+M+1
)
.
.
.
x
i+NM1
= h(x
iM1
, x
iM
, x
i1
)
x
i+NM
= h(x
iM
, x
i
M
+1
, x
i
)
(4.1)
The last element in the sequence which produces x
i+NM
requires the value
of x
i
, which has not been calculated yet. Thus, only N M elements of the
sequence produced by a Mersenne Twister can be calculated in parallel.
29
As N is xed by the Mersenne Prime chosen for the algorithm, all that
is required to maximise the parallel eciency of the MTGP is the careful
selection of the constant M. This constant M, specic to each generator,
determines the selection of one of the previous elements in the sequence in
the recurrence algorithm that denes the MTGP. Thus, it has a direct impact
on the quality of the random numbers generated, and the distribution of the
sequence. A candidate set of parameters for an instance of the MTGP can
be evaluated via equidistribution theory. This provides a measure called the
total dimension defect. The Dynamic Creator for MTGP (MTGPDC) uses
this to provide 128 parameter sets for each of the periods available (dened
by dierent Mersenne primes: 2
11213
1, 2
23209
1, and 2
44497
1), for
the initialisation of distinct MTGP on request by applications during run-
time. While the selection of N and M allow for N M elements to be
calculated in parallel, there is also the restriction imposed by the number of
previous elements available. The MTGP maintains these previous elements
in an internal state-space buer. This is implemented as a circular array,
A, so that older elements are over-written after they are no longer needed
when no further terms in the recurrence expression are dependent on them.
Thus, for a given size of the state space array, A, we can only produce
A N numbers in parallel. As an example, for the 32-bit (w = 32)
MTGP with p = 11213, we have N = 351 and N M typically greater
than or equal to 256. If we choose A = 512 then we can only calculate
at most A N = 161 numbers in parallel. The MTGP requests A in
increments of powers of two to simplify memory access. Consequently, to
achieve the full level of parallel computation supported by these parameters,
this MTGP will require A = 1024 of 32-bit variables (4 KB of memory).
At the time of this generators development, the CUDA architecture (version
1.3) supported 512 threads per block of threads and a maximum of 16 KB
of shared memory per multi-processor on the device [51].
This relatively large state-space would ultimately be a limiting factor in
achieving the full computational throughput possible by a GPU device. The
author of MTGP notes in the most recent release a change in the CUDA
sample from larger periods (2
23209
1 and 2
44497
1 for the 32-bit and 64-
bit versions respectively) to a smaller one (2
11213
1 for both versions) in
order to improve the utilisation of the device due to the increased memory
requirements of the larger periods.
The author of MTGP provides performance results for the 32-bit MTGP
on NVIDIAs GeForce GTX 260 GPU device. 5 10
7
random numbers
are generated across 108 blocks of 256 threads. The time claimed for this
to complete was measured at 4.6ms. This translates to a random number
generation rate of approximately 1.1 10
10
RN/s. No information regarding
30
the quality of the MTGP sequence is reported. However, the MTGP is
still a variant of the Mersenne Twister, so it can be assumed that it should
suer from the same problem in the linear recurrence tests at some level as
described in [32].
With respects to a programming model for the implementation of the
MTGP, this PRNG loosely conforms to the general CUDA programming
model which encourages a contiguous divide-and-conquer approach. The
problem of producing a sequence of random numbers is divided such that each
block computes a sub-sequence. For example, if only two blocks are used,
each block would compute contiguous halves of the 510
7
random numbers.
Each block is then initialised as a distinct MTGP and can independently do
its portion of the work. This avoids the issue of overlapping sub-sequences
found in other contiguous sub-sequence approaches. Each thread within the
block is responsible for generating one random number per iteration of the
algorithm. The threads work synchronously to produce sub-sequences of the
work allocated to the block.
4.7 Parallel Lagged Fibonacci Generators
The Lagged Fibonacci Generator (LFG) is a well documented class of gen-
erators. A number of authors have noted the advantages of LFG over LCG
generators and have implemented and analysed LFGs on a number of dier-
ent parallel architectures, including distributed systems [1, 44, 61, 33] and
GPU systems [24]. LFGs have been used in a wide variety of applications,
including Monte Carlo algorithms, particle ltering, and machine-learning in-
spired computer vision algorithms. Here we discuss a few dierent distributed
implementations of this generator and one NVIDIA CUDA [51] implementa-
tion.
Worthy of note is the leapfrog variant of this generator class. To explain
the application of the leapfrog technique to the LFG let us rst recall the
recurrence:
x
i
= x
ir
x
ir+s
mod m,
where
is a suitable operator, e.g. is the exclusive or operator (XOR),
m is the modulus value (usually 2
32
or 2
64
), irrelevant in the case ,
r and s are integer parameters, r > s.
In the recurrence, x
ir
is the older of the two terms used to dene x
i
, and
as we do not need to store any older terms, this denes the size of the state
31
space. If r and s are chosen suitably, and is , the period is 2
r
1. (If
is a dierent operator, such as addition or multiplication mod m, then the
period may be larger, but the leapfrog technique is not applicable, so we
assume here that is .)
It is not dicult to show, by consideration of Pascals triangle or using
properties of the generating function in GF(2)[x], that
x
i
= x
ipr
x
ipr+ps
(4.2)
whenever p is a power of two.
Janowczyk et al [24] provide an implementation of this leapfrog LFG on a
GPU architecture using NVIDIA CUDA [51]. Their implementation requires
the initialisation of p r random numbers, which can either be loaded from
a le or created in parallel in a seeding step. As r also denes the size of
the state space, its selection is important. Some suggested values for r for
good statistical properties of the generator range into the tens of thousands,
which is undesirable on a GPU architecture where memory is limited.
The experiments presented in the paper by Jacowczyk et al were con-
ducted on the NIVIDA GeForce 8800GTS GPU device. In this implementa-
tion, values of r equal to 607 and 1279 are cited as being used, but the authors
found that the larger value decreased the speed of random numbers per sec-
ond (RN/s) achieved. They compared their results to the version of the
Mersenne Twister found in the NVIDIA CUDA SDK. A speed of 1.26 10
9
RN/s is claimed. This gure includes the time taken to copy the random
numbers to the device memory. In separate tests where the results are not
copied, mimicking a situation where a simulation may execute inline with
the generation of random numbers, the leapfrog LFG performs at 3.5 10
9
RN/s, and the Mersenne Twister at 2.59 10
9
RN/s. The dierence is ex-
plained by the increased computational cost per random number due to the
Mersenne Twisters tempering step, where the LFG requires only one XOR
operation per random number (but may produce numbers with statistically
lower quality as a result).
As the sequence produced by this generator is identical to the sequential
LFG, its quality is exactly the same and all comments on its quality in
comparison to other generators hold, for example those made by LEcuyer [32]
and Marsaglia [39].
4.8 xorgensGP
In 2010 one of us [Nandapalan] implemented the xorgens generator in NVIDIA
CUDA. Extending the xorgens PRNG to be applicable in the GPGPU do-
32
main was not a trivial pursuit as a number of design considerations had to
be taken into account in order to create an ecient implementation. This
provided some insight into the problem of adapting algorithms to a GPU
environment.
We are essentially seeking to exploit some level of parallelism inherent
in the ow of data. To realise this, let us rst recall the recursion relation
describing the xorgens algorithm:
x
i
= x
ir
(I + L
a
)(I + R
b
) + x
is
(I + L
c
)(I + R
d
)
In this equation, the parameter r represents the degree of recurrence, and
consequently the size of the state space (in words, and not counting a small
constant for the Weyl generator and a circular array index). L and R rep-
resent left-shift and right-shift operators, respectively. If we conceptualise
this state space array as a circular buer of r elements we can reveal some
structure in the ow of data. In a circular buer, x, of r elements, where
x[i] denotes the i
th
element, x
i
, the indices i and i +r would access the same
position within the circular buer. This means that as each new element x
i
in the sequence is calculated from x[i r] and x[i s], the result replaces
the r
th
oldest element in the state space, which is no longer necessary for
calculating future elements.
Now we can begin to consider the parallel computation of a sub-sequence
of xorgens. Let us examine the dependencies of the data ow within the
buer x as a sequence is being produced:
x
i
= x
ir
A + x
is
B
x
i+1
= x
ir+1
A + x
is+1
B
.
.
.
x
i+(rs)
= x
ir+(rs)
A + x
is+(rs)
B
= x
is
A + x
i+r2s
B
.
.
.
x
i+s
= x
ir+s
A + x
is+s
B
= x
ir+s
A + x
i
B
(4.3)
If we consider the concurrent computation of the sequence, we observe that
the maximum number of terms that can be computed in parallel is
min(s, r s).
Here r is xed by the period required, but we have some freedom in the
choice of s. It is best to choose s r/2 to maximise the inherent parallism.
33
However, the constraint GCD(r, s) = 1 implies that the best we can do is
s = r/2 1, except in the case r = 2, s = 1. This provides one additional
constraint, in the context of xorgensGP versus (serial) xorgens, on the pa-
rameter set {r, s, a, b, c, d} dening a generator. These properties provided
sucient motivation for two dierent implementations of xorgensGP to be
tested, each of which exploit the data-parallelism with non-overlapping con-
tiguous blocks of threads. The two dierent implementations are based on
the independent subsequences technique and a variation of it.
Let us rst consider the independent subsequence technique. With the
block of threads architecture of the CUDA interface and the techniques goal
of creating subsequences, it is a logical and natural decision to allocate each
subsequence to a block within the grid of blocks. This can be implemented
by providing each block with its own local copy of a state space. This local
state space will then represent the same distinct generator, but at dierent
points within its period.
It should be noted that each generator is identical in that the same pa-
rameter set {r, s, a, b, c, d} is used for each. An advantage of this is the
parameters are known at compile time, allowing compile-time optimisations
that would not be available if the parameters were dynamically allocated,
and thus known only at runtime.
Now we outline the implementation using a variation of the independent
subsequences technique to making xorgens parallel. Whereas the previous
technique uses the same parameter set for each generator and the same gen-
erator on each block, this approach achieves independent subsequences by
providing each block with a distinct generator, that is with a unique set of
parameters {r, s, a, b, c, d}. As r is xed, this requires searching through a
ve-dimensional space of parameters to nd those that give period 2
r
1 and
satisfy other constraints designed to give output with high statistical quality.
This search computationally expensive (requiring of the order of minutes on
a single CPU), so suitable parameter sets were precomputed.
Each generator can be started with the same seed, since the unique pa-
rameters of each generator should be sucient to produce independent and
uncorrelated outputs.
We now present an evaluation on the results obtained in testing the two
implementations of xorgensGP. All experiments were performed on a single
GPU on the NVIDIA GeForce GTX 295 which is a dual GPU device.
First, the performance of the two candidate solutions with respects to
random number throughput (RN/s) was tested see Table 4.1.
Second, a report was produced on the memory resource allocation of each
implementations kernel functions, known at the time of code compilation.
This report is optionally produced by the cuda compiler. The report for
34
Implementation: RN/s
Independent Subsequence 6.3 10
9
Variation 2.6 10
8
Table 4.1: Comparison of implementations of xorgensGP.
each kernel includes the number of registers required per thread, the total
statically allocated shared memory consumed by the user and compiler (dy-
namically allocated shared memory can only be calculated at runtime), and
the utilisation of the constant memory banks. The reports for the two imple-
mentations are presented in Table 4.2. Thus it can be seen that the standard
independent subsequences is a more memory ecient solution by a signicant
number of registers per thread. It should be noted that this does not include
the dynamically allocated shared memory of size r for the state space that is
the same for both generators. In the case of 32-bit words and n set to 4096,
this gives the state space a size of 128.
Implementation Registers Shared Memory Constant Memory
Independent Subsequence 14 32+16 bytes [1] 8 bytes
Variation 21 32+16 bytes [14] 4 bytes
Table 4.2: Comparison of Memory resource allocation known at compilation
time for threads, between candidates of xorgensGP.
The period of this generator is greater than 2
4096
1 due to the combi-
nation with the Weyl generator. The xorshift based component guarantees a
period of at least 2
4096
1. The period of the Weyl generator is either 2
32
or
2
64
, depending on the use of the 32-bit or 64-bit precision respectively. Thus
the overall period is 2
32
(2
4096
1) or 2
64
(2
4096
1).
The two implementations were each subjected to the BigCrush battery
for empirical statistical PRNG testing, as oered by the TestU01 software
library [32]. The result of this experiment found that both implementa-
tions pass all tests within the battery. It can be concluded that there is no
appreciable dierence in the quality of the two sequences from the implemen-
tations by accepted benchmarks. The dierence in performance between the
two implementations can be attributed partly to the compiler optimisations
that can be made when the parameters are known at compile time. Another
factor is that the code is most ecient when multiples of 32 threads are run
35
for each block. Since s must be odd, the best we can achieve is usually 31
threads per block.
A preliminary test of the MTGP was conducted on the same device as the
xorgensGP implementations for comparison. The throughput of the MTGP
was found to be about 1.2 10
9
RN/s. This suggests that the xorgensGP
implementation with independent subsequences is approximately four times
faster than the MTGP, at least on this and similar devices. (Note that the
gures given in Table 4.3 use dierent platforms, so are not comparable.)
4.9 Other Parallel PRNG
We now present a survey of some other recent parallel PRNGs. These im-
plementations do not oer any further insight into the design of parallel
PRNGs beyond the examples described so far. We present a summary of
their performance and any interesting features.
4.9.1 CUDA SDK Mersenne Twister
The NVIDIA CUDA SDK provide a number of examples to assist develop-
ers. One of these examples is an implementation of a Mersenne Twister and
is described in [59]. This is not to be confused with the MTGP described
in 4.6. The approach taken by this generator for the GPU architecture is
similar to the variant of the independent subsequences method attempted for
the xorgensGP. This implementation takes advantage of the dynamic gener-
ator for the original Mersenne Twister (DCMT) to produce many distinct
Mersenne Twisters that execute in parallel. This is required as the Mersenne
Twister can not avoid correlations in subsequences for very dierent (by
any denition) seeds, or initial values, unlike the xorgens generator.
The CUDA SDK Mersenne Twister implementation makes use of the
unique thread identier as an input to the dynamic Mersenne Twister gener-
ator, allowing each thread to update a Mersenne Twister independently. To
reiterate, every thread is associated with its own state-space, and so there
as many generators as threads running on the device. Consequently, each
Mersenne Twister in this implementation is comparatively small, having a
period of just 2
607
1, which is quite small in comparison to the period of
the MTGP. This generator has been shown to be four times slower than the
MTGP.
36
4.9.2 CURAND
The CUDA CURAND Library is NVIDIAs parallel PRNG framework and
library. It is documented in [52]. The default generator for this library is
based on the XORWOW algorithm introduced by Marsaglia in [41]. It is an
example of a xorshift class generator. No performance data appears to be
available for this particular implementation.
4.9.3 Park-Miller on GPU
The Park-Miller Generator, also known as a Lehmer Generator, can be
viewed as a special case of the LCG class of generators. Langdon presents
and releases code for a GPU implementation of this generator in CUDA in
[26] and revise it in [27] with double precision and a pre-production model of
a NVIDIA Tesla T10P unit. A peak random number throughput of 3.510
9
RN/s is claimed at double precision for this device.
4.9.4 Parallel LCG
A parallel implementation of a 64-bit LCG is provided in the Scalable Li-
brary for Pseudorandom Number Generation (SPRNG) presented in [44].
Performance information for this parallel LCG is made available by Tan
in [61] and was obtained on a cluster of twenty DEC Alpha 21164 proces-
sors. The throughput of this generator measured by Tan was 12.8 10
9
RN/s. This LCG was made parallel by a variant of the independent subse-
quences technique by using unique parameters to create distinct generators
and subsequences.
4.9.5 Parallel MRG
The SPRNG library also provides a parallel implementation of a Combined
Multiple Recursive Generator (MRG). Similarly, performance information is
available by Tan in [61]. The throughput of this generator was measured at
4.2 10
9
RN/s. The variant of the independent subsequences technique was
also used on this implementation.
4.9.6 Other Parallel Lagged Fibonacci Generators
The LFG has been implemented in parallel a number of times, see Aluru [1],
Mascagni [44], and Tan et al [61]. Aluru and Mascagni do not present per-
formance information for any specic implementation, but performance in-
formation for Mascagnis implementation is available in Tans paper.
37
Tan presents a portable implementation, called PLFG, using MPI in C.
Tans implementation uses the independent sequences technique by inde-
pendently seeding multiple generators using the output of the MT19937
Mersenne Twister. The period of the implementation is at least
2
29
(2
23209
1) and the authors validate the statistical quality of their im-
plementation by applying the generator to a Monte Carlo based simulation
and conrming the results using a dierent parallel PRNG (SPRNG version
1.0 [44]). They provide performance results by measuring the time taken to
produce 10
6
random numbers. This gives a random number throughput of
4.0 10
9
RN/s. This result was achieved on a cluster of twenty DEC Alpha
21164 processors.
Two of the generators proposed by Mascagni that are compared against
Tans PLFG are LFG variants. These are a Multiplicative LFG and an Ad-
ditive LFG. The throughputs of these generators on the DEC Alpha cluster
as measured by Tan were 5.4 10
9
RN/s and 3.8 10
9
RN/s respectively.
Mascagnis generators are parallelised by a variant of the independent sub-
sequences technique, using unique parameters to create distinct generators
and subsequences.
4.9.7 Additional Notes
All of the generators oered in the SPRNG library were re-implemented
by Lee et al [33]. Their implementation targeted recongurable computing
platforms including Field Programmable Gate Arrays (FPGAs). Specically,
they provide performance information for the Cray XD1. This platform can
provide some level of parallel acceleration which is conceptually similar to
GPUs. They claim an average speed increase by a factor of 1.7 across all the
generators implemented in the SPRNG library.
4.10 Summary
Table 4.3 presents a summary of most of the performance information men-
tioned in this chapter.
38
Generator RN/s (10
9
) Period State Space Platform
Wallace (normal) 5.2 2
80
2048 G80 based GPU
Hybrid Comb. Tausworth 4.3 2
121
4 G80 based GPU
MTGP 10.1 2
11213
1024 GeForce GTX 260
1.2 GeForce GTX 295
Leapfrog 3.5 2
607
607 GeForce 8800GTS
CUDA SDK MT 2.59 2
607
19 GeForce 8800GTS
xorgensGP 6.5 2
4096+w
4096/w GeForce GTX 295
Park-MillerGP 3.5 Tesla T10P
SPRNG LCG 12.8 DEC Alpha 21164
SPRNG Combined MRG 4.2 DEC Alpha 21164
PLFG 4.0 2
23238
DEC Alpha 21164
SPRNG MLFG 5.4 DEC Alpha 21164
SPRNG ALFG 3.8 DEC Alpha 21164
Table 4.3: Summary of implementations surveyed. Periods and state space
sizes (in words) are approximate (and in some cases surmised). The wordsize
w = 32 or 64 for xorgensGP.
NB: The throughputs of each generator should not be directly compared in
view of the dierent platforms on which each generator was implemented.
39
Bibliography
[1] S. Aluru. Lagged Fibonacci random number generators for distributed
memory parallel computers. Journal of Parallel and Distributed Com-
puting, 45(1):112, 1997.
[2] E. Barker and J. Kelsey. Recommendation for Random Number Gen-
eration Using Deterministic Random Bit Generators (Revised). NIST
Special Publication 800-90, 2007.
[3] R. P. Brent. Algorithm 488: A Gaussian pseudo-random number gen-
erator. Communications of the ACM, 17:704706, 1974.
[4] R. P. Brent. Uniform random number generators for supercomputers.
In Proceedings Fifth Australian Supercomputer Conference, Melbourne,
December 1992, pages 95104, 1992.
[5] R. P. Brent. Fast normal random number generators on vector proces-
sors. Technical Report TR-CS-93-04, ANU, 1993. arXiv:1004.3105v2.
[6] R. P. Brent. A fast vectorised implementation of Wallaces normal ran-
dom number generator. Technical Report TR-CS-97-07, ANU, 1997.
arXiv:1004.3114v1.
[7] R. P. Brent. Random number generation and simulation on vector and
parallel computers. Lecture Notes in Computer Science, 1470:120, 1998.
[8] R. P. Brent. Note on Marsaglias xorshift random number generators.
Journal of Statistical Software, 11(5):14, 2004.
[9] R. P. Brent. Some long-period random number generators using shifts
and xors. ANZIAM Journal, 48 (CTAC2006):C188C202, 2007.
[10] R. P. Brent. xorgens version 3.05, 2008. http://maths.anu.edu.au/
brent/random.html.
40
[11] R. P. Brent. Some comments on C. S. Wallaces random number gener-
ators. Computer Journal, 51(5):579584, 2008.
[12] R. P. Brent. The myth of equidistribution, 2010. arXiv:1005.1320v1.
[13] R. P. Brent and P. Zimmermann. The great trinomial hunt. Notices of
the American Mathematical Society, 58(2):233239, 2011.
[14] N. Brookwood. AMD Fusion family of APUs: Enabling a superior,
immersive PC experience. Insight 64, 1:18, March 2010.
[15] Paul D. Coddington. Random number generators for parallel computers.
The NHSE Review, 2, 1997.
[16] Advanced Micro Devices. http://sites.amd.com/us/fusion/apu/
Pages/fusion.aspx.
[17] G. S. Fishman and L. R. Moore. An exhuastive analysis of multiplicative
congruential random number generators with modulus 2
31
1. SIAM J.
Scientic and Statistical Computing, 7:2445, 1986.
[18] H. Haramoto, M. Matsumoto, and P. LEcuyer. A fast jump ahead
algorithm for linear recurrences in a polynomial space. In Proceedings
SETA 2008, pages 290298, 2008.
[19] H. Haramoto, M. Matsumoto, T. Nishimura, F. Panneton, and
P. LEcuyer. Ecient jump ahead for F
2
-linear random number gen-
erators. INFORMS J. on Computing, 20(3):385390, 2008.
[20] L. Howes and D. Thomas. Ecient random number generation and
application using CUDA. GPU gems, 3:805830, 2007.
[21] Intel Corp. Intel advanced vector extensions programming reference.
http://software.intel.com/file/33301, 2010.
[22] Intel Corp. Intel microarchitecture codename sandy bridge.
http://www.intel.com/technology/architecture-silicon/
2ndgen/index.htm, 2011.
[23] F. James. A review of pseudorandom number generators. Computer
Physics Communications, 60:329334, 1990.
[24] A. Janowczyk, S. Chandran, and S. Aluru. Fast, processor-cardinality
agnostic PRNG with a tracking application. In Computer Vision,
Graphics and Image Processing, 2008: Proc. Sixth Indian Conference,
ICVGIP08, pages 171178. IEEE, 2009.
41
[25] D. E. Knuth. The Art of Computer Programming, Volume 2: Seminu-
merical Algorithms. Addison-Wesley, third edition, 1997.
[26] W. B. Langdon. PRNG Random Numbers on GPU. Technical Report
CES-477, University of Essex, UK, 2007.
[27] W. B. Langdon. A fast high quality pseudo random number generator
for nVidia CUDA. In Proceedings of GECCO09, pages 25112514, 2009.
[28] P. LEcuyer. Ecient and portable combined random number genera-
tors. Communications of the ACM, 31(6):742751, 1988.
[29] P. LEcuyer. Maximally equidistributed combined Tausworthe genera-
tors. Mathematics of Computation, 65(213):203213, 1996.
[30] P. LEcuyer and F. Panneton. Construction of equidistributed genera-
tors based on linear recurrences mod 2. In K.-T. Fang et al, editor, Monte
Carlo and Quasi-Monte Carlo Methods 2000, pages 318330. Springer-
Verlag, 2002.
[31] P. LEcuyer and R. Simard. TestU01: A software library in ANSI C
for empirical testing of random number generators. Technical report,
Department dInformatique et de Recherche Operationnelle, Universite
de Montreal, 17 August 2009, 219 pp.
[32] P. LEcuyer and R. Simard. TestU01: A C library for empirical test-
ing of random number generators. ACM Transactions on Mathematical
Software (TOMS), 33(4), 2007.
[33] JunKyu Lee, Yu Bi, Gregory D. Peterson, Robert J. Hinde, and
Robert J. Harrison. Hasprng: Hardware accelerated scalable paral-
lel random number generators. Computer Physics Communications,
180(12):2574 2581, 2009. 40 YEARS OF CPC: A celebratory issue
focused on quality software for high performance, grid and novel com-
puting architectures.
[34] P. Leopardi. Testing the tests: using random number generators to
improve empirical tests. Monte Carlo and Quasi-Monte Carlo Methods
2008, pages 501512, 2009.
[35] J. Makino and O. Miyamura. Generation of shift register random num-
bers on vector processors. Computer Physics Communications, 64:363
368, 1991.
42
[36] G. Marsaglia. http://groups.google.com/group/comp.lang.c/msg/
e3c4ea1169e463ae, 14 May 2003.
[37] G. Marsaglia. http://groups.google.com/group/sci.crypt/msg/
12152f657a3bb219, 14 July 2005.
[38] G. Marsaglia. Random numbers fall mainly on the planes. Proc. National
Academy of Science USA, 61(1):2528, 1968.
[39] G. Marsaglia. A current view of random number generators. In L. Bil-
lard, editor, Computer Science and Statistics: The Interface, pages 310.
Elsevier Science Publishers B. V. (North-Holland), 1985.
[40] G. Marsaglia. DIEHARD: a battery of tests of randomness.
http://stat.fsu.edu/geo/ diehard.html, 1996.
[41] G. Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):16,
2003.
[42] G. Marsaglia and A. Zaman. A new class of random number generators.
Annals of Applied Probability, 1(3):462480, 1991.
[43] M. Mascagni, S. A. Cuccaro, D. V. Pryor, and M. L. Robinson. Recent
developments in parallel pseudorandom number generation. In Proceed-
ings of PPSC1993, pages 524529, 1993.
[44] M. Mascagni and A. Srinivasan. Algorithm 806: SPRNG: a scalable
library for pseudorandom number generation. ACM Transactions on
Mathematical Software (TOMS), 26(3):436461, 2000.
[45] M. Matsumoto and Y. Kurita. Strong deviations from randomness in m-
sequences based on trinomials. ACM Trans. on Modeling and Computer
Simulation, 6(2):99106, 1996.
[46] M. Matsumoto and T. Nishimura. Mersenne Twister: a 623-
dimensionally equidistributed uniform pseudo-random number gen-
erator. ACM Transactions on Modeling and Computer Simulation
(TOMACS), 8(1):330, 1998.
[47] M. Matsumoto, T. Nishimura, M. Hagita, and M. Saito.
Mersenne Twister and Fubiki stream/block cipher, 2005.
http://eprint.iacr.org/2005/165.pdf.
43
[48] M. Matsumoto, M. Saito, T. Nishimura, and M. Hagita. CryptMT
stream cipher ver. 3: description. Lecture Notes in Computer Science,
4986:719, 2008.
[49] M. Matsumoto, I. Wada, A. Kuramoto, and H. Ashihara. Common de-
fects in initialization of pseudorandom number generators. ACM Trans.
on Modeling and Computer Simulation, 17(no. 4, article 15), 2007.
[50] A. Menezes, P. C. van Oorschot, and S. A. Vanstone. Handbook of
Applied Cryptography. CRC Press, 1996 (fth printing 2001).
[51] NVIDIA. Compute Unied Device Architecture (CUDA) Programming
Guide. NVIDIA Corp., Santa Clara, Ca, 2010.
[52] NVIDIA. CUDA CURAND Library. NVIDIA Corp., Santa Clara, Ca,
2010.
[53] NVIDIA Corp. Tegra. http://www.nvidia.com/object/tegra.html,
2011.
[54] F. Panneton and P. LEcuyer. On the xorshift random number gen-
erators. ACM Transactions on Modeling and Computer Simulation,
15(4):346361, 2005.
[55] F. Panneton, P. LEcuyer, and M. Matsumoto. Improved long-period
generators based on linear recurrences modulo 2. ACM Transactions on
Mathematical Software, 32(1):116, 2006.
[56] S. K. Park and K. W. Miller. Random number generators: good ones
are hard to nd. Communications of the ACM, 31:11921201, 1988.
[57] W. P. Petersen. Some vectorized random number generators for uni-
form, normal, and Poisson distributions for CRAY X-MP. Journal of
Supercomputing, 1:327335, 1988.
[58] M. Pharr and R. Fernando. CPU Gems 2. Addison-Wesley, 2005.
[59] V. Podlozhnyuk. Parallel Mersenne Twister. NVIDIA white paper, 2007.
[60] M. Saito. A variant of Mersenne Twister suitable for graphic processors,
2010. arXiv:1005.4973.
[61] Chih Tan and J. Rod Blais. PLFG: A highly scalable parallel pseudo-
random number generator for Monte Carlo simulations. In M. Bubak,
44
H. Afsarmanesh, B. Hertzberger, and R. Williams, editors, High Per-
formance Computing and Networking, volume 1823 of Lecture Notes in
Computer Science, pages 127135. Springer Berlin, Heidelberg, 2000.
[62] S. Tezuka and P. LEcuyer. Ecient and portable combined Tausworthe
random number generators. ACM Transactions on Modeling and Com-
puter Simulation, 1(2):99112, 1991.
[63] D. B. Thomas, W. Luk, P. H. W. Leong, and J. D. Villasenor. Gaussian
random number generators. ACM Computing Surveys, 39(4):11:136,
2007.
[64] D.B. Thomas, L. Howes, and W. Luk. A comparison of CPUs, GPUs,
FPGAs, and massively parallel processor arrays for random number gen-
eration. In Proceeding of the ACM/SIGDA International Symposium on
Field Programmable Gate Arrays, pages 6372. ACM, 2009.
[65] J. von Neumann. Various techniques used in connection with random
digits. In The Monte Carlo Method, pages 3638, 1951. Reprinted in
John von Neumann Collected Works, volume 5, 768770.
45

You might also like