Professional Documents
Culture Documents
net/publication/286245602
CITATIONS READS
0 433
1 author:
SEE PROFILE
All content following this page was uploaded by Ivonne Leonor Lino on 08 December 2015.
G14SCD
MSc Dissertation in
Scientific Computation
2014/15
I have read and understood the School and University guidelines on plagiarism. I confirm
that this work is my own, apart from the acknowledged references.
Contents
Abstract 3
Acknowledgements 3
Dedication 3
1 Introduction 4
1
5 Results 32
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Numerical results of the serial curve . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Coefficients in serial and in parallel . . . . . . . . . . . . . . . . . . 33
5.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Conclusions 42
B CUDA Code 46
C Matlab Code 61
D Error Code 69
References 73
2
Abstract
This work provides an analysis of the performance of the Discontinuous Galerkin Finite
Element Method (DGFEMs) for a 1D Elliptic Problem in parallel using GPU technology.
DGFEMs was chosen as a numerical method due to its stability, robustnesses, and mainly
because the method is very suitable for parallelization due to a large degree of locallity
and intensive arithmetic operations. It is well known that the time used to run and get
results from a code increases as we use more elements in the DGFEMs. The aim of this
work is to develop a CUDA code for an Elliptic Problem in 1D using DGFEMs to reduce
this time and, compare its performance with its counterpart in serial. At the end of the
work some benchmarks were performed to show that the code in parallel is much faster
than the code in serial.
Acknowledgements
I would like to thank Professor Paul Houston who supported me from the very begin-
ning, when I started the MSc. He has been the most patient lecturer and always very
approachable. It has been an honor and great pleasure being his pupil. I appreciate and
thank him deeply for all his help. I also want to thank my dear family and friends who
have supported me and helped me to cope with all the emotional and economic difficul-
ties during the whole process. Finally, I would like to thank my country and Mexican
government (through CONACYT) who trusted me and gave me the opportunity to study
this Master to have a better future in my career and life.
Dedication
To my husband Benjamin, my son Eduardo Emanuel and my mom Maria del Carmen.
3
1 Introduction
Discontinuous Galerkin Finite Element Methods (DGFEMs) were introduced for the nu-
merical solution of first-order hyperbolic problems in 1973 [23]. In more recent times
this method has enjoyed considerable success due to its robustness, stability and local
conservativeness. Also because it is easy to implement and is highly parallelizable [27],
[17], [19].
Depending on the input parameters, we obtain several variations of DGFEMs methods:
Symmetric Interior Penalty Galerkin (SIPG) method, introduced in the late 1970s [3], [31],
the Global Element Method [10], the Nonsymmetric Interior Penalty Galerkin (NIPG)
method [25] and [22], the Incomplete Interior Penalty Galerkin (IIPG) [9], the case of
elliptic problems written as first order system[5] Discontinuous Galerkin Finite Element
Methods (DGFEMs) for elliptic equations were proposed with variants by [12], [4],[31] and
[2], called interior penalty (IP) methods from that time.Examples of the method applied
to elliptic and parabolic problems can also be seen in [4], [31] and [2] Works about the pure
elliptic problem can be seen in [8], [6] and [7]. For this work we will use the symmetric
interior penalty method (SIP). which will be described in detail in the next section.
The main aim of this work is to study the performance of a code which solves a Dis-
continuous Galerkin Method for an elliptic problem. The comparison was made between
a serial code presented in [24] and its counterpart in parallel. CUDA was used as the
language for the implementation of parallel programming, using its libraries. The code in
parallel was developed by the author of this work inspired in the serial code of [24] except
by the Conjugate Gradient Algorithm (CG) which can be found in the N vidiar Cuda
Toolkit 7.5 [1].
In section 2, the Discontinuous Galerkin Finite Element Methods (DGFEMs) for an
elliptic problem is seen in general for a multidimensional problem with the model taken
from [3]. In section 3 the particular problem of DGFEMs for the elliptic problem in 1D is
studied in detail, taking the discretization from [24]. In section 4, the model for paradigm
of parallel programming is studied and explained throughout the particular model of
4
CUDA language which was developed on Nvidia GPU hardware. In section 5 results
of the serial and parallel code are shown as well as some benchmarking between them
using different input parameters. Finally in Appendix A a tutorial for installing CUDA
technology is described in detailed, in Appendix B the CUDA code in parallel is presented
and in Appendix C we present the serial code with some modifications from the original.
One of the most important achievements of this work was the implementation of the
model in parallel using the GPU technology and verifying the more efficient performance
of this code in comparison with its counterpart in serial.
Discontinuous Galerkin Finite Element Methods (DGFEMs) combines features of the fi-
nite volume method and the finite element method and can be applied to solve hyperbolic,
elliptic and parabolic problems.
5
2.2 Flux formulation
As it is usually done for finite element method [16], we multiply (2.2) by functions τ and
v, respectively and integrate on a subset K of Ω. Then we have [3]
Z Z Z
φ · τ dx = − u∇ · τ dx + unK · τ ds, (2.3)
K K ∂K
Z Z Z
φ · ∇vdx = f vdx + φ · nK vds, (2.4)
K K ∂K
where the space of polynomial functions is given by P (K) = Pp (K) of degree at most
p > 1 on K and Σ(K) = [Pp (K)]n . Let us consider the general formulation given by [8]:
Z Z Z
φh · τ dx = − uh ∇ · τ dx + ūK nK · τ ds, (2.5)
K K ∂K
Z Z Z
φh · ∇vdx = f vdx + φ̄K · nK vds, (2.6)
K K ∂K
6
2.3 Derivation of the primal formulation
Now, following [3], We denote by H l (Th ) be the space of functions on Ω whose restriction
to each element K belongs to the Sobolev space ∈ H l (K).
Let Γ denote the union of the boundaries of the elements K of Th . Functions in
T (L) := ΠK∈Th L2 (∂K) are evaluated twice on Γ0 := Γ/∂Ω and evaluated once on ∂Ω.
Therefore, L2 (Γ) is identified as the subspace of T (Γ) which is the space of the functions
for which the values coincide on all internal edges.
The scalar numerical flux is defined as ū = (ūK )K∈Th and the vector numerical flux
φ̄ = (φ̄K )K∈Th , both of them linear functions given by
Let e be the interior edge between K1 and K2 , let the unit normal vectors be n1 and n2
on Γ pointing to the exterior to K1 and K2 on e respectively. We define the average {v}
as [2]
1
{v} = (v1 + v2 ), (2.8)
2
and jump [v] as
1
ν = (ν1 + ν2 ) (2.10)
2
7
2.4 Primal formulation
Again, following [3], adding over all the elements equations 2.5 and 2.6, using the average
and jump operator, applying integrations by parts and suitable identities, we get the next
result (for details see [3])
Z
Bh (uh , v) = f vdx ∀v ∈ Vh , (2.12)
Ω
Z
Bh (uh , v) := ∇h uh · ∇h vdx (2.13)
ZΩ
+ ([ū − uh ] · {∇h v} − {φ̄} · [v])ds
ZΓ
+ ({ū − uh }[∇h v] − [φ̄]v)ds ∀uh , v ∈ H 2 (Th )
Γ0
Where q ∈ T (Γ) and ϕ ∈ [T (Γ)]2 and we have r : [L2 (Γ)]2 → Σh and l : L2 (Γ0 ) → Σh
given by
Z Z
r(ϕ) · τ dx = − ϕ · τ ds, (2.14)
Ω Γ
Z Z
l(q) · τ dx = q[τ ]ds ∀τ ∈ Σh . (2.15)
Ω Γ0
The form Bh is bilinear and the proof of (2.13) can be seen in [3] Equation 2.12 is
called the “primal formulation” and 2.13 is called the “primal form”, and recalling the
definition of φh given in Section 2.2 we have
All these general concepts can be applied to the one dimensional case as we will see
in the next section.
8
3 The elliptic problem in 1D
In this section we will use and apply the concepts of the “primal formulation” defined
and deduced in the section 2 for the one dimensional case using the deduction in [24].
In this part of the text we aim to deduce the discretization of the elliptic problem in 1D
and get the linear algebra problem similarly as we did in general in Section 2.1. It can be
seen this deduction with more detail in [24].
Let start with some definitions [16]:
Now, we consider the boundary value problem in the interval (0, 1) [24]:
u(0) = 1, (3.2)
u(1) = 0, (3.3)
where a ∈ C 1 (0, 1) and g ∈ C 0 (0, 1), using the definition 3. It is assumed that there
are two constants a0 and a1 such that
9
u is said to be the classical solution of 3.1-3.3 if u ∈ C 2 (0, 1) and u satisfies the
equations 3.1-3.3 pointwise [24].
Let Eh be a partition of (0,1) given by 0 = x0 < x1 < . . . < xN = 1 and let
In = (xn , xn+1 ), n = 0, 1, ..., N . With this definition we have
hn = xn+1 − xn ,
hn−1,n = max(hn−1 , hn ),
h = max0≤n≤N −1 hn .
v(x+
n ) = lim v(xn + ), (3.6)
→0
>0
v(x−
n ) = lim v(xn − ), (3.7)
→0
>0
and just as we did in Section 2.3.1 we define here the jump of v at the endpoints of
In [24]:
[v(xn )] = v(x− +
n ) − v(xn ),
1
v(x− +
v(xn ) = n ) + v(x n ) , ∀n = 1, ..., N − 1
2
This definition is extended to the end point as:
[v(x0 )] = v(x+
0 ),
10
{v(x0 )} = v(x+
0 ),
[v(xN )] = v(x−
N ),
{v(xN )} = v(x−
N ).
Given the definition above the penalty terms are defined as [24]:
N
X α0
J0 (v, w) = [v(xn )][w(xn )], (3.8)
n=0
hn−1,n
Where α0 and α1 are real nonnegative numbers.
At this point we apply the same process of the finite element method as in 3.1 which
means multiplying by v ∈ Dk (Eh ) for each interval In :
Z xn+1
a(x)u0 (x)v 0 (x)dx − a(xn+1 )u0 (xn+1 )v(x− 0 +
n+1 ) + a(xn )u (xn )v(xn ) (3.9)
Zxnxn+1
= f (x)v(x)dx, n = 0, ..., N − 1
xn
[a(xn )u0 (xn )v(xn )] = {a(xn )u0 (xn )}[v(xn )] + {v(xn )}[a(xn )u0 (xn )], (3.10)
Then we have
N
X −1 Z xn+1 N
X
a(x)u0 (x)v 0 (x)dx − {a(xn )u0 (xn )}[v(xn )] (3.11)
n=0 xn n=0
XN Z 1
+ {a(xn )v 0 (xn )}[u(xn )] = f (x)v(x)dx
n=0 0
11
For this work will be restricted to the case when = −1, for which the DGFEMs
form is symmetric, i.e.
∀v, w, a1 (v, w) = a1 (w, v), (3.12)
N
X −1 Z xn+1 N
X
0
a−1 (v, v) = 2
a(x)(v (x)) dx − 2 {a(xn )u0 (xn )}[v(xn )] (3.13)
n=0 xn n=0
+ J0 (v, v),
1
α0
Z
0
L(v) = g(x)v(x)dx − a(x0 )v (x0 ) + v(x0 ), (3.15)
0 h0,1
Piecewise quadratic polynomials are used as the monomial basis function given by:
x − xn+1/2
φn1 (x) = 2 , (3.17)
xn+1 − xn
(x − xn+1/2 )2
φn2 (x) = 4 , (3.18)
(xn+1 − xn )2
with the midpoint of the interval given by xn+1/2 = 12 (xn + xn+1 ). We also have
1
xN = x0 + nh, h= (3.19)
N
12
2
φn1 (x) = (x − (n + 1/2)h), (3.21)
h
4
φn2 (x) = (x − (n + 1/2)h)2 , (3.22)
h2
0
φn0 (x) = 0, (3.23)
0 2
φn1 (x) = , (3.24)
h
0 8
φn2 (x) = (x − (n + 1/2)h). (3.25)
h2
Then the DGFEMs solution is given by
N
X −1 X
2
uh (x) = Ujm φjm (x), (3.26)
m=0 j=0
AU = b, (3.27)
with components L(φin ) and A is the matrix with entries a (φjm , φin )
Z
(An )ij = (φni )0 (x)(φnj )0 (x)dx, (3.28)
In
13
Computing the corresponding coefficients of An we get
0 0 0
1
An = 0 4 0,
h
16
0 0 3
We can also compute the interior nodes xn . Expanding jump and average terms we
get [24]
α0 DG
−(P DG )0 (xn )[v(xn )] + v 0 (xn )[P DG (xn )] + [P (xn )][v(xn )]
h
1 0 + DG + α0 DG +
= (P DG )0 (x+n )[v(x +
n )] − v (x n )[P (x n )] + [P (xn )][v(x+n )]
2 2 h
1 DG 0 − 0 − DG − α0 DG −
− (P ) (xn )[v(xn )] + v (xn )[P (xn )] + [P (xn )][v(x−
−
n )]
2 2 h
1 0 + DG − α0 DG +
− (P DG )0 (x+
n )[v(x −
n )] − v (x n )[P (x n )] + [P (xn )][v(x−n )]
2 2 h
1 DG 0 − α0
(P ) (xn )[v(xn )− ] + v 0 (xn )− [P DG (xn )+ ] + [P DG (x− +
n )][v(xn )].
2 2 h
Using definitions 3.16 to 3.25 it is possible to compute the local matrices for the interior
nodes given by:
0 0 0
α 1−α −2 + α
1
Bn = − − α −1 + + α
0 0
2 − − α ,,
0
h
0 0 0
2 + α 1 − 2 − α −2 + 2 + α
α0 −1 + α0 −2 + α0
1
Cn = + α0 −1 + + α0 −2 + + α0 , ,
h
0 0 0
2 + α 1 + 2 + α −2 + 2 + α
−α0 −1 + α0 2 − α0
1
Dn = − − α 0
−1 + + α 0
2 − − α ,,
0
h
0 0 0
−2 − α −1 + 2 + α 2 − 2 − α
14
−α0 1 − α0 2 − α0
1
En = 0 0 0 ,
−1 + + α −2 + + α ,
+α
h
0 0 0
−2 − α 1 − 2 − α −2 − 2 − α
And also for the boundary nodes:
α0 2 − α0 −4 + α0
1
F0 = −2 − α −2 + 2 + α
0 0
4 − 2 − α , ,
0
h
0 0 0
4 + α 2 − 4 − α −4 + 4 + α
α0 −2 + α0 −4 + α0
1
F0 = 2 + α0 −2 + 2 + α0 4 + 2 + α0 ,
h
0 0 0
4 + α −2 + 4 + α −4 + 4 + α
We define the next matrices to assemble the global matrix as [24]:
T = An + Bn + Cn+1 , (3.29)
T0 = A0 + F0 + C1 , (3.30)
TN = AN −1 + Fn + BN +1 , (3.31)
15
3.1.2 Right Hand Side
1
α0 i
Z
0
L(φin ) = f (x)φin (x)dx − K(x0 )(φin ) (x0 ) + φ (x0 ). (3.32)
0 h n
Using the definition of the local variable φni and making a change of variable, we obtain
Z 1 Z 1
h h
g(x)φni (x)dx = g( t + (n + 1/2)h)ti dt. (3.33)
0 2 −1 2
The last integral is approximated using the Gauss quadrature rule. If v is a polynomial
of degree 2QG − 1 the Gauss quadrature is exact. We obtain
Z 1 QG
hX h
f (x)φni (x)dx ≈ wj f ( sj + (n + 1/2)h)sij . (3.34)
0 2 j=1 2
−1 N −1 N −1
(b00 , b01 , b02 , b10 , b11 , b12 , b20 , b21 , b22 , ..., bN
0 , b1 , b2 ). (3.35)
With the result obtained above we can do a computer program to calculate the numerical
solution of the elliptic equation. At the end of the program it is necessary to solve the
Linear Algebra Problem (LAP) Ax = b with A the global matrix, x the coefficients of the
solution and b the right hand side. As we know from Computer Linear Algebra (CLA) [11],
[28], [14], we could choose among different solvers to get faster results, this is important
in the case we have a huge amount of data, i.e., if we want to calculate the solution for
a number of elements. In this case we have to choose a faster and parallelizable solver.
In this sense a solver which is highly parallelizable is the Conjugate Gradient algorithm
(CG) [15]. Also from CLA [11], [28], [14], we know that, the part of the code which needs
more floating-point operations (FLOP) is the solution of the LAP. Therefore in this work
16
we will use the parallel paradigm via CUDA to compute the LAP together with the CG.
In the next chapter we will explain what is the CUDA language and how to use it.
It is well known the great necessity of the big scale computing in all areas of science
and especially in Scientific Computation where almost all problems in this area involve
a Partial Differential Equation (PDE) that should be solved. The PDE leads to a linear
algebra problem, which means solving the problem Ax = b with matrices that sometimes
have hundreds, thousands or millions of entries per matrix per step-time. It is this area
which this work focuses on.
4.1 Background
Less than 10 years ago (November 2006), Nvidiar opened its language called CUDA
(Compute Unified Device Architecture) to the general community. The language is a
general purpose parallel programming model and a computing platform that is used in
Nvidiar Graphic Processor Unit or GPU’s which comes with a highly parallel, many core
processor and multithread system. These GPU’s have enormous computational horse-
power and great memory bandwidth in comparison with the classical CPU, as shown in
Figure 1.
To explain this great difference in the performance, we have to explain the differences
between the architecture of the CPU and the GPU. The main difference between both of
them is the way they are built. As Figure 2 shows, the GPU has many more transistors
devoted to data processing than the CPU, which has more memory dedicated to data
caching and flow control.
17
Figure 1: (Taken from [1])Floating-Point Operations per Second for the CPU and GPU.
Figure 2: (Taken from [1]).The GPU devotes more transistors to data processing.
To understand and learn the CUDA language it is important to have a clear knowledge
of the hardware because the software and the hardware have a very close relationship,
and also the way the code is written has a direct connection with the architecture of the
GPU.
A GPU has a certain number of arithmetic logic units which are divided into smaller
groups called grids, these grids are divided into blocks, which are finally subdivided into
18
Figure 3: Memory Hierarchy (Taken from [1]).
threads, this idea is shown in Figure 3 schematically. Every thread is able to run a
function which in CUDAr programming is called a “kernel”. After declaring a function,
it is necessary to specified the number of threads per block and the number of blocks per
grid within the “chevron” syntax given by <<< ... >>> and they can be of type int or
dim3.
19
4.3 Programming model
The CUDAr programming model as we have described above relies on the architecture of
the GPU which mainly consist of a processor chip with a parallel system. At the bottom
of Figure 4 we can see a pair of the physical device, they are also called graphic cards.
This graphic card is manage by the driver Application Programming Interface (API)
which can be found in the Nvidia web page, As in many languages, CUDAr has libraries
and APIs to help the process of coding. NVIDIA also provides the runtime API and the
libraries. The user or developer can install the API’s and the libraries (see Appendix A).
This libraries are very similar as the libraries in other languages. Actually CUDA itself
was built on C and in general all the CUDA language is very similar to C and also their
sentences. This gives CUDA a great advantage for C and C++ programmers who can
learn CUDA very straightforward.
In Figure 5 (also called software stack) we can see the general structure of the device
20
which is organize in cores, connected with the share memory and then this with the texture
memory. This physical connection has relevance and a very close relationship with the
structure of the language. Figure 5 is also called software stack. The counterpart of this
is the CPU or “host” and is the code that classical runs in serial. CUDA also has support
for other languages like Fortran and Python as you can see in [26] We can mention that
at the beginning of this dissertation, we tried to work on CUDA Fortran, but it does
not have much support for this kind of calculations and binding does not have enough
documentation, then we decided to swap for Cuda C.
21
4.3.1 Memory hierarchy
There are three principal concepts to be learned about the CUDA programming model:
the group of threads and its hierarchy, the concept of shared memory and finally the
synchronization. All these concepts have simple representation in the syntax of the C
language.
We can say that the aim behind the idea of parallelization is to find out what part
of the algorithm has information which is independent of the rest of the calculations,
and which is reflected on the code as the loops with independent information parallelized
between one cycle and another.
Every independent calculation can be done by a thread. The order of threads that
can be used in total is currently 1024 threads per block but there are millions of blocks
containing threads within the grids, Figure 6 shows how threads, blocks and grids are
organized within the device. This architecture is easily scalable, which allows GPU archi-
tecture to increase the number of processors that can be used for calculations for future
models of dedicated hardware.
The model itself guides the programmer to partition the problem in blocks of threads,
and then in pieces that are finer and can share information and cooperate within the
same block. Each block of threads can do calculations sequentially or concurrently on
any of the multiprocessors available in the GPU and only the multiprocessor count will
be known by the runtime system.
22
Figure 6: Grid of Thread Blocks (Taken from [1])
A very important concept that is defined in the actual code is the “kernel” which is
equivalent to a “function” in C but with the important characteristic that kernels will
be executed in parallel as many times as the number of threads defined for that specific
kernel and all of them at the same time. The syntax for a kernel begins with the keyword
__global__ and the chevron syntax that contains the number of threads and blocks used
for that kernel. Each thread in the kernel is identified by a unique thread id given by
a built-in threadIdx variable. Source Code 1 shows an explicit example of the kernel
syntax:
23
Source Code 1: Addition of two vector launching one single thread (Taken from [1])
1 // Kernel definition
2 __global__ void VecAdd(float* a, float* b, float* c)
3 {
4 int i = threadIdx.x;
5 c[i] = a[i] + b[i];
6 }
7 int main()
8 {
9 ...
10 // Kernel invocation with N threads
11 VecAdd<<<1, N>>>(a, b, c);
12 ...
13 }
In this example we can see how the kernel can be declared (line 2) with the keyword
__global__, the input parameters which are the vector pointer a and b and the output
parameter which is C. For this kernel we are calculating a simple sum of vectors. Consider
that if we do this with a serial code we would use a loop with a do or a for with i as
a dummy variable. For this CUDA code we are using the dummy variable i (line 4) but
instead of using loops we use the built-in threadIdx.x variable, which means that we
will do all the i operations in parallel or at the same time. In the function main (line 11)
the kernel is called as a normal function but including the chevron syntax <<< , >>> and
including the information of the number of blocks (1) and threads per block (N) inside
the brackets. Figure 7 shows in picture this kernel which was taken from [18], here, every
entry from vector a is added to the corresponding entry from vector b. It is important to
note that every sum is being performed for a different thread and we usually have enough
threads available for every entry of the vector (the maximum number of threads that can
be launched in a CUDA kernel is of order 1017 ) then, we can use them to perform all the
additions at the same time.
In the Source Code 2 (taken from [1]) we can see the addition of two matrices A and
B of size NxN and the result is stored in matrix C. The maximum number of threads per
24
Figure 7: Vector summation (Taken from [18])
block that can be used is currently 1024, but, depending on the architecture of the GPU,
the number of blocks can be millions.
25
Figure 8: A 2D hierarchy of blocks and threads(Taken from [18])
Blocks and threads can be organized in two or three dimension, as schematically shown
in Figure 8. The dimension of blocks can be specified in the CUDA variables int or dim3
and are accessible through the build-in variable blockIdx. The dimension of threads
can also be specified in the CUDA variables int or dim3 and are accessible through the
build-in variable threadIdx.
26
Source Code 2: Addition of two matrices (Taken from [1])
1 // Kernel definition
2 __global__ void MatAdd(float A[N][N], float B[N][N],
3 float C[N][N])
4 {
5 int i = threadIdx.x;
6 int j = threadIdx.y;
7 C[i][j] = A[i][j] + B[i][j];
8 }
9 int main()
10 {
11 ...
12 // Kernel invocation with one block of N * N * 1 threads
13 int numBlocks = 1;
14 dim3 threadsPerBlock(N, N);
15 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
16 ...
17 }
The same way as in the Source code 1 above, in the Source code 2 we define the kernel
(line 2) but now with two dummy variables: One for the first entry of the matrix and
the other for the second one. Here every pairwise addition is perform by each thread in
the kernel MatAdd. The two variables are build-in as before and one uses .x and the
other .y (line 5 and 6) to distinguish between them. We call the kernel at line 15 but
we previously define the int variable numBlocks (line 13) and the specific CUDA kind
variable dim3 and threadsPerBlock(N, N) (line 14). We suppose N is already defined.
The CUDA type dim3 can be used as a one, two or three dimensional variable to specify
the dimension of the block, as blocks are composed of three dimensions. Finally, in line
15 the number of thread per block and the number of blocks per grid are passed via
the dim variables numBlocks (two dimensional)and threadsPerBlock (one dimensional)
respectively to the kernel MatAdd.
27
Source Code 3: Addition of two matrices (Taken from [1])
1 // Kernel definition
2 __global__ void MatAdd(float A[N][N], float B[N][N],
3 float C[N][N])
4 {
5 int i = blockIdx.x * blockDim.x + threadIdx.x;
6 int j = blockIdx.y * blockDim.y + threadIdx.y;
7 if (i < N && j < N)
8 C[i][j] = A[i][j] + B[i][j];
9 }
10 int main()
11 {
12 ...
13 // Kernel invocation
14 dim3 threadsPerBlock(16, 16);
15 dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
16 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
17 ...
18 }
In the Source code 3 we can see the use of blockIdx. Here, we perform the same
sum of matrices as before but in the definition of i and j we are using blockIdx as two
dimensional variable and the built-in variable blockIdx is used to define the number of
threads and blocks.
These are the first examples of the parallelization of simple algorithms. In the next
pages we explain more about the architecture and more concepts about the other levels
of memory.
CUDA memory is divided into three different levels of memory with different levels of
accessibility. The first one is “global memory” for which all threads can access. The
second one is “shared memory” where thread from the same block can share information
between them. The last one is “constant and texture memory” which can be accessed for
all threads and is only readable.
28
In figure 7 together with the Source Code 1, we can see a simple example of the global
memory. In this work shared and texture memory are not used explicitly but it can be
seen more about the topic in [18].
29
4.5 Data transfer
30
Source Code 4: Vector addition code sample (Taken from [1])
1 // Device code
2 __global__ void VecAdd(float* A, float* B, float* C, int N)
3 {
4 int i = blockDim.x * blockIdx.x + threadIdx.x;
5 if (i < N)
6 C[i] = A[i] + B[i];
7 }
8 // Host code
9 int main()
10 {
11 int N = ...;
12 size_t size = N * sizeof(float);
13 // Allocate input vectors h_A and h_B in host memory
14 float* h_A = (float*)malloc(size);
15 float* h_B = (float*)malloc(size);
16 // Initialize input vectors
17 ...
18 // Allocate vectors in device memory
19 float* d_A;
20 cudaMalloc(&d_A, size);
21 float* d_B;
22 cudaMalloc(&d_B, size);
23 float* d_C;
24 cudaMalloc(&d_C, size);
25 // Copy vectors from host memory to device memory
26 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
27 cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
28 // Invoke kernel
29 int threadsPerBlock = 256;
30 int blocksPerGrid =
31 (N + threadsPerBlock - 1) / threadsPerBlock;
32 VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
33 // Copy result from device memory to host memory
34 // h_C contains the result in host memory
35 cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
36 // Free device memory
37 cudaFree(d_A);
38 cudaFree(d_B);
39 cudaFree(d_C);
40 }
41 // Free host memory
42 }
31
5 Results
5.1 Methodology
The main purpose of the results of this work is to show the performance of the code in
parallel versus the performance of a code in serial. To accomplish this we took the code
in [24] as a reference which you could see in the Appendix C and the code in parallel can
be seen in Appendix B. For this work we use CUDA C, the model of the graphic card was
GeForce GTX 765M and a processor intel core i7. To solve the linear system in parallel
we used a Conjugate Gradient Algorithm that can be found in the Nvidia Cuda Toolkit
7.5. A benchmark was made to compare the performance of both codes together with the
same input parameters. And at the end the norm of the L2 norm was calculated for the
Matlab code. In the codes (the symmetrization parameter) was called ss which for this
model (SIPG) equals one. The penalty parameter α0 was called penal and nel=number
of elements.
The serial code given by [24] gave us the coefficients for the numerical result of the
solution of the linear system. We added a loop to calculate the numerical solution and
could compare directly with the analytical solution. As a result of this we get: ur , which
is the solution at the right boundary of the element, and ul , which is the solution at the
left boundary of the element. With this two points we get the numerical solution of the
elliptic problem. In Figure 9 we have plotted ur (x) with asterisks in red and ul (x) with
circles in blue.
In Figure 10 we compare this last result with the analytic solution of the elliptic
problem which is [24]:
2
p(x) = (1 − x)e−x ; (5.1)
we can see in this Figure that the numerical result is in agreement with the analytical
32
Numerical Solution of the DG
0.8
0.6
u(x)
0.4
0.2
-0.2
0 0.2 0.4 0.6 0.8 1
x
result.
In Figure 12 it is shown the plot of the coefficients αjm from the codes in parallel and in
serial. The coefficients obtained from the serial code [24] are given in red circles and the
coefficients of the code in CUDA are given in green crosses. This first comparison was
made using the original solver i.e. the direct method, for the code in serial and CG for
the code in CUDA. As the graphic shows, the results with both codes are the same as we
expected.
33
Numerical Solution from the Matlab Code
0.8
0.6
u(x)
0.4
0.2
-0.2
0 0.2 0.4 0.6 0.8 1
x
0.8
0.6
u(x)
0.4
0.2
-0.2
0 0.2 0.4 0.6 0.8 1
x
Figure 11: Numerical results of the DGFEMs elliptical problem from the code in parallel.
34
Coefficients in serial and in parallel. Input: nel=10,ss=-1,penal=2
1
Matlab
Cuda
0.8
0.6
Coefficients α m
j
0.4
0.2
-0.2
0 0.2 0.4 0.6 0.8 1
x
Figure 12: Coefficients given in the serial code and in the parallel code.
5.3 Benchmarking
We also did some benchmarks to compare the performance of the serial code versus the
performance of the parallel code. The original serial code was solved with the solver
MLD. However, in order to do a most fair comparison, we changed this solver to the
already implemented Matlab functions CGS which is the matlab implementation of the
Conjugate Gradient Square algorithm and PCG which is the matlab implementation of
the Conjugate Gradient Algorithm. The function PCG was used without preconditioner,
which is exactly the same implementation used for the CUDA code.
35
Benchmark: ss=-1, penal=3.0
250
Matlab
Cuda
200
150
Time [s]
100
50
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
No. of elements
In figure 13 we can see the comparison of the performance of the codes with the
function CGS and the input parameters ss=-1, penal=3.0. In abscissa axis are plotted
the number of elements and in the ordinate axes is plotted the time in seconds. We ran
the program once to get every point. The results obtained from the Matlab code are in
pink crosses and the results from the CUDA code are in red crosses. It can be seen from
the graph that at the beginnig i.e. for few elements the time is almost the same but for
many elements the code in Matlab takes more than 200 times to get the same result.
36
Benchmark ss=-1.0, penal=5.0
50
Matlab
Cuda
Time [s]
0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements
Figure 14 shows a graph with the same solver as in Figure 13 but with parameters
ss=-1 and penal=5.0. It is interesting to note that in the first part of the graphic, points
from the CUDA code take more time than the points from the code in Matlab. With
2000 elements both codes have the same performance, from that point, the Matlab code
started to take more time until the end. This behavior at the beginning of the graph
could be explained taking in account data transfer rate between host and device, which
is the time the information spends to pass from the host to the device i.e. from the CPU
to the GPU. Although we did not measure this rate, from the Figure we can say that is
the reason the performance of the CUDA code for few elements takes more time than the
Matlab code. This rate is not very important when the amount of information is bigger
i.e. when the number of element increases and for more than 7000 elements the Matlab
code takes much more time than the CUDA code.
37
Benchmark: Function PCG. Input: ss=-1,penal=5
40
Matlab
Cuda
35
30
25
Time [s]
20
15
10
0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements
38
Benchmark. Function PCG ss=-1.0, penal=2000.0
80
Matlab
Cuda
70
60
50
Time [s]
40
30
20
10
0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements
In Figure 15 it is shown a benchmark using the Matlab PCG function as a solver for
the serial code which is a fairer comparison than the benchmark with the CGS. For this
graphic, the PCG was used without any preconditioner. We took as the input parameters
ss=-1 and penal=5.0. We also can see, that subtle behavior which was described above
at the beginning of the figure regarding the data transfer rate.
At the end of the figure we can see that the performance is more that 35 times better
for the last point which was for 7500 elements. Figure 16 shows a benchmark for the input
ss=-1.0 and penal=2000, the behavior is the same in for Figure 15 and it can be seen that
the performance of the CUDA code is more than 70 times better than the Matlab code
for the 7500 elements.
39
5.4 Error
The L2 norm of the error [29], [16], defined from equation 5.3 to 5.6 was plotted vs the
mesh size h in order to assure the correctness of the DGFEMs solution. According to
theory [29], [16], for piece wise polynomial basis of degree 2 a slope of 3 is expected in the
convergence plot. This result was achieved using a quadrature rule for the integral of the
error. Since the error function its not a polynomial function (exponential function from
the analytic solution of the problem), then the integral will not be exact, nevertheless a
large number of terms (large number of quadrature points) for the integral was used
Z xi+1 Nq
X
2
(u − uh ) dx ≈ (u(xq ) − uh (xˆq ))2 ŵq |J|, (5.2)
xi q=1
Where xq is the value of x mapped in the domain (-1,1) and xˆq are the roots of the
Legendre polynomials in the interval (-1,1) and ŵ are the weights, then this quantity was
computed for all elements and each contribution was added in a loop. The error code can
be seen in Appendix D. The subroutine lgwt.m was taken from [30].
N Z
X xi
ku− un k2L2 (0,1) = (u − un )2 dx (5.3)
i=1 xi−1
Taking in to account the expression 5.2, we can compute the error squared using the
following expression:
Nq
N X
X
ku− un k2L2 (0,1) ∼ (u(xq ) − un (xˆq ))2 ω̂q |J| (5.4)
i=1 q=1
The variable x mapped in the local domain has the following expression which depends
on the endpoints of each element and the quadrature points:
1 − xˆq 1 + xˆq
xq = xi−1 + xi (5.5)
2 2
The jacobian is computed using the following expression:
dx xi − xi−1
|J| = = (5.6)
dξ 2
40
The DGFEMs approximation is given by :
3
X
uh (xˆq ) = Uj (xˆq )j−1 , (5.7)
j=1
Where (xˆq )j−1 are the mapped basis polynomials 1, x, x2 in the interval (-1,1)
and Uj are the coefficients. Finally the expected convergence plot is obtained:
L2 norm vs h
10 -2
10 -4
L2 norm
-6
10
10 -8
10 -10
-4 -3 -2 -1 0
10 10 10 10 10
h
41
5.5 Discussion of the results
As we could see from Figure 12 results from the CUDA code are in good agreement with
the results from the Matlab code [24] and numerical results coincide with the analytical
solution shown in Figure 10 as we expected. From the second part of the results which
is the benchmarking we also can see a much better performance from the CUDA code in
comparison with the Matlab code as it is expected. An important result is the fact that
when we try to calculate the solution for more elements the performance of the CUDA
core is much better in comparison with the Matlab code. This could be explained because
we are using more FLOP’s as we increase the number of elements, when the difference
between the code in serial and in parallel is more evident. This could be more beneficial
when we try to get the solution for multidimensional cases of the problem when the
number of elements grows faster. We could also see qualitatively that the data transfer
rate between host and device affects the performance of the CUDA code and explain
the beginning of Figures 14, 15 and 16. Another important issue was the benchmarking
technique, for this results we use the tic and toc Matlab timing function and the basic
shell command time for measuring the performance of the CUDA code but as we know
from [18] CUDA has its own specific function to measure different parts of the code. For
future works we could improve this area to obtain better results. The last result is shown
in Figure 17, this shows us the accuracy of the numerical results. We obtains a linear
curve in a log-log graphic with slope 3 as we use polynomials of degree 2 for our basis
function which is consistent with the theory.
6 Conclusions
This research has attempted to show the performance of a CUDA code of a Discontinuous
Galerkin Finite Element Methods (DGFEMs) for an elliptic problem in 1D. A comparison
was made between a serial code in Matlab and the same problem implemented in parallel in
CUDA. Some benchmarks showed the performance of the code in serial was substantially
42
slower than the code in parallel, as we expected. Using exactly the same solver we found
that the CUDA code could be more than 70 time faster than the code in Matlab. We
also obtain that numerical results for the code in parallel and in serial are in agreement.
Finally, the L2 error norm of the serial code was calculated, obtaining a straight line
of slope 3 as it was expected. Future studies should concentrate on the use of better
profiling for a fairer benchmarks, especially, it could be use in the development of a
multidimensional version of the problem. This work may well represent a step forward in
the use of GPU in the implementation of the DGFEMs for elliptic problems. Application
of these findings will make it easier to perform DGFEMs in more dimensions where its
LAP is numerically and computationally more demanding.
43
A Tutorial to install Cuda 7.5 on Ubuntu 15.04
First of all you must have an Nvidiar CUDA capable card and can check this with:
$lsb_release -a
https://developer.nvidia.com/cuda-downloads
44
and create a file at
/etc/modprobe.d/blacklist-nouveau.conf with the following
contents:
blacklist nouveau
Reboot into text mode (runlevel 3). This can usually be accomplished by adding the
number ”3” to the end of the system’s kernel boot parameters. Since the NVIDIA drivers
are not yet installed, the text terminals may not display correctly. Temporarily adding
”nomodeset” to the system’s kernel boot parameters may fix this issue. Consult
https://wiki.ubuntu.com/Kernel/KernelBootParameters
Two steps that are not described in the official guide but work for me are:
Eliminate the file:
.X0-lock
45
session. If your specific application use the OpenGL library, maybe this tutorial is not
going to help you, otherwise it will work.
Say yes and accept all the others options.
If you either get an error or the executable skip something in the instalation, read
the error message it sends to you at the end. It is going to give you a hint to solve the
problem.
If everything is done properly, you should have your Cuda Toolkit installation com-
plete. After this make sure you add the next path at the end of the .bashrc file at home
folder:
for 32 bits include these lines:
$export PATH=/usr/local/cuda-6.5/bin:\$PATH$
$export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib:\$LD_LIBRARY_PATH$
$export PATH=/usr/local/cuda-6.5/bin:\$PATH$
$export LD_LIBRARY_PATH=/usr/local/cuda-7.5
/lib:/usr/local/cuda-6.5/lib64:\$LD_LIBRARY_PATH$
Then, close the terminal. Open another terminal and go to the Nvidia Sample solvers.
We recommend to start with the deviceQuery example which is in the NVIDIA CUDA
Toolkit, it contains all the information of your Nvidia Card and its architecture. Finally,
run make to get the executable and then execute the result of this step.
Good luck with the installation!
B CUDA Code
This code was developed by the author of this work and inspired from [24] except for the
implementation of the CG algorithm which can be found in the CUDA Toolkit Version
7.5. To more details of this please check https://developer.nvidia.com/cuda-toolkit
46
1 /*
2 * Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
3 *
4 * Please refer to the NVIDIA end user license agreement (EULA) associated
5 * with this source code for terms and conditions that govern your use of
6 * this software. Any use, reproduction, disclosure, or distribution of
7 * this software and related documentation outside the terms of the EULA
8 * is strictly prohibited.
9 *
10 */
11
12 /*
13 * This sample implements a conjugate gradient solver on GPU
14 * using CUBLAS and CUSPARSE
15 *
16 */
17
18 // includes, system
19 #include <stdlib.h>
20 #include <stdio.h>
21 #include <string.h>
22
32 #include "common.h"
33 #include <cusparse_v2.h>
34 #include <cuda.h>
35
36
47
43
52 return yval;
53 }
54
73 float *dA,*A;
74 int *dNnzPerRow,*nnzh=NULL;
75 int totalNnz;
76
77 int nel=200,mz=3;
78 int glodim = nel * mz;
79 // const int glodim2=1500;
80 // float Aglobal[1500][1400];
81 float rhsglobal[glodim];
82
83
48
88 if (devID < 0)
89 {
90 printf("exiting...\n");
91 exit(EXIT_SUCCESS);
92 }
93
94 checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));
95
128 // printf("\nglodim=%d\n",glodim);
129
49
130 // float Aglobal[glodim][glodim];
131 // float rhsglobal[glodim];
132
133
134 /*
135 for(int j=0;j<glodim2;j++)
136 {
137 for(int i=0;i<glodim2;i++)
138 {
139 Aglobal[i][j]=0.0;
140 }
141 }
142 */
143
157
158
159 Amat[0][0]=0.0;
160 Amat[0][1]=0.0;
161 Amat[0][2]=0.0;
162 Amat[1][0]=0.0;
163 Amat[1][1]=4.0;
164 Amat[1][2]=0.0;
165 Amat[2][0]=0.0;
166 Amat[2][1]=0.0;
167 Amat[2][2]=(16.0/3.0);
168
50
175 }
176
177 Bmat[0][0]=penal;
178 Bmat[0][1]=1.0-penal;
179 Bmat[0][2]=-2.0+penal;
180 Bmat[1][0]=-ss-penal;
181 Bmat[1][1]=-1.0+ss+penal;
182 Bmat[1][2]=2.0-ss-penal;
183 Bmat[2][0]=2.0*ss+penal;
184 Bmat[2][1]=1.0-2.0*ss-penal;
185 Bmat[2][2]=-2.0+2.0*ss+penal;
186
187
196 Cmat[0][0]=penal;
197 Cmat[0][1]=-1+penal;
198 Cmat[0][2]=-2+penal;
199 Cmat[1][0]=ss+penal;
200 Cmat[1][1]=-1+ss+penal;
201 Cmat[1][2]=-2+ss+penal;
202 Cmat[2][0]=2*ss+penal;
203 Cmat[2][1]=-1+2*ss+penal;
204 Cmat[2][2]=-2+2*ss+penal;
205
214 Dmat[0][0]=-penal;
215 Dmat[0][1]=-1+penal;
216 Dmat[0][2]=2-penal;
217 Dmat[1][0]=-ss-penal;
218 Dmat[1][1]=-1+ss+penal;
219 Dmat[1][2]=2-ss-penal;
51
220 Dmat[2][0]=-2*ss-penal;
221 Dmat[2][1]=-1+2*ss+penal;
222 Dmat[2][2]=2-2*ss-penal;
223
232 Emat[0][0]=-penal;
233 Emat[0][1]=1-penal;
234 Emat[0][2]=2-penal;
235 Emat[1][0]=ss+penal;
236 Emat[1][1]=-1+ss+penal;
237 Emat[1][2]=-2+ss+penal;
238 Emat[2][0]=-2*ss-penal;
239 Emat[2][1]=1-2*ss-penal;
240 Emat[2][2]=2-2*ss-penal;
241
250 F0mat[0][0]=penal;
251 F0mat[0][1]=2-penal;
252 F0mat[0][2]=-4+penal;
253 F0mat[1][0]=-2*ss-penal;
254 F0mat[1][1]=-2+2*ss+penal;
255 F0mat[1][2]=4-2*ss-penal;
256 F0mat[2][0]=4*ss+penal;
257 F0mat[2][1]=2-4*ss-penal;
258 F0mat[2][2]=-4+4*ss+penal;
259
52
265 }
266 }
267
268 FNmat[0][0]=penal;
269 FNmat[0][1]=-2+penal;
270 FNmat[0][2]=-4+penal;
271 FNmat[1][0]=2*ss+penal;
272 FNmat[1][1]=-2+2*ss+penal;
273 FNmat[1][2]=-4+2*ss+penal;
274 FNmat[2][0]=4*ss+penal;
275 FNmat[2][1]=-2+4*ss+penal;
276 FNmat[2][2]=-4+4*ss+penal;
277
,→ Aglobal[ii][jj]=Aglobal[ii][jj]+Amat[ii][jj]+F0mat[ii][jj]+Cmat[ii][jj];
299 je=mz+jj;
300 Aglobal[ii][je]=Aglobal[ii][je]+Dmat[ii][jj];
301 }
302 }
303
53
309 for(int ig=0;ig<2;ig++)
310 {
311 rhsglobal[0]=rhsglobal[0]+wg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
312
,→ rhsglobal[1]=rhsglobal[1]+wg[ig]*sg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
313
,→ rhsglobal[2]=rhsglobal[2]+wg[ig]*sg[ig]*sg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
314
315 }
316
327 je=jj+(i-1)*mz;
328 Aglobal[ie][je]=Aglobal[ie][je]+Emat[ii][jj];
329
330 je=jj+(i+1)*mz;
331 Aglobal[ie][je]=Aglobal[ie][je]+Dmat[ii][jj];
332 }
333
345
54
352
,→ Aglobal[ie][je]=Aglobal[ie][je]+Amat[ii][jj]+FNmat[ii][jj]+Bmat[ii][jj];
353 je=jj+(nel-2)*mz;
354 Aglobal[ie][je]=Aglobal[ie][je]+Emat[ii][jj];
355 }
356 for(int ig=0;ig<2;ig++)
357 {
358 float c=(pow(sg[ig],(ii)));
359 float d=sourcef((sg[ig]+2*(nel-1)+1.0)/(2*nel))/(2.0*nel);
360 rhsglobal[ie]=rhsglobal[ie]+wg[ig]*c*d;
361 }
362 }
363
378
379
384
55
396
397
411
419
425
437 checkCudaErrors(cublasStatus);
438
56
441 cusparseStatus_t cusparseStatus;
442 cusparseStatus = cusparseCreate(&cusparseHandle);
443
444 checkCudaErrors(cusparseStatus);
445
449 checkCudaErrors(cusparseStatus);
450
451 cusparseSetMatType(descr,CUSPARSE_MATRIX_TYPE_GENERAL);
452 cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ZERO);
453
57
485 CHECK(cudaMalloc((void **)&d_val, sizeof(float) * totalNnz));
486 CHECK(cudaMalloc((void **)&d_row, sizeof(int) * (M + 1)));
487 CHECK(cudaMalloc((void **)&d_col, sizeof(int) * totalNnz));
488 val = (float *)malloc(sizeof(float)*totalNnz);
489 I = (int *)malloc(sizeof(int)*(M + 1));
490 J = (int *)malloc(sizeof(int)*totalNnz);
491
492
493 // Convert A from a dense formatting to a CSR formatting, using the GPU
494 CHECK_CUSPARSE(cusparseSdense2csr(cusparseHandle, M, N, descr, dA, M,
,→ dNnzPerRow,
495 d_val, d_row, d_col));
496
497
498 nz=totalNnz;
499
504
505
506
524
525
58
529
530
541 k = 1;
542
564 r0 = r1;
565 cublasStatus = cublasSdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
566 cudaThreadSynchronize();
567 // printf("iteration = %3d, residual = %e\n", k, sqrt(r1));
568 k++;
569 }
570
59
572
597 cusparseDestroy(cusparseHandle);
598 cublasDestroy(cublasHandle);
599
600 free(I);
601 free(J);
602 free(val);
603 free(x);
604 free(rhs);
605 cudaFree(d_col);
606 cudaFree(d_row);
607 cudaFree(d_val);
608 cudaFree(d_x);
609 cudaFree(d_r);
610 cudaFree(d_p);
611 cudaFree(d_Ax);
612
60
617 // flushed before the application exits
618 cudaDeviceReset();
619
623 }
C Matlab Code
This code was taken from [24] and modified from line 132 to line 327 in order to compare
with the analytical result and with the CUDA code properly.
1 function [ysol,ul,ur,xl,xr]=DGsimplesolve3(nel,ss,penal)
2 %function [x0,fl0,rr0,it0,rv0,S,rhsglobal] = DGsimplesolve3(nel,ss,penal)
3 %function DGsimplesolve(nel,ss,penal)
4
5 format long
6
61
25 wg(2) = 1.0;
26 sg(1) = -0.577350269189;
27 sg(2) = 0.577350269189;
28 % assemble global matrix and right-hand side
29 % first block row
30 for ii=1:locdim
31 for jj=1:locdim
32 %fprintf(’\n’);
33 %fprintf(’%d , %d’,ii,jj);
34 Aglobal(ii,jj) = Aglobal(ii,jj)+Amat(ii,jj)+F0mat(ii,jj)+Cmat(ii,jj);
35 je = locdim+jj ;
36 %fprintf(’%d , %d’,ii,je);
37 Aglobal(ii,je) = Aglobal(ii,je)+Dmat(ii,jj);
38 end; %jj
39 end; %ii
40
41 %
42 %%for ii=1:locdim
43 %% for jj=1:locdim
44 %% fprintf(’%6.2f ’,Aglobal(ii,jj));
45 %% end; %jj
46 %% end; %ii
47
48
54 for ig=1:2
55 rhsglobal(1) = rhsglobal(1) + wg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
56 rhsglobal(2) = rhsglobal(2) +
,→ wg(ig)*sg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
57 rhsglobal(3) = rhsglobal(3) +
,→ wg(ig)*sg(ig)*sg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
58 end; %ig
59 rhsglobal(1);
60 rhsglobal(2);
61 rhsglobal(3);
62
62
68 for jj=1:locdim
69 fprintf(’\n’);
70 je = jj+(i-1)*locdim;
71 %fprintf(’%d , %d’,ie,je);
72 Aglobal(ie,je) = Aglobal(ie,je)+Amat(ii,jj)+Bmat(ii,jj)+Cmat(ii,jj);
73 %fprintf(’ %d, %f ’,je,Aglobal);
74 %Aglobal(ie,je)
75 je = jj+(i-2)*locdim;
76 Aglobal(ie,je) = Aglobal(ie,je)+Emat(ii,jj);
77 je = jj+(i)*locdim;
78 %fprintf(’%d , %d’,ie,je);
79 Aglobal(ie,je)=Aglobal(ie,je)+Dmat(ii,jj);
80 % fprintf(’%d , %d, %f ’,ie,je,Aglobal(ie,je));
81 end; %jj
82 % compute right-hand side
83 for ig=1:2
84 rhsglobal(ie) =
,→ rhsglobal(ie)+wg(ig)*(sg(ig)^(ii-1))*sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel);
85 % fprintf(’\n’);
86 %fprintf(’ig= %d, sg(ig)= %f,i= %d nel=%d, 2*nel= %d,
,→ b=%f’,ig,sg(ig),i,nel,2*nel, sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel));
87 % fprintf(’ %d, %d, %f, %f, %f, %f, %f ’, ie, ii, wg(ig), sg(ig)
,→ ,(sg(ig)^(ii-1)),sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel),
,→ rhsglobal(ie));
88 end; %
89 end; %ii
90 end; %i
91 %fprintf(’s=%f’,sourcef(1.0));
92
63
108 % compute right-hand side
109 for ig=1:2
110 % fprintf(’\n’);
111 rhsglobal(ie) =
,→ rhsglobal(ie)+wg(ig)*(sg(ig)^(ii-1))*sourcef((sg(ig)+2*(nel-1)+1.0)/(2*nel))/(2*nel);
112 %fprintf(’ig=%d, sg(ig)= %f, ii-1= %d c=%f, d=%f
,→ ’,ig,sg(ig),ii-1,sg(ig)^(ii-1),sourcef((sg(ig)+2*(nel-1)+1.0)/(2*nel))/(2*nel));
113 % fprintf(’%f \n’,rhsglobal(ie));
114 end; %ig
115 end; %ii
116
117
118 % fprintf(’Aglobal(11,11)=%f’,Aglobal(12,12));
119 % for i=1:glodim
120 % fprintf(’\n’);
121 % for j=1:glodim
122 % fprintf(’ %f ’,Aglobal(i,j));
123 % end;
124 % end;
125
131
132 n1=length(rhsglobal);
133 M1 = spdiags((1:n1)’,0,n1,n1);
134
135
139 %pcg(A,b1,tol,maxit,M1);
140 % solve linear system
141 S = sparse(Aglobal);
142 %ysol2 = cgs(Aglobal,rhsglobal,tol,maxit);
143 %ysol3 = cgs(S,rhsglobal,tol,maxit);
144
145 x1 = pcg(S,rhsglobal,tol,maxit);
146
64
151 %[x0,fl0,rr0,it0,rv0] = pcg(Aglobal,rhsglobal,1e-8,100);
152
153
154 x=linspace(0,1,nel*3);
155 yanal=(1-x).*exp(-x.*x);
156
157 %plot(x,ysol,x,yanal)
158
159 %plot(x,ysol)
160
161
162 j=1;
163
164 %fprintf(’xm(32)=%f’,xm(32));
165 % for i=1:31
166 % fprintf(’\n’);
167 % fprintf(’ %f \n’,xm(i));
168 % end;
169
170
173 %A = fscanf(fileID,’%f’);
174
184
185
195
65
196
197 plot(xl(1),ul(1),’bo’,xr(1),ur(1),’r*’)
198 hold on
199 plot(xl(2),ul(2),’bo’,xr(2),ur(2),’r*’)
200 hold on
201 plot(xl(3),ul(3),’bo’,xr(3),ur(3),’r*’)
202 hold on
203 plot(xl(4),ul(4),’bo’,xr(4),ur(4),’r*’)
204 hold on
205 plot(xl(5),ul(5),’bo’,xr(5),ur(5),’r*’)
206 hold on
207 plot(xl(6),ul(6),’bo’,xr(6),ur(6),’r*’)
208 hold on
209 plot(xl(7),ul(7),’bo’,xr(7),ur(7),’r*’)
210 hold on
211 plot(xl(8),ul(8),’bo’,xr(8),ur(8),’r*’)
212 hold on
213 plot(xl(9),ul(9),’bo’,xr(9),ur(9),’r*’)
214 hold on
215 plot(xl(10),ul(10),’bo’,xr(10),ur(10),’r*’)
216 hold on
217 %plot(x,yanal)
218 %hold on
219
220 xx(1)=xl(1);
221 xx(2)=xr(1);
222 xx(3)=xl(2);
223 xx(4)=xr(2);
224 xx(5)=xl(3);
225 xx(6)=xr(3);
226 xx(7)=xl(4);
227 xx(8)=xr(4);
228 xx(9)=xl(5);
229 xx(10)=xr(5);
230 xx(11)=xl(6);
231 xx(12)=xr(6);
232 xx(13)=xl(7);
233 xx(14)=xr(7);
234 xx(15)=xl(8);
235 xx(16)=xr(8);
236 xx(17)=xl(9);
237 xx(18)=xr(9);
238 xx(19)=xl(10);
239 xx(20)=xr(10);
240
66
241 ll(1)=ul(1);
242 ll(2)=ur(1);
243 ll(3)=ul(2);
244 ll(4)=ur(2);
245 ll(5)=ul(3);
246 ll(6)=ur(3);
247 ll(7)=ul(4);
248 ll(8)=ur(4);
249 ll(9)=ul(5);
250 ll(10)=ur(5);
251 ll(11)=ul(6);
252 ll(12)=ur(6);
253 ll(13)=ul(7);
254 ll(14)=ur(7);
255 ll(15)=ul(8);
256 ll(16)=ur(8);
257 ll(17)=ul(9);
258 ll(18)=ur(9);
259 ll(19)=ul(10);
260 ll(20)=ur(10);
261
262 %plot(x,yanal,’b’)
263 %hold on
264 %plot(xx,ll,’r*’)
265 %hold on
266
267
268 %xr=linspace(0,1,nel);
269 %plot(x,yanal,’b’,xr,ur,’r’,xl,ul,’g’,’linewidth’,2)
270 %plot(x,ysol,’r’,x,yanal,’b’)
271 %plot(xr,ur)
272
273
274 %j=1;
275
283 %j=1;
284
285
67
286
287
288 %xl=linspace(0,1,nel);
289 %plot(xl,ul,’r’,x,yanal,’b’)
290 %plot(x,ysol,’r’,x,yanal,’b’)
291
292
293 %k=3;
294 %j=1;
295 %for i=1:nel-1
296 % ur(i)=ysol(k)*PC(j)+ysol(k+1)*PC(j+1)+ysol(k+2)*PC(j+2);
297 % j=j+3;
298 % k=k+3;
299 %end;
300
301 %xr=linspace(0,1,nel-1);
302 %plot(xr,ur,’r’,x,yanal,’b’)
303
311
312 return;
313
314
315 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
316 function yval = sourcef(xval)
317 % source function for exact solution=(1-x)e?(-x?2)
318 yval = -(2*xval-2*(1-2*xval)+4*xval*(xval-xval^2))*exp(-xval*xval);
319 return;
320
330
68
331 function pol3=P3(x,nel,n)
332 h=1/nel;
333 %fprintf(’h=%f\n’,h);
334 %fprintf(’4/(h*h)=%f\n’,4/(h*h));
335 %fprintf(’((x-(n+0.5)*h)^2)=%f\n’,((x-(n+0.5)*h)^2));
336 pol3=(4/(h*h))*((x-(n+0.5)*h)^2);
337 return;
338
339
340
341
342
343
344
345
346
D Error Code
1 function [ l2norm ] = errorv2( nq,uh,nOn,a,b)
2 %UNTITLED2 Summary of this function goes here
3 % Detailed explanation goes here
4
5 %%--------------------------------------------------------------------
6 % THis function computes the L2 norm of the error
7 % || u-uh ||_{ L2(0,L) } given an analitical solution u
8 % of the POisson equation in 1d and a mesh size h
9 %
10 %---------------------------------------------------------------------
11 % nOn number of nodes
12 % u is analitical solution function
13 % L length of the domain.
14 % nq % number of quadrature points necesary to obtain an excact
15 % solution given that (u-uh)^{2} it’s a polinomioum of 4th degree.
16 % nq=2(3)-1=5, so in our case 3 its enough.
17 % uh finite element aproximation
18
69
25 us=0.0; % u start
26 ue=0.0; % u end
27 J=0.0; % Jacobian
28
29
32 for i=1:nOe
33 xs=x(i);
34 xe=x(i+1);
35 us=uh(i);
36 ue=uh(i+1);
37 J=(xe-xs)*0.5;
38 sum=0.0;
39 x_map=zeros(1,nq);
40 f_map=zeros(1,nq);
41 [eta,w]=lgwt(nq,-1,1);
42 u_map=zeros(1,nq);
43 N1=0.0;
44 N2=0.0;
45 error_map = 0.0;
46
47 for j=1:nq
48
52 x_map(j)=N1 * xs + N2 * xe;
53 u_map(j)=N1 * us + N2 * ue;
54
59 end
60
61 Integrals(i)=sum;
62
63
64 end
65
66 I=0;
67 for k=1:nOe
68 I=Integrals(k)+I;
70
69 end
70
71 l2norm=sqrt(I);
1 function [x,w]=lgwt(N,a,b)
2
3 % lgwt.m
4 %
5 % This script is for computing definite integrals using Legendre-Gauss
6 % Quadrature. Computes the Legendre-Gauss nodes and weights on an interval
7 % [a,b] with truncation order N
8 %
9 % Suppose you have a continuous function f(x) which is defined on [a,b]
10 % which you can evaluate at any x in [a,b]. Simply evaluate it at all of
11 % the values contained in the x vector to obtain a vector f. Then compute
12 % the definite integral using sum(f.*w);
13 %
14 % Written by Greg von Winckel - 02/25/2004
15 N=N-1;
16 N1=N+1; N2=N+2;
17
18 xu=linspace(-1,1,N1)’;
19
20 % Initial guess
21 y=cos((2*(0:N)’+1)*pi/(2*N+2))+(0.27/N1)*sin(pi*xu*N/N2);
22
26 % Derivative of LGVM
27 Lp=zeros(N1,N2);
28
32 y0=2;
33
34 % Iterate until new points are uniformly within epsilon of old points
35 while max(abs(y-y0))>eps
36
37
38 L(:,1)=1;
39 Lp(:,1)=0;
71
40
41 L(:,2)=y;
42 Lp(:,2)=1;
43
44 for k=2:N1
45 L(:,k+1)=( (2*k-1)*y.*L(:,k)-(k-1)*L(:,k-1) )/k;
46 end
47
50 y0=y;
51 y=y0-L(:,N2)./Lp;
52
53 end
54
72
References
[2] Douglas N. Arnold. An Interior Penalty Finite Element Method with Discontinuous
Elements, 1982.
[3] Douglas N Arnold, Franco Brezzi, Bernardo Cockburn, and L Donatella Marini. Uni-
fied analysis of discontinuous galerkin methods for elliptic problems. SIAM journal
on numerical analysis, 39(5):1749–1779, 2002.
[4] Garth A Baker. Finite element methods for elliptic equations using nonconforming
elements. Mathematics of Computation, 31(137):45–59, 1977.
[5] Francesco Bassi and Stefano Rebay. A high-order accurate discontinuous finite ele-
ment method for the numerical solution of the compressible navier–stokes equations.
Journal of computational physics, 131(2):267–279, 1997.
[6] Francesco Bassi and Stefano Rebay. A high-order accurate discontinuous finite ele-
ment method for the numerical solution of the compressible navier–stokes equations.
Journal of computational physics, 131(2):267–279, 1997.
[7] Franco Brezzi, Gianmarco Manzini, Donatella Marini, Paola Pietra, and Alessan-
dro Russo. Discontinuous galerkin approximations for elliptic problems. Numerical
Methods for Partial Differential Equations, 16(4):365–378, 2000.
[8] Bernardo Cockburn and Chi-Wang Shu. The RungeKutta Discontinuous Galerkin
Method for Conservation Laws V. Journal of Computational Physics, 141(2):199–224,
1998.
[9] Clint Dawson, Shuyu Sun, and Mary F Wheeler. Compatible algorithms for cou-
pled flow and transport. Computer Methods in Applied Mechanics and Engineering,
193(23):2565–2580, 2004.
73
[10] LM Delves and CA Hall. An implicit matching principle for global element calcula-
tions. IMA Journal of Applied Mathematics, 23(2):223–234, 1979.
[12] Jim Douglas and Todd Dupont. Interior penalty procedures for elliptic and parabolic
galerkin methods. In Computing methods in applied sciences, pages 207–216.
Springer, 1976.
[13] Rob Farber. CUDA Application Design and Development. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 1st edition, 2012.
[14] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press,
2012.
[15] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for
solving linear systems. 1952.
[16] Paul Houston. Lectures notes in variational methods. The University of Nottingham,
2014.
[17] Paul Houston, Emmanuil H Georgoulis, and Edward Hall. Adaptivity and a poste-
riori error estimation for dg methods on anisotropic meshes. In Proceedings of the
International Conference on Boundary and Interior Layers (BAIL)-Computational
and Asymptotic Methods, 2006.
[18] Sanders Jason and KanDrot Edward. Cuda by example: An introduction to general-
purpose gpu programming. AddisonWesley, USA, 2011.
[19] Andreas Klöckner, Timothy Warburton, and Jan S Hesthaven. High-order discontin-
uous galerkin methods by gpu metaprogramming. In GPU Solutions to Multi-scale
Problems in Science and Engineering, pages 353–374. Springer, 2013.
[20] CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California, 15,
2015.
74
[21] CUDA NVIDIA. Cusparse library. NVIDIA Corporation, Santa Clara, California,
2015.
[22] J Tinsley Oden, Ivo Babuŝka, and Carlos Erik Baumann. A discontinuoushpfinite
element method for diffusion problems. Journal of computational physics, 146(2):491–
519, 1998.
[24] Béatrice Rivière. Discontinuous Galerkin methods for solving elliptic and parabolic
equations: theory and implementation. Society for Industrial and Applied Mathe-
matics, 2008.
[25] Béatrice Rivière, Mary F Wheeler, and Vivette Girault. Improved energy estimates
for interior penalty, constrained and discontinuous galerkin methods for elliptic prob-
lems. part i. Computational Geosciences, 3(3-4):337–360, 1999.
[26] Gregory Ruetsch and Massimiliano Fatica. CUDA Fortran for Scientists and Engi-
neers: Best Practices for Efficient CUDA Fortran Programming. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.
[27] Endre Süli, Christoph Schwab, and Paul Houston. hp-DGFEM for partial differential
equations with nonnegative characteristic form. Springer, 2000.
[28] Lloyd N Trefethen. Db iii, numerical linear algebra. SIAM: Society for Industrial
and Applied Mathematics, 1997.
[31] Mary Fanett Wheeler. An elliptic collocation-finite element method with interior
penalties. SIAM Journal on Numerical Analysis, 15(1):152–161, 1978.
75