You are on page 1of 77

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/286245602

Discontinuous Galerkin Method for a 1D Elliptic Equation in CUDA

Thesis · November 2015

CITATIONS READS

0 433

1 author:

Ivonne Leonor Lino


University of Nottingham
1 PUBLICATION   0 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Ivonne Leonor Lino on 08 December 2015.

The user has requested enhancement of the downloaded file.


Discontinuous Galerkin Method for
a 1D Elliptic Equation in CUDA

G14SCD

MSc Dissertation in

Scientific Computation

2014/15

School of Mathematical Sciences


University of Nottingham

Ivonne Leonor Medina Lino.

Supervisor: Professor Paul Houston

I have read and understood the School and University guidelines on plagiarism. I confirm
that this work is my own, apart from the acknowledged references.
Contents

Abstract 3

Acknowledgements 3

Dedication 3

1 Introduction 4

2 Discontinuous Galerkin Finite Element Methods (DGFEMs) for elliptic


problems. 5
2.1 Mathematical model of the DGFEMs . . . . . . . . . . . . . . . . . . . . . 5
2.2 Flux formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Derivation of the primal formulation . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Jump and Average . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Primal formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 The elliptic problem in 1D 9


3.1 Mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Local and global matrices . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Right Hand Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 DGFEMs in CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 CUDA: The parallel computing language 17


4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 CUDA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.1 Memory hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 CUDA memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Data transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1
5 Results 32
5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Numerical results of the serial curve . . . . . . . . . . . . . . . . . . . . . . 32
5.2.1 Coefficients in serial and in parallel . . . . . . . . . . . . . . . . . . 33
5.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Discussion of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusions 42

A Tutorial to install Cuda 7.5 on Ubuntu 15.04 44

B CUDA Code 46

C Matlab Code 61

D Error Code 69

References 73

2
Abstract

This work provides an analysis of the performance of the Discontinuous Galerkin Finite
Element Method (DGFEMs) for a 1D Elliptic Problem in parallel using GPU technology.
DGFEMs was chosen as a numerical method due to its stability, robustnesses, and mainly
because the method is very suitable for parallelization due to a large degree of locallity
and intensive arithmetic operations. It is well known that the time used to run and get
results from a code increases as we use more elements in the DGFEMs. The aim of this
work is to develop a CUDA code for an Elliptic Problem in 1D using DGFEMs to reduce
this time and, compare its performance with its counterpart in serial. At the end of the
work some benchmarks were performed to show that the code in parallel is much faster
than the code in serial.

Acknowledgements

I would like to thank Professor Paul Houston who supported me from the very begin-
ning, when I started the MSc. He has been the most patient lecturer and always very
approachable. It has been an honor and great pleasure being his pupil. I appreciate and
thank him deeply for all his help. I also want to thank my dear family and friends who
have supported me and helped me to cope with all the emotional and economic difficul-
ties during the whole process. Finally, I would like to thank my country and Mexican
government (through CONACYT) who trusted me and gave me the opportunity to study
this Master to have a better future in my career and life.

Dedication

To my husband Benjamin, my son Eduardo Emanuel and my mom Maria del Carmen.

3
1 Introduction

Discontinuous Galerkin Finite Element Methods (DGFEMs) were introduced for the nu-
merical solution of first-order hyperbolic problems in 1973 [23]. In more recent times
this method has enjoyed considerable success due to its robustness, stability and local
conservativeness. Also because it is easy to implement and is highly parallelizable [27],
[17], [19].
Depending on the input parameters, we obtain several variations of DGFEMs methods:
Symmetric Interior Penalty Galerkin (SIPG) method, introduced in the late 1970s [3], [31],
the Global Element Method [10], the Nonsymmetric Interior Penalty Galerkin (NIPG)
method [25] and [22], the Incomplete Interior Penalty Galerkin (IIPG) [9], the case of
elliptic problems written as first order system[5] Discontinuous Galerkin Finite Element
Methods (DGFEMs) for elliptic equations were proposed with variants by [12], [4],[31] and
[2], called interior penalty (IP) methods from that time.Examples of the method applied
to elliptic and parabolic problems can also be seen in [4], [31] and [2] Works about the pure
elliptic problem can be seen in [8], [6] and [7]. For this work we will use the symmetric
interior penalty method (SIP). which will be described in detail in the next section.
The main aim of this work is to study the performance of a code which solves a Dis-
continuous Galerkin Method for an elliptic problem. The comparison was made between
a serial code presented in [24] and its counterpart in parallel. CUDA was used as the
language for the implementation of parallel programming, using its libraries. The code in
parallel was developed by the author of this work inspired in the serial code of [24] except
by the Conjugate Gradient Algorithm (CG) which can be found in the N vidiar Cuda
Toolkit 7.5 [1].
In section 2, the Discontinuous Galerkin Finite Element Methods (DGFEMs) for an
elliptic problem is seen in general for a multidimensional problem with the model taken
from [3]. In section 3 the particular problem of DGFEMs for the elliptic problem in 1D is
studied in detail, taking the discretization from [24]. In section 4, the model for paradigm
of parallel programming is studied and explained throughout the particular model of

4
CUDA language which was developed on Nvidia GPU hardware. In section 5 results
of the serial and parallel code are shown as well as some benchmarking between them
using different input parameters. Finally in Appendix A a tutorial for installing CUDA
technology is described in detailed, in Appendix B the CUDA code in parallel is presented
and in Appendix C we present the serial code with some modifications from the original.
One of the most important achievements of this work was the implementation of the
model in parallel using the GPU technology and verifying the more efficient performance
of this code in comparison with its counterpart in serial.

2 Discontinuous Galerkin Finite Element Methods

(DGFEMs) for elliptic problems.

Discontinuous Galerkin Finite Element Methods (DGFEMs) combines features of the fi-
nite volume method and the finite element method and can be applied to solve hyperbolic,
elliptic and parabolic problems.

2.1 Mathematical model of the DGFEMs

The model problem of DGFEMs is introduced [3] as

−∆u = f in Ω, u = 0 on ∂Ω, (2.1)

where f is a given function in L2 (Ω) and Ω ⊆ Rn , n > 1, is a polygonal domain. The


problem can be rewritten as [3]

φ = ∇u, −∇ · φ = f in Ω, u = 0, on ∂Ω. (2.2)

5
2.2 Flux formulation

As it is usually done for finite element method [16], we multiply (2.2) by functions τ and
v, respectively and integrate on a subset K of Ω. Then we have [3]

Z Z Z
φ · τ dx = − u∇ · τ dx + unK · τ ds, (2.3)
K K ∂K

Z Z Z
φ · ∇vdx = f vdx + φ · nK vds, (2.4)
K K ∂K

where nK is known as the outward normal unit vector to ∂K.


Assume the K triangle to be regular shape and let the finite element space associated
with that triangulation Th = {K} of the domain Ω. Then, we have [3]

Vh := {v ∈ L2 (Ω) : v|K ∈ P (K) ∀K ∈ Th },

Σh := {τ ∈ [L2 (Ω)]n : τ |K ∈ Σ(K) ∀K ∈ Th },

where the space of polynomial functions is given by P (K) = Pp (K) of degree at most
p > 1 on K and Σ(K) = [Pp (K)]n . Let us consider the general formulation given by [8]:

Find uh ∈ Vh and φh ∈ Σh such that for all K ∈ Th we have

Z Z Z
φh · τ dx = − uh ∇ · τ dx + ūK nK · τ ds, (2.5)
K K ∂K

Z Z Z
φh · ∇vdx = f vdx + φ̄K · nK vds, (2.6)
K K ∂K

∀τ ∈ Σ(K) and ∀v ∈ P (K),


with φ̄K the numerical approximation of φ and ūK the numerical approximation of
u which are called “numerical fluxes”, on the boundary of K. This formulation is called
“flux formulation” as the DGFEMs method must be expressed in terms of the numerical
fluxes φ¯K and u¯K .

6
2.3 Derivation of the primal formulation

Now, following [3], We denote by H l (Th ) be the space of functions on Ω whose restriction
to each element K belongs to the Sobolev space ∈ H l (K).
Let Γ denote the union of the boundaries of the elements K of Th . Functions in
T (L) := ΠK∈Th L2 (∂K) are evaluated twice on Γ0 := Γ/∂Ω and evaluated once on ∂Ω.
Therefore, L2 (Γ) is identified as the subspace of T (Γ) which is the space of the functions
for which the values coincide on all internal edges.
The scalar numerical flux is defined as ū = (ūK )K∈Th and the vector numerical flux
φ̄ = (φ̄K )K∈Th , both of them linear functions given by

ū : H 1 (Th ) → T (Γ), φ̄ : H 2 (Th ) × [H 1 (Th )]2 → [T (Γ)]2 . (2.7)

2.3.1 Jump and Average

Let e be the interior edge between K1 and K2 , let the unit normal vectors be n1 and n2
on Γ pointing to the exterior to K1 and K2 on e respectively. We define the average {v}
as [2]

1
{v} = (v1 + v2 ), (2.8)
2
and jump [v] as

[v] = v1 n1 + v2 n2 on e ∈ Eh0 , (2.9)

where Eh0 is the set of the interiors edges e.


For ν ∈ [T (Γ)]2 , and ν1 , ν2 defined analogously we have

1
ν = (ν1 + ν2 ) (2.10)
2

[ν] = ν1 · n1 + ν2 · n2 on e ∈ Eh0 (2.11)

7
2.4 Primal formulation

Again, following [3], adding over all the elements equations 2.5 and 2.6, using the average
and jump operator, applying integrations by parts and suitable identities, we get the next
result (for details see [3])

Z
Bh (uh , v) = f vdx ∀v ∈ Vh , (2.12)

Z
Bh (uh , v) := ∇h uh · ∇h vdx (2.13)
ZΩ
+ ([ū − uh ] · {∇h v} − {φ̄} · [v])ds

+ ({ū − uh }[∇h v] − [φ̄]v)ds ∀uh , v ∈ H 2 (Th )
Γ0

Where q ∈ T (Γ) and ϕ ∈ [T (Γ)]2 and we have r : [L2 (Γ)]2 → Σh and l : L2 (Γ0 ) → Σh
given by

Z Z
r(ϕ) · τ dx = − ϕ · τ ds, (2.14)
Ω Γ

Z Z
l(q) · τ dx = q[τ ]ds ∀τ ∈ Σh . (2.15)
Ω Γ0

The form Bh is bilinear and the proof of (2.13) can be seen in [3] Equation 2.12 is
called the “primal formulation” and 2.13 is called the “primal form”, and recalling the
definition of φh given in Section 2.2 we have

φh = φh (uh ) := ∇h uh − r([ū(uh ) − uh ]) − l({u¯h (uh ) − uh }). (2.16)

All these general concepts can be applied to the one dimensional case as we will see
in the next section.

8
3 The elliptic problem in 1D

In this section we will use and apply the concepts of the “primal formulation” defined
and deduced in the section 2 for the one dimensional case using the deduction in [24].

3.1 Mathematical model

In this part of the text we aim to deduce the discretization of the elliptic problem in 1D
and get the linear algebra problem similarly as we did in general in Section 2.1. It can be
seen this deduction with more detail in [24].
Let start with some definitions [16]:

Definition 1 Let Ω ∈ Rn , n ≥ 1, be an open bounded set. We write C(Ω) to denote the


set of all continuous (real-valued) functions defined on Ω.

Let α = (α1 , ..., αn ) ∈ Nn . The length of a multi-index is defined as |α| = α1 + ... + αn .


  α1   αn
∂ ∂ ∂ |α|
Definition 2 Dα = ∂x1
··· ∂xn
= α
∂xx 1 ···∂xα n
n

Definition 3 Let Ω ∈ Rn , n ≥ 1, be an open bounded set; then for m ≥ 0, we write


C m (Ω) = {v ∈ C(Ω) : Dα v ∈ C(Ω) ∀|α| ≤ m}

Now, we consider the boundary value problem in the interval (0, 1) [24]:

∀x ∈ (0, 1), −(a(x)u0 (x)) = g(x), (3.1)

u(0) = 1, (3.2)

u(1) = 0, (3.3)

where a ∈ C 1 (0, 1) and g ∈ C 0 (0, 1), using the definition 3. It is assumed that there
are two constants a0 and a1 such that

∀x ∈ (0, 1), 0 < a0 ≤ a(x) ≤ a1 . (3.4)

9
u is said to be the classical solution of 3.1-3.3 if u ∈ C 2 (0, 1) and u satisfies the
equations 3.1-3.3 pointwise [24].
Let Eh be a partition of (0,1) given by 0 = x0 < x1 < . . . < xN = 1 and let
In = (xn , xn+1 ), n = 0, 1, ..., N . With this definition we have

hn = xn+1 − xn ,

hn−1,n = max(hn−1 , hn ),

h = max0≤n≤N −1 hn .

The space of a piecewise discontinuous polynomial of degree k is denoted by Dk (Eh ):

Dk (Ek ) = {v : v|In ∈ Pk (In ), ∀j = 0, ..., N − 1} (3.5)

Where Pk (In ) is the space of the polynomial of degree k on the interval In .


We can define

v(x+
n ) = lim v(xn + ), (3.6)
→0
>0

v(x−
n ) = lim v(xn − ), (3.7)
→0
>0
and just as we did in Section 2.3.1 we define here the jump of v at the endpoints of
In [24]:

[v(xn )] = v(x− +
n ) − v(xn ),

and average of v at the endpoints of In as:

1
v(x− +

v(xn ) = n ) + v(x n ) , ∀n = 1, ..., N − 1
2
This definition is extended to the end point as:

[v(x0 )] = v(x+
0 ),

10
{v(x0 )} = v(x+
0 ),

[v(xN )] = v(x−
N ),

{v(xN )} = v(x−
N ).

Given the definition above the penalty terms are defined as [24]:

N
X α0
J0 (v, w) = [v(xn )][w(xn )], (3.8)
n=0
hn−1,n
Where α0 and α1 are real nonnegative numbers.
At this point we apply the same process of the finite element method as in 3.1 which
means multiplying by v ∈ Dk (Eh ) for each interval In :

Z xn+1
a(x)u0 (x)v 0 (x)dx − a(xn+1 )u0 (xn+1 )v(x− 0 +
n+1 ) + a(xn )u (xn )v(xn ) (3.9)
Zxnxn+1
= f (x)v(x)dx, n = 0, ..., N − 1
xn

From this we can derive

[a(xn )u0 (xn )v(xn )] = {a(xn )u0 (xn )}[v(xn )] + {v(xn )}[a(xn )u0 (xn )], (3.10)

Then we have

N
X −1 Z xn+1 N
X
a(x)u0 (x)v 0 (x)dx − {a(xn )u0 (xn )}[v(xn )] (3.11)
n=0 xn n=0
XN Z 1
+  {a(xn )v 0 (xn )}[u(xn )] = f (x)v(x)dx
n=0 0

− a(x0 )v 0 (x0 )u(x0 ) + a(xN )v 0 (xN )u(xN )


Z 1
= f (x)v(x)dx − a(x0 )v 0 (x0 ).
0

11
For this work  will be restricted to the case when  = −1, for which the DGFEMs
form is symmetric, i.e.
∀v, w, a1 (v, w) = a1 (w, v), (3.12)

N
X −1 Z xn+1 N
X
0
a−1 (v, v) = 2
a(x)(v (x)) dx − 2 {a(xn )u0 (xn )}[v(xn )] (3.13)
n=0 xn n=0
+ J0 (v, v),

Finally, the statement for the DGFEMs method is given by [24]:


Find uDG
h ∈ Dk (Eh ) such that

∀v ∈ Dk (Eh ), a (uh , v) = L(v), (3.14)

where L : Dk (Eh ) → R is given by

1
α0
Z
0
L(v) = g(x)v(x)dx − a(x0 )v (x0 ) + v(x0 ), (3.15)
0 h0,1
Piecewise quadratic polynomials are used as the monomial basis function given by:

φn0 (x) = 1, (3.16)

x − xn+1/2
φn1 (x) = 2 , (3.17)
xn+1 − xn

(x − xn+1/2 )2
φn2 (x) = 4 , (3.18)
(xn+1 − xn )2
with the midpoint of the interval given by xn+1/2 = 12 (xn + xn+1 ). We also have

1
xN = x0 + nh, h= (3.19)
N

φn0 (x) = 1, (3.20)

12
2
φn1 (x) = (x − (n + 1/2)h), (3.21)
h

4
φn2 (x) = (x − (n + 1/2)h)2 , (3.22)
h2

0
φn0 (x) = 0, (3.23)

0 2
φn1 (x) = , (3.24)
h

0 8
φn2 (x) = (x − (n + 1/2)h). (3.25)
h2
Then the DGFEMs solution is given by

N
X −1 X
2
uh (x) = Ujm φjm (x), (3.26)
m=0 j=0

Finally a linear system is obtained:

AU = b, (3.27)

where U is the vector of the unknown real numbers to be solved um


j , b is the vector

with components L(φin ) and A is the matrix with entries a (φjm , φin )

3.1.1 Local and global matrices

We also can write the linear system with An U n as [24]


 
un0
 
U n = un1  ,
 
 
un2

Z
(An )ij = (φni )0 (x)(φnj )0 (x)dx, (3.28)
In

13
Computing the corresponding coefficients of An we get
 
0 0 0
1 
An =  0 4 0,
 
h 
16
0 0 3

We can also compute the interior nodes xn . Expanding jump and average terms we
get [24]

α0 DG
−(P DG )0 (xn )[v(xn )] + v 0 (xn )[P DG (xn )] + [P (xn )][v(xn )]
h
1  0 + DG + α0 DG +
= (P DG )0 (x+n )[v(x +
n )] − v (x n )[P (x n )] + [P (xn )][v(x+n )]
2 2 h
1 DG 0 −  0 − DG − α0 DG −
− (P ) (xn )[v(xn )] + v (xn )[P (xn )] + [P (xn )][v(x−

n )]
2 2 h
1  0 + DG − α0 DG +
− (P DG )0 (x+
n )[v(x −
n )] − v (x n )[P (x n )] + [P (xn )][v(x−n )]
2 2 h
1 DG 0 −  α0
(P ) (xn )[v(xn )− ] + v 0 (xn )− [P DG (xn )+ ] + [P DG (x− +
n )][v(xn )].
2 2 h

Using definitions 3.16 to 3.25 it is possible to compute the local matrices for the interior
nodes given by:

 
0 0 0
α 1−α −2 + α
1 
Bn = − − α −1 +  + α
0 0
2 −  − α ,,
0
 
h 
0 0 0
2 + α 1 − 2 − α −2 + 2 + α
 
α0 −1 + α0 −2 + α0
1 
Cn =   + α0 −1 +  + α0 −2 +  + α0 ,  ,
 
h 
0 0 0
2 + α 1 + 2 + α −2 + 2 + α
 
−α0 −1 + α0 2 − α0
1 
Dn =  − − α 0
−1 +  + α 0
2 −  − α ,,
0
 
h 
0 0 0
−2 − α −1 + 2 + α 2 − 2 − α

14
 
−α0 1 − α0 2 − α0
1 
En = 0 0 0 ,
−1 +  + α −2 +  + α  ,
 
 +α
h 
0 0 0
−2 − α 1 − 2 − α −2 − 2 − α
And also for the boundary nodes:

 
α0 2 − α0 −4 + α0
1 
F0 = −2 − α −2 + 2 + α
0 0
4 − 2 − α ,  ,
0
 
h 
0 0 0
4 + α 2 − 4 − α −4 + 4 + α
 
α0 −2 + α0 −4 + α0
1 
F0 = 2 + α0 −2 + 2 + α0 4 + 2 + α0  ,
 
h 
0 0 0
4 + α −2 + 4 + α −4 + 4 + α
We define the next matrices to assemble the global matrix as [24]:

T = An + Bn + Cn+1 , (3.29)

T0 = A0 + F0 + C1 , (3.30)

TN = AN −1 + Fn + BN +1 , (3.31)

Then the global matrix is given by:


 
T D1
 0 
 
E1 T D2 
 
 
 ... ... ... 
A=
 


 ... ... ... 

 
TN −2 M DN −1 
 

 
EN −1 TN

15
3.1.2 Right Hand Side

The expansion of the right hand side vector is given by [24]

1
α0 i
Z
0
L(φin ) = f (x)φin (x)dx − K(x0 )(φin ) (x0 ) + φ (x0 ). (3.32)
0 h n
Using the definition of the local variable φni and making a change of variable, we obtain

Z 1 Z 1
h h
g(x)φni (x)dx = g( t + (n + 1/2)h)ti dt. (3.33)
0 2 −1 2
The last integral is approximated using the Gauss quadrature rule. If v is a polynomial
of degree 2QG − 1 the Gauss quadrature is exact. We obtain

Z 1 QG
hX h
f (x)φni (x)dx ≈ wj f ( sj + (n + 1/2)h)sij . (3.34)
0 2 j=1 2

and finally the component of the vector b is given by [24]

−1 N −1 N −1
(b00 , b01 , b02 , b10 , b11 , b12 , b20 , b21 , b22 , ..., bN
0 , b1 , b2 ). (3.35)

3.2 DGFEMs in CUDA

With the result obtained above we can do a computer program to calculate the numerical
solution of the elliptic equation. At the end of the program it is necessary to solve the
Linear Algebra Problem (LAP) Ax = b with A the global matrix, x the coefficients of the
solution and b the right hand side. As we know from Computer Linear Algebra (CLA) [11],
[28], [14], we could choose among different solvers to get faster results, this is important
in the case we have a huge amount of data, i.e., if we want to calculate the solution for
a number of elements. In this case we have to choose a faster and parallelizable solver.
In this sense a solver which is highly parallelizable is the Conjugate Gradient algorithm
(CG) [15]. Also from CLA [11], [28], [14], we know that, the part of the code which needs
more floating-point operations (FLOP) is the solution of the LAP. Therefore in this work

16
we will use the parallel paradigm via CUDA to compute the LAP together with the CG.
In the next chapter we will explain what is the CUDA language and how to use it.

4 CUDA: The parallel computing language

It is well known the great necessity of the big scale computing in all areas of science
and especially in Scientific Computation where almost all problems in this area involve
a Partial Differential Equation (PDE) that should be solved. The PDE leads to a linear
algebra problem, which means solving the problem Ax = b with matrices that sometimes
have hundreds, thousands or millions of entries per matrix per step-time. It is this area
which this work focuses on.

4.1 Background

Less than 10 years ago (November 2006), Nvidiar opened its language called CUDA
(Compute Unified Device Architecture) to the general community. The language is a
general purpose parallel programming model and a computing platform that is used in
Nvidiar Graphic Processor Unit or GPU’s which comes with a highly parallel, many core
processor and multithread system. These GPU’s have enormous computational horse-
power and great memory bandwidth in comparison with the classical CPU, as shown in
Figure 1.
To explain this great difference in the performance, we have to explain the differences
between the architecture of the CPU and the GPU. The main difference between both of
them is the way they are built. As Figure 2 shows, the GPU has many more transistors
devoted to data processing than the CPU, which has more memory dedicated to data
caching and flow control.

17
Figure 1: (Taken from [1])Floating-Point Operations per Second for the CPU and GPU.

Figure 2: (Taken from [1]).The GPU devotes more transistors to data processing.

4.2 CUDA Architecture

To understand and learn the CUDA language it is important to have a clear knowledge
of the hardware because the software and the hardware have a very close relationship,
and also the way the code is written has a direct connection with the architecture of the
GPU.
A GPU has a certain number of arithmetic logic units which are divided into smaller
groups called grids, these grids are divided into blocks, which are finally subdivided into

18
Figure 3: Memory Hierarchy (Taken from [1]).

threads, this idea is shown in Figure 3 schematically. Every thread is able to run a
function which in CUDAr programming is called a “kernel”. After declaring a function,
it is necessary to specified the number of threads per block and the number of blocks per
grid within the “chevron” syntax given by <<< ... >>> and they can be of type int or
dim3.

19
4.3 Programming model

The CUDAr programming model as we have described above relies on the architecture of
the GPU which mainly consist of a processor chip with a parallel system. At the bottom
of Figure 4 we can see a pair of the physical device, they are also called graphic cards.
This graphic card is manage by the driver Application Programming Interface (API)
which can be found in the Nvidia web page, As in many languages, CUDAr has libraries
and APIs to help the process of coding. NVIDIA also provides the runtime API and the
libraries. The user or developer can install the API’s and the libraries (see Appendix A).
This libraries are very similar as the libraries in other languages. Actually CUDA itself
was built on C and in general all the CUDA language is very similar to C and also their
sentences. This gives CUDA a great advantage for C and C++ programmers who can
learn CUDA very straightforward.

Figure 4: Components of the ’device’ or software stack (Taken from [13]).

In Figure 5 (also called software stack) we can see the general structure of the device

20
which is organize in cores, connected with the share memory and then this with the texture
memory. This physical connection has relevance and a very close relationship with the
structure of the language. Figure 5 is also called software stack. The counterpart of this
is the CPU or “host” and is the code that classical runs in serial. CUDA also has support
for other languages like Fortran and Python as you can see in [26] We can mention that
at the beginning of this dissertation, we tried to work on CUDA Fortran, but it does
not have much support for this kind of calculations and binding does not have enough
documentation, then we decided to swap for Cuda C.

Figure 5: Schematically representation of the device (Taken from [13]).

21
4.3.1 Memory hierarchy

There are three principal concepts to be learned about the CUDA programming model:
the group of threads and its hierarchy, the concept of shared memory and finally the
synchronization. All these concepts have simple representation in the syntax of the C
language.
We can say that the aim behind the idea of parallelization is to find out what part
of the algorithm has information which is independent of the rest of the calculations,
and which is reflected on the code as the loops with independent information parallelized
between one cycle and another.
Every independent calculation can be done by a thread. The order of threads that
can be used in total is currently 1024 threads per block but there are millions of blocks
containing threads within the grids, Figure 6 shows how threads, blocks and grids are
organized within the device. This architecture is easily scalable, which allows GPU archi-
tecture to increase the number of processors that can be used for calculations for future
models of dedicated hardware.
The model itself guides the programmer to partition the problem in blocks of threads,
and then in pieces that are finer and can share information and cooperate within the
same block. Each block of threads can do calculations sequentially or concurrently on
any of the multiprocessors available in the GPU and only the multiprocessor count will
be known by the runtime system.

22
Figure 6: Grid of Thread Blocks (Taken from [1])

A very important concept that is defined in the actual code is the “kernel” which is
equivalent to a “function” in C but with the important characteristic that kernels will
be executed in parallel as many times as the number of threads defined for that specific
kernel and all of them at the same time. The syntax for a kernel begins with the keyword
__global__ and the chevron syntax that contains the number of threads and blocks used
for that kernel. Each thread in the kernel is identified by a unique thread id given by
a built-in threadIdx variable. Source Code 1 shows an explicit example of the kernel
syntax:

23
Source Code 1: Addition of two vector launching one single thread (Taken from [1])

1 // Kernel definition
2 __global__ void VecAdd(float* a, float* b, float* c)
3 {
4 int i = threadIdx.x;
5 c[i] = a[i] + b[i];
6 }
7 int main()
8 {
9 ...
10 // Kernel invocation with N threads
11 VecAdd<<<1, N>>>(a, b, c);
12 ...
13 }

In this example we can see how the kernel can be declared (line 2) with the keyword
__global__, the input parameters which are the vector pointer a and b and the output
parameter which is C. For this kernel we are calculating a simple sum of vectors. Consider
that if we do this with a serial code we would use a loop with a do or a for with i as
a dummy variable. For this CUDA code we are using the dummy variable i (line 4) but
instead of using loops we use the built-in threadIdx.x variable, which means that we
will do all the i operations in parallel or at the same time. In the function main (line 11)
the kernel is called as a normal function but including the chevron syntax <<< , >>> and
including the information of the number of blocks (1) and threads per block (N) inside
the brackets. Figure 7 shows in picture this kernel which was taken from [18], here, every
entry from vector a is added to the corresponding entry from vector b. It is important to
note that every sum is being performed for a different thread and we usually have enough
threads available for every entry of the vector (the maximum number of threads that can
be launched in a CUDA kernel is of order 1017 ) then, we can use them to perform all the
additions at the same time.
In the Source Code 2 (taken from [1]) we can see the addition of two matrices A and
B of size NxN and the result is stored in matrix C. The maximum number of threads per

24
Figure 7: Vector summation (Taken from [18])

block that can be used is currently 1024, but, depending on the architecture of the GPU,
the number of blocks can be millions.

25
Figure 8: A 2D hierarchy of blocks and threads(Taken from [18])

Blocks and threads can be organized in two or three dimension, as schematically shown
in Figure 8. The dimension of blocks can be specified in the CUDA variables int or dim3
and are accessible through the build-in variable blockIdx. The dimension of threads
can also be specified in the CUDA variables int or dim3 and are accessible through the
build-in variable threadIdx.

26
Source Code 2: Addition of two matrices (Taken from [1])

1 // Kernel definition
2 __global__ void MatAdd(float A[N][N], float B[N][N],
3 float C[N][N])
4 {
5 int i = threadIdx.x;
6 int j = threadIdx.y;
7 C[i][j] = A[i][j] + B[i][j];
8 }
9 int main()
10 {
11 ...
12 // Kernel invocation with one block of N * N * 1 threads
13 int numBlocks = 1;
14 dim3 threadsPerBlock(N, N);
15 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
16 ...
17 }

The same way as in the Source code 1 above, in the Source code 2 we define the kernel
(line 2) but now with two dummy variables: One for the first entry of the matrix and
the other for the second one. Here every pairwise addition is perform by each thread in
the kernel MatAdd. The two variables are build-in as before and one uses .x and the
other .y (line 5 and 6) to distinguish between them. We call the kernel at line 15 but
we previously define the int variable numBlocks (line 13) and the specific CUDA kind
variable dim3 and threadsPerBlock(N, N) (line 14). We suppose N is already defined.
The CUDA type dim3 can be used as a one, two or three dimensional variable to specify
the dimension of the block, as blocks are composed of three dimensions. Finally, in line
15 the number of thread per block and the number of blocks per grid are passed via
the dim variables numBlocks (two dimensional)and threadsPerBlock (one dimensional)
respectively to the kernel MatAdd.

27
Source Code 3: Addition of two matrices (Taken from [1])

1 // Kernel definition
2 __global__ void MatAdd(float A[N][N], float B[N][N],
3 float C[N][N])
4 {
5 int i = blockIdx.x * blockDim.x + threadIdx.x;
6 int j = blockIdx.y * blockDim.y + threadIdx.y;
7 if (i < N && j < N)
8 C[i][j] = A[i][j] + B[i][j];
9 }
10 int main()
11 {
12 ...
13 // Kernel invocation
14 dim3 threadsPerBlock(16, 16);
15 dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
16 MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
17 ...
18 }

In the Source code 3 we can see the use of blockIdx. Here, we perform the same
sum of matrices as before but in the definition of i and j we are using blockIdx as two
dimensional variable and the built-in variable blockIdx is used to define the number of
threads and blocks.
These are the first examples of the parallelization of simple algorithms. In the next
pages we explain more about the architecture and more concepts about the other levels
of memory.

4.4 CUDA memory.

CUDA memory is divided into three different levels of memory with different levels of
accessibility. The first one is “global memory” for which all threads can access. The
second one is “shared memory” where thread from the same block can share information
between them. The last one is “constant and texture memory” which can be accessed for
all threads and is only readable.

28
In figure 7 together with the Source Code 1, we can see a simple example of the global
memory. In this work shared and texture memory are not used explicitly but it can be
seen more about the topic in [18].

29
4.5 Data transfer

It is important to know the basis of the programming model as we described above to


understand the way we could develop a program properly and what is behind the scenes
in more complex problems. However, most of the time, it is very frequent to use API’s
(Section 4.3) for these complex problems. For this work this API and its libraries are
extensibility used and they handle the specific problem to pass and calculate the number
of threads and blocks that should be used but still is necessary to allocate and pass
memory from the CPU to the GPU and take the information of the results from the GPU
to the CPU.
In the Code 4 we can see a complete example about the information transfer. From
line 1 to 7 is declared the kernel VecAdd. In line 14 and 15 memory is allocated in the
host for the vectors A and B. From line 19 to 24, memory is allocated in the device. In
line 26 and 27 information is passed from host to device. In line 32 the kernel is invoked.
In line 35 information is passed from the device to the host and finally from line 37 to 39
device memory is released.
The same process is done when we use libraries, some explicit examples of the corre-
sponding implementation can be seen in the Nvidiar CUDA Toolkit and specifically for
this work, we used two libraries: CUBLASr [20] and CUSPARSEr [21].

30
Source Code 4: Vector addition code sample (Taken from [1])

1 // Device code
2 __global__ void VecAdd(float* A, float* B, float* C, int N)
3 {
4 int i = blockDim.x * blockIdx.x + threadIdx.x;
5 if (i < N)
6 C[i] = A[i] + B[i];
7 }
8 // Host code
9 int main()
10 {
11 int N = ...;
12 size_t size = N * sizeof(float);
13 // Allocate input vectors h_A and h_B in host memory
14 float* h_A = (float*)malloc(size);
15 float* h_B = (float*)malloc(size);
16 // Initialize input vectors
17 ...
18 // Allocate vectors in device memory
19 float* d_A;
20 cudaMalloc(&d_A, size);
21 float* d_B;
22 cudaMalloc(&d_B, size);
23 float* d_C;
24 cudaMalloc(&d_C, size);
25 // Copy vectors from host memory to device memory
26 cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
27 cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
28 // Invoke kernel
29 int threadsPerBlock = 256;
30 int blocksPerGrid =
31 (N + threadsPerBlock - 1) / threadsPerBlock;
32 VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
33 // Copy result from device memory to host memory
34 // h_C contains the result in host memory
35 cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
36 // Free device memory
37 cudaFree(d_A);
38 cudaFree(d_B);
39 cudaFree(d_C);
40 }
41 // Free host memory
42 }

31
5 Results

5.1 Methodology

The main purpose of the results of this work is to show the performance of the code in
parallel versus the performance of a code in serial. To accomplish this we took the code
in [24] as a reference which you could see in the Appendix C and the code in parallel can
be seen in Appendix B. For this work we use CUDA C, the model of the graphic card was
GeForce GTX 765M and a processor intel core i7. To solve the linear system in parallel
we used a Conjugate Gradient Algorithm that can be found in the Nvidia Cuda Toolkit
7.5. A benchmark was made to compare the performance of both codes together with the
same input parameters. And at the end the norm of the L2 norm was calculated for the
Matlab code. In the codes  (the symmetrization parameter) was called ss which for this
model (SIPG) equals one. The penalty parameter α0 was called penal and nel=number
of elements.

5.2 Numerical results of the serial curve

The serial code given by [24] gave us the coefficients for the numerical result of the
solution of the linear system. We added a loop to calculate the numerical solution and
could compare directly with the analytical solution. As a result of this we get: ur , which
is the solution at the right boundary of the element, and ul , which is the solution at the
left boundary of the element. With this two points we get the numerical solution of the
elliptic problem. In Figure 9 we have plotted ur (x) with asterisks in red and ul (x) with
circles in blue.
In Figure 10 we compare this last result with the analytic solution of the elliptic
problem which is [24]:

2
p(x) = (1 − x)e−x ; (5.1)

we can see in this Figure that the numerical result is in agreement with the analytical

32
Numerical Solution of the DG

0.8

0.6
u(x)

0.4

0.2

-0.2
0 0.2 0.4 0.6 0.8 1
x

Figure 9: ur (x) and ul (x) curves.

result.

5.2.1 Coefficients in serial and in parallel

In Figure 12 it is shown the plot of the coefficients αjm from the codes in parallel and in
serial. The coefficients obtained from the serial code [24] are given in red circles and the
coefficients of the code in CUDA are given in green crosses. This first comparison was
made using the original solver i.e. the direct method, for the code in serial and CG for
the code in CUDA. As the graphic shows, the results with both codes are the same as we
expected.

33
Numerical Solution from the Matlab Code

0.8

0.6
u(x)

0.4

0.2

-0.2
0 0.2 0.4 0.6 0.8 1
x

Figure 10: Analytic, ur and ul curves.

Numerical solution from the CUDA code

0.8

0.6
u(x)

0.4

0.2

-0.2
0 0.2 0.4 0.6 0.8 1
x

Figure 11: Numerical results of the DGFEMs elliptical problem from the code in parallel.

34
Coefficients in serial and in parallel. Input: nel=10,ss=-1,penal=2
1
Matlab
Cuda
0.8

0.6
Coefficients α m
j

0.4

0.2

-0.2
0 0.2 0.4 0.6 0.8 1
x

Figure 12: Coefficients given in the serial code and in the parallel code.

5.3 Benchmarking

We also did some benchmarks to compare the performance of the serial code versus the
performance of the parallel code. The original serial code was solved with the solver
MLD. However, in order to do a most fair comparison, we changed this solver to the
already implemented Matlab functions CGS which is the matlab implementation of the
Conjugate Gradient Square algorithm and PCG which is the matlab implementation of
the Conjugate Gradient Algorithm. The function PCG was used without preconditioner,
which is exactly the same implementation used for the CUDA code.

35
Benchmark: ss=-1, penal=3.0
250
Matlab
Cuda
200

150
Time [s]

100

50

0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
No. of elements

Figure 13: Benchmark with CGS function. Input: ss=-1, penal=3.0

In figure 13 we can see the comparison of the performance of the codes with the
function CGS and the input parameters ss=-1, penal=3.0. In abscissa axis are plotted
the number of elements and in the ordinate axes is plotted the time in seconds. We ran
the program once to get every point. The results obtained from the Matlab code are in
pink crosses and the results from the CUDA code are in red crosses. It can be seen from
the graph that at the beginnig i.e. for few elements the time is almost the same but for
many elements the code in Matlab takes more than 200 times to get the same result.

36
Benchmark ss=-1.0, penal=5.0
50
Matlab
Cuda
Time [s]

0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements

Figure 14: Benchmark with CGS function. Input: ss=-1, penal=5.0

Figure 14 shows a graph with the same solver as in Figure 13 but with parameters
ss=-1 and penal=5.0. It is interesting to note that in the first part of the graphic, points
from the CUDA code take more time than the points from the code in Matlab. With
2000 elements both codes have the same performance, from that point, the Matlab code
started to take more time until the end. This behavior at the beginning of the graph
could be explained taking in account data transfer rate between host and device, which
is the time the information spends to pass from the host to the device i.e. from the CPU
to the GPU. Although we did not measure this rate, from the Figure we can say that is
the reason the performance of the CUDA code for few elements takes more time than the
Matlab code. This rate is not very important when the amount of information is bigger
i.e. when the number of element increases and for more than 7000 elements the Matlab
code takes much more time than the CUDA code.

37
Benchmark: Function PCG. Input: ss=-1,penal=5
40
Matlab
Cuda
35

30

25
Time [s]

20

15

10

0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements

Figure 15: Benchmark with PCG function. Input: ss=-1, penal=5.0.

38
Benchmark. Function PCG ss=-1.0, penal=2000.0
80
Matlab
Cuda
70

60

50
Time [s]

40

30

20

10

0
0 1000 2000 3000 4000 5000 6000 7000 8000
No. of elements

Figure 16: Benchmark with PCG function. Input: ss=-1, penal=2000.

In Figure 15 it is shown a benchmark using the Matlab PCG function as a solver for
the serial code which is a fairer comparison than the benchmark with the CGS. For this
graphic, the PCG was used without any preconditioner. We took as the input parameters
ss=-1 and penal=5.0. We also can see, that subtle behavior which was described above
at the beginning of the figure regarding the data transfer rate.
At the end of the figure we can see that the performance is more that 35 times better
for the last point which was for 7500 elements. Figure 16 shows a benchmark for the input
ss=-1.0 and penal=2000, the behavior is the same in for Figure 15 and it can be seen that
the performance of the CUDA code is more than 70 times better than the Matlab code
for the 7500 elements.

39
5.4 Error

The L2 norm of the error [29], [16], defined from equation 5.3 to 5.6 was plotted vs the
mesh size h in order to assure the correctness of the DGFEMs solution. According to
theory [29], [16], for piece wise polynomial basis of degree 2 a slope of 3 is expected in the
convergence plot. This result was achieved using a quadrature rule for the integral of the
error. Since the error function its not a polynomial function (exponential function from
the analytic solution of the problem), then the integral will not be exact, nevertheless a
large number of terms (large number of quadrature points) for the integral was used
Z xi+1 Nq
X
2
(u − uh ) dx ≈ (u(xq ) − uh (xˆq ))2 ŵq |J|, (5.2)
xi q=1

Where xq is the value of x mapped in the domain (-1,1) and xˆq are the roots of the
Legendre polynomials in the interval (-1,1) and ŵ are the weights, then this quantity was
computed for all elements and each contribution was added in a loop. The error code can
be seen in Appendix D. The subroutine lgwt.m was taken from [30].

N Z
X xi
ku− un k2L2 (0,1) = (u − un )2 dx (5.3)
i=1 xi−1

Taking in to account the expression 5.2, we can compute the error squared using the
following expression:

Nq
N X
X
ku− un k2L2 (0,1) ∼ (u(xq ) − un (xˆq ))2 ω̂q |J| (5.4)
i=1 q=1

The variable x mapped in the local domain has the following expression which depends
on the endpoints of each element and the quadrature points:
1 − xˆq 1 + xˆq
xq = xi−1 + xi (5.5)
2 2
The jacobian is computed using the following expression:

dx xi − xi−1
|J| = = (5.6)
dξ 2

40
The DGFEMs approximation is given by :
3
X
uh (xˆq ) = Uj (xˆq )j−1 , (5.7)
j=1

Where (xˆq )j−1 are the mapped basis polynomials 1, x, x2 in the interval (-1,1)
and Uj are the coefficients. Finally the expected convergence plot is obtained:

L2 norm vs h
10 -2

10 -4
L2 norm

-6
10

10 -8

10 -10
-4 -3 -2 -1 0
10 10 10 10 10
h

Figure 17: L2 norm of the error vs the mesh size h.

41
5.5 Discussion of the results

As we could see from Figure 12 results from the CUDA code are in good agreement with
the results from the Matlab code [24] and numerical results coincide with the analytical
solution shown in Figure 10 as we expected. From the second part of the results which
is the benchmarking we also can see a much better performance from the CUDA code in
comparison with the Matlab code as it is expected. An important result is the fact that
when we try to calculate the solution for more elements the performance of the CUDA
core is much better in comparison with the Matlab code. This could be explained because
we are using more FLOP’s as we increase the number of elements, when the difference
between the code in serial and in parallel is more evident. This could be more beneficial
when we try to get the solution for multidimensional cases of the problem when the
number of elements grows faster. We could also see qualitatively that the data transfer
rate between host and device affects the performance of the CUDA code and explain
the beginning of Figures 14, 15 and 16. Another important issue was the benchmarking
technique, for this results we use the tic and toc Matlab timing function and the basic
shell command time for measuring the performance of the CUDA code but as we know
from [18] CUDA has its own specific function to measure different parts of the code. For
future works we could improve this area to obtain better results. The last result is shown
in Figure 17, this shows us the accuracy of the numerical results. We obtains a linear
curve in a log-log graphic with slope 3 as we use polynomials of degree 2 for our basis
function which is consistent with the theory.

6 Conclusions

This research has attempted to show the performance of a CUDA code of a Discontinuous
Galerkin Finite Element Methods (DGFEMs) for an elliptic problem in 1D. A comparison
was made between a serial code in Matlab and the same problem implemented in parallel in
CUDA. Some benchmarks showed the performance of the code in serial was substantially

42
slower than the code in parallel, as we expected. Using exactly the same solver we found
that the CUDA code could be more than 70 time faster than the code in Matlab. We
also obtain that numerical results for the code in parallel and in serial are in agreement.
Finally, the L2 error norm of the serial code was calculated, obtaining a straight line
of slope 3 as it was expected. Future studies should concentrate on the use of better
profiling for a fairer benchmarks, especially, it could be use in the development of a
multidimensional version of the problem. This work may well represent a step forward in
the use of GPU in the implementation of the DGFEMs for elliptic problems. Application
of these findings will make it easier to perform DGFEMs in more dimensions where its
LAP is numerically and computationally more demanding.

43
A Tutorial to install Cuda 7.5 on Ubuntu 15.04

First of all you must have an Nvidiar CUDA capable card and can check this with:

$ lspci |grep NVIDIA

You also can check you Ubuntu version with:

$lsb_release -a

It is recommended to have a brand new installation of Ubuntu to avoide possible


conflict with others compilers or Nvidia drivers.
Then you should be download the Nvidiar Cuda driver from:

https://developer.nvidia.com/cuda-downloads

Install all these libraries:


sudo apt-get install freeglut3-dev build-essential libx11-dev
libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa
libglu1-mesa-dev libglu1-mesa glew-utils mesa-utils

Open the next file:

sudo gedit /etc/modprobe.d/blacklist.conf

And add these sentence to the end of the file:


blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv

A very important step is to disable the Noveau with:


options nouveau modeset=0

44
and create a file at
/etc/modprobe.d/blacklist-nouveau.conf with the following

contents:
blacklist nouveau

and regenerate the kernel initramfs:


sudo update-initramfs -u

Reboot into text mode (runlevel 3). This can usually be accomplished by adding the
number ”3” to the end of the system’s kernel boot parameters. Since the NVIDIA drivers
are not yet installed, the text terminals may not display correctly. Temporarily adding
”nomodeset” to the system’s kernel boot parameters may fix this issue. Consult

https://wiki.ubuntu.com/Kernel/KernelBootParameters

on how to make the above boot parameter changes.


The reboot is required to completely disable the Nouveau drivers and prevent the
graphical interface from loading. The CUDA driver cannot be installed while the Nouveau
drivers are loaded or while the graphical interface is active.
Then change the permission to the executable with:

$chmod 777 -R cuda_<version>_linux.run

Two steps that are not described in the official guide but work for me are:
Eliminate the file:
.X0-lock

at the archive /temp


After the step above reboot the computer.
Another key step is that when the installer is running you have to write yes and accept
all the options but when it asks whether you want to install the OpenGL library, you must
say no, otherwise you will get in an infinite loop at log on stage and not able to start the

45
session. If your specific application use the OpenGL library, maybe this tutorial is not
going to help you, otherwise it will work.
Say yes and accept all the others options.
If you either get an error or the executable skip something in the instalation, read
the error message it sends to you at the end. It is going to give you a hint to solve the
problem.
If everything is done properly, you should have your Cuda Toolkit installation com-
plete. After this make sure you add the next path at the end of the .bashrc file at home
folder:
for 32 bits include these lines:

$export PATH=/usr/local/cuda-6.5/bin:\$PATH$
$export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib:\$LD_LIBRARY_PATH$

and for 64 bits:

$export PATH=/usr/local/cuda-6.5/bin:\$PATH$
$export LD_LIBRARY_PATH=/usr/local/cuda-7.5
/lib:/usr/local/cuda-6.5/lib64:\$LD_LIBRARY_PATH$

Then, close the terminal. Open another terminal and go to the Nvidia Sample solvers.
We recommend to start with the deviceQuery example which is in the NVIDIA CUDA
Toolkit, it contains all the information of your Nvidia Card and its architecture. Finally,
run make to get the executable and then execute the result of this step.
Good luck with the installation!

B CUDA Code

This code was developed by the author of this work and inspired from [24] except for the
implementation of the CG algorithm which can be found in the CUDA Toolkit Version
7.5. To more details of this please check https://developer.nvidia.com/cuda-toolkit

46
1 /*
2 * Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
3 *
4 * Please refer to the NVIDIA end user license agreement (EULA) associated
5 * with this source code for terms and conditions that govern your use of
6 * this software. Any use, reproduction, disclosure, or distribution of
7 * this software and related documentation outside the terms of the EULA
8 * is strictly prohibited.
9 *
10 */
11

12 /*
13 * This sample implements a conjugate gradient solver on GPU
14 * using CUBLAS and CUSPARSE
15 *
16 */
17

18 // includes, system
19 #include <stdlib.h>
20 #include <stdio.h>
21 #include <string.h>
22

23 /* Using updated (v2) interfaces to cublas */


24 #include <cuda_runtime.h>
25 #include <cusparse.h>
26 #include <cublas_v2.h>
27

28 // Utilities and system includes


29 #include <helper_functions.h> // helper for shared functions common to CUDA
,→ Samples
30 #include <helper_cuda.h> // helper function CUDA error checking and
,→ initialization
31

32 #include "common.h"
33 #include <cusparse_v2.h>
34 #include <cuda.h>
35

36

37 #define WIDTH 600


38 #define HEIGHT 600
39

40 float Aglobal [HEIGHT][WIDTH];


41 int n,m;
42

47
43

44 const char *sSDKname = "conjugateGradient";


45

46 float sourcef(float xval)


47 {
48 //source function for exact solution = (1-x)e^(-x^2)
49 float yval;
50 yval=-(2*xval-2*(1-2*xval)+4*xval*(xval-pow(xval,2)))*exp(-xval*xval);
51

52 return yval;
53 }
54

55 // const int nel=500,mz=3;


56 // int glodim = nel * mz;
57

58 int main(int argc, char **argv)


59 {
60 int M = 0, N = 0, nz = 0, *I = NULL, *J = NULL;
61 float *val = NULL;
62 const float tol = 1e-5f;
63 const int max_iter = 10000;
64 float *x;
65 float *rhs;
66 float a, b, na, r0, r1;
67 int *d_col, *d_row;
68 float *d_val, *d_x, dot;
69 float *d_r, *d_p, *d_Ax;
70 int k;
71 float alpha, beta, alpham1;
72

73 float *dA,*A;
74 int *dNnzPerRow,*nnzh=NULL;
75 int totalNnz;
76

77 int nel=200,mz=3;
78 int glodim = nel * mz;
79 // const int glodim2=1500;
80 // float Aglobal[1500][1400];
81 float rhsglobal[glodim];
82

83

84 // This will pick the best possible CUDA capable device


85 cudaDeviceProp deviceProp;
86 int devID = findCudaDevice(argc, (const char **)argv);
87

48
88 if (devID < 0)
89 {
90 printf("exiting...\n");
91 exit(EXIT_SUCCESS);
92 }
93

94 checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));
95

96 // Statistics about the GPU device


97 // printf("> GPU device has %d Multi-Processors, SM %d.%d compute
,→ capabilities\n\n",
98 // deviceProp.multiProcessorCount, deviceProp.major, deviceProp.minor);
99

100 // int version = (deviceProp.major * 0x10 + deviceProp.minor);


101

102 // if (version < 0x11)


103 // {
104 // printf("%s: requires a minimum CUDA compute 1.1 capability\n",
,→ sSDKname);
105

106 // cudaDeviceReset causes the driver to clean up all state. While


107 // not mandatory in normal operation, it is good practice. It is also
108 // needed to ensure correct operation when the application is being
109 // profiled. Calling cudaDeviceReset causes all profile data to be
110 // flushed before the application exits
111 // cudaDeviceReset();
112 // exit(EXIT_SUCCESS);
113 // }
114

115 //////////////////////////Riviere code//////////////////////////////////////////


116

117 int je,ie;


118 float ss,penal;
119 float
,→ Amat[3][3],Bmat[3][3],Cmat[3][3],Dmat[3][3],Emat[3][3],F0mat[3][3],FNmat[3][3];
120

121 //symmetric interior penalty Galerkin (SIPG) method


122 ss=-1.0;
123 penal=5.0;
124

125 //dimension of global matrix


126 //glodim = nel * mz;
127

128 // printf("\nglodim=%d\n",glodim);
129

49
130 // float Aglobal[glodim][glodim];
131 // float rhsglobal[glodim];
132

133

134 /*
135 for(int j=0;j<glodim2;j++)
136 {
137 for(int i=0;i<glodim2;i++)
138 {
139 Aglobal[i][j]=0.0;
140 }
141 }
142 */
143

144 for (n=0; n<HEIGHT; n++)


145 for (m=0; m<WIDTH; m++)
146 {
147 Aglobal[n][m]=0.0;
148 }
149

150 for(int k=0; k<glodim ; k++)


151 {
152 //printf("i=%d\n",k);
153 rhsglobal[k]=0.0;
154 // printf("i=%d,rhsglobal=%f \n",k,rhsglobal[k]);
155 }
156

157

158

159 Amat[0][0]=0.0;
160 Amat[0][1]=0.0;
161 Amat[0][2]=0.0;
162 Amat[1][0]=0.0;
163 Amat[1][1]=4.0;
164 Amat[1][2]=0.0;
165 Amat[2][0]=0.0;
166 Amat[2][1]=0.0;
167 Amat[2][2]=(16.0/3.0);
168

169 for(int j=0;j<mz;j++)


170 {
171 for(int i=0;i<mz;i++)
172 {
173 Amat[i][j]=nel*Amat[i][j];
174 }

50
175 }
176

177 Bmat[0][0]=penal;
178 Bmat[0][1]=1.0-penal;
179 Bmat[0][2]=-2.0+penal;
180 Bmat[1][0]=-ss-penal;
181 Bmat[1][1]=-1.0+ss+penal;
182 Bmat[1][2]=2.0-ss-penal;
183 Bmat[2][0]=2.0*ss+penal;
184 Bmat[2][1]=1.0-2.0*ss-penal;
185 Bmat[2][2]=-2.0+2.0*ss+penal;
186

187

188 for(int i=0;i<mz;i++)


189 {
190 for(int j=0;j<mz;j++)
191 {
192 Bmat[i][j]=nel*Bmat[i][j];
193 }
194 }
195

196 Cmat[0][0]=penal;
197 Cmat[0][1]=-1+penal;
198 Cmat[0][2]=-2+penal;
199 Cmat[1][0]=ss+penal;
200 Cmat[1][1]=-1+ss+penal;
201 Cmat[1][2]=-2+ss+penal;
202 Cmat[2][0]=2*ss+penal;
203 Cmat[2][1]=-1+2*ss+penal;
204 Cmat[2][2]=-2+2*ss+penal;
205

206 for(int i=0;i<mz;i++)


207 {
208 for(int j=0;j<mz;j++)
209 {
210 Cmat[i][j]=nel*Cmat[i][j];
211 }
212 }
213

214 Dmat[0][0]=-penal;
215 Dmat[0][1]=-1+penal;
216 Dmat[0][2]=2-penal;
217 Dmat[1][0]=-ss-penal;
218 Dmat[1][1]=-1+ss+penal;
219 Dmat[1][2]=2-ss-penal;

51
220 Dmat[2][0]=-2*ss-penal;
221 Dmat[2][1]=-1+2*ss+penal;
222 Dmat[2][2]=2-2*ss-penal;
223

224 for(int i=0;i<mz;i++)


225 {
226 for(int j=0;j<mz;j++)
227 {
228 Dmat[i][j]=nel*Dmat[i][j];
229 }
230 }
231

232 Emat[0][0]=-penal;
233 Emat[0][1]=1-penal;
234 Emat[0][2]=2-penal;
235 Emat[1][0]=ss+penal;
236 Emat[1][1]=-1+ss+penal;
237 Emat[1][2]=-2+ss+penal;
238 Emat[2][0]=-2*ss-penal;
239 Emat[2][1]=1-2*ss-penal;
240 Emat[2][2]=2-2*ss-penal;
241

242 for(int i=0;i<mz;i++)


243 {
244 for(int j=0;j<mz;j++)
245 {
246 Emat[i][j]=nel*Emat[i][j];
247 }
248 }
249

250 F0mat[0][0]=penal;
251 F0mat[0][1]=2-penal;
252 F0mat[0][2]=-4+penal;
253 F0mat[1][0]=-2*ss-penal;
254 F0mat[1][1]=-2+2*ss+penal;
255 F0mat[1][2]=4-2*ss-penal;
256 F0mat[2][0]=4*ss+penal;
257 F0mat[2][1]=2-4*ss-penal;
258 F0mat[2][2]=-4+4*ss+penal;
259

260 for(int i=0;i<mz;i++)


261 {
262 for(int j=0;j<mz;j++)
263 {
264 F0mat[i][j]=nel*F0mat[i][j];

52
265 }
266 }
267

268 FNmat[0][0]=penal;
269 FNmat[0][1]=-2+penal;
270 FNmat[0][2]=-4+penal;
271 FNmat[1][0]=2*ss+penal;
272 FNmat[1][1]=-2+2*ss+penal;
273 FNmat[1][2]=-4+2*ss+penal;
274 FNmat[2][0]=4*ss+penal;
275 FNmat[2][1]=-2+4*ss+penal;
276 FNmat[2][2]=-4+4*ss+penal;
277

278 for(int i=0;i<mz;i++)


279 {
280 for(int j=0;j<mz;j++)
281 {
282 FNmat[i][j]=nel*FNmat[i][j];
283 }
284 }
285

286 //Gauss quadrature weights and points


287 float wg[2],sg[2];
288 wg[0] = 1.0;
289 wg[1] = 1.0;
290 sg[0] = -0.577350269189;
291 sg[1] = 0.577350269189;
292

293 //first block row


294 for(int ii=0;ii<mz;ii++)
295 {
296 for (int jj=0;jj<mz;jj++)
297 {
298

,→ Aglobal[ii][jj]=Aglobal[ii][jj]+Amat[ii][jj]+F0mat[ii][jj]+Cmat[ii][jj];
299 je=mz+jj;
300 Aglobal[ii][je]=Aglobal[ii][je]+Dmat[ii][jj];
301 }
302 }
303

304 //compute right-hand side


305 rhsglobal[0]=nel*penal;
306 rhsglobal[1]=nel*penal*(-1.0)-ss*2.0*nel;
307 rhsglobal[2]=nel*penal+ss*4*nel;
308

53
309 for(int ig=0;ig<2;ig++)
310 {
311 rhsglobal[0]=rhsglobal[0]+wg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
312

,→ rhsglobal[1]=rhsglobal[1]+wg[ig]*sg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
313

,→ rhsglobal[2]=rhsglobal[2]+wg[ig]*sg[ig]*sg[ig]*sourcef((sg[ig]+1)/(2*nel))/(2*nel);
314

315 }
316

317 for (int i=1;i<nel-1;i++)


318 {
319 for(int ii=0;ii<mz;ii++)
320 {
321 ie=ii+(i)*mz;
322 for(int jj=0;jj<mz;jj++)
323 {
324 je=jj+(i)*mz;
325 Aglobal[ie][je]=Aglobal[ie][je]+Amat[ii][jj]+Bmat[ii][jj]+Cmat[ii][jj];
326

327 je=jj+(i-1)*mz;
328 Aglobal[ie][je]=Aglobal[ie][je]+Emat[ii][jj];
329

330 je=jj+(i+1)*mz;
331 Aglobal[ie][je]=Aglobal[ie][je]+Dmat[ii][jj];
332 }
333

334 //compute right-hand side


335 for(int ig=0;ig<2;ig++)
336 {
337

338 float a=pow(sg[ig],ii);


339 float b=sourcef((sg[ig]+2*(i)+1.0)/(2*nel))/(2*nel);
340 rhsglobal[ie]=rhsglobal[ie]+wg[ig]*a*b;
341 }
342 }
343 }
344

345

346 for(int ii=0;ii<mz;ii++)


347 {
348 ie=ii+(nel-1)*mz;
349 for(int jj=0;jj<mz;jj++)
350 {
351 je=jj+(nel-1)*mz;

54
352

,→ Aglobal[ie][je]=Aglobal[ie][je]+Amat[ii][jj]+FNmat[ii][jj]+Bmat[ii][jj];
353 je=jj+(nel-2)*mz;
354 Aglobal[ie][je]=Aglobal[ie][je]+Emat[ii][jj];
355 }
356 for(int ig=0;ig<2;ig++)
357 {
358 float c=(pow(sg[ig],(ii)));
359 float d=sourcef((sg[ig]+2*(nel-1)+1.0)/(2*nel))/(2.0*nel);
360 rhsglobal[ie]=rhsglobal[ie]+wg[ig]*c*d;
361 }
362 }
363

364 // for(int i=0;i<glodim;i++)


365 // {
366 // printf("\n");
367 // for(int j=0;j<glodim;j++)
368 // {
369 // printf(" %f ",Aglobal[i][j]);
370 // }
371 // }
372

373 //for(int i=0;i<glodim;i++)


374 // {
375 // printf(" %f \n",rhsglobal[i]);
376 // }
377

378

379

380 ////////////////////finish Riviere code////////////////////////////////777/////


381

382 //implement matrix Aglobal to csr line 443


383

384

385 A = (float *)malloc(sizeof(float)*((glodim)*(glodim)));


386

387 for(int i=0;i<glodim;i++)


388 {
389 // printf("\n");
390 for(int j=0;j<glodim;j++)
391 {
392 A[i*glodim+j]=Aglobal[i][j];
393 // printf("A[%d]=%f \n",i*glodim+j,A[i*glodim+j]);
394 }
395 }

55
396

397

398 // Generate a random tridiagonal symmetric matrix in CSR format


399 //M = N = 1048576;
400 M=N=glodim;
401 // nz = (N-2)*3 + 4;
402 //printf("nz original=%d\n",nz);
403 I = (int *)malloc(sizeof(int)*(N+1));
404 J = (int *)malloc(sizeof(int)*nz);
405 val = (float *)malloc(sizeof(float)*nz);
406 //genTridiag(I, J, val, N, nz);
407

408 x = (float *)malloc(sizeof(float)*N);


409 rhs = (float *)malloc(sizeof(float)*N);
410

411

412 rhs = (float *)malloc(sizeof(float)*((N)));


413 for(int i=0;i<N;i++)
414 {
415 rhs[i]=rhsglobal[i];
416 // printf(" rhs[%d]=%f \n",i,rhs[i]);
417 }
418

419

420 for (int i = 0; i < N; i++)


421 {
422 x[i] = 0.0;
423 }
424

425

426 // for (int i = 0; i < N; i++)


427 // {
428 // rhs[i] = 1.0;
429 // x[i] = 0.0;
430 // }
431

432 // Get handle to the CUBLAS context


433 cublasHandle_t cublasHandle = 0;
434 cublasStatus_t cublasStatus;
435 cublasStatus = cublasCreate(&cublasHandle);
436

437 checkCudaErrors(cublasStatus);
438

439 // Get handle to the CUSPARSE context


440 cusparseHandle_t cusparseHandle = 0;

56
441 cusparseStatus_t cusparseStatus;
442 cusparseStatus = cusparseCreate(&cusparseHandle);
443

444 checkCudaErrors(cusparseStatus);
445

446 cusparseMatDescr_t descr = 0;


447 cusparseStatus = cusparseCreateMatDescr(&descr);
448

449 checkCudaErrors(cusparseStatus);
450

451 cusparseSetMatType(descr,CUSPARSE_MATRIX_TYPE_GENERAL);
452 cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ZERO);
453

454 // checkCudaErrors(cudaMalloc((void **)&d_col, nz*sizeof(int)));


455 // checkCudaErrors(cudaMalloc((void **)&d_row, (N+1)*sizeof(int)));
456 // checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
457

458 checkCudaErrors(cudaMalloc((void **)&d_x, N*sizeof(float)));


459 checkCudaErrors(cudaMalloc((void **)&d_r, N*sizeof(float)));
460 checkCudaErrors(cudaMalloc((void **)&d_p, N*sizeof(float)));
461 checkCudaErrors(cudaMalloc((void **)&d_Ax, N*sizeof(float)));
462

463 //allocate memory for function dense2csr


464 CHECK(cudaMalloc((void **)&dA, sizeof(float) * M * N));
465 CHECK(cudaMalloc((void **)&dNnzPerRow, sizeof(int) * M));
466 //Transfer data dense matrix A to dA
467

468 CHECK(cudaMemcpy(d_x, x, sizeof(float) * N, cudaMemcpyHostToDevice));


469 CHECK(cudaMemcpy(d_r, rhs, sizeof(float) * M, cudaMemcpyHostToDevice));
470 CHECK(cudaMemcpy(dA, A, sizeof(float) * M * N, cudaMemcpyHostToDevice));
471

472 // Compute the number of non-zero elements in A


473 CHECK_CUSPARSE(cusparseSnnz(cusparseHandle, CUSPARSE_DIRECTION_ROW, M, N,
,→ descr, dA, M, dNnzPerRow, &totalNnz));
474 //this part check if we have correct results from the last function cusparseSnnz
475 // printf("totalNnz=%d\n",totalNnz);
476 nnzh = (int *)malloc(sizeof(int)*N);
477 cudaMemcpy(nnzh,dNnzPerRow, sizeof(int)*N, cudaMemcpyDeviceToHost);
478

479 // for( int i=0 ; i<N ;i++)


480 // {
481 // printf("nnzh[%d]=%d\n",i,nnzh[i]);
482 // }
483

484 // Allocate device memory to store the sparse CSR representation of A

57
485 CHECK(cudaMalloc((void **)&d_val, sizeof(float) * totalNnz));
486 CHECK(cudaMalloc((void **)&d_row, sizeof(int) * (M + 1)));
487 CHECK(cudaMalloc((void **)&d_col, sizeof(int) * totalNnz));
488 val = (float *)malloc(sizeof(float)*totalNnz);
489 I = (int *)malloc(sizeof(int)*(M + 1));
490 J = (int *)malloc(sizeof(int)*totalNnz);
491

492

493 // Convert A from a dense formatting to a CSR formatting, using the GPU
494 CHECK_CUSPARSE(cusparseSdense2csr(cusparseHandle, M, N, descr, dA, M,
,→ dNnzPerRow,
495 d_val, d_row, d_col));
496

497

498 nz=totalNnz;
499

500 // cudaMemcpy(d_col, J, nz*sizeof(int), cudaMemcpyHostToDevice);


501 // cudaMemcpy(d_row, I, (N+1)*sizeof(int), cudaMemcpyHostToDevice);
502 // cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);
503

504

505

506

507 cudaMemcpy(val,d_val, sizeof(float) * totalNnz, cudaMemcpyDeviceToHost);


508 cudaMemcpy(I,d_row, sizeof(float) * (N+1), cudaMemcpyDeviceToHost);
509 cudaMemcpy(J,d_col , sizeof(float) * totalNnz, cudaMemcpyDeviceToHost);
510

511 // for( int i=0 ; i<totalNnz ;i++)


512 // {
513 // printf("val[%d]=%f\n",i,val[i]);
514 // }
515 // for( int i=0 ; i<N+1 ;i++)
516 // {
517 // printf("I[%d]=%d\n",i,I[i]);
518 // }
519 // for( int i=0 ; i<totalNnz ;i++)
520 // {
521 // printf("J[%d]=%d\n",i,J[i]);
522 // }
523

524

525

526 cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);


527 cudaMemcpy(d_r, rhs, N*sizeof(float), cudaMemcpyHostToDevice);
528

58
529

530

531 alpha = 1.0;


532 alpham1 = -1.0;
533 beta = 0.0;
534 r0 = 0.;
535

536 cusparseScsrmv(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz,


,→ &alpha, descr, d_val, d_row, d_col, d_x, &beta, d_Ax);
537

538 cublasSaxpy(cublasHandle, N, &alpham1, d_Ax, 1, d_r, 1);


539 cublasStatus = cublasSdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
540

541 k = 1;
542

543 while (r1 > tol*tol && k <= max_iter)


544 {
545 if (k > 1)
546 {
547 b = r1 / r0;
548 cublasStatus = cublasSscal(cublasHandle, N, &b, d_p, 1);
549 cublasStatus = cublasSaxpy(cublasHandle, N, &alpha, d_r, 1, d_p, 1);
550 }
551 else
552 {
553 cublasStatus = cublasScopy(cublasHandle, N, d_r, 1, d_p, 1);
554 }
555

556 cusparseScsrmv(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nz,


,→ &alpha, descr, d_val, d_row, d_col, d_p, &beta, d_Ax);
557 cublasStatus = cublasSdot(cublasHandle, N, d_p, 1, d_Ax, 1, &dot);
558 a = r1 / dot;
559

560 cublasStatus = cublasSaxpy(cublasHandle, N, &a, d_p, 1, d_x, 1);


561 na = -a;
562 cublasStatus = cublasSaxpy(cublasHandle, N, &na, d_Ax, 1, d_r, 1);
563

564 r0 = r1;
565 cublasStatus = cublasSdot(cublasHandle, N, d_r, 1, d_r, 1, &r1);
566 cudaThreadSynchronize();
567 // printf("iteration = %3d, residual = %e\n", k, sqrt(r1));
568 k++;
569 }
570

571 cudaMemcpy(x, d_x, N*sizeof(float), cudaMemcpyDeviceToHost);

59
572

573 // for(int i=0;i<N;i++)


574 // {
575 // printf("x[%d]=%f\n",i,x[i]);
576 // }
577

578 float rsum, diff, err = 0.0;


579

580 for (int i = 0; i < N; i++)


581 {
582 rsum = 0.0;
583

584 for (int j = I[i]; j < I[i+1]; j++)


585 {
586 rsum += val[j]*x[J[j]];
587 }
588

589 diff = fabs(rsum - rhs[i]);


590

591 if (diff > err)


592 {
593 err = diff;
594 }
595 }
596

597 cusparseDestroy(cusparseHandle);
598 cublasDestroy(cublasHandle);
599

600 free(I);
601 free(J);
602 free(val);
603 free(x);
604 free(rhs);
605 cudaFree(d_col);
606 cudaFree(d_row);
607 cudaFree(d_val);
608 cudaFree(d_x);
609 cudaFree(d_r);
610 cudaFree(d_p);
611 cudaFree(d_Ax);
612

613 // cudaDeviceReset causes the driver to clean up all state. While


614 // not mandatory in normal operation, it is good practice. It is also
615 // needed to ensure correct operation when the application is being
616 // profiled. Calling cudaDeviceReset causes all profile data to be

60
617 // flushed before the application exits
618 cudaDeviceReset();
619

620 printf("Test Summary: Error amount = %f\n", err);


621 exit((k <= max_iter) ? 0 : 1);
622

623 }

C Matlab Code

This code was taken from [24] and modified from line 132 to line 327 in order to compare
with the analytical result and with the CUDA code properly.

1 function [ysol,ul,ur,xl,xr]=DGsimplesolve3(nel,ss,penal)
2 %function [x0,fl0,rr0,it0,rv0,S,rhsglobal] = DGsimplesolve3(nel,ss,penal)
3 %function DGsimplesolve(nel,ss,penal)
4

5 format long
6

7 Amat = (nel)*[0 0 0;0 4 0;0 0 16/3];


8 Bmat = (nel)*[penal 1-penal -2+penal; -ss-penal -1+ss+penal 2-ss-penal; 2*ss+penal
,→ 1-2*ss-penal -2+2*ss+penal] ;
9 Cmat = (nel)*[penal -1+penal -2+penal; ss+penal -1+ss+penal -2+ss+penal;
,→ 2*ss+penal -1+2*ss+penal -2+2*ss+penal];
10 Dmat = (nel)*[-penal -1+penal 2-penal; -ss-penal -1+ss+penal 2-ss-penal;
,→ -2*ss-penal -1+2*ss+penal 2-2*ss-penal] ;
11 Emat = (nel)*[-penal 1-penal 2-penal; ss+penal -1+ss+penal -2+ss+penal;
,→ -2*ss-penal 1-2*ss-penal 2-2*ss-penal];
12 F0mat =(nel)*[penal 2-penal -4+penal; -2*ss-penal -2+2*ss+penal 4-2*ss-penal;
,→ 4*ss+penal 2-4*ss-penal -4+4*ss+penal] ;
13 FNmat =(nel)*[penal -2+penal -4+penal; 2*ss+penal -2+2*ss+penal -4+2*ss+penal;
,→ 4*ss+penal -2+4*ss+penal -4+4*ss+penal];
14 % dimension of local matrices
15 locdim = 3;
16 % dimension of global matrix
17 glodim = nel * locdim;
18 %number of subintervals
19 n=glodim;
20 % initialize to zero matrix and right-hand side vector
21 Aglobal = zeros(glodim,glodim);
22 rhsglobal = zeros(glodim,1);
23 % Gauss quadrature weights and points
24 wg(1) = 1.0;

61
25 wg(2) = 1.0;
26 sg(1) = -0.577350269189;
27 sg(2) = 0.577350269189;
28 % assemble global matrix and right-hand side
29 % first block row
30 for ii=1:locdim
31 for jj=1:locdim
32 %fprintf(’\n’);
33 %fprintf(’%d , %d’,ii,jj);
34 Aglobal(ii,jj) = Aglobal(ii,jj)+Amat(ii,jj)+F0mat(ii,jj)+Cmat(ii,jj);
35 je = locdim+jj ;
36 %fprintf(’%d , %d’,ii,je);
37 Aglobal(ii,je) = Aglobal(ii,je)+Dmat(ii,jj);
38 end; %jj
39 end; %ii
40

41 %
42 %%for ii=1:locdim
43 %% for jj=1:locdim
44 %% fprintf(’%6.2f ’,Aglobal(ii,jj));
45 %% end; %jj
46 %% end; %ii
47

48

49 % compute right-hand side


50 rhsglobal(1) = nel*penal;
51 rhsglobal(2) = nel*penal*(-1) - ss*2*nel;
52 rhsglobal(3) = nel*penal+ss*4*nel;
53

54 for ig=1:2
55 rhsglobal(1) = rhsglobal(1) + wg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
56 rhsglobal(2) = rhsglobal(2) +
,→ wg(ig)*sg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
57 rhsglobal(3) = rhsglobal(3) +
,→ wg(ig)*sg(ig)*sg(ig)*sourcef((sg(ig)+1)/(2*nel))/(2*nel);
58 end; %ig
59 rhsglobal(1);
60 rhsglobal(2);
61 rhsglobal(3);
62

63 % intermediate block rows


64 % loop over elements
65 for i=2:(nel-1)
66 for ii=1:locdim
67 ie = ii+(i-1)*locdim;

62
68 for jj=1:locdim
69 fprintf(’\n’);
70 je = jj+(i-1)*locdim;
71 %fprintf(’%d , %d’,ie,je);
72 Aglobal(ie,je) = Aglobal(ie,je)+Amat(ii,jj)+Bmat(ii,jj)+Cmat(ii,jj);
73 %fprintf(’ %d, %f ’,je,Aglobal);
74 %Aglobal(ie,je)
75 je = jj+(i-2)*locdim;
76 Aglobal(ie,je) = Aglobal(ie,je)+Emat(ii,jj);
77 je = jj+(i)*locdim;
78 %fprintf(’%d , %d’,ie,je);
79 Aglobal(ie,je)=Aglobal(ie,je)+Dmat(ii,jj);
80 % fprintf(’%d , %d, %f ’,ie,je,Aglobal(ie,je));
81 end; %jj
82 % compute right-hand side
83 for ig=1:2
84 rhsglobal(ie) =
,→ rhsglobal(ie)+wg(ig)*(sg(ig)^(ii-1))*sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel);
85 % fprintf(’\n’);
86 %fprintf(’ig= %d, sg(ig)= %f,i= %d nel=%d, 2*nel= %d,
,→ b=%f’,ig,sg(ig),i,nel,2*nel, sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel));
87 % fprintf(’ %d, %d, %f, %f, %f, %f, %f ’, ie, ii, wg(ig), sg(ig)
,→ ,(sg(ig)^(ii-1)),sourcef((sg(ig)+2*(i-1)+1.0)/(2*nel))/(2*nel),
,→ rhsglobal(ie));
88 end; %
89 end; %ii
90 end; %i
91 %fprintf(’s=%f’,sourcef(1.0));
92

93 % last block row


94 for ii=1:locdim
95 ie = ii+(nel-1)*locdim;
96 for jj=1:locdim
97 je = jj+(nel-1)*locdim;
98 %if((ie==12)&&(je==12))
99 % fprintf(’ie=%d, je=%d, Aglobal=%f, ii=%d,jj=%d, Amat=%f, FNmat=%f,
,→ Bmat=%f’,ie, je, Aglobal(ie,je),ii,jj,Amat(ii,jj),FNmat(ii,jj),Bmat(ii,jj));
100 %end;
101 Aglobal(ie,je) = Aglobal(ie,je)+Amat(ii,jj)+FNmat(ii,jj)+Bmat(ii,jj);
102 je = jj+(nel-2)*locdim;
103 Aglobal(ie,je) = Aglobal(ie,je)+Emat(ii,jj);
104 % fprintf(’%f’, Aglobal(ie,je));
105 % fprintf(’\n’);
106 end; %jj
107

63
108 % compute right-hand side
109 for ig=1:2
110 % fprintf(’\n’);
111 rhsglobal(ie) =
,→ rhsglobal(ie)+wg(ig)*(sg(ig)^(ii-1))*sourcef((sg(ig)+2*(nel-1)+1.0)/(2*nel))/(2*nel);
112 %fprintf(’ig=%d, sg(ig)= %f, ii-1= %d c=%f, d=%f
,→ ’,ig,sg(ig),ii-1,sg(ig)^(ii-1),sourcef((sg(ig)+2*(nel-1)+1.0)/(2*nel))/(2*nel));
113 % fprintf(’%f \n’,rhsglobal(ie));
114 end; %ig
115 end; %ii
116

117

118 % fprintf(’Aglobal(11,11)=%f’,Aglobal(12,12));
119 % for i=1:glodim
120 % fprintf(’\n’);
121 % for j=1:glodim
122 % fprintf(’ %f ’,Aglobal(i,j));
123 % end;
124 % end;
125

126 % for i=1:glodim


127 % fprintf(’\n’);
128 % fprintf(’ %f \n’,rhsglobal(i));
129 % end;
130

131

132 n1=length(rhsglobal);
133 M1 = spdiags((1:n1)’,0,n1,n1);
134

135

136 tol = 1e-10;


137 maxit = 100000;
138

139 %pcg(A,b1,tol,maxit,M1);
140 % solve linear system
141 S = sparse(Aglobal);
142 %ysol2 = cgs(Aglobal,rhsglobal,tol,maxit);
143 %ysol3 = cgs(S,rhsglobal,tol,maxit);
144

145 x1 = pcg(S,rhsglobal,tol,maxit);
146

147 %x1 = pcg(S,rhsglobal,tol,maxit,M1)


148

149 ysol = Aglobal\rhsglobal;


150

64
151 %[x0,fl0,rr0,it0,rv0] = pcg(Aglobal,rhsglobal,1e-8,100);
152

153

154 x=linspace(0,1,nel*3);
155 yanal=(1-x).*exp(-x.*x);
156

157 %plot(x,ysol,x,yanal)
158

159 %plot(x,ysol)
160

161

162 j=1;
163

164 %fprintf(’xm(32)=%f’,xm(32));
165 % for i=1:31
166 % fprintf(’\n’);
167 % fprintf(’ %f \n’,xm(i));
168 % end;
169

170

171 %fileID = fopen(’CudaOut.txt’,’r’);


172

173 %A = fscanf(fileID,’%f’);
174

175 %for i=1:nel


176 % ul(i)=A(j)-A(j+1)+A(j+2);
177 % ur(i)=A(j)+A(j+1)+A(j+2);
178 % j=j+3;
179 % xl(i)=(i-1)/nel;
180 % xr(i)=(i)/nel;
181 % plot(xl,ul,xl,ur)
182 %end;
183

184

185

186 for i=1:nel


187 ul(i)=ysol(j)-ysol(j+1)+ysol(j+2);
188 ur(i)=ysol(j)+ysol(j+1)+ysol(j+2);
189 j=j+3;
190 xl(i)=(i-1)/nel;
191 xr(i)=(i)/nel;
192 % plot(xl,ul,xl,ur)
193 end;
194

195

65
196

197 plot(xl(1),ul(1),’bo’,xr(1),ur(1),’r*’)
198 hold on
199 plot(xl(2),ul(2),’bo’,xr(2),ur(2),’r*’)
200 hold on
201 plot(xl(3),ul(3),’bo’,xr(3),ur(3),’r*’)
202 hold on
203 plot(xl(4),ul(4),’bo’,xr(4),ur(4),’r*’)
204 hold on
205 plot(xl(5),ul(5),’bo’,xr(5),ur(5),’r*’)
206 hold on
207 plot(xl(6),ul(6),’bo’,xr(6),ur(6),’r*’)
208 hold on
209 plot(xl(7),ul(7),’bo’,xr(7),ur(7),’r*’)
210 hold on
211 plot(xl(8),ul(8),’bo’,xr(8),ur(8),’r*’)
212 hold on
213 plot(xl(9),ul(9),’bo’,xr(9),ur(9),’r*’)
214 hold on
215 plot(xl(10),ul(10),’bo’,xr(10),ur(10),’r*’)
216 hold on
217 %plot(x,yanal)
218 %hold on
219

220 xx(1)=xl(1);
221 xx(2)=xr(1);
222 xx(3)=xl(2);
223 xx(4)=xr(2);
224 xx(5)=xl(3);
225 xx(6)=xr(3);
226 xx(7)=xl(4);
227 xx(8)=xr(4);
228 xx(9)=xl(5);
229 xx(10)=xr(5);
230 xx(11)=xl(6);
231 xx(12)=xr(6);
232 xx(13)=xl(7);
233 xx(14)=xr(7);
234 xx(15)=xl(8);
235 xx(16)=xr(8);
236 xx(17)=xl(9);
237 xx(18)=xr(9);
238 xx(19)=xl(10);
239 xx(20)=xr(10);
240

66
241 ll(1)=ul(1);
242 ll(2)=ur(1);
243 ll(3)=ul(2);
244 ll(4)=ur(2);
245 ll(5)=ul(3);
246 ll(6)=ur(3);
247 ll(7)=ul(4);
248 ll(8)=ur(4);
249 ll(9)=ul(5);
250 ll(10)=ur(5);
251 ll(11)=ul(6);
252 ll(12)=ur(6);
253 ll(13)=ul(7);
254 ll(14)=ur(7);
255 ll(15)=ul(8);
256 ll(16)=ur(8);
257 ll(17)=ul(9);
258 ll(18)=ur(9);
259 ll(19)=ul(10);
260 ll(20)=ur(10);
261

262 %plot(x,yanal,’b’)
263 %hold on
264 %plot(xx,ll,’r*’)
265 %hold on
266

267

268 %xr=linspace(0,1,nel);
269 %plot(x,yanal,’b’,xr,ur,’r’,xl,ul,’g’,’linewidth’,2)
270 %plot(x,ysol,’r’,x,yanal,’b’)
271 %plot(xr,ur)
272

273

274 %j=1;
275

276 %for i=1:nel


277 % PC(j)=P1(x(i));
278 % PC(j+1)=P2(x(i),nel,n);
279 % PC(j+2)=P3(x(i),nel,n);
280 % j=j+3;
281 % end;
282

283 %j=1;
284

285

67
286

287

288 %xl=linspace(0,1,nel);
289 %plot(xl,ul,’r’,x,yanal,’b’)
290 %plot(x,ysol,’r’,x,yanal,’b’)
291

292

293 %k=3;
294 %j=1;
295 %for i=1:nel-1
296 % ur(i)=ysol(k)*PC(j)+ysol(k+1)*PC(j+1)+ysol(k+2)*PC(j+2);
297 % j=j+3;
298 % k=k+3;
299 %end;
300

301 %xr=linspace(0,1,nel-1);
302 %plot(xr,ur,’r’,x,yanal,’b’)
303

304 %for i=1:nel-1


305 % um(i)=(ul(i)-ur(i))/2;
306 % end;
307 %plot(xr,um,’r’,x,yanal,’b’)
308 %plot(xr,um)
309 %plot(xm,u)
310

311

312 return;
313

314

315 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
316 function yval = sourcef(xval)
317 % source function for exact solution=(1-x)e?(-x?2)
318 yval = -(2*xval-2*(1-2*xval)+4*xval*(xval-xval^2))*exp(-xval*xval);
319 return;
320

321 function pol1=P1(x)


322 pol1=1;
323 return;
324

325 function pol2=P2(x,nel,n)


326 h=1/nel;
327 pol2=(2/h)*(x-(n+0.5)*h);
328 return;
329

330

68
331 function pol3=P3(x,nel,n)
332 h=1/nel;
333 %fprintf(’h=%f\n’,h);
334 %fprintf(’4/(h*h)=%f\n’,4/(h*h));
335 %fprintf(’((x-(n+0.5)*h)^2)=%f\n’,((x-(n+0.5)*h)^2));
336 pol3=(4/(h*h))*((x-(n+0.5)*h)^2);
337 return;
338

339

340

341

342

343

344

345

346

D Error Code
1 function [ l2norm ] = errorv2( nq,uh,nOn,a,b)
2 %UNTITLED2 Summary of this function goes here
3 % Detailed explanation goes here
4

5 %%--------------------------------------------------------------------
6 % THis function computes the L2 norm of the error
7 % || u-uh ||_{ L2(0,L) } given an analitical solution u
8 % of the POisson equation in 1d and a mesh size h
9 %
10 %---------------------------------------------------------------------
11 % nOn number of nodes
12 % u is analitical solution function
13 % L length of the domain.
14 % nq % number of quadrature points necesary to obtain an excact
15 % solution given that (u-uh)^{2} it’s a polinomioum of 4th degree.
16 % nq=2(3)-1=5, so in our case 3 its enough.
17 % uh finite element aproximation
18

19 nOe=nOn-1; % numeberof elements


20 f=@(x) 0.5 * x * ( (b-a) - x ); % analitycal solution
21

22 x=linspace(a,b,nOe+1); % this is the 1D domain divided in nOe elements


23 xs=0.0; % x start of element
24 xe=0.0; % x end of element

69
25 us=0.0; % u start
26 ue=0.0; % u end
27 J=0.0; % Jacobian
28

29

30 Integrals=zeros(1,nOe); % We set a vector for the integral of each element


31

32 for i=1:nOe
33 xs=x(i);
34 xe=x(i+1);
35 us=uh(i);
36 ue=uh(i+1);
37 J=(xe-xs)*0.5;
38 sum=0.0;
39 x_map=zeros(1,nq);
40 f_map=zeros(1,nq);
41 [eta,w]=lgwt(nq,-1,1);
42 u_map=zeros(1,nq);
43 N1=0.0;
44 N2=0.0;
45 error_map = 0.0;
46

47 for j=1:nq
48

49 N1=( 1.0 + eta(j) )*0.5;


50 N2=( 1.0 - eta(j) )*0.5;
51

52 x_map(j)=N1 * xs + N2 * xe;
53 u_map(j)=N1 * us + N2 * ue;
54

55 error_map= ( f(x_map(j)) - u_map(j) )*( f(x_map(j)) - u_map(j) ); %


,→ (u-uh)^2
56

57 sum= error_map * w ( j ) * J + sum;


58

59 end
60

61 Integrals(i)=sum;
62

63

64 end
65

66 I=0;
67 for k=1:nOe
68 I=Integrals(k)+I;

70
69 end
70

71 l2norm=sqrt(I);

1 function [x,w]=lgwt(N,a,b)
2

3 % lgwt.m
4 %
5 % This script is for computing definite integrals using Legendre-Gauss
6 % Quadrature. Computes the Legendre-Gauss nodes and weights on an interval
7 % [a,b] with truncation order N
8 %
9 % Suppose you have a continuous function f(x) which is defined on [a,b]
10 % which you can evaluate at any x in [a,b]. Simply evaluate it at all of
11 % the values contained in the x vector to obtain a vector f. Then compute
12 % the definite integral using sum(f.*w);
13 %
14 % Written by Greg von Winckel - 02/25/2004
15 N=N-1;
16 N1=N+1; N2=N+2;
17

18 xu=linspace(-1,1,N1)’;
19

20 % Initial guess
21 y=cos((2*(0:N)’+1)*pi/(2*N+2))+(0.27/N1)*sin(pi*xu*N/N2);
22

23 % Legendre-Gauss Vandermonde Matrix


24 L=zeros(N1,N2);
25

26 % Derivative of LGVM
27 Lp=zeros(N1,N2);
28

29 % Compute the zeros of the N+1 Legendre Polynomial


30 % using the recursion relation and the Newton-Raphson method
31

32 y0=2;
33

34 % Iterate until new points are uniformly within epsilon of old points
35 while max(abs(y-y0))>eps
36

37

38 L(:,1)=1;
39 Lp(:,1)=0;

71
40

41 L(:,2)=y;
42 Lp(:,2)=1;
43

44 for k=2:N1
45 L(:,k+1)=( (2*k-1)*y.*L(:,k)-(k-1)*L(:,k-1) )/k;
46 end
47

48 Lp=(N2)*( L(:,N1)-y.*L(:,N2) )./(1-y.^2);


49

50 y0=y;
51 y=y0-L(:,N2)./Lp;
52

53 end
54

55 % Linear map from[-1,1] to [a,b]


56 x=(a*(1-y)+b*(1+y))/2;
57

58 % Compute the weights


59 w=(b-a)./((1-y.^2).*Lp.^2)*(N2/N1)^2;

72
References

[1] Nvidia cuda programming guide, v 7.5. NVIDIA Corporation, 2015.

[2] Douglas N. Arnold. An Interior Penalty Finite Element Method with Discontinuous
Elements, 1982.

[3] Douglas N Arnold, Franco Brezzi, Bernardo Cockburn, and L Donatella Marini. Uni-
fied analysis of discontinuous galerkin methods for elliptic problems. SIAM journal
on numerical analysis, 39(5):1749–1779, 2002.

[4] Garth A Baker. Finite element methods for elliptic equations using nonconforming
elements. Mathematics of Computation, 31(137):45–59, 1977.

[5] Francesco Bassi and Stefano Rebay. A high-order accurate discontinuous finite ele-
ment method for the numerical solution of the compressible navier–stokes equations.
Journal of computational physics, 131(2):267–279, 1997.

[6] Francesco Bassi and Stefano Rebay. A high-order accurate discontinuous finite ele-
ment method for the numerical solution of the compressible navier–stokes equations.
Journal of computational physics, 131(2):267–279, 1997.

[7] Franco Brezzi, Gianmarco Manzini, Donatella Marini, Paola Pietra, and Alessan-
dro Russo. Discontinuous galerkin approximations for elliptic problems. Numerical
Methods for Partial Differential Equations, 16(4):365–378, 2000.

[8] Bernardo Cockburn and Chi-Wang Shu. The RungeKutta Discontinuous Galerkin
Method for Conservation Laws V. Journal of Computational Physics, 141(2):199–224,
1998.

[9] Clint Dawson, Shuyu Sun, and Mary F Wheeler. Compatible algorithms for cou-
pled flow and transport. Computer Methods in Applied Mechanics and Engineering,
193(23):2565–2580, 2004.

73
[10] LM Delves and CA Hall. An implicit matching principle for global element calcula-
tions. IMA Journal of Applied Mathematics, 23(2):223–234, 1979.

[11] James W Demmel. Applied numerical linear algebra. Siam, 1997.

[12] Jim Douglas and Todd Dupont. Interior penalty procedures for elliptic and parabolic
galerkin methods. In Computing methods in applied sciences, pages 207–216.
Springer, 1976.

[13] Rob Farber. CUDA Application Design and Development. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 1st edition, 2012.

[14] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU Press,
2012.

[15] Magnus Rudolph Hestenes and Eduard Stiefel. Methods of conjugate gradients for
solving linear systems. 1952.

[16] Paul Houston. Lectures notes in variational methods. The University of Nottingham,
2014.

[17] Paul Houston, Emmanuil H Georgoulis, and Edward Hall. Adaptivity and a poste-
riori error estimation for dg methods on anisotropic meshes. In Proceedings of the
International Conference on Boundary and Interior Layers (BAIL)-Computational
and Asymptotic Methods, 2006.

[18] Sanders Jason and KanDrot Edward. Cuda by example: An introduction to general-
purpose gpu programming. AddisonWesley, USA, 2011.

[19] Andreas Klöckner, Timothy Warburton, and Jan S Hesthaven. High-order discontin-
uous galerkin methods by gpu metaprogramming. In GPU Solutions to Multi-scale
Problems in Science and Engineering, pages 353–374. Springer, 2013.

[20] CUDA Nvidia. Cublas library. NVIDIA Corporation, Santa Clara, California, 15,
2015.

74
[21] CUDA NVIDIA. Cusparse library. NVIDIA Corporation, Santa Clara, California,
2015.

[22] J Tinsley Oden, Ivo Babuŝka, and Carlos Erik Baumann. A discontinuoushpfinite
element method for diffusion problems. Journal of computational physics, 146(2):491–
519, 1998.

[23] Wm H Reed and TR Hill. Triangularmesh methodsfor the neutrontransportequation.


Los Alamos Report LA-UR-73-479, 1973.

[24] Béatrice Rivière. Discontinuous Galerkin methods for solving elliptic and parabolic
equations: theory and implementation. Society for Industrial and Applied Mathe-
matics, 2008.

[25] Béatrice Rivière, Mary F Wheeler, and Vivette Girault. Improved energy estimates
for interior penalty, constrained and discontinuous galerkin methods for elliptic prob-
lems. part i. Computational Geosciences, 3(3-4):337–360, 1999.

[26] Gregory Ruetsch and Massimiliano Fatica. CUDA Fortran for Scientists and Engi-
neers: Best Practices for Efficient CUDA Fortran Programming. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA, 1st edition, 2013.

[27] Endre Süli, Christoph Schwab, and Paul Houston. hp-DGFEM for partial differential
equations with nonnegative characteristic form. Springer, 2000.

[28] Lloyd N Trefethen. Db iii, numerical linear algebra. SIAM: Society for Industrial
and Applied Mathematics, 1997.

[29] Kristoffer G Van der Zee. Goal-adaptive discretization of fluid-structure interaction.


TU Delft, Delft University of Technology, 2009.

[30] Greg von Winckel. Greg von winckel, 2004.

[31] Mary Fanett Wheeler. An elliptic collocation-finite element method with interior
penalties. SIAM Journal on Numerical Analysis, 15(1):152–161, 1978.

75

View publication stats

You might also like