Professional Documents
Culture Documents
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
big data
a short introduction
Buzz Word
High dimensional data
Memory intensive data and/or algorithms
Bioinformatics
Genomics and other "Omics"
Astronomy
Meteorology
Environmental Research
Multiscale physics simulations
Economic and financial simulations
Social Networks
Text Mining
Large Hadron Collider
MapReduce
Distributed File Systems
Parallel File Systems
Distributed Databases
Task Queues
Memory Attached Files
http://www.sdsc.edu/supercomputing/gordon/
(cache consistency,
multi-threading, etc)
(intel 2000)
12-D Hypercube
Rmax: 20 GFLOP/s
Supercomputer: SMP
SMP Machine:
Example: gvs1
shared memory
128 GB RAM
16 cores
threaded programs
bus interconnect
Example: uv3.cos.lrz.de
in R:
2000 GB RAM
1120 cores
library(multicore)
and inlined code
Supercomputer: MPI
Cluster of machines:
Example: coolMUC
distributed memory
4700 GB RAM
2030 cores
Levels of Parallelism
Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
Socket Level
each socket contains 8 cores
Core Level
each core has 16 vector registers
Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
Pipeline Level (how many simultaneous pipelines)
hyperthreading
Instruction Level (instructions per cycle)
out of order execution, branch prediction
CPU register
1ns
fridge
L2 cache
10ns
microwave
100s ~ 2min
memory
80 ns
pizza service
800s ~ 15min
network(IB)
200 ns
city mall
GPU(PCIe)
50.000 ns
10s
2000s ~ 0.5h
Computing MFlop/s
mflops.internal <- function(np) {
a=matrix(runif(np**2),np,np)
b=matrix(runif(np**2),np,np)
nflops=np**2*(2*np-1)
time=system.time(a %*% b)[[3]]
nflops/time/1000000}
Amdahl's law
Computing time for N processors
Accelerator factor:
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
Amdahl's Law II
R on the HLRB-II
Hardware @ LRZ
Capacity computing
Special equipment
Backup and Archiving Centre (10 petabyte, more than 6 billion files)
Distributed File Systems
Competence centre (Networks, HPC, IT Management)
Hardware @ LRZ
http://www.lrz.de/services/compute/linux-cluster/overview/
SuperMUC
SuperMIG
8200 cores
SuperMUC
147456 cores
supzero
80 cores
supermuc
16 cores
SGI UV
2080 cores
SGI ICE
512 cores
gvs1...4
64 cores
CoolMUC
4300 cores
lx64ia2
8 cores
login
lx64ia3
8 cores
login
ia64
x86_64
GPU
$HOME
25 GB per group, with backup and snapshots
cd $HOME/.snapshot
$OPT_TMP
temporary scratch space (beware!)
High Watermark Deletion
When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the
oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.
$PROJECT
project space (max 1TB), no automatic backup, use dsmc
module system@LRZ
http://www.lrz.de/services/software/utilities/modules/
module avail
module list
module load <name>
e.g. module load matlab
module unload <name>
module show <name>
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
#SBATCH --time=00:05:00
. /etc/profile
load the standard environment (see below)
cd mydir
./myprog.exe
echo $JOB_ID
start executable
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
sbatch jobfile.sh
squeue -u <userid>
scancel <jobid>
delete my job
programming:
time
Why scripting?
A scripting language. . .
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
is dynamically typed.
works well for glueing lower-level blocks together.
examples: tcl/tk, perl, python, ruby, R, MATLAB
R functions
R can define named and anonymous functions
Define a (named or anonymous) function:
todB <- function(X) {10*log10(X)}
Functions can even return (anonymous) functions
The last value evaluated is the return value
Variables from the calling namespace are visible
All other variables are local unless specified
Variable number of inputs:
myfunc <- function(...) list(...)
Variable names and predefined values
myfunc <- function(a,b=1,c=a*b) c+1
Use It!
You can write multi-machine, multi-core, GPGPU accelerated, clientserver based, web-enabled applications using R
Parallel R Packages
foreach
pnmath/MKL
multicore
snow
Rmpi
rgpu, gputools
R webservices
sqldf
rredis
mapReduce
parallel abstraction
parallel intrinsic functions
SMP programming
Simple Network of Workers
Message Passing Interface
GPGPU programming
client/server webservices
SQL server for R
noSQL server for R
large scale parallelization
Example:
Abstraction:
foreach package
doMC
doMPI
doSNOW
doREDIS
foreach(i=1:10) %dopar%
library(doMC)
registerDoMC(cores=5)
sqrt(i)
SMP programming
library(multicore)
send tasks into the background with parallel
wait for completion and gather results with collect
library(multicore)
# spawn two tasks
p1 <- parallel(sum(runif(10000000)))
p2 <- parallel(sum(runif(10000000)))
# gather results blocking
collect(list(p1,p2))
# gather results non-blocking
collect(list(p1,p2),wait=F)
library(multicore)
Extension of the apply function family in R
function-function or functional
utilizes SMP:
library(multicore)
doit <- function(x,np)sum(sort(runif(np)))
# single call
system.time( doit(0,10000000) )
# serial loop
system.time( lapply(1:16,doit,10000000))
# parallel loop
system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))
doMC
# R
> library(foreach)
> library(doMC)
> registerDoMC(cores=4)
system elapsed
2.652
12.002
system elapsed
7.216
3.296
multithreading with R
library(foreach)
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N) %dopar%
{
foreach(i=1:N) %do%
{
mmult.f()
}
mmult.f()
}
# thread execution
# serial execution
Cluster Programming
doSNOW
# R
> library(doSNOW)
> registerDoSNOW(makeSOCKcluster(4))
system elapsed
0.928
16.303
system elapsed
0.000
4.865
SNOW with R
library(foreach)
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N) %do%
{
foreach(i=1:N) %dopar%
{
mmult.f()
}
# serial execution
mmult.f()
}
# cluster execution
Job Scheduler
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
system elapsed
0.928
16.303
system elapsed
0.000
4.865
doMC
# R
> library(doMC)
> registerDoMC(cores=4)
system elapsed
2.652
12.002
system elapsed
7.216
3.296
doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)
system elapsed
0.928
16.303
system elapsed
0.000
4.865
redisConnect()
redisSet('x',runif(5))
# store a value
redisGet
('x')
redisClose()
# close connection
redisAuth(pwd)
# simple authentication
redisConnect()
redisLPush('x',
1)
redisLPush('x',2)
redisLPush('x',3)
redisLRange('x',0,2)
R shell
R gui
math notebook
automatic latex/pdf
vtk
.C("funcname", args...)
.Fortran("test", args...)
.jcall("class", args...)
.Call
.External
dyn.load("lib.so")
.C("fname", args)
.Fortran("fname", args)
C integration
shared object libraries can be
used in R out of the box
R arrays are mapped to C
pointers
Example:
integer
int*
numeric
double*
character
char*
use in R:
> dyn.load("test.so")
> .C("test", args)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt;
dv=-x*dt
x=x+dx;
v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
Fortran Compiler
use Intel fortran compiler
$ time ./myprog.exe
R subroutine
subroutine mysub(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt;
x=x+dx;
end do
return
end subroutine
dv=-x*dt
v=v+dv
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np)
b(i,k)*c(k,j)
end do
return
end subroutine
a(i,j) = a(i,j) +
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Big Memory
MEM
MEM
MEM
MEM
Disk
Disk
Disk
Network
Network
Network
MEM
library(bigmemory)
shared memory regions for several
processes in SMP
file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])
Part II
Applications
exp(bx)
S(t|x) = S0(t)
User
34.635
System elapsed
0.002
34.686
User
System elapsed
26.531 0.020
26.676
max(abs(out$coefs-coefs[,1]))
[1] 1.004459e-07
performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up
principally difficult to program and quite error prone
C-functions for single variables are usually available and wrappers are usually easy to program
User System
1.825 0.291
elapsed
7.215
Exercise:
1.
2.
3.
4.
Problem 2: Example
Normalization of Gene-Expression-Microarrays:
Problem 2: Example
source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation
minor speedup
Problem 2: Exercise
Exercise for you:
1.
2.
3.
4.
5.
Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package
affy)
use top to observe the memory consumption of the process.
Additionally, measure its runtime.
Perform the background correction as a distributed data approach using snow
(you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files)
Compare memory consumption and runtime to the sequential code
R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP)
modern biological data can have several dozen GB (e.g. Next Generation Sequencing)
If the R-object representing the data set grows larger than the available RAM, R stops throwing an
error reading "Cannot allocate vector of xx byte".
Possible solution: R package bigmemory (based on C++-libraries for big data objects)
2 areas of usage:
R-Package bigmemory
Essential functions:
bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to
access the matrix)
filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small)
describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object
bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their
elements can be accessed using brackets
###write
data(golub)
library(bigmemory)
setwd('~/tmp/bigmem')
X<-as.matrix(golub[,-1])
#create filebacked.bigmatrix and write data into its elements
z<-filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',
backingfile="magolub.bin",descriptorfile="magolub.desc")
k<-0
for(i in 1:5000){
inds<-sample(1:nrow(X),30)
z[(1:30)+(k*30),]<-X[inds,]
k<-k+1}
#create and save descriptorfile for later usage
desc<-describe(z)
save(desc,file='desc_z.RData')
###read
library(bigmemory)
setwd('tmp/bigmem')
#load descriptorfile
load('desc_z.RData')
#attach bigmatrix object using the descriptor file
y<-attach.big.matrix(desc)
#access elements
y[1:10,7]
#read element 7 in the 5th row
b<-y[5,7]
#compute sum of a submatrix
(sum1<-sum(y[1:10,5:20]))
bigmemory: exercise
Exercise for you:
1.
2.
3.
4.
5.
Master process:
registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue
for the tasks to come
foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base
redisFlushAll(): clears the data base
removeQueue(): removes a queue from the data base
Worker process:
registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed
startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks
specified in jobqueue (uses multicore)
redisWorker(jobqueue,host): useful in mpi-environments
usually users do not request or set the data base values directly
typical parallelization as known from other "Do-packages"
Worker processes can run on any R-compatible hardware and can connect at any time
master:
doRedis
sends
jobs +objects
robust
flexible
dynamic
redis-server
distributes
jobs and objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
doRedis: exercise
1.
2.
3.
4.
5.
redis and doRedis provide high flexibility for performing independent subtasks
worker processes can connect at any time
errors in individual processes do not stop the entire computation (robustness)
worker processes can run on totally different architectures
worker processes can run all around the world
solution: separation of large data objects (bigmemory) and job tasks (redis)
IBE (NFS)
R - bigmemory - implementation
doRedis/bigmemory: Exercise
Exercise:
1.
2.
3.
4.
run the previous example using only two doredis-workers which perform only a single task
rewrite the previous example such that the proportion of class 1 predictions is returned
try to rewrite the previous example such that each worker process reads 10 subdatasets at a time
and then constructs a classifier for each of the ten read in subdatasets
create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension
200x10000 ) using random numbers and run the previous example using that input 'bigmatrix'
The End
Further questions?
Worker processes can run on any R-compatible hardware and can connect at any time
master:
doRedis
sends
jobs +objects
robust
flexible
dynamic
redis-server
distributes
jobs and objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z