LRZ Kurs - Big Data Analysis

Big Data Analysis
Christoph Bernau and Ferdinand Jamitzky

jamitzky@lrz.de
http://goo.gl/kS31X
Big Data Analysis

jamitzky@lrz.de
http://goo.gl/kS31X
Big Data Analysis

jamitzky@lrz.de
http://goo.gl/kS31X
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
A short introduction to big data

Parallel programming is hard
Hardware @LRZ
Functional Programming
Available packages for R
Parallel Programming Tools
SMP Programming
Cluster Programming
Job Scheduler
Calling external binary code
big data
a short introduction
What is Big Data?

In information technology, big data is a looselydefined term used to describe data sets so large
and complex that they become awkward to work
with using on-hand database management tools
(from wikipedia)
Buzz Word
High dimensional data
Memory intensive data and/or algorithms
Who does Big Data?
Bioinformatics
Genomics and other "Omics"
Astronomy
Meteorology
Environmental Research
Multiscale physics simulations
Economic and financial simulations
Social Networks
Text Mining
Large Hadron Collider
Hardware for Big Data
Large Arrays of Harddisks

Solid State Disks as temp storage
Large RAM
Manycore
Multicore
Accelerators
Tape Archives
Software Middleware for Big Data
MapReduce
Distributed File Systems
Parallel File Systems
Distributed Databases
Task Queues
Memory Attached Files
Supercomputer for Big Data

(Flash) Gordon: Data-Intensive Supercomputing at
the San Diego Supercomputing Centre
1,024 dual-socket Intel Sandy Bridge nodes,

each with 64 GB DDR3 1333 memory
Over 300 TB of high performance Intel flash
memory SSDs via 64 dual-socket Intel
Westmere I/O nodes
Large memory supernodes capable of
presenting over 2 TB of cache coherent
memory
Dual rail QDR InfiniBand network
http://www.sdsc.edu/supercomputing/gordon/
SuperMUC as Big Data System

SuperMUC
9,216 dual-socket Intel Sandy Bridge nodes,

each with 32 GB DDR3 1333 memory
Parallel File System GPFS
FDR10 InfiniBand network
Bandwith to GPFS 200 GByte/s
No Flash :-(
parallel programming is hard
Why parallel programming?

End of the free lunch
Moore's law means

no longer faster
processors, only more
of them. But beware!
2 x 3 GHz < 6 GHz
(cache consistency,
multi-threading, etc)
The future is parallel

Moore's law is still valid
Number of transistors doubles every 2 years
Clock speed saturates at 3 to 4 GHz
multi-core processors vs many-core processors
grid/cloud computing
clusters
GPGPUs
(intel 2000)
The future is massively parallel

Connection Machine
CM-1 (1983)
12-D Hypercube
65536 1-bit cores

(AND, OR, NOT)
Rmax: 20 GFLOP/s
The future is massively parallel

JUGENE
Blue Gene/P (2007)
3-D Torus or Tree
65536 64-bit cores

(PowerPC 450)
Rmax: 222 TFLOP/s
Supercomputer: SMP
SMP Machine:
Example: gvs1
shared memory
128 GB RAM
typically 10s of cores
16 cores
threaded programs
bus interconnect
Example: uv3.cos.lrz.de
in R:
2000 GB RAM
1120 cores
library(multicore)
and inlined code
Supercomputer: MPI
Cluster of machines:
Example: coolMUC
distributed memory
4700 GB RAM
typically 100s of cores
2030 cores
message passing interface

infiniband interconnect
Example: superMUC
in R:
320.000 GB RAM
160.000 cores
library(Rmpi)
and inlined code
Levels of Parallelism
Node Level (e.g. SuperMUC has approx. 10000 nodes)
each node has 2 sockets
Socket Level
each socket contains 8 cores
Core Level
each core has 16 vector registers
Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)
Pipeline Level (how many simultaneous pipelines)
hyperthreading
Instruction Level (instructions per cycle)
out of order execution, branch prediction
Problems: Access Times
Getting data from:
Getting some food from:
CPU register
1ns
fridge
L2 cache
10ns
microwave
100s ~ 2min
memory
80 ns
pizza service
800s ~ 15min
network(IB)
200 ns
city mall
GPU(PCIe)
50.000 ns
mum sends cake 500.000 s~1 week
10s
2000s ~ 0.5h
Computing MFlop/s
mflops.internal <- function(np) {
a=matrix(runif(np**2),np,np)
b=matrix(runif(np**2),np,np)
nflops=np**2*(2*np-1)
time=system.time(a %*% b)[[3]]
nflops/time/1000000}
This function computes a matrix-matrix multiplication using np x np random matrices.

The number of floating point operations is:
np x np matrix elements
np multiplications and (np-1) additions
resulting in
np x np x (np+np-1) = np**2*(2*np-1) FLOPS
Amdahl's law
Computing time for N processors
T(N) = T(1)/N + Tserial + Tcomm * N
Accelerator factor:
T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)
small N: T(1)/T(N) ~ N
large N: T(1)/T(N) ~ 1/N
Amdahl's Law II
Acceleration factor for

Tserial/T(1)=0.01
Amdahl's law III

> plot(N,type="l")
> lines(N/(1+0.01*N),col="red")
> lines(N/(1+0.01*N+0.001*N**2),col="green")
R on the HLRB-II
Strong scaling for

up to 120 cores
then the computing time is

too low.
Leibniz Supercomputing Centre
Hardware @ LRZ
The Leibniz Supercomputing Centre is
Computer Centre (~175 employees) for all Munich Universities with
more than 26,000 employees

including 8,500 scientists
Regional Computer Centre for all Bavarian Universities
more than 80,000 students and
Capacity computing
Special equipment
Backup and Archiving Centre (10 petabyte, more than 6 billion files)
Distributed File Systems
Competence centre (Networks, HPC, IT Management)
National Supercomputing Centre
Gauss Centre for Supercomputing

Integrated in European HPC and Grid projects
Hardware @ LRZ
http://www.lrz.de/services/compute/linux-cluster/overview/
The LRZ Linux Cluster:

Heterogeneous Cluster of Intel-compatible systems
lx64ia, lx64ia2, lx64ia3 (login nodes)
gvs1, gvs2, gvs3, gvs4 (remote visualisation nodes 8 GPUs)
uv2, uv3 (SMP nodes 1.040 cores)
ice1-login (cluster)
lxa1 (coolMUC, MPP cluster)
The SuperMUC
superMIG (migration system and fat island, 8.200 cores)
superMUC (cluster of thin islands, 147.456 cores available in Sept 2012)
Hardware@LRZ (new Sept 2012)

Linux Cluster
SuperMUC
SuperMIG
8200 cores
SuperMUC
147456 cores
supzero
80 cores
supermuc
16 cores
SGI UV
2080 cores
SGI ICE
512 cores
gvs1...4
64 cores
CoolMUC
4300 cores
lx64ia2
8 cores
login
lx64ia3
8 cores
login
ia64
x86_64
GPU
File space @ LRZ

http://www.lrz.de/services/compute/backup/
$HOME
25 GB per group, with backup and snapshots
cd $HOME/.snapshot
$OPT_TMP
temporary scratch space (beware!)
High Watermark Deletion
When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the
oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.
$PROJECT
project space (max 1TB), no automatic backup, use dsmc
module system@LRZ
http://www.lrz.de/services/software/utilities/modules/
module avail
module list
module load <name>
e.g. module load matlab
module unload <name>
module show <name>
insert module system into qsub job:

. /etc/profile
or
. /etc/profile.d/modules.sh
What our user do: Usage 2010 by Research Area
Performance per core by Research area
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
simple slurm script:

#!/bin/bash
#SBATCH -J myjob
#SBATCH --mailuser=me@my_domain
#SBATCH --time=00:05:00
this is ignored by SGE, but could be used if

executed normally
(Placeholder) name of job
(Placeholder) e-Mail address (don't forget!)
maximum run time; this may be increased up to

the queue limit
. /etc/profile
load the standard environment (see below)
cd mydir
./myprog.exe
change to working directory
echo $JOB_ID
start executable
batch system@LRZ
http://www.lrz.de/services/compute/linux-cluster/batch-parallel
sbatch jobfile.sh
submit job to SLURM
squeue -u <userid>
get status of my job
scancel <jobid>
delete my job
Start interactive shell:
srun --ntasks=32 --partition=uv2_batch xterm
R makes life easier
functional programming matters
How are High-Performance Codes constructed?
Traditional Construction of High-Performance Codes:

C/C++/Fortran
Libraries
Alternative Construction of High-Performance Codes:
Scripting for brains
GPUs/multicore for inner loops
Play to the strengths of each programming environment.
Hybrid
programming:
use cluster and task parallelism at the same

cluster parallelism: separated memory
task parallelism: shared memory
time
Why scripting?
A scripting language. . .
is discoverable and interactive.
has comprehensive built-in functionality.
manages resources automatically.
is dynamically typed.
works well for glueing lower-level blocks together.
examples: tcl/tk, perl, python, ruby, R, MATLAB
Why functional matters...

for parallel programming:
no side effects
code as data
for structured programming:
late binding
recursion
lazy evaluation
very high abstraction
R functions
R can define named and anonymous functions
Define a (named or anonymous) function:
todB <- function(X) {10*log10(X)}
Functions can even return (anonymous) functions
The last value evaluated is the return value
Variables from the calling namespace are visible
All other variables are local unless specified
Variable number of inputs:
myfunc <- function(...) list(...)
Variable names and predefined values
myfunc <- function(a,b=1,c=a*b) c+1
Available packages for R
How to use multiple cores with R

R provides modularization
R provides high level abstractions
R provides mixing of programming paradigms
R provides dynamic libraries
R provides vector expressions
Use It!
You can write multi-machine, multi-core, GPGPU accelerated, clientserver based, web-enabled applications using R
Parallel R Packages
foreach
pnmath/MKL
multicore
snow
Rmpi
rgpu, gputools
R webservices
sqldf
rredis
mapReduce
parallel abstraction
parallel intrinsic functions
SMP programming
Simple Network of Workers
Message Passing Interface
GPGPU programming
client/server webservices
SQL server for R
noSQL server for R
large scale parallelization
Parallel programming with R

Parallel APIs:
SMP - multicore
MPP/MPI - mpi
ssh/sockets - snow
Example:
Abstraction:
foreach package
doMC
doMPI
doSNOW
doREDIS
foreach(i=1:10) %dopar%
library(doMC)
registerDoMC(cores=5)
sqrt(i)
roots -> foreach(i=1:10) %

dopar% sqrt(i)
SMP programming
library(multicore)
send tasks into the background with parallel
wait for completion and gather results with collect
library(multicore)
# spawn two tasks
p1 <- parallel(sum(runif(10000000)))
p2 <- parallel(sum(runif(10000000)))
# gather results blocking
collect(list(p1,p2))
# gather results non-blocking
collect(list(p1,p2),wait=F)
library(multicore)
Extension of the apply function family in R
function-function or functional
utilizes SMP:
library(multicore)
doit <- function(x,np)sum(sort(runif(np)))
# single call
system.time( doit(0,10000000) )
# serial loop
system.time( lapply(1:16,doit,10000000))
# parallel loop
system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))
doMC
# R
> library(foreach)
> library(doMC)
> registerDoMC(cores=4)
> foreach(i=1:10) %do% sum(runif(10000000))

user
9.352
system elapsed
2.652
12.002
> foreach(i=1:10) %dopar% sum(runif(10000000))

user
7.228
system elapsed
7.216
3.296
multithreading with R
library(foreach)
library(foreach)
library(doMC)
registerDoMC()
foreach(i=1:N) %dopar%
{
foreach(i=1:N) %do%
{
mmult.f()
}
mmult.f()
}
# thread execution
# serial execution
Cluster Programming
doSNOW
# R
> library(doSNOW)
> registerDoSNOW(makeSOCKcluster(4))
> foreach(i=1:10) %do% sum(runif(10000000))

user
15.377
system elapsed
0.928
16.303
> foreach(i=1:10) %dopar% sum(runif(10000000))

user
4.864
system elapsed
0.000
4.865
SNOW with R
library(foreach)
library(foreach)
library(doSNOW)
registerDoSNOW()
foreach(i=1:N) %do%
{
foreach(i=1:N) %dopar%
{
mmult.f()
}
# serial execution
mmult.f()
}
# cluster execution
Job Scheduler
noSQL databases
Redis is an open source, advanced key-value store. It is often referred
to as a data structure server since keys can contain strings, hashes,
lists, sets and sorted sets.
http://www.redis.io
Clients are available for C, C++, C#, Objective-C, Clojure, Common

Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,
smalltalk, tcl
doRedis / workers
start redis worker:
> echo "require('doRedis');redisWorker('jobs')" | R
The workers can be distributed over the internet

> startRedisWorkers(100)
doRedis
# R
> library(doRedis)
> registerDoRedis("jobs")
> system.time(foreach(i=1:10) %do% sum(runif(10000000)))

user
15.377
system elapsed
0.928
16.303
> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))

user
4.864
system elapsed
0.000
4.865
doMC
# R
> library(doMC)
> registerDoMC(cores=4)

user
9.352
system elapsed
2.652
12.002

user
7.228
system elapsed
7.216
3.296
doSNOW
# R
> library(doSNOW)
> cl <- makeSOCKcluster(4)
> registerDoSNOW(cl)

user
15.377
system elapsed
0.928
16.303

user
4.864
system elapsed
0.000
4.865
redis and R: rredis, doREDIS
redisConnect()
# connect to redis store
redisSet('x',runif(5))
# store a value
redisGet
('x')
# retrieve value from store
redisClose()
# close connection
redisAuth(pwd)
# simple authentication
redisConnect()
redisLPush('x',
1)
redisLPush('x',2)
redisLPush('x',3)
redisLRange('x',0,2)
# push numbers into list
Calling external binary code
One R to rule them all

C/C++/objectiveC
Fortran
java
Mpi
Threads
opengl
ssh
web server/client
linux mac mswin
R shell
R gui
math notebook
automatic latex/pdf
vtk
One R to bind them

C/C++/objectiveC
Fortran
java
R objects
R objects
.C("funcname", args...)
.Fortran("test", args...)
.jcall("class", args...)
.Call
.External
Use R as scripting language

R can dynamically load shared objects:
dyn.load("lib.so")
these functions can then be called via
.C("fname", args)
.Fortran("fname", args)
C integration
shared object libraries can be
used in R out of the box
R arrays are mapped to C
pointers
Example:
R CMD SHLIB -o test.so test.c

R
integer
int*
numeric
double*
character
char*
use in R:
> dyn.load("test.so")
> .C("test", args)
Fortran 90 Example
program myprog
! simulate harmonic oscillator
integer, parameter :: np=1000, nstep=1000
real :: x(np), v(np), dx(np), dv(np), dt=0.01
integer :: i,j
forall(i=1:np) x(i)=i
forall(i=1:np) v(i)=i
do j=1,nstep
dx=v*dt;
dv=-x*dt
x=x+dx;
v=v+dv
end do
print*, " total energy: ",sum(x**2+v**2)
end program
Fortran Compiler
use Intel fortran compiler
$ ifort -o myprog.exe myprog.f90
$ time ./myprog.exe
exercise for you:

compute MFlop/s (Floating Point Operations: 4 * np * nstep)
optimize (hint: -fast, -O3)
R subroutine
subroutine mysub(x,v,nstep)
! simulate harmonic oscillator
integer, parameter :: np=1000000
real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001
integer :: i,j, nstep
forall(i=1:np) x(i)=real(i)/np
forall(i=1:np) v(i)=real(i)/np
do j=1,nstep
dx=v*dt;
x=x+dx;
end do
return
end subroutine
dv=-x*dt
v=v+dv
Matrix Multipl. in FORTRAN
subroutine mmult(a,b,c,np)
integer np
real*8 a(np,np), b(np,np), c(np,np)
integer i,j, k
do k=1, np
forall(i=1:np, j=1:np)
b(i,k)*c(k,j)
end do
return
end subroutine
a(i,j) = a(i,j) +
Call FORTRAN from R

# compile f90 to shared object library
system("ifort -shared -fPIC -o mmult.so mmult.f90");
# dynamically load library

dyn.load("mmult.so")
# define multiplication function

mmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,
c=c,np=as.integer(dim(a)[1]))
Call FORTRAN binary

np=100
system.time(
mmult.f(
a = matrix(numeric(np*np),np,np),
b = matrix(numeric(np*np)+1.,np,np),
c = matrix(numeric(np*np)+1.,np,np)
)
)
Exercise: make a plot system-time vs matrix-dimension
Big Memory
Logical Setup of Node

without shared memory
MEM
MEM

with shared memory

with file-backed memory

with network attached filebacked memory
MEM
MEM
Disk
Disk
Disk
Network
Network
Network
MEM
library(bigmemory)
shared memory regions for several
processes in SMP
file backed arrays for several node over
network file systems
library(bigmemory)
x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))
sum(x[1,1:1000])
Part II
Applications
Potential Problems on Big Data Sets

1. many small tasks have to be performed for
each of many thousands of variables (long
run time)
2. analysis/ processing needs more main
memory than available
3. several R processes on a node need to
process the same big data set and each
process creates its own big R-object
4. data set cannot be loaded into R because
the R-object representing it would be too big
for the main memory available (worst case)
Approaches for Big Data Problems

1.
2.
3.
4.
5.
6.
C-function (shared library)

Accelerators (gpgpu, MICs)
SMP parallelisation
Cluster parallelisation
distributed data
in memory data files (arrays as big as
available memory)
7. parallel file systems (file backed arrays, no
size limit)
8. hierarchical and heterogeneous file systems
Problem 1: Example (Microarray Data)
gene expressions for approximately 20000 genes

influence of each variable on a Survival response shall be tested
Compute a Cox-Survival-Model for each variable
exp(bx)
S(t|x) = S0(t)
In R: function coxph() in package Surv (already part of package base)

even more challenging problem: test all second order interactions
(all pairs, 20000 choose 2)

First approach: for-loop in R using function coxph() [which actually calls a C-function using dyn.load to
compute the Cox-Model ]:
library(survHD)
data(beer.survival)
data(beer.exprs)
set.seed(123)
X<-t(as.matrix(beer.exprs))
y<-Surv(beer.survival[,2],beer.survival[,1])
coefs<-c()
system.time(
for(j in 1:ncol(X)){
fit <- coxph( y ~ X[,j])
coefs<-rbind(coefs,summary(fit)$coefficients[ 1 , c(1, 3, 5) ])})
User
34.635
System elapsed
0.002
34.686
Second Approach: using apply

system.time(output <- apply(t(X),1,function(xrow){
fit <- coxph( y ~ xrow )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]
}))
User
System elapsed
26.531 0.020
26.676

2nd Approach:
Passing a matrix to C and perform the for-loop inside C

only coefficients and cooresponding p-values are returned for each variable
function rowCoxTests in R-package survHD
time <- y[,1]

status <- y[,2]
sorted <- order(time)
time <- time[sorted]
status <- status[sorted]
X <- X[sorted,]
User System elapsed

0.229 0.000 0.229
max(abs(out$coefs-coefs[,1]))
[1] 1.004459e-07
##compute columnwise coxmodels

#dynload not necessary, because 'coxmat.so' is integrated into survHD
system.time(out<- .C('coxmat',regmat=as.double(X),ncolmat=as.integer(ncol(X)),
nrowmat=as.integer(nrow(X)),reg=as.double(X[,1]),zscores=as.double(numeric
(ncol(X))),coefs=as.double(numeric(ncol(X))),maxiter=as.integer(20),...))
performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up
principally difficult to program and quite error prone
C-functions for single variables are usually available and wrappers are usually easy to program
Comparison to parallel programming:

Parallelization of for-loop using snow:
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#broadcast X
Z<-X
clusterExport(cl=cl,list=list('Z'))
User System elapsed

0.031 0.003
3.474
#function to be applied in parallel

parcoxph<-function(ind,y){
require(survHD)
zcol<-Z[,ind]
fit<-coxph( y ~ zcol )
summary(fit)$coefficients[ 1 , c(1, 3, 5) ]}
#run function on 10 cores
system.time(result <- parLapply(cl=cl,x=1:ncol(Z),fun=parcoxph,y=y))
parallelization of very small and short tasks usually not efficient

possible improvement: rewrite code such that bunches of tests are performed
Combining both approaches:

For really big data sets (>100000 variables) one can combine both approaches?
X2<-X
for(i in 30){
X2<-cbind(X2,X)}
colnames(X2) <- 1:ncol(X2)
system.time(tt<-rowCoxTests(t(X2),y,option='fast'))
system.time(rowCoxTests(t(X),y,option='fast'))
##using snow
#create cluster
library(snow)
cl<-makeSOCKcluster(10)
#function to be applied in parallel
parfun<-function(ind,Z,y){
require(survHD)
rowCoxTests(X=t(Z),y=y,option='fast')}
User System elapsed

0.593 0.010
0.606
User System elapsed

0.303 0.000
0.303
User System
1.825 0.291
elapsed
7.215
User System elapsed

2.255 0.206
3.436
#run function on 10 cores

system.time(result<-parLapply(cl=cl,x=1:30,fun=parfun,Z=X,y=y))
X2<-cbind(X,X,X)
system.time(result<-parLapply(cl=cl,x=1:10,fun=parfun,Z=X,y=y))
Combining both approaches: Exercise

In the current example, however, parallel computing is less effective anyway
Exercise:
1.
2.
3.
4.
Create a large data set by concatenating the gene-expression matrix 20 times

(use cbind)
apply the function rowCoxTests() and measure the runtime.
use snow in order to sent the expression matrix to 20 cores and let each core
perform rowCoxTests() on its own matrix.
Measure the runtime.
Problem 2: Example
Normalization of Gene-Expression-Microarrays:
approximately 500k measurements per array

background correction has to be performed
ca. 50 measurements have to be summarized to a single value representing one gene expression
(summarization step)
R functions: rma() or vsn() in Bioconductor package affy
high memory requirements as soon as number of observations exceeds 100 arrays (>10GB RAM)
Distributed Data Approach (Bioconductor Package affyPara)
Problem 2: Example
source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation
Distributed Data Approach for backgound correction
AffyPara: Code Example

#load packages and initialize snow-cluster (for affyPara)
library(snow) #parallelization
library(affyPara) #parallel preprocessing
library(affy) #for reading in affy batches
ncpusaffy<-7 #number of cpus
cl<-makeSOCKcluster(ncpusaffy) #create cluster
#reading AffyBatch from cel-files
setwd('~/dataCEL/wang05/cel') #directory containing cel files
aboall<-ReadAffy() #reading
#create subcluster of length ncores
ncores<-7
cll<-cl[1:ncores]
#perform preprocessing using subcluster cll
res<-system.time(arrs.out<-preproPara(aboall,bgcorrect=T,bgcorrect.method='rma',normalize=T,
normalize.method='quantiles',pmcorrect.method='pmonly',summary.method='avgdiff',cluster=cll))
###stop cluster/ finalize MPI
stopCluster(cl)
single core RAM > 6GB
7 cores: ca. 1.5GB/core
minor speedup
Problem 2: Exercise
Exercise for you:
1.
2.
3.
4.
5.
Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package
affy)
use top to observe the memory consumption of the process.
Additionally, measure its runtime.
Perform the background correction as a distributed data approach using snow
(you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files)
Compare memory consumption and runtime to the sequential code
Problem 3/4: Data set too large for

RAM
R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP)
modern biological data can have several dozen GB (e.g. Next Generation Sequencing)
If the R-object representing the data set grows larger than the available RAM, R stops throwing an
error reading "Cannot allocate vector of xx byte".
Possible solution: R package bigmemory (based on C++-libraries for big data objects)
2 areas of usage:
if several processes operate on the same big matrix

file-backed-matrices if data sets are larger than available main memory
and the combination of both situations
R-Package bigmemory
Essential functions:
bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to
access the matrix)
filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small)
describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object
bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their
elements can be accessed using brackets
bigmemory: code example
###write
data(golub)
library(bigmemory)
setwd('~/tmp/bigmem')
X<-as.matrix(golub[,-1])
#create filebacked.bigmatrix and write data into its elements
z<-filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',
backingfile="magolub.bin",descriptorfile="magolub.desc")
k<-0
for(i in 1:5000){
inds<-sample(1:nrow(X),30)
z[(1:30)+(k*30),]<-X[inds,]
k<-k+1}
#create and save descriptorfile for later usage
desc<-describe(z)
save(desc,file='desc_z.RData')
bigmemory: code example
###read
library(bigmemory)
setwd('tmp/bigmem')
#load descriptorfile
load('desc_z.RData')
#attach bigmatrix object using the descriptor file
y<-attach.big.matrix(desc)
#access elements
y[1:10,7]
#read element 7 in the 5th row
b<-y[5,7]
#compute sum of a submatrix
(sum1<-sum(y[1:10,5:20]))
bigmemory: exercise
Exercise for you:
1.
2.
3.
4.
5.
create a bigmatrix object using big.matrix()

create a descriptor and save it
start another R-session on the same node
load the descriptor file and attach the bigmatrix
use the bigmatrix object for communication between both R processes
Gaining Flexibility: doRedis
separates job administration and execution

subtasks are stored in a redis data base
master process sends subtasks of a computation to the server
worker can log in and request the tasks
all necessary R objects are stored in the redis server, too
necessary software:
R-packages: rredis, doRedis
data base: redis-server (debian-package)
doRedis: essential functionality
Master process:
registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue
for the tasks to come
foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base
redisFlushAll(): clears the data base
removeQueue(): removes a queue from the data base
Worker process:
registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed
startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks
specified in jobqueue (uses multicore)
redisWorker(jobqueue,host): useful in mpi-environments
usually users do not request or set the data base values directly
typical parallelization as known from other "Do-packages"
Worker processes can run on any R-compatible hardware and can connect at any time
master:
doRedis
sends
jobs +objects
eventually returns results
robust
flexible
dynamic
redis-server
distributes
jobs and objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z
doRedis: code example

Master (sending subtasks to redis-server and wait for results):
#redis-server ~/redis/redis-2.2.14/redis.conf (in linux shell, starts the
redis-server)
#cross-validation of classification on microarray data
library(CMA)
X <- as.matrix(golub[,-1])
y <- golub[,1]
ls <- GenerateLearningsets(y=y,method='CV',
fold=10,niter=10000)
#function to be applied on each node
cl2 <- function(j){
require(CMA)
ttt<-system.time(cl<-svmCMA(y=y,X=X,learnind=ls@learnmatrix[j,],cost=10))
list(cl,ttt,Sys.info())}
#connect to redis-server, sent subtasks and wait for results
library(doRedis)
redisFlushAll()
registerDoRedis('jobscmanew')
numtodo<-nrow(ls@learnmatrix)
lll3<-foreach(j=1:numtodo) %dopar% {cl2(j)}
doRedis: code example

Worker processes (connect to server, receive subtasks and objects, return results):
###using multicore (just two lines)
#register jobqueue from redis-server
registerDoRedis('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
#start 10 local workers
startLocalWorkers(n=10, queue='jobscmanew')
###using MPI
#function to be run by each mpi-process
startdr<-function(ll){
library(doRedis)
redisWorker('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')
}
#start rmpi
library(Rmpi)
numworker<-mpi.universe.size()
mpi.spawn.Rslaves()
#let each mpi-process connect to redis-server and perform subtasks
mpi.apply(1:numworker,startdr)
doRedis: exercise
1.
2.
3.
4.
5.
connect to the redis server in R

submit a job queue
start workers to perform the subtasks
set a value for variable xnewinteger (use)
request the value of variable xnewinteger (use)
Combining doRedis and bigmemory
redis and doRedis provide high flexibility for performing independent subtasks
worker processes can connect at any time
errors in individual processes do not stop the entire computation (robustness)
worker processes can run on totally different architectures
worker processes can run all around the world
disadvantage: database can become a bottleneck if large R objects have to be stored/sent
solution: separation of large data objects (bigmemory) and job tasks (redis)
Combining doRedis and bigmemory

Separate task and data channel:
doredis/bigmemory: Code Example

worker process:
redisbigreadwrite<-function(procind){
require(CMA)
require(bigmemory)
j<-procind
setwd('~/tmp/bigmemlrz')
load('desc_z.RData')
#big data object containing many gene expression sets
load('desc_out.RData') #big data file for misclassification rates
z<-attach.big.matrix(desc)
out<-attach.big.matrix(descout)
load('descresmat.RData')
resmat<-attach.big.matrix(descresmat) #big data object for simulating large
writing operation
for(iter in 1:10){
start<-(j-1)*30*10*10+(iter-1)*30*10+1
X<-z[start:(start+299),] #read gene expression matrix
cl<-svmCMA(y=sample(c(1,2),nrow(X),replace=T),X=X,learnind=1:25,cost=10))
#construct classifier
out[(j-1)*10+iter]<-mean(abs(cl@y-cl@yhat)) #compute misclassification rate
resmat[start:(start+299)]<-X #write X
}
#flush
flush(resmat);flush(out)}
doredis/bigmemory: Code Example

master process:
###create bigmatrix (gene expressions)
library(bigmemory)
setwd('~/tmp/bigmemlrz')
X<-as.matrix(golub[,-1])
z<-filebacked.big.matrix(nrow=30*1500,ncol=ncol(X),type='double',backingfile="
magolub.bin",
descriptorfile="magolub.desc")
for(i in 1:1500){
inds<-sample(1:nrow(X),30)
z[(1:30)+(i*30),]<-X[inds,]}
#create descrptor file and save it for other processes
desc<-describe(z)
save(desc,file='desc_z.RData')
###doredis part
library(doRedis)
registerDoRedis('rwbigmem')
lll3<-foreach(j=1:1500) %dopar% redisbigreadwrite{(j)}
results are returned in a file-backed object so master could quit
doredis/bigmemory: code example

LRZ (NAS)
IBE (NFS)
main difference: underlying network and network file system
comparison to standard MPIIOapproach

Fortran90 - MPIIO - Implementation
Difference: MPI less flexible

not robust
collective open/close calls
R - bigmemory - implementation
doRedis/bigmemory: Exercise
Exercise:
1.
2.
3.
4.
run the previous example using only two doredis-workers which perform only a single task
rewrite the previous example such that the proportion of class 1 predictions is returned
try to rewrite the previous example such that each worker process reads 10 subdatasets at a time
and then constructs a classifier for each of the ten read in subdatasets
create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension
200x10000 ) using random numbers and run the previous example using that input 'bigmatrix'
The End
Thanks for your attention.
Further questions?
Worker processes can run on any R-compatible hardware and can connect at any time
master:
doRedis
sends
jobs +objects
eventually returns results
robust
flexible
dynamic
redis-server
distributes
jobs and objects
NODE 1
worker 1a
...
worker 1z
NODE 2
worker 2a
...
worker 2z
NODE 3
worker 3a
...
worker 3z
NODE 4
worker 4a
...
worker 4z

LRZ Kurs - Big Data Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LRZ Kurs - Big Data Analysis

Uploaded by

Copyright:

Available Formats

Big Data Analysis

Christoph Bernau and Ferdinand Jamitzky

Big Data Analysis

Big Data Analysis

A short introduction to big data

What is Big Data?

Who does Big Data?

Hardware for Big Data

Large Arrays of Harddisks

Software Middleware for Big Data

Supercomputer for Big Data

1,024 dual-socket Intel Sandy Bridge nodes,

SuperMUC as Big Data System

9,216 dual-socket Intel Sandy Bridge nodes,

parallel programming is hard

Why parallel programming?

Moore's law means

2 x 3 GHz < 6 GHz

The future is parallel

The future is massively parallel

65536 1-bit cores

The future is massively parallel

3-D Torus or Tree

65536 64-bit cores

Rmax: 222 TFLOP/s

typically 10s of cores

typically 100s of cores

message passing interface

Problems: Access Times

Getting data from:

Getting some food from:

mum sends cake 500.000 s~1 week

This function computes a matrix-matrix multiplication using np x np random matrices.

np x np x (np+np-1) = np**2*(2*np-1) FLOPS

T(N) = T(1)/N + Tserial + Tcomm * N

T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)

Acceleration factor for

Amdahl's law III

Strong scaling for

then the computing time is

Leibniz Supercomputing Centre

The Leibniz Supercomputing Centre is

Computer Centre (~175 employees) for all Munich Universities with

more than 26,000 employees

Regional Computer Centre for all Bavarian Universities

more than 80,000 students and

National Supercomputing Centre

Gauss Centre for Supercomputing

The LRZ Linux Cluster:

Hardware@LRZ (new Sept 2012)

File space @ LRZ

insert module system into qsub job:

What our user do: Usage 2010 by Research Area

Performance per core by Research area

simple slurm script:

this is ignored by SGE, but could be used if

maximum run time; this may be increased up to

change to working directory

submit job to SLURM

get status of my job

Start interactive shell:

srun --ntasks=32 --partition=uv2_batch xterm

R makes life easier

np x np x (np+np-1) = np**2(2np-1) FLOPS

T(1)/T(N) = N / (1 + Tserial/T(1)N + Tcomm/T(1)N^2)