You are on page 1of 21

IBM General Parallel File System (GPFS) 3.

5
and SoNAS
Scalable high-performance file system

Klaus Gottschalk HPC Architect IBM Germany


Anselm Hruschka IT Specialist IBM Germany

2012 IBM Corporation


Agenda Whats New in GPFS 3.5

Active File Management


GPFS Native RAID
Independent Filesets
Improved Quotas
ILM Improvements

2 2012 IBM Corporation


Evolution of the global namespace:
GPFS Active File Management (afm)

GPFS

GPFS

GPFS GPFS
GPFS
GPFS

GPFS introduced AFM takes global namespace


concurrent file Multi-cluster expands the global truly global by automatically
system access from namespace by connecting managing asynchronous
multiple nodes. multiple sites replication of data

1993 2005 2011

3 2012 IBM Corporation


AFM Use Cases

HPC Distributed NAS Storage Cloud

Grid computing: WAN Caching: Caching NAS key building block


allowing data to across WAN between of cloud storage
move transparently SoNAS clusters or architecture
SoNAS and another NAS Enables edge caching in
during grid
vendor the cloud
workflows
Data Migration: Online DR support within cloud
Facilitates content cross-vendor data data repositories
distribution for global migration Peer-to-peer data access
enterprises, follow- Disaster Recovery: among cloud edge sites
the-sun engineering multi-site fileset-level Global wide-area
teams replication/failover filesystem spanning
Shared Namespace: multiple sites in the
across SoNAS clusters cloud

4 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


AFM Architecture If data is modified at home
Revalidation done at a configurable timeout
Close to NFS style close-to-open consistency across
Fileset on home cluster is associated with a sites
fileset on one or more cache clusters POSIX strong consistency within cache site
If data is in cache If data is modified at cache
Cache hit at local disk speeds Writes see no WAN latency
Client sees local GPFS performance if file or directory is in
cache are done to the cache (i.e. local GPFS), then
asynchronously pushed home
If data not in cache If network is disconnected
Data and metadata (files and directories) pulled on-demand
at network line speed and written to GPFS cached data can still be read, and writes to cache are
Uses NFS/pNFS for WAN data transfer written back after reconnection
There can be conflicts
SoNAS layer GW Nodes SoNAS layer

Pull on cache miss


Push on write

pNFS/NFS over the WAN

Cache Cluster Site 2


Cache Cluster Site1
(GPFS+Panache)
(GPFS+Panache)
Home Cluster Site
(Any NAS box or SOFS)

5 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA Remote user reads local


edge device for file
Remote site read caching
/home/appl/data/web/important_big_spreadsheet.xls
/home/appl/data/web/important_big_spreadsheet.xls
/home
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/big_architecture_drawing.ppt
/appl

/data /home/appl/data/web/unstructured_big_video.mpg
/home/appl/data/web/unstructured_big_video.mpg

/web

Local cache to disk

Can run disconnected Read


IBM Scale Out NAS Auto-read from home site
Panache w/
Scale Out NAS

Policy Engine
Interface
node Interface
node

Storage
node
Storage
node

Panache w/
Scale Out NAS

Tier 1: SAS drive Tier 2: SATA drives

6 2012 IBM Corporation


IBM General Parallel File System Remote user writes
file to local edge
Data Marking: OTHER DATA device
Remote site write caching, update home site
/home/appl/data/web/important_big_spreadsheet.xls
/home/appl/data/web/important_big_spreadsheet.xls
/home
/home/appl/data/web/big_architecture_drawing.ppt
/home/appl/data/web/big_architecture_drawing.ppt
/appl

/data /home/appl/data/web/unstructured_big_video.mpg
/home/appl/data/web/unstructured_big_video.mpg

/web

Local cache to disk

Periodically, or when
nw is reconnected 1. Write
IBM Scale Out NAS
Panache w/
Scale Out NAS
Global Namespace
Policy Engine
Interface
node Interface
node

Storage
node
Storage
node

Panache w/
Scale Out NAS

Tier 1: SAS drive Tier 2: SATA drives

7 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


Panache Modes

Single Writer
Only cache can write data. Home cant change. Other peer caches have to
be setup as read only caches.

Read Only
Cache can only read data, no data change allowed.

Local Update
Data is cached from home and changes are allowed like SW mode but changes are
not pushed to home.
Once data is changed the relationship is broken i.e cache and home are no longer in
sync for that file.

Change of Modes
SW & RO mode caches can be changed to any other mode.
LU cache cant be changed too many complications/conflicts to deal with.

8 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


Pre-fetching

Policy-based pre population


Periodically runs parallel inodescan at home
Selects files/dirs based on policy criterion
Includes any user defined metadata in xattrs or other file attributes
SQL like construct to select
RULE LIST prefetchlist' WHERE FILESIZE > 1GB AND
MODIFICATION_TIME > CURRENT_TIME- 3600 AND
USER_ATTR1 = sat-photo OR USER_ATTR2 =
classified

Cache then pre-fetches selected objects


Runs asynchronously in the background
Parallel multi-node prefetch
Can callout when completed

9 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


Expiration of Data

Staleness Control
Defined based on time since disconnection
Once cache is expired, no access is allowed to cache
Manual expire/unexpire option for admin
mmafmctl expire/unexpire, ctlcache in sonas
Allowed onlys for ro mode cache
Disabled for SW & LU as they are sources of data themselves

10 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


Use Case: Central/Branch Office
Central Site is where data is
created, maintained,
updated/changed.
Periodic Prefetch On Demand Pull
This is typically done in
customer situations, where data
is ingested via satellite or data
warehousing etc.
Branch/edge sites can
periodically prefetch (via policy)
HQ Primary Site or pull on demand
(Writer)
Data is revalidated when
accessed
A typical scenario for this is:
Music sites, where data is
maintained at central
location and other sites in
various locations will pull in
Edge site data into cache and serve
(Reader) the data locally at that
location.
Customers Use Cases:
BofA, NGA, US Army

11 2012 IBM Corporation


IBM General Parallel File System

Data Marking: OTHER DATA


Use Case: Independent Writers

Local sites dedicated to researchers


within a campus sharing a dedicated
home filesystem but with individual home
directory.
UseUser Bs home
UseUser As home directory directory (writer)
Each site/system has their own fileset,
which will be their local cluster.
(writer)
r As home directory Use case oftens has a central system
which will have all home dirs and
(writer) backup/hsm will be managed out of this.

A company spread across various


countries. They maintain logs of phone
calls/sms etc as needed. The
Backup Site headquarters needs to maintain
logs/records of all calls/sms from all
countries. Per each country
requirements, the data needs to be
UseBackup site maintainted to process any queries
received from customers/govt etc.

Typically in this case, the company


headquarters contains all records from all
countries/branches of its office. And
each location maintains logs/records for
that country and any country it requires.

12 2012 IBM Corporation


Global Namespace
File System: store1
Cache Filesets:
/data1
/data2
Clients access:
/global/data1 Local Filesets:
/global/data2 /data3
/global/data3 /data4 File System: store2
/global/data4 Cache Filesets:
/data5 Local Filesets: Clients access:
/global/data5 /data1
/data6 /global/data1
/global/data6 /data2
/global/data2
Cache Filesets: /global/data3
/data3 /global/data4
/data4
/global/data5
File System: store3 Cache Filesets: /global/data6
Clients access: /data5
Cache Filesets: /data6
/global/data1 /data1
/global/data2 /data2
/global/data3
Cache Filesets:
/global/data4 /data3
/global/data5 /data4
/global/data6
Local Filesets:
/data5
/data6
See all data from any Cluster
Cache as much data as
required or fetch data on
demand

13 2012 IBM Corporation


Why build GPFS Native RAID?

Disk rebuilding is a fact of life at Petascale level


With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about
four times a day
24-hour rebuild implies four concurrent, continuous rebuilds at all times
With larger disks, rebuild time is larger risking 2nd disk failure.

Disk Integrity issues


Silent i/o drops etc

14 2012 IBM Corporation


What is GPFS Native RAID?

Software RAID on the I/O Servers


SAS attached JBOD
Special JBOD storage drawer for very
dense drive packing Local area network (LAN)
Solid-state drives (SSDs) for metadata
storage NSD
servers
vDISK vDISK

SAS SAS

Features JBODs
Auto rebalancing
Only 2% rebuild performance hit
Reed Solomon erasure code, 8 data +3 parity
~105 year MTDDL for 100-PB file system
End-to-end, disk-to-GPFS-client data checksums
15 2012 IBM Corporation
Declustered RAID

Data, parity and spare strips are uniformly and independently distributed across
disk array

Conventional Declustered

Supports an arbitrary number of disks per array


Not restricted to an integral number of RAID track widths

16 2012 IBM Corporation


GPFS Native RAID algorithm
Each block of each file is stripped
Two types of RAID
2-fault and 3-fault tolerant codes (RAID-D2, RAID-D3)
3 or 4 way replication
8 + 2 or 3 way parity

2-fault 8 + 2p Reed Solomon 3-way Replication (1+2)


tolerant
codes

3-fault 8 + 3p Reed Solomon 4-way Replication (1+3)


tolerant
codes

8 strips 2 or 3 1 strip 2 or 3
(GPFS block) redundancy (GPFS replicated
strips block) strips

17 2012 IBM Corporation


Component Hierarchy

A Recovery group can have


max 512 disks
16 declustered arrays Vdisks = NSD VD VD VD VD VD VD VD VD
At least 1 SSD log vdisk
Max 64 vdisks
Declustered
A De-clustered array can Arrays DA DA DA DA
contain 128 pdisks
Smallest is 4 disks
Must have one large >= 11 disks
Need 1 or more pdisks worth of Recovery Group Recovery Group
spare space
Vdisks
Block Size: 1 MiB, 2 MiB, 4 MiB, 8
MiB and 16 MiB
pdisks

18 2012 IBM Corporation


Independent Filesets

Independent Filesets
Own inode space; dynamic expansion of inodes
Efficient File mgmt ops
Fileset level snapshots
Per user/group quotas per fileset

19 2012 IBM Corporation


Whats new in 3.5?

New event callbacks


Tiebreaker callback to let customer decide which side to survive in case of network
partition
diskDown to ensure desired action is taken when disk goes down
Performance Enhancements
NSD multi-queue
Provides more pipelining and parallelism in terms of i/o scheduling
Better I/O performance in large SMP configs
Data in inode for small files
Striped Log files provides balanced disk usage in small clusters

ILM Enhancements
Scope option allows scans to be limited to fileset, filesystem or inode space
choice-algorithm (best, exact, fast)
split-margin specifies how much deviation is allowed when use fast choice
algorithm in terms of THRESHOLD usage etc.

20 2012 IBM Corporation


Misc

Snapshot Clones
Quick, efficient way of making a file copy by creating a clone
Doesnt copy data blocks (eg: fits well with VM images)
IPV6 support
Windows
GPFS daemon no longer needs SUA (SUA still required for GPFS admin cmds)

SELinux Support
API to access xattrs of a file

21 2012 IBM Corporation

You might also like