IBM General Parallel File System 3.5 and SoNAS

IBM General Parallel File System (GPFS) 3.
5
and SoNAS
Scalable high-performance file system
Klaus Gottschalk HPC Architect IBM Germany

Anselm Hruschka IT Specialist IBM Germany
2012 IBM Corporation

Agenda Whats New in GPFS 3.5
Active File Management

GPFS Native RAID
Independent Filesets
Improved Quotas
ILM Improvements
2 2012 IBM Corporation

Evolution of the global namespace:
GPFS Active File Management (afm)
GPFS
GPFS
GPFS GPFS
GPFS
GPFS
GPFS introduced AFM takes global namespace

concurrent file Multi-cluster expands the global truly global by automatically
system access from namespace by connecting managing asynchronous
multiple nodes. multiple sites replication of data
1993 2005 2011

AFM Use Cases
HPC Distributed NAS Storage Cloud
Grid computing: WAN Caching: Caching NAS key building block

allowing data to across WAN between of cloud storage
move transparently SoNAS clusters or architecture
SoNAS and another NAS Enables edge caching in
during grid
vendor the cloud
workflows
Data Migration: Online DR support within cloud
Facilitates content cross-vendor data data repositories
distribution for global migration Peer-to-peer data access
enterprises, follow- Disaster Recovery: among cloud edge sites
the-sun engineering multi-site fileset-level Global wide-area
teams replication/failover filesystem spanning
Shared Namespace: multiple sites in the
across SoNAS clusters cloud

IBM General Parallel File System
Data Marking: OTHER DATA

AFM Architecture If data is modified at home
Revalidation done at a configurable timeout
Close to NFS style close-to-open consistency across
Fileset on home cluster is associated with a sites
fileset on one or more cache clusters POSIX strong consistency within cache site
If data is in cache If data is modified at cache
Cache hit at local disk speeds Writes see no WAN latency
Client sees local GPFS performance if file or directory is in
cache are done to the cache (i.e. local GPFS), then
asynchronously pushed home
If data not in cache If network is disconnected
Data and metadata (files and directories) pulled on-demand
at network line speed and written to GPFS cached data can still be read, and writes to cache are
Uses NFS/pNFS for WAN data transfer written back after reconnection
There can be conflicts
SoNAS layer GW Nodes SoNAS layer
Pull on cache miss

Push on write
pNFS/NFS over the WAN
Cache Cluster Site 2

Cache Cluster Site1
(GPFS+Panache)
(GPFS+Panache)
Home Cluster Site
(Any NAS box or SOFS)

Data Marking: OTHER DATA Remote user reads local

edge device for file
Remote site read caching
/home/appl/data/web/important_big_spreadsheet.xls
/home
/home/appl/data/web/big_architecture_drawing.ppt
/appl
/data /home/appl/data/web/unstructured_big_video.mpg
/home/appl/data/web/unstructured_big_video.mpg
/web
Local cache to disk
Can run disconnected Read

IBM Scale Out NAS Auto-read from home site
Panache w/
Scale Out NAS
Policy Engine
Interface
node Interface
node
Storage
node
Storage
node

Panache w/
Scale Out NAS
Tier 1: SAS drive Tier 2: SATA drives

IBM General Parallel File System Remote user writes
file to local edge
Data Marking: OTHER DATA device
Remote site write caching, update home site
/home
/appl
/data /home/appl/data/web/unstructured_big_video.mpg
/home/appl/data/web/unstructured_big_video.mpg
/web
Local cache to disk
Periodically, or when
nw is reconnected 1. Write
IBM Scale Out NAS
Panache w/
Scale Out NAS
Global Namespace
Policy Engine
Interface
node Interface
node
Storage
node
Storage
node

Panache w/
Scale Out NAS
Tier 1: SAS drive Tier 2: SATA drives


Panache Modes
Single Writer
Only cache can write data. Home cant change. Other peer caches have to
be setup as read only caches.
Read Only
Cache can only read data, no data change allowed.
Local Update
Data is cached from home and changes are allowed like SW mode but changes are
not pushed to home.
Once data is changed the relationship is broken i.e cache and home are no longer in
sync for that file.
Change of Modes
SW & RO mode caches can be changed to any other mode.
LU cache cant be changed too many complications/conflicts to deal with.


Pre-fetching
Policy-based pre population

Periodically runs parallel inodescan at home
Selects files/dirs based on policy criterion
Includes any user defined metadata in xattrs or other file attributes
SQL like construct to select
RULE LIST prefetchlist' WHERE FILESIZE > 1GB AND
MODIFICATION_TIME > CURRENT_TIME- 3600 AND
USER_ATTR1 = sat-photo OR USER_ATTR2 =
classified
Cache then pre-fetches selected objects

Runs asynchronously in the background
Parallel multi-node prefetch
Can callout when completed


Expiration of Data
Staleness Control
Defined based on time since disconnection
Once cache is expired, no access is allowed to cache
Manual expire/unexpire option for admin
mmafmctl expire/unexpire, ctlcache in sonas
Allowed onlys for ro mode cache
Disabled for SW & LU as they are sources of data themselves


Use Case: Central/Branch Office
Central Site is where data is
created, maintained,
updated/changed.
Periodic Prefetch On Demand Pull
This is typically done in
customer situations, where data
is ingested via satellite or data
warehousing etc.
Branch/edge sites can
periodically prefetch (via policy)
HQ Primary Site or pull on demand
(Writer)
Data is revalidated when
accessed
A typical scenario for this is:
Music sites, where data is
maintained at central
location and other sites in
various locations will pull in
Edge site data into cache and serve
(Reader) the data locally at that
location.
Customers Use Cases:
BofA, NGA, US Army


Use Case: Independent Writers
Local sites dedicated to researchers

within a campus sharing a dedicated
home filesystem but with individual home
directory.
UseUser Bs home
UseUser As home directory directory (writer)
Each site/system has their own fileset,
which will be their local cluster.
(writer)
r As home directory Use case oftens has a central system
which will have all home dirs and
(writer) backup/hsm will be managed out of this.
A company spread across various

countries. They maintain logs of phone
calls/sms etc as needed. The
Backup Site headquarters needs to maintain
logs/records of all calls/sms from all
countries. Per each country
requirements, the data needs to be
UseBackup site maintainted to process any queries
received from customers/govt etc.
Typically in this case, the company

headquarters contains all records from all
countries/branches of its office. And
each location maintains logs/records for
that country and any country it requires.

Global Namespace
File System: store1
Cache Filesets:
/data1
/data2
Clients access:
/global/data1 Local Filesets:
/global/data2 /data3
/global/data3 /data4 File System: store2
/global/data4 Cache Filesets:
/data5 Local Filesets: Clients access:
/data6 /global/data1
/global/data2
Cache Filesets: /global/data3
/data3 /global/data4
/data4
/global/data5
File System: store3 Cache Filesets: /global/data6
Clients access: /data5
Cache Filesets: /data6
/global/data3
Cache Filesets:
/global/data6
Local Filesets:
/data5
/data6
See all data from any Cluster
Cache as much data as
required or fetch data on
demand

Why build GPFS Native RAID?
Disk rebuilding is a fact of life at Petascale level

With 100,000 disks and an MTBFdisk = 600 Khrs, rebuild is triggered about
four times a day
24-hour rebuild implies four concurrent, continuous rebuilds at all times
With larger disks, rebuild time is larger risking 2nd disk failure.
Disk Integrity issues

Silent i/o drops etc

What is GPFS Native RAID?
Software RAID on the I/O Servers

SAS attached JBOD
Special JBOD storage drawer for very
dense drive packing Local area network (LAN)
Solid-state drives (SSDs) for metadata
storage NSD
servers
vDISK vDISK
SAS SAS
Features JBODs
Auto rebalancing
Only 2% rebuild performance hit
Reed Solomon erasure code, 8 data +3 parity
~105 year MTDDL for 100-PB file system
End-to-end, disk-to-GPFS-client data checksums
Declustered RAID
Data, parity and spare strips are uniformly and independently distributed across
disk array
Conventional Declustered
Supports an arbitrary number of disks per array

Not restricted to an integral number of RAID track widths

GPFS Native RAID algorithm
Each block of each file is stripped
Two types of RAID
2-fault and 3-fault tolerant codes (RAID-D2, RAID-D3)
3 or 4 way replication
8 + 2 or 3 way parity
2-fault 8 + 2p Reed Solomon 3-way Replication (1+2)

tolerant
codes
3-fault 8 + 3p Reed Solomon 4-way Replication (1+3)

tolerant
codes
8 strips 2 or 3 1 strip 2 or 3
(GPFS block) redundancy (GPFS replicated
strips block) strips

Component Hierarchy
A Recovery group can have

max 512 disks
16 declustered arrays Vdisks = NSD VD VD VD VD VD VD VD VD
At least 1 SSD log vdisk
Max 64 vdisks
Declustered
A De-clustered array can Arrays DA DA DA DA
contain 128 pdisks
Smallest is 4 disks
Must have one large >= 11 disks
Need 1 or more pdisks worth of Recovery Group Recovery Group
spare space
Vdisks
Block Size: 1 MiB, 2 MiB, 4 MiB, 8
MiB and 16 MiB
pdisks

Own inode space; dynamic expansion of inodes
Efficient File mgmt ops
Fileset level snapshots
Per user/group quotas per fileset

Whats new in 3.5?
New event callbacks

Tiebreaker callback to let customer decide which side to survive in case of network
partition
diskDown to ensure desired action is taken when disk goes down
Performance Enhancements
NSD multi-queue
Provides more pipelining and parallelism in terms of i/o scheduling
Better I/O performance in large SMP configs
Data in inode for small files
Striped Log files provides balanced disk usage in small clusters
ILM Enhancements
Scope option allows scans to be limited to fileset, filesystem or inode space
choice-algorithm (best, exact, fast)
split-margin specifies how much deviation is allowed when use fast choice
algorithm in terms of THRESHOLD usage etc.

Misc
Snapshot Clones
Quick, efficient way of making a file copy by creating a clone
Doesnt copy data blocks (eg: fits well with VM images)
IPV6 support
Windows
GPFS daemon no longer needs SUA (SUA still required for GPFS admin cmds)
SELinux Support
API to access xattrs of a file

IBM General Parallel File System 3.5 and SoNAS

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IBM General Parallel File System 3.5 and SoNAS

Uploaded by

Copyright:

Available Formats

IBM General Parallel File System (GPFS) 3.

Klaus Gottschalk HPC Architect IBM Germany

2012 IBM Corporation

Active File Management

2 2012 IBM Corporation

GPFS introduced AFM takes global namespace

1993 2005 2011

3 2012 IBM Corporation

HPC Distributed NAS Storage Cloud

Grid computing: WAN Caching: Caching NAS key building block

4 2012 IBM Corporation

Data Marking: OTHER DATA

Pull on cache miss

pNFS/NFS over the WAN

Cache Cluster Site 2

5 2012 IBM Corporation

Data Marking: OTHER DATA Remote user reads local

Local cache to disk

Can run disconnected Read

Tier 1: SAS drive Tier 2: SATA drives

6 2012 IBM Corporation

Local cache to disk

Tier 1: SAS drive Tier 2: SATA drives

7 2012 IBM Corporation

Data Marking: OTHER DATA

8 2012 IBM Corporation

Data Marking: OTHER DATA

Policy-based pre population

Cache then pre-fetches selected objects

9 2012 IBM Corporation

Data Marking: OTHER DATA

10 2012 IBM Corporation

Data Marking: OTHER DATA

11 2012 IBM Corporation

Data Marking: OTHER DATA

Local sites dedicated to researchers

A company spread across various

Typically in this case, the company

12 2012 IBM Corporation

13 2012 IBM Corporation

Disk rebuilding is a fact of life at Petascale level

Disk Integrity issues

14 2012 IBM Corporation

Software RAID on the I/O Servers

Supports an arbitrary number of disks per array

16 2012 IBM Corporation

2-fault 8 + 2p Reed Solomon 3-way Replication (1+2)

3-fault 8 + 3p Reed Solomon 4-way Replication (1+3)

17 2012 IBM Corporation

A Recovery group can have

18 2012 IBM Corporation

19 2012 IBM Corporation

New event callbacks

20 2012 IBM Corporation

21 2012 IBM Corporation

You might also like