You are on page 1of 12

J.E.D.I.

1 ZFS
1.1

Objectives

This chapter gives an overview of the Solaris ZFS file system. It discusses the features of ZFS
as well as discusses the basic ZFS commands.

1.2

Chapter Outline

This chapter is divided into the following sections

Overview

Capacity

Pooled Storage

Data Integrity

Mirroring and RAID-Z

Administration

1.3

Overview

ZFS was developed by the team of Jeff Bonwick as a new file system for Solaris. Announced for
development in September 2004, it was included as part of the official build of OpenSolaris in
November 2005 and distributed with the 6/06 update of Solaris 10 in June 2006.
ZFS aims to be the "last word in file systems". Developers of ZFS describe it in the Solaris
community webpage as "a new kind of file system that provides simple administration,
transactional semantics, end-to-end data integrity, and immense scalability." 1
This chapter will discuss in detail what these features are and show their improvements over
file systems currently being used today.

1.4

Capacity

A file system's capacity can be described in the number of bits the system uses to store
information about files. Current file systems are 64-bit systems, meaning that 64 bits of data
are used to store each file's information, such as location on which device, file permissions,
directory contents, etc. Being a 64-bit file system denotes a theoretical maximum file system
size of 2^64 bytes. This is approximately 10^11 GB or 1 billion hard disks (of 100 GB each).
This is only a theoretical maximum. The following are the file system limitations of some well
known file systems.
File system

Operating System

Capacity

FAT16

DOS-Win3.11

2GB

FAT32

Win95-WinME

8TB (or 8000 GB)

ext3

Linux

32TB (32000GB)

1 http://www.opensolaris.org/os/community/zfs/ taken Aug 21 2007.


Operating Systems

J.E.D.I.

NTFS

WinXP-Vista

2TB (2000 GB)

This may seem large enough but already some computer companies have data in the petabyte
range, approximately 10^6 GB worth of storage space. If the trend predicted by Moore's law
continues, then by the middle of the next decade can see the 64-bit limit being pushed. 2
Already, the world wide storage has now exceeded 161 exebytes (10^18 bytes) and will reach
988 exebytes by 2010.3 Our worldwide storage has gone past the 10^11 GB theoretical limit of
64-bit systems a billion times over.
ZFS is the first 128-bit file system. Being a 128 bit file system means that it can theoretically
store 2^128 bytes. This is 10^24 GB or a trillion trillion gigabytes or 10,000 times more than
the total size of world storage. The developers in fact, have computed, using physical limits to
computation, that in order to store a fully populated 128-bit file system, a user would require a
storage device with a minimum mass of 136 billion kilograms and more than enough energy to
boil the oceans away.4

1.5

Pooled Storage

1.5.1

The New Hard Disk Problem

If you have had your computer for a long enough time (and connected to the internet no
doubt), you would have run across the problem of running out of disk space. If you considered
buying a new hard disk, then you would have two strategies.
First, you might attach the new device as a secondary hard disk, (like a Drive E for Windows or
mounted as a directory in Linux/Solaris systems). You would then have big files or big
applications placed in the new hard disk. If this is the case, then computer users would have to
remember what drive to place their files in, and applications would start from different
directories, making your filesystem disorganized.
To avoid that problem, perhaps you would just simply copy the contents of the old hard disk to
the new one (or at the least, user documents, as these are the ones that use up the most
space). However, doing this has some problems. What if the user or an application cannot find
the file anymore because it has been moved to another drive (no such problem in Linux and
Solaris computers). What if some files were left behind?
In the end, adding a new hard disk to your computer causes a good amount of headaches due
to reorganization.

1.5.2

ZFS' answer: Pooled Storage

Other file systems exist on a single device, and having multiple devices means having a
volume manager. Volume managers show the different devices as "Drives" in Windows
systems. In Linux systems, a single root directory spans all the devices, and devices (or
directories on those devices) can be mounted as directories in the root directory.
ZFS on the other hand uses a pooled storage system. When you setup a ZFS pool, you can
assign one, two or more devices to it. All the capacity of all the devices are accessible by the
2 http://blogs.sun.com/bonwick/entry/128_bit_storage_are_you taken Aug 22 2007.
3 EMC Corporation. The Expanding Digital Universe: A Forecast of Worldwide Information
Growth Through 2010
4 128-bit storage are you
Operating Systems

J.E.D.I.

pool.
This pool can than be mounted as a directory on the regular Solaris filesystem. From the user's
standpoint, you are simply saving to another directory in the Solaris system. However,
internally, the data you save may be saved in one device and might be extended to the next
device if space is lacking. It could even be mirrored in another or distributed across several
hard disks for greater redundancy. The user is abstracted from having to worry about where
and how the data is stored.
To put it in simple terms, ZFS "does to storage what virtual memory did to RAM"5. As was
discussed in a previous chapter, virtual memory abstracts from applications details of how they
are stored in memory. An application does not know where it stays in memory, if it is actually
allocated continuous memory, or is actually not in memory but temporarily stored on the hard
disk. Files in ZFS are the same, the user is not made aware how the file is actually stored, save
that it could be accessed in this directory.
As for the problem we discussed earlier, adding a new device in ZFS means just adding a new
device in the storage pool. Your computer automatically gets the additional capacity without
you having to transfer or reorganize your filesystem. You can have a maximum of 2^64
devices in a single storage pool.

1.6

Data Integrity

Secondary storage is far from a reliable means of keeping your data intact. Problems can occur
at anytime destroying information.
Bit rot occurs when parts of the magnetic medium of your hard disk might fail due to simple
wear and tear. A phantom write occurs when the hard disk claims to have written the data but
actually hasn't. The hard disk may accidentally read or write data from the wrong portion of
the hard disk. These and more may cause your data to suddenly be unreadable.
ZFS has many features that aim to maintain the integrity of data.

1.6.1

End-to-end Checksums

Filesystems store information in blocks. Traditional file systems have checksums appended to
these blocks in order to provide error correction. This can detect bit rot, as what happens when
the data and the checksum do not match.

However, what happens when the correct data and correct checksum are placed on the wrong
portion of the hard disk? In fact, this system can only detect bit rot and nothing more.

5 Jeff Bonwick. ZFS: The Last Word in File Systems. Sun Microsystems
Operating Systems

J.E.D.I.

ZFS goes a step further by placing a checksum on each level of the block tree.
Placing checksums all the way up to the parent block assures that each data block is
consistent. It also assures that data block placing and the entire pool is consistent. Any
operation that results in bad checksum anywhere along the tree means that the entire pool is
in some inconsistent state and would require correction.

Operating Systems

J.E.D.I.

Also, ZFS separates the data from the checksum. In traditional file systems, where the data
and checksum are placed in the same block, there is a chance that a hard disk error could
modify both data and checksum in a way that the system can no longer determine if there was
an error. By physically separating data and checksum, hard disk errors would affect only the
data or only the checksum, making this error easier to detect.

1.6.2

Disk scrubbing

To make sure that data remains consistent, ZFS continuously checks all data blocks in a
process known as disk scrubbing. ZFS goes through the entire disk, making sure that data
matches checksums. In case of errors, ZFS is able to correct the information automatically,
either deriving it from the checksum, or through a mirror, which we will discuss in a later
chapter.

1.6.3

Copy-on-write Transactional I/O

Have you ever experienced a power failure as you were saving a very important document? To
your horror, you find out that your document's file was corrupted and you would have to start
from scratch.
The file was corrupted because the file was left in an inconsistent state. The file consists of
data blocks from the new version as well as data blocks from the old version, which should
have been overwritten if not for the power failure.
Operating Systems

J.E.D.I.

Some traditional filesystems make use of a journaling system. All suggested I/O operations are
first written into the file system journal before being run. This ensures that in the event of a
power failure or other error, the system can simply go through the operations written in the
journal until the file system is once more in a consistent state. Operations are said to be
atomic. Either they are all successfully executed originally or replayed from the journal in the
event of a crash, or they did not happen at all.
Journaling, however, slows down I/O execution because of the extra step of having to take
note of all instructions to be executed. Over a long period of time, the journal itself takes
significant space in the file system.
ZFS goes the extra step by implementing copy-on-write transactional I/O. No data is actually
being overwritten by ZFS. Any changes to the system takes place on a copy of the data. Any
system failure will affect only the copy of the data. In the event of failure the file system
recovers the original consistent data blocks before we started running operations. Only when
the changes reach the root block are the changes committed to the system. Either all grouped
I/O instructions are executed or the transaction did not happen at all.

Operating Systems

J.E.D.I.

1.6.4

Linear time snapshots

As a side-effect of the Copy-On-Write system, file system snapshots are automatically done
after any file system operation. Snapshots are a copy of the file system at some point in the
past which are used for backup purposes. Every operation in ZFS automatically creates a
snapshot of the old system. It is actually faster to create the snapshot rather than overwriting
the old data, which requires an extra step.

Operating Systems

J.E.D.I.

1.7

Mirroring and RAIDZ

1.7.1

Mirroring

ZFS allows for nearly effortless setup of a mirrored file system. A mirrored file system uses a
second hard disk to completely replicate the data of a first hard disk. Mirrors are often used for
redundancy. If the first hard disk fails, the system still operates with data from the second
hard disk. Also, mirrors make reading twice as fast. Data could be retrieved from the second
hard disk while busy reading from the first hard disk.
Traditional mirroring implementations cannot differentiate bad blocks. Even though a backup
copy exists in the mirror, it is unable to tell if the block has been corrupted in any way. As ZFS
checksums all blocks, it silently determines if a block has turned bad. If it has, it automatically
retrieves the correct block from the second disk and repairs the bad block on the first disk.
This is done silently without informing the user of any problems.

1.7.2

RAID-Z

RAID stands for Redundant Array of Inexpensive Disks. Setting up a RAID system means
adding more hard disks, following a particular scheme (called a RAID level) of how information
is stored among the hard disks.
There are 5 traditional RAID levels.
In RAID 0, data is simply striped over multiple disks. A stripe consists of several data blocks
joined together. This has a performance advantage as reads could be done in parallel.
However, this does not provide any kind of data security, as the failure of one disk means the
failure of all data.
In RAID 1, data is mirrored or completely duplicated on a second (or multiple) disk. Reads
could also be done in parallel, as well as provide data security since if one disk fails the data
can still be read from the second disk. However, this RAID level is expensive as you would
need to have twice the amount of hard disk to store an amount of data.
In RAID 2, data is written one bit each per disk, with a special disk dedicated to store parity
information for data recovery. This level is not used as it is very impractical to store per bit.
Like RAID 2, RAID 3 divides the data into disks with a special parity disk. However, data this
time is divided into stripes.
RAID 4 divides data into file system data blocks instead of striping a block among multiple
disks. A dedicated parity disk contains information that can be used to detect and correct
errors from the data disks.
The problem with RAID 4 is that any writing done to a data block automatically means
recomputing the parity. This means that any write would involve two writes: the data write and
the new parity block. This causes a bottleneck at the parity disk. RAID 5 distributes the parity
disk, allowing faster write time.
The problem with RAID 4 and 5 is that whenever any data block is written, the other data
blocks in the stripe would have to be read to recompute the parity block. This causes writes to
slow down. Also, data becomes corrupted if a power failure occurs while the parity block is
being computed. As the new parity block was not written correctly, then the data blocks do not
match with the parity block. When this happens, RAID incorrectly assumes that the data was
corrupted.
ZFS includes with it RAID-Z. RAID-Z is a modified RAID 5 implementation. RAID-Z sets each
file system data block to be its own stripe. This way, in order to recompute the parity, it only
needs to load a single data block. There is no need to read any other block. As file system
operations are now transaction based, the parity hole is now avoided. Either the entire block
Operating Systems

J.E.D.I.

(including the parity) gets written, or it does not get written at all.
Dynamic striping is also an additional feature of RAID-Z. Old RAID implementations mean that
the number of hard disks is fixed once the system is setup because stripe sizes are fixed.
When a new disk is added, all new data is striped to use the new disk. There is no need to
migrate the old data. Over time, ZFS migrates the old data into the new stripe format. This
migration is done automatically and behind the scenes.

1.8

ZFS Administration

1.8.1

Disk naming convention

Before we can discuss ZFS administration, we must first be familiar with the disk notation used
in Solaris.
All devices in Solaris are represented as files. These files are stored in the directory /devices.
# ls /devices
iscsi
pci@1f,2000:devctl pci@1f,4000:devctl pseudo:devctl
iscsi:devctl
pci@1f,2000:intr
pci@1f,4000:intr
scsi_vhci
options
pci@1f,2000:reg
pci@1f,4000:reg
scsi_vhci:devctl
pci@1f,2000
pci@1f,4000
pseudo
These are all devices that are attached to the system, including the keyboard, monitor, USB
devices and the like. To differentiate between hard disks (including CD drives) with other
devices, Solaris provides a separate directory /dev/rdsk for hard disks.
If you list the contents of /dev/rdsk, you will see the files following a particular format:
c#d#s# or c#t#d#s# or c#d#p#. These strings describe the complete address of a disk slice.
# ls /dev/dsk
c0d0p0 c0d0s7
c1t0d0s4 c1t1d0s15 c1t2d0s12 c1t3d0s1 c1t4d0p3 c0d0p1
c0d0s8
c1t0d0s5 c1t1d0s2
c1t2d0s13 c1t3d0s10 c1t4d0p4 c0d0p2 c0d0s9
c1t0d0s6 c1t1d0s3
c1t2d0s14 c1t3d0s11 c1t4d0s0 c0d0p3 c1t0d0p0 c1t0d0s7
c1t1d0s4
c1t2d0s15 c1t3d0s12 c1t4d0s1 c0d0p4 c1t0d0p1 c1t0d0s8 c1t1d0s5
c1t2d0s2
c1t3d0s13 c1t4d0s10 c0d0s0 c1t0d0p2 c1t0d0s9 c1t1d0s6
c1t2d0s3
c1t3d0s14 c1t4d0s11 c0d0s1 c1t0d0p3 c1t1d0p0 c1t1d0s7
c1t2d0s4
c1t3d0s15
c1t4d0s12 c0d0s10 c1t0d0p4 c1t1d0p1 c1t1d0s8
c1t2d0s5
c1t3d0s2 c1t4d0s13
c0d0s11 c1t0d0s0 c1t1d0p2 c1t1d0s9
c1t2d0s6
c1t3d0s3 c1t4d0s14 c0d0s12
c1t0d0s1 c1t1d0p3 c1t2d0p0
c1t2d0s7
c1t3d0s4 c1t4d0s15 c0d0s13 c1t0d0s10
c1t1d0p4 c1t2d0p1
c1t2d0s8
c1t3d0s5 c1t4d0s2 c0d0s14 c1t0d0s11 c1t1d0s0
c1t2d0p2
c1t2d0s9
c1t3d0s6 c1t4d0s3 c0d0s15 c1t0d0s12 c1t1d0s1 c1t2d0p3
c1t3d0p0
c1t3d0s7 c1t4d0s4 c0d0s2 c1t0d0s13 c1t1d0s10 c1t2d0p4
c1t3d0p1
c1t3d0s8 c1t4d0s5 c0d0s3 c1t0d0s14 c1t1d0s11 c1t2d0s0
c1t3d0p2
c1t3d0s9
c1t4d0s6 c0d0s4 c1t0d0s15 c1t1d0s12 c1t2d0s1
c1t3d0p3
c1t4d0p0 c1t4d0s7
c0d0s5 c1t0d0s2 c1t1d0s13 c1t2d0s10 c1t3d0p4
c1t4d0p1 c1t4d0s8 c0d0s6
c1t0d0s3 c1t1d0s14 c1t2d0s11 c1t3d0s0
c1t4d0p2 c1t4d0s9
c# represents the controller number. They are numbered c0, c1, c2, etc. Hard disks are
connected to a controller.
d# represents disk number for that controller.
s# represents slice number. Slice numbers can go from s0 to s15.
p# or partition number is sometimes used instead of a slice number. Partition numbers go from
p0 to p4.
A regular computer upon bootup shows 4 devices. These are your primary master device,
primary slave device, secondary master device and secondary slave device. Your primary hard
disk is more often than not the primary master device. A CD or DVD drive is often placed as
Operating Systems

J.E.D.I.

the primary slave device. If you have additional hard disks, they would be secondary master
and secondary slave.
For Solaris, the disk notation for these would be:
primary master: c0d0s0 (cable 0, disk 0, slice 0)
primary slave: c0d1s0 (cable 0 disk 1 slice 0)
secondary master: c1d0s0 (cable 1 disk 0 slice 0)
secondary slave: c1d1s0 (cable 1 disk 1 slice 0)
Some computers may have a SCSI interface. SCSI allows for up to 16 devices attached to a
single controller. Often, the SCSI controller number starts from c2 (as c0 and c1 are the
primary and secondary controllers respectively).
SCSI computers use the t# value to indicate a disk. t# can range from t0 to t15. SCSI
addresses also use the d# notation, but it is always set to zero (d0). For example, the files
c2t2d0s0 to c2t2d0s15 represents all the slices of the device assigned to target #2 on
controller 2.

1.8.2

Pool administration

Pools are maintained by the zpool command. Subcommands of zpool allow for creation,
deletion, adding of new devices, listing and modification of pools.
To create a basic zfs pool, you can run the command zpool create. The basic syntax of zpool
create is as follows
#zpool create <poolname> <vdev> <devices>
Poolname is the name of the pool you are going to create. Vdev describes what storage feature
the pool should use. Devices indicate what hard disks you are going to be using for the pool.
You can create pools from disk slices (or even files) but usually pools are made from whole
disks.
Note that setting a disk to be part of a pool formats that disk.
For example, the following command creates a basic zpool named myfirstpool1 using the
secondary master hard disk.
#zpool create myfirstpool1 disk c1d0
To create a mirrored pool, simply replace the disk keyword with mirror. Note that you have to
provide more than one disk to create a mirror
#zpool create mymirroedpool mirror c1d0 c1d1
A collection of disks can be setup into a RAID-Z configuration by placing raidz on the <vdev>
entry. Note that the recommended number of disks for RAID-Z is between 3 to 9.
#zpool create myraidzpool raidz c0d1 c1d0 c1d1
Note that ZFS allows pools to be created from regular files. Instead of specifying a disk, you
can specify a regular file. For example

# zpool create mypool1 /export/home/alice/mypoolfile

# zpool create mypool2 mirror /export/home/alice/p1 /export/home/alice/p2

But you would have to create the poolfiles via the mkfile command

# mkfile 1G /export/home/alice/mypoolfile

1G stands for a 1 GB file. 100M for 100MB etc...

Operating Systems

10

J.E.D.I.

This feature allows for you to test ZFS features without needing to have additional disks. You
will use this for our exercises
To add a device to a pool:
#zpool add myfirstpool disk c1d0
You can add a mirror or an additional RAID-Z device to an existing pool simply by replacing
disk with mirror or raidz respectively.
The command
#zpool list
lists down all the zfs pools available along with their space usage information and status. A
pool can have 3 status values: online, degraded or faulted. An online zpool has all its devices
in working order. A degraded pool has a failed device but data can still be recovered through
redundancy. A faulted device means that a device has failed and data cannot be recovered.
Exporting a pool means setting up devices for transfer. To export a pool, simply run the
command:
#zpool export mypool
And after migrating the devices to a new computer, run the command
#zpool import mypool
And finally, to destroy a pool, run the command:
#zpool destroy mypool
This destroys a pool, making the devices that used to be part of this pool available for other
uses.

1.8.3

Basic ZFS pool usage

Once you have created a pool, you can now mount it to be part of the regular Solaris file
system. Users would simply save as usual to the mounted directory, not knowing that the
directory is now using ZFS. Users are also abstracted from the knowledge that their directories
are mirrored or stored using RAID-Z, its all business as usual.
To mount the pool to be part of the regular Solaris filesystem, use the command
#zfs set mountpoint=/target/directory/in/regular/filesystem poolname
For example, we will mount mypool to store user files.
#zfs set mountpoint=/export/home mypool
You can create additional directories in mypool. For example, we create directories for user1
and user2
#zfs create mypool/user1
#zfs create mypool/user2
As we have already mounted mypool to /export/home, user1 is automatically mounted as
/export/home/user1, user2 as /export/home/user2.
There are additional options such as compression, the enforcement of disk quotas and disk
space guarantees, which can be easily added with the following commands.
#zfs set compression=on mypool
Operating Systems

11

J.E.D.I.

#zfs set quota=5g mypool/user1


#zfs set reservation=10g mypool/user2

1.8.4

Snapshots and clones

As was discussed earlier, ZFS allows for the creation of snapshots. Snapshots are a read-only
copy of the file system at a given point in time, which can be used for backup purposes.
To create a snapshot, simply run the command zfs snapshot indicating the ZFS directory you
wish to take a snapshot of together with a name for that snapshot. For example, the following
command creates a snapshot of the projects directory of user1. We will name the snapshot as
ver3backup.
#zfs snapshot mypool/user1/projects@ver3backup
Due to the way ZFS stores data, snapshots are instantly created and require no additional
space. No additional processing is necessary. ZFS simply preserves original data blocks
whenever changes are made to the target directory.
All snapshots are stored in the .zfs/snapshot directory located in the root of each filesystem.
This allows users to check their snapshots without having to be system administration. For
example, the snapshot we just created a while ago is now stored in
/export/home/user1/.zfs/snapshot/ver3backup
which user1 can access without having to be a system administrator.
In addition, you can rollback your directory to a snapshot. Rollback means retrieving the
snapshot to replace all changes made to your directory since the snapshot was taken. For
example, user1 made a lot of errors in project version4 so there would be a need to go back to
the version3 backup. To revert to the ver3backup snapshot, simply run the command:
#zfs rollback -r mypool/user1/projects@ver3backup
ZFS clones are a writeable copy of a snapshot. To create a clone, indicate the snapshot name
and the target directory where the snapshot is to be cloned.
#zfs clone mypool/user1/project@ver3backup mypool/user1/project/ver3copy
There are many more zfs commands. To find out additional zfs command options and how to
use them, simply run the man command on zfs.
#man zfs

Operating Systems

12

You might also like