You are on page 1of 39

Improving Performance of a Distributed File System Using OSDs and Cooperative Cache

PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF B.Sc(H) Computer Science

Hans Raj College University Of Delhi Delhi 110 007 India

Submitted by: Parvez Gupta Roll No. - 6010027

Varenya Agrawal Roll No. - 6010044

Certicate
This is to certify that the project work entitled Improving Performance of a Distributed File System Using OSDs and Cooperative Cache being submitted by Parvez Gupta and Varenya Agrawal, in partial fulllment of the requirement for the award of the degree of B.Sc (Hons) Computer Science, University of Delhi, is a record of work carried out under the supervision of Ms. Baljeet Kaur at Hans Raj College, University of Delhi, Delhi.

It is further certied that we have not submitted this report to any other organization for any other degree.

Parvez Gupta Roll No: - 6010027 Varenya Agrawal Roll No: - 6010044

Project Supervisor Ms. Baljeet Kaur

Principal Dr. S.R. Arora Hans Raj College University of Delhi

Dept. of Computer Science Hans Raj College University of Delhi

Acknowledgment
We would sincerely like to thank Ms. Baljeet Kaur for her invaluable support and guidance in carrying out this project to successful completion. Also, we would like to thank Head of the Computer Science Department, Ms Harmeet Kaur, who was always there with her invaluable knowledge and experience that helped us greatly during the research work. We would also like to extend our gratitude and special thanks to Mr. I.P.S. Negi, Mr. Sanjay Narang and Ms. Anita Mittal for their help in the computer laboratory. Lastly, we would like to thank all our friends and well wishers who directly or indirectly inuenced the successful compilation of the project.

Table of Contents

List of Figures Chapter 1 Introduction


1.1 Background 1.2 About the Work

3 4 4
5 6

Chapter 2 z-Series File System


2.1 Prominent Features 2.2 Architecture
2.2.1 Object Store 2.2.2 Front End 2.2.3 Lease Manager 2.2.4 File Manager 2.2.5 Cooperative Cache 2.2.6 Transaction Server

8 8
9 10

10 11 11 12 13 13
16 16
17 18

Chapter 3 Cooperative Cache


3.1 Working of Cooperative Cache 3.2 Cooperative Cache Algorithm

3.2.1 Node Failure 3.2.2 Network Delays

20 21
25 26

3.3 Choosing the Proper Third Party Node 3.4 Pre-fetching Data in zFS

Chapter 4 Testing
4.1 Test Environment 4.2 Comparing zFS and NFS

28 28
29 32

Conclusion Bibliography

35 36

List of Figures
Figure 1: zFS Architecture Figure 2: Delayed Move Notication Messages Figure 3: System conguration for testing zFS performance Figure 4: System conguration for testing NFS performance Figure 5 : Performance results for large server cache Figure 6 : Performance Results for small server cache 15 24 31 31 32 33

Chapter 1 Introduction

1.1 Background
As computer networks started to evolve in the 1980s it became evident that the old le systems had many limitations that made them unsuitable for multiuser environments.

In the beginning, many users started to use FTP to share les. Although this method avoided the time consuming physical movement of removable media, les still needed to be copied twice: once from the source computer onto a server, and a second time from the server onto the destination computer. Additionally, users had to know the physical addresses of every computer involved in the le sharing process.

As computer companies tried to solve the shortcomings above, distributed le systems were developed and new features such as le locking were added to existing le systems. The new systems were not replacements for the old le systems, but an additional layer between the disk le system and the user processes.

In a Distributed File System (DFS) a single le system can be distributed across several physical computer nodes. Separate nodes have direct access to only a part of the entire le system. With DFS, system administrators can make les distributed across multiple servers appear to users as if they reside in one place on the network.

zFS (z-Series File System), a distributed le system developed by IBM, is used in the z/OS operating system. zFS evolved from DSF (Data Sharing Facility) project which aimed at building a server-less le system that distributes all aspects of le and storage management over cooperating machines interconnected by a

fast switched network. zFS was designed to achieve a scalable le system that operates equally well on only a few and thousands of machines and in which the addition of new machines leads to a linear increase in performance.

1.2 About the Work


This work describes a cooperative cache algorithm used in zFS, which can withstand network delays and node failures. The work explores the eectiveness of this algorithm and of zFS as a le system. This is done by comparing the system s performance to NFS using the IOZONE benchmark. The researchers have also investigated whether using a cooperative cache results in better performance, despite the fact that the object store devices have their own caches. Their results show that zFS performs better than NFS when cooperative cache is activated and that zFS provides better performance even though the OSDs have their own caches. They have also demonstrated that using pre-fetching in zFS increases performance signicantly. Thus, zFS performance scales well when the number of participating clients increases linearly.

There are several other related works that have researched cooperative caching in network le systems. Another le system, xFS uses a central server to coordinate between the various clients, and the load of the server increases as the number of clients increase. Thus, the scalability of xFS was limited by the strength of the server. However, xFS is more scalable than AFS and NFS due to four dierent caching techniques used by it that contribute signicantly to the load reduction.

There are three major dierences between the zFS architecture and xFS architecture:

zFS does not have a central server and the management of les is distributed among several le managers. There is no hierarchy of cluster servers; if two clients work on the same le they interact with the same le manager.

In zFS, caching is done on a per page basis rather than using whole les. This increases sharing since dierent clients can work on dierent parts of the same le.

In zFS, no caching is done on the local disk.

Thus, zFS is more scalable because it has no central server and le managers can dynamically be added or removed to respond to load changes in the cluster. Moreover, performance is better due to zFS s stronger sharing capability. zFS does not have a central server that can become a bottleneck. All control information is exchanged between clients and le managers. The set of le managers dynamically adapts itself to the load on the cluster. Clients in zFS only pass data among themselves (in cooperative cache mode).

Chapter 2 z-Series File System

2.1 Prominent Features


zFS is a scalable le system which uses Object Store Devices (OSD) and a set of cooperative machines for distributed le management. These are its two most prominent features.

zFS integrates the memory of all participating machines into one coherent cache. Thus, instead of going to the disk for a block of data already in one of the machine memories, zFS retrieves the data block from the remote machine. To maintain le system consistency, zFS uses distributed transactions and leases to implement meta data operations and coordinate shared access to data. zFS achieves its high performance and scalability by avoiding groupcommunication mechanisms and clustering software and using distributed transactions and leases instead.

The design and implementation of zFS is aimed at achieving a scalable le system beyond those that exist today. More specically, the objectives of zFS are:

A le system that operates equally well on only on few or on thousands of machines

Built from o-the-shelf components with Object disks (ObSs)

Makes use of the memory of all participating machines as a global cache to increase performance

The addition of machines leads to an almost linear increase in performance

zFS will achieve scalability by separating storage management from le management and by dynamically distributing le management. Having ObSs handle storage management implies that functions usually handled by le systems are done in the ObS itself, and are transparent to other components of zFS. The Object Store recognizes only those objects that are sparse streams of bytes. Thus, it does not distinguish between les and directories. It is the responsibility of the le system management to handle them correctly.

2.2 Architecture
zFS has six components: a Front End (FE), a Cooperative Cache (Cache), a File Manager (FMGR), a Lease Manager (LMGR), a Transaction Server (TSVR), and an Object Store (ObS). These components work together to provide applications or users with a distributed le system. Now we describe the functionality of each component and how it interacts with the other components.

2.2.1 Object Store


The object disk (ObS) is the storage device on which les and directories are created, and from where they are retrieved. The ObS API enables creation and deletion of objects (les), and writing and reading byte-ranges from the object. Object disks provide le abstractions, security, safe writes and other capabilities. Using object disks allows zFS to focus on management and scalability issues, while letting the ObS handle the physical disk chores of block allocation and mapping.

10

2.2.2 Front End


The zFS front-end (FE) runs on every workstation on which a client wants to use zFS. It presents the client with the standard POSIX le system API and provides access to zFS les and directories.

2.2.3 Lease Manager


The need for a Lease Manager (LMGR) stems from the following facts: File systems use one form or another of locking mechanism to control access to the disks in order to maintain data integrity when several users work on the same les. To work in SAN le systems where clients can write directly to object disks, the ObSs themselves have to support some form of locking. Otherwise, two clients could damage each other s data.

In distributed environments, where network connections and even machines themselves can fail, it is preferable to use leases rather than locks. Leases are locks with an expiration period that is set up in advance. Thus, when a machine holding a lease on a resource fails, we are able to acquire a new lease after the lease of the failed machine expires. Obviously, the use of leases incurs the overhead of lease renewal on the client that acquired the lease and still needs the resource.

To reduce the overhead of the ObS, the following mechanism is used: Each ObS maintains one major lease for the whole disk. Each ObS also has one lease manager (LMGR) which acquires and renews the major lease. Leases for specic objects (les or directories) on the ObS are managed by the ObS s LMGR. Thus, the majority of lease management overhead is ooaded from the ObS, while still maintaining the ability to protect data. The ObS stores in memory the network address of the current holder of the major-lease. To nd out which
11

machine is currently managing a particular ObS O, a client simply asks O for the network address of its current LMGR.

The lease-manager, after acquiring the major-lease, grants exclusive leases on objects residing on the ObS. It also maintains in memory the current network address of each object-lease owner. This allows looking up le-managers. Any machine that needs to access an object obj on ObS O, rst gures out who is it s LMGR. If one exists, the object-lease for obj is requested form from the LMGR. If one does not exist, the requesting machine creates a local instance of an LMGR to manage O for it.

2.2.4 File Manager


Each opened le in zFS is managed by a single le manager assigned to the le when the le is opened. The set of all currently active le managers manage all opened zFS les. Initially, no le has an associated le-manager(FMGR). The rst machine to open a le will create an instance of a le manager for the le. Henceforth, and until that le manager is shut-down, each lease request for any part of the le will be mediated by that FMGR. For better performance, the rst machine to open a le, will create a local instance of the le manager for that le.

The FMGR keeps track of each accomplished open() and read() request, and maintains the information regarding where each le s blocks reside in internal data structures. When an open() request arrives at the le manager, it checks whether the le has already been opened by another client (on another machine). If not, the FMGR acquires the proper exclusive lease from the lease manager and directs the request to the object disk. In case the data requested resides in the cache of another machine, the FMGR directs the cache on that machine to forward the data to the requesting cache.

12

The le manager interacts with the lease manager of the ObS where the le resides to obtain an exclusive lease on the le. It also creates and keeps track of all range-leases it distributes. These leases are kept in internal FMGR tables, and are used to control and provide proper access to les by various clients.

2.2.5 Cooperative Cache


The cooperative cache (Cache) of zFS is a key component in achieving high scalability. Due to the fast increase in network speed nowadays, it takes less time to retrieve data from another machine s memory than from a local disk. This is where a cooperative cache is useful. When a client on machine A requests a block of data via FEa and the le manager (FMGRB on machine B) realizes that the requested block resides in the Cache of machine M , Cachem, it sends a message to Cachem to send the block to Cachea and updates the information on the location of that block in FMGRB .

The Cache on A then receives the block, updates its internal tables (for future accesses to the block) and passes the data to the FEa , which passes it to the client.

2.2.6 Transaction Server


In zFS, directory operations are implemented as distributed transactions. For example, a create-le operation includes, at the very least, (a) creating a new entry in the parent directory, and (b) creating a new le object. Each of these operations can fail independently, and the initiating host can fail as well. Such occurrences can corrupt the le system. Hence, each directory operation should be protected inside a transaction, such that in the event of failure, the consis-

13

tency of the le-system can be restored. This means either rolling the transaction forward or backward.

The most complicated directory operation is rename(). This requires, at the very least, (a) locking the source directory, target directory, and le (to be moved), (b) creating a new directory entry at the target, (c) erasing the old entry, and (d) releasing the locks.

Since such transactions are complex, zFS uses a special component to manage them: a transaction server (TSVR). The TSVR works on a per operation basis. It acquires all required leases and performs the transaction. The TSVR attempts to hold onto acquired leases for as long as possible and releases them only for the benet of other hosts.

14

sive file lease, the file manager manages ranges of leases, which it grants the clients (FEs)2.

h e S zF su

A 1

Figure 1: zFS architecture


Each participating nodeFigure 1: zFS Architecture cooperative cache. Each runs the front-end and OSD has only one lease manager associated with it. Several file managers and transaction managers run on various nodes in the cluster.

W ti ro c ti

Every file opened in zFS is managed by a single file man15 ager that is assigned to the file when it is first opened. The set of all currently active file managers manage all opened

T th

Chapter 3 Cooperative Cache

16

3.1 Working of Cooperative Cache


In zFS, the cooperative cache is integrated with the Linux kernel page cache for two main reasons. First, by doing this the operating system does not have to have two seperate caches with dierent cache policies which may interfere with each other. Second, it provides comparable local performance between zFS and other local le systems supported by Linux. All the supported le systems use the kernel page cache. As a result, the researchers achieved the following: The kernel evokes page eviction according to its internal algorithmwhen free available memory is low. There is no need for a special zFS mechanism to detect it. Caching is not done on whole les but on per-page basis. The pages of zFS and other le systems are treated equally by the kernel algorithm, regardless of the le system type leading to fairness between the le systems. When a le is closed, its pages remain in the cache until memory pressure causes the kernel to discard them.

When eviction is invoked and a zFS page is the candidate page for eviction then the decision is passed to a specic zFS routine, which decides whether to forward the page to the cache of another node or to discard it.

The implementation of zFS page cache supports the following optimizations:

An application using a zFS le to write a whole page acquires only the write lease when no read is done from the OSD. If one application or user on a machine has a write lease, all other applications/users on that machine can try to read and write to the page using
17

the same lease, without requesting the le manager for another lease. The kernel then checks the permission to read/write, based on the permissions specied in the mode (read or write or both) parameter when the le is opened. If the mode bits allow the operation, zFS allows it. When a client has a write lease and another client requests a read lease for the same page, a write to the object store device is done if the page has been modied and the lease on the rst client is downgraded from write to read lease without discarding the page. This increases the probability of a cache hit by a client requesting the same page, thus increasing performance.

3.2 Cooperative Cache Algorithm


In this paper a data block is considered to be a page. Each page that exists in the cooperative cache is said to be either singlet or replicated. A singlet page is the one that is present in the memory of only one of the nodes in the connected network. A replicated page is the one that is present in the memory of several nodes.

When a client wants to open a le for reading, the local cache is checked for the page. In case of a cache miss, zFS requests the page and its read lease from the zFS le manager. The le manager checks if a range of pages starting with the requested page has already been read into the memory of another machine in the network. If not, zFS grants the leases to the client A, which enables the client to read the range of pages from the OSD directly. The client A then reads the range of pages from the OSD, marking each page as a singlet (as A is the only node having this range of pages in its cache). If the le manager nds that the range of pages requested resides in the memory of some other node say B, it sends a message to B requesting that B send the range of pages and leases to
18

A. In this case, zFS records the fact that A also has this particular range internally and both A and B mark the pages as replicated. Node B is called a thirdparty node, since A gets the requested data not from the OSD but from a thirdparty.

When memory becomes scarce for a client, the Linux kernel invokes the kswapd() daemon. This daemon scans and discards inactive pages from the memory of the client. In our modied kernel, if the page is a replicated zFS page, a message is sent to the zFS le manager indicating that machine A no longer holds the page and the page is discarded.

If the zFS page is a singlet, the page is forwarded to another node using the following steps:
1. A message is sent to the zFS le manager indicating that the page is sent to

another machine B, the node with the largest free memory known to A.
2. The page is forwarded to B. 3. The page is discarded from the page cache of A.

zFS uses a recirculation counter, and if the singlet page has not been accessed after two hops, it is discarded. Once the page has been accessed, the recirculation counter is reset. When a le manager is notied about a discarded page, it updates the lease and page location and checks whether the page has become a singlet. If only one other node N holds the page, the le manager sends a singlet message to N to that eect.

The eects of node failure and network delays are also considered in this algorithm.

19

3.2.1 Node Failure


To take care of node failure, the researchers take the approach that it is acceptable for the le manager to assume the existence of pages on nodes even if this is not true but it is unacceptable to have pages on nodes, where the le manager is unaware of their existence. If the le manager is wrong in its assumption that a page exists on a node, its request will be rejected and thus it will update its records eventually. However, if there are pages on nodes that the le manager is not aware of, this may cause data trashing and thus is not allowed.

Because of this, the order of steps for forwarding a singlet page to another node is important and is to be followed as described above.

1. Node fails before Step 1:- The le manager will eventually detect this and update its data to reect that the respective node does not hold pages and leases. If the node fails to execute Step 1 and notify the le manager, it does not forward the page and only discards it. Thus, we end up with a situation where the le manager assumes the page exists on node A, although it does not. This is acceptable since it can be corrected without data corruption.

2. Node failed after Step 1:- In this case, the le manager is informed that the page is on B, but node A may have crashed before it was able to forward the page to B. Again, we have a situation where the le manager assumes the page is on B, although in reality that is not true.

3. Failure after Step 2 does not pose any problem.

20

3.2.2 Network Delays


In this paper the following cases are considered for network delays:

1. The rst case that the authors have considered is where a replicated page residing on two nodes M and N is discarded from the memory of M:-

When the zFS le manager sees that a page has become singlet and only resides in the memory of N now, it sends a message to N with this information. However, due to network delays, this message may arrive after memory pressure developed on N. But on the node N this page is marked as replicated, while in reality it is a singlet and should have been forwarded to another node.

They handle this situation as follows:

If a singlet message arrives at N and the page is not in the cache of N, the cooperative cache algorithm on N will ignore the singlet message. Because the le manager still knows that the page resides on N, it may ask N to forward the page to a requesting client B. In this case, N will send back a reject message to the le manager. Upon receiving a reject message, the le manager updates its internal tables and retries to respond to the request from B by nding another client who in the meantime read the page from the OSD or by telling B to read the page from the OSD. In such cases, network delays will cause performance degradation, but not inconsistency.

2. Another possible scenario is that no memory pressure occurred on N, the page has not arrived yet, and a singlet message arrived and was ignored. The le manager asked N to forward the page to B and N sent a reject message
21

back to the le manager. If the page never arrives at N due to sender failure or network failure, there is no problem.

However, if the page arrives after the reject message was sent, a consistency problem may occur if a write lease exists. Because the le manager is not aware of the page on N, another node may get the write lease and the page from the OSD. This leaves two clients having the same page with write leases on two different nodes.

To avoid this situation, A reject list is kept in the node N, which records the pages (and their corresponding leases) that were requested but rejected. When a forwarded page arrives at N and the page is on the reject list, the page and its entry on the reject list are discarded, thus keeping the information in the le manager accurate. The reject list is scanned periodically (by the FE) and each entry whose time on the list exceeds T, is deleted. T is the maximum time it can take a page to reach its destination node, and is determined experimentally depending on the network topology.

An alternative method for handling these network delay issues is to use a complicated synchronization mechanism to keep track of the state of each page in the cluster. This is unacceptable for two reasons. First, it incurs overhead from extra messages, and second, this synchronization delays the kernel when it needs to evict pages quickly.

3. Another problem caused by network delays is that suppose node N noties the zFS le manager upon forwarding a page to M, and M does the same forwarding the page to O. However, the link from N to the le manager is slow compared to the other links. Thus, the le manager may receive a message that a page was moved from M to O before receiving the message that the singlet page was moved from N to M. Moreover, the le manager does
22

not have in its records that this specic page and lease reside on M. The problem is further complicated by the fact that M may decide to discard the page and this notication may arrive at the le manager before the move notication.

To solve this problem, the researchers used the following data structures:

Each lease on a node has a hop_count, which counts the number of times the lease and its corresponding page were moved from one node to another.

Initially, when the page is read from the OSD, the hop_count in the corresponding lease is set to zero and is incremented whenever the lease and page are transferred to another node.

When a node initiates a move, the move notication passed to the le manager includes the hop_count and the target_node.

Two elds are reserved in each lease record in the le manager s internal tables for handling move notication messages: Last_hop_count initially set to -1, and target_node initially set to NULL.

23

Figure 2: Delayed Move Notication Messages

If message (3) arrives rst, its hop count and target node are saved in the lease record. This is done since node M is not registered as holding the lease and page. When message (1) arrives, N is the registered node; therefore, the lease is moved to the target node stored in the target_node eld. This is done by updating the information stored in the internal tables of the le manager. If message (3) arrives rst and then message (5) arrives, due to the larger hop count, the information from message (5) is stored and used when message (1) arrives. When message (3) arrives, it is ignored due to its smaller hop count. In other words, using the hop count enables us to ignore late messages that are irrelevant.

4. Suppose the page was moved from N to M and then to O, where it was discarded due to memory pressure on O and its recirculation count exceeded its limit. O then sends a release_lease message, which arrives at the le manager before the move notications.

24

This case is resolved as follows:

Since O is not registered as holding the page and lease, the release_lease message is placed on a pending queue and a ag is raised in the lease record. When the move operation is resolved and this ag is set, the release_lease message is moved to the input queue and executed.

3.3 Choosing the Proper Third Party Node


The zFS le manager uses an enhanced round robin method to choose the third-party node, which holds a range of pages starting with the requested page. For each range granted to a node N, the le manager records the time it was granted t(N). When a request arrives, the le manager scans the list of all nodes holding a potential range, N0Nk. For each selected node Ni, the le manager checks if currentTimet(Ni) > C to check whether enough time passed for the range of pages granted to Ni to reach the node. If this is true, Ni is marked as a potential provider for the requested range and the next node, Ni+1, is checked; otherwise, the next node is checked. Once all nodes are checked, the marked node with the largest range, Nmax, is chosen. The next time the le manager is asked for a page and lease, it starts the scan from node Nmax+1. Two goals are achieved using this algorithm. First, no single node is overloaded with requests and becomes a bottleneck. Second, the pages reside at the chosen node for sure, thus reducing the probability of reject messages.

25

3.4 Pre-fetching Data in zFS


The Linux kernel uses a read ahead mechanism to improve le reading performance. Based on the read pattern of each le, the kernel dynamically calculates how many pages to read ahead, n, and invokes the readpage() routine n times.

This method of operating is not ecient when the pages are transmitted over the network. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. For comparatively small blocks, the setup overhead is a signicant part of the total overhead.

Intuitively, it seems that it is more ecient to transmit k pages in one message rather than transmitting them in a separate message for each page as the setup overhead is amortized over k pages.

To conrm this, the researchers wrote client and server programs that test the time it takes to read a le residing entirely in memory from one node to another. Using a le size of N pages, they tested reading it in chunks of 1k pages in each TCP message. That is, reading the le in NN/k messages. They found that the best results are achieved for k=4 and k =8. When k is smaller, the setup time is signicant, and when k is larger (16 and above) the size of the L2 cache starts to aect the performance. TCP performance decreases when the transmitted block size exceeds the size of the L2 cache.

Similar performance gains were achieved by the zFS pre-fetching mechanism. When the le manager is instantiated, it is passed a pre-fetching parameter, R, indicating the maximum range of pages to grant. When a client A requests a page (and lease) the le manager searches for a client B having the largest con-

26

tiguous range of pages, r, starting with the requested page p and r <= R. If such a client B is found, the le manager sends B a message to send r pages (and their leases) to A. The selected range r can be smaller than R if the le manager nds a page with a conicting lease before reaching the full range R. If no range is found in any client, the le manager grants R leases to client A and instructs A to read R pages from the OSD. The requested page may reside on client A, while the next one resides on client B, and the next on client C. In this case, the granted range will be only the requested page from client A. The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from client C. Thus, there is no interference with the kernel read ahead mechanism. However, if the le manager nds that client A has a range of k pages, it will ignore the subsequent requests that are initiated by the kernel read-ahead mechanism and covered by the granted range.

27

Chapter 4 Testing

28

4.1 Test Environment


The zFS performance test environment consisted of a cluster of client PCs and one server PC. Each of the PCs in the cluster had an 800 MHz Pentium III processor with 256 MB memory, 256 KB L2 cache and 15 GB IDE (Integrated Drive Electronics) disks. All of the PCs in the cluster ran the Linux operating system. The kernel was a modied 2.4.19 kernel with VFS (Virtual File System) implementing zFS and some patches to enable the integration of zFS with the kernel's page cache. The server PC had a 2 GHz Pentium 4 processor with 512 MB memory and 30 GB IDE disk running vanilla Linux kernel. The server PC ran a simulator of the Antara OSD when the researchers tested zFS performance and ran an NFS (Network File System) server when they compared the results to NFS. The PCs in the cluster and the server PC were connected via 1 Gbit LAN.

The client running the zFS front end was implemented as a kernel mode process, while all other components were implemented as user mode processes. The le manager and lease manager were fully implemented. The transaction manager implemented all operations in memory, without writing the log to the OSD. However, this fact does not inuence the results because only the results of read operations using cooperative cache are recorded and not the meta data operations.

To begin testing zFS, the researchers congured the system much like a SAN (Storage Area Network) le system. The server PC ran an OSD simulator, a separate PC ran the lease manager, le manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.

When testing NFS performance, they congured the system dierently. The server PC ran an NFS server with eight NFS daemons (nfsd) and the four PCs
29

ran NFS clients. The nal results are an average over several runs, where the caches of the machines were cleared before each run.

To evaluate zFS performance relative to an existing le system, the researchers compared it to the widely-used NFS system, using the IOZONE benchmark. IOZONE is a lesystem benchmark tool. The benchmark generates and measures a variety of le operations. IOZONE is useful for performing a broad lesystem analysis of a computer platform. The benchmark tests le I/O performance for operations such as read,write,etc.

The comparison to NFS was dicult because NFS does not carry out prefetching. To make up for this feature, IOZONE was congured to read the NFS mounted le using record sizes of n=1,4,8,16 pages and compared its performance with reading zFS mounted les with record sizes of one page but with prefetching parameter R=1,4,8,16 pages.

30

cating A refor a es, r, a clisend r an be a conrange ses to

separate PC ran the lease manager, file manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.

e next n this from ahead from


Figure 3: System conguration for testing zFS performance

e page e read write

Figure 3: System configuration for testing zFS performance

When testing (nfsd) and the four configured the system NFS daemonsNFS performance, wePCs ran NFS clients. The differently. The server PC ran over several runs, eight reported results are an average an NFS server withwhere the caches of the machines were cleared before each run.
5

(sinc zFS w

2 KB/Sec

Figure 4: System conguration for testing NFS performance

Figure 4: System configuration for testing NFS performance

5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system,

31

This f the se

4.2 Comparing zFS and NFS


The primary aim of this research was to test whether and how much performance is gained when the total amount of free memory in the cluster exceeds the server s cache size. To this end, two scenarios were investigated. In the rst one, the le size was smaller than the cache of the server and all the data resided in the server s cache. In the second, the le size was much larger than the size of the server s cache. The results appear in Figure 5 and Figure 6 given below, respectively.

The the

(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.

File smaller than available se rver memory 256M B file, 512MB serv er me mory
30000.00 25000.00 20000.00 KB/Sec NFS 15000.00 10000.00 5000.00 0.00 1 4 Range 8 16 No c_cache c_cache

ce

file em,

Figure 5: Performance results for large server cache


Figure 5 : Performance results for largeresults when This figure shows the performance server cache

the data fits entirely in the servers memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.
32

not on-

We also saw that when using cooperative cache, the performance for a range of 16 was lower than for ranges of 4

File larger than av ailable serve r me mory 1GB file, 512M B Serv er M e mory
25000.00 20000.00 15000.00 10000.00 5000.00 0.00 1 4 Range 8 16

The ture 1.

KB/Sec

NFS No c_cache c_cache

2.

3.

Figure 6 : Performance Results for small server cache

Figure 6: Performance results for small server cache

This figure shows the performance results when the data size is greater than the server cache size and the servers local disk has to be used. We see that cooperative cache provides much better performance than NFS. Deactivating cooperative cache results in worse performance than NFS. In both the cases, it was observed that the performance of NFS was almost the
same for dierent block sizes. However, the performance is greatly inuenced by the data size compared to the available memory. When the le can t entirely

Add ing tech it ha be a ter. shar

6 Related Work Several existing works have researched cooperative caching to the case when the le size is larger than the available memory. [ [ [ [ [ ! 2], ! 3], ! 4], ! 5], ! 12]. Although we cannot cover all these works, we will concentrate on those systems that describe When the le ts entirely into memory ( Figure 1) the performance of zFS with network file systems, rather than those describing caching cooperative cache is much better than NFS. But when cooperative cache was for parallel machines ! 3], or those that use a micro-kernel [ deactivated, dierent behaviors were observed for dierent ranges of pages. [ ! 2].
This is due to the fact that extra messages are passed between the le manager and the client for larger ranges of pages. Hence, the performance of zFS for

into the memory, the performance of NFS is almost four times better compared

Dahlin et. al. ! 4] describe four caching techniques to im[ prove the performance of the xFS file system. xFS uses a R=1 is lower than that of NFS. However, for larger ranges, there are fewer mescentral server to coordinate between the various clients, and the load of the server increases as the number of clients 33 increase. Thus, the scalability of xFS was limited by the strength of the server. Implementing these techniques and

In ! [ find sion perf main evic by c chos mem

Sark mec state com and

sages to the le manager (due to pre-fetching in zFS) and the performance of zFS was slightly better than that of NFS.

The researchers also observed that when cooperative cache was used, the performance for a range of 16 was lower than for ranges of 4 and 8. This is due to the fact that IOZONE starts the requests of each client with a xed time delay relative to other clients, each new request was for dierent 256 KB. This stems from the following calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the entire le is in memory, the L2 cache is cleared and reloaded for each new granted request, resulting in reduced performance.

When the cache of server was smaller than the requested data, it was expected that memory pressure would occur in the server (NFS and OSD) and the server s local disk would be used. In such a case, the anticipations that the cooperative cache would exhibit improved performance proved to be correct. The results are shown in Figure 2.

We can see that zFS performance, when cooperative cache is deactivated, is lower than that of NFS but it gets better for larger ranges. When the cooperative cache is active, zFS performance is signicantly better than NFS and increases with increasing range.

The performance with cooperative cache enabled is lower in this case when compared to the case when the le ts into memory. This is due to the fact that the le was larger than the available memory, hence the clients suered memory pressure,and discarded pages and responded to the le manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the le manager and the probability that the requested data is in memory was also smaller than when the le was almost entirely in memory.
34

Conclusion
The results show that using the cache of all the clients as one cooperative cache gives better performance as compared to NFS as well as to the case when cooperative cache is not used. This is evident when using pre-fetching with a range of one page. It is also noted from the results that using pre-fetching with ranges of four and eight pages results in much better performance. In zFS, the selection of the target node for forwarding pages during page eviction is done by the le manager, which chooses the node with the largest free memory as the target node. However, the le manager chooses target nodes only from the ones interacting with it. It may be the case that there is an idle machine with a large free memory that is not connected to this le manager and thus will not be used.

35

Bibliography
Improving Performance of a Distributed File System Using OSDs and Cooperative Cache A. Teperman, A. Weit IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel O. Rodeh and A. Teperman. "zFS A Scalable distributed File System using Object Disks." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference, pages 207-218, San Diego, CA, USA, 2003.

T. Cortes, S. Girona and J. Labarta. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. Department d Arquitectura de Computadros, Universitat Politecnica de Catalunia Barcelona. M. D. Dahlin, Randolph Y. Wang, Thomas E. Anderson and David A. Patterson. " Cooperative Caching: Using Remote Client Memory to Improve File System Performance." Proceedings of the First Symposium on Operating Systems Design and Implementation,1994. V. Drezin, N. Rinetzky, A. Tavory and E. Yerushalmi. "The Antara Object-disk Design." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2001. Z. Dubitzky, I. Gold, E. Henis, J. Satran and D. Sheinwald. "DSF Data Sharing Facility." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2000. http://www.haifa.il.ibm.com/projects/systems/dsf.html Iozone. See http://iozone.org/ http://www.lustre.org/docs/whitepaper.pdf P. Sarkar and J. Hartman. Ecient Cooperative Caching Using Hints. Department of Computer Science, University of Arizona, Tucson.
36

You might also like