Professional Documents
Culture Documents
PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF B.Sc(H) Computer Science
Certicate
This is to certify that the project work entitled Improving Performance of a Distributed File System Using OSDs and Cooperative Cache being submitted by Parvez Gupta and Varenya Agrawal, in partial fulllment of the requirement for the award of the degree of B.Sc (Hons) Computer Science, University of Delhi, is a record of work carried out under the supervision of Ms. Baljeet Kaur at Hans Raj College, University of Delhi, Delhi.
It is further certied that we have not submitted this report to any other organization for any other degree.
Parvez Gupta Roll No: - 6010027 Varenya Agrawal Roll No: - 6010044
Acknowledgment
We would sincerely like to thank Ms. Baljeet Kaur for her invaluable support and guidance in carrying out this project to successful completion. Also, we would like to thank Head of the Computer Science Department, Ms Harmeet Kaur, who was always there with her invaluable knowledge and experience that helped us greatly during the research work. We would also like to extend our gratitude and special thanks to Mr. I.P.S. Negi, Mr. Sanjay Narang and Ms. Anita Mittal for their help in the computer laboratory. Lastly, we would like to thank all our friends and well wishers who directly or indirectly inuenced the successful compilation of the project.
Table of Contents
3 4 4
5 6
8 8
9 10
10 11 11 12 13 13
16 16
17 18
20 21
25 26
3.3 Choosing the Proper Third Party Node 3.4 Pre-fetching Data in zFS
Chapter 4
Testing
4.1 Test Environment
4.2 Comparing zFS and NFS
28 28
29 32
Conclusion Bibliography
35 36
List of Figures
Figure 1: zFS Architecture Figure 2: Delayed Move Notication Messages Figure 3: System conguration for testing zFS performance Figure 4: System conguration for testing NFS performance Figure 5 : Performance results for large server cache Figure 6 : Performance Results for small server cache 15 24 31 31 32 33
Chapter 1 Introduction
1.1 Background
As computer networks started to evolve in the 1980s it became evident that the old le systems had many limitations that made them unsuitable for multiuser environments.
In the beginning, many users started to use FTP to share les. Although this method avoided the time consuming physical movement of removable media, les still needed to be copied twice: once from the source computer onto a server, and a second time from the server onto the destination computer. Additionally, users had to know the physical addresses of every computer involved in the le sharing process.
As computer companies tried to solve the shortcomings above, distributed le systems were developed and new features such as le locking were added to existing le systems. The new systems were not replacements for the old le systems, but an additional layer between the disk le system and the user processes.
In a Distributed File System (DFS) a single le system can be distributed across several physical computer nodes. Separate nodes have direct access to only a part of the entire le system. With DFS, system administrators can make les distributed across multiple servers appear to users as if they reside in one place on the network.
zFS (z-Series File System), a distributed le system developed by IBM, is used in the z/OS operating system. zFS evolved from DSF (Data Sharing Facility) project which aimed at building a server-less le system that distributes all aspects of le and storage management over cooperating machines interconnected by a
fast switched network. zFS was designed to achieve a scalable le system that operates equally well on only a few and thousands of machines and in which the addition of new machines leads to a linear increase in performance.
There are several other related works that have researched cooperative caching in network le systems. Another le system, xFS uses a central server to coordinate between the various clients, and the load of the server increases as the number of clients increase. Thus, the scalability of xFS was limited by the strength of the server. However, xFS is more scalable than AFS and NFS due to four dierent caching techniques used by it that contribute signicantly to the load reduction.
There are three major dierences between the zFS architecture and xFS architecture:
zFS does not have a central server and the management of les is distributed among several le managers. There is no hierarchy of cluster servers; if two clients work on the same le they interact with the same le manager.
In zFS, caching is done on a per page basis rather than using whole les. This increases sharing since dierent clients can work on dierent parts of the same le.
Thus, zFS is more scalable because it has no central server and le managers can dynamically be added or removed to respond to load changes in the cluster. Moreover, performance is better due to zFS s stronger sharing capability. zFS does not have a central server that can become a bottleneck. All control information is exchanged between clients and le managers. The set of le managers dynamically adapts itself to the load on the cluster. Clients in zFS only pass data among themselves (in cooperative cache mode).
zFS integrates the memory of all participating machines into one coherent cache. Thus, instead of going to the disk for a block of data already in one of the machine memories, zFS retrieves the data block from the remote machine. To maintain le system consistency, zFS uses distributed transactions and leases to implement meta data operations and coordinate shared access to data. zFS achieves its high performance and scalability by avoiding groupcommunication mechanisms and clustering software and using distributed transactions and leases instead.
The design and implementation of zFS is aimed at achieving a scalable le system beyond those that exist today. More specically, the objectives of zFS are:
Makes use of the memory of all participating machines as a global cache to increase performance
zFS will achieve scalability by separating storage management from le management and by dynamically distributing le management. Having ObSs handle storage management implies that functions usually handled by le systems are done in the ObS itself, and are transparent to other components of zFS. The Object Store recognizes only those objects that are sparse streams of bytes. Thus, it does not distinguish between les and directories. It is the responsibility of the le system management to handle them correctly.
2.2 Architecture
zFS has six components: a Front End (FE), a Cooperative Cache (Cache), a File Manager (FMGR), a Lease Manager (LMGR), a Transaction Server (TSVR), and an Object Store (ObS). These components work together to provide applications or users with a distributed le system. Now we describe the functionality of each component and how it interacts with the other components.
10
In distributed environments, where network connections and even machines themselves can fail, it is preferable to use leases rather than locks. Leases are locks with an expiration period that is set up in advance. Thus, when a machine holding a lease on a resource fails, we are able to acquire a new lease after the lease of the failed machine expires. Obviously, the use of leases incurs the overhead of lease renewal on the client that acquired the lease and still needs the resource.
To reduce the overhead of the ObS, the following mechanism is used: Each ObS maintains one major lease for the whole disk. Each ObS also has one lease manager (LMGR) which acquires and renews the major lease. Leases for specic objects (les or directories) on the ObS are managed by the ObS s LMGR. Thus, the majority of lease management overhead is ooaded from the ObS, while still maintaining the ability to protect data. The ObS stores in memory the network address of the current holder of the major-lease. To nd out which
11
machine is currently managing a particular ObS O, a client simply asks O for the network address of its current LMGR.
The lease-manager, after acquiring the major-lease, grants exclusive leases on objects residing on the ObS. It also maintains in memory the current network address of each object-lease owner. This allows looking up le-managers. Any machine that needs to access an object obj on ObS O, rst gures out who is it s LMGR. If one exists, the object-lease for obj is requested form from the LMGR. If one does not exist, the requesting machine creates a local instance of an LMGR to manage O for it.
The FMGR keeps track of each accomplished open() and read() request, and maintains the information regarding where each le s blocks reside in internal data structures. When an open() request arrives at the le manager, it checks whether the le has already been opened by another client (on another machine). If not, the FMGR acquires the proper exclusive lease from the lease manager and directs the request to the object disk. In case the data requested resides in the cache of another machine, the FMGR directs the cache on that machine to forward the data to the requesting cache.
12
The le manager interacts with the lease manager of the ObS where the le resides to obtain an exclusive lease on the le. It also creates and keeps track of all range-leases it distributes. These leases are kept in internal FMGR tables, and are used to control and provide proper access to les by various clients.
The Cache on A then receives the block, updates its internal tables (for future accesses to the block) and passes the data to the FEa , which passes it to the client.
13
tency of the le-system can be restored. This means either rolling the transaction forward or backward.
The most complicated directory operation is rename(). This requires, at the very least, (a) locking the source directory, target directory, and le (to be moved), (b) creating a new directory entry at the target, (c) erasing the old entry, and (d) releasing the locks.
Since such transactions are complex, zFS uses a special component to manage them: a transaction server (TSVR). The TSVR works on a per operation basis. It acquires all required leases and performs the transaction. The TSVR attempts to hold onto acquired leases for as long as possible and releases them only for the benet of other hosts.
14
sive file lease, the file manager manages ranges of leases, which it grants the clients (FEs)2.
h e S zF su
A 1
W ti ro c ti
Every file opened in zFS is managed by a single file man15 ager that is assigned to the file when it is first opened. The set of all currently active file managers manage all opened
T th
16
When eviction is invoked and a zFS page is the candidate page for eviction then the decision is passed to a specic zFS routine, which decides whether to forward the page to the cache of another node or to discard it.
An application using a zFS le to write a whole page acquires only the write lease when no read is done from the OSD. If one application or user on a machine has a write lease, all other applications/users on that machine can try to read and write to the page using
17
the same lease, without requesting the le manager for another lease. The kernel then checks the permission to read/write, based on the permissions specied in the mode (read or write or both) parameter when the le is opened. If the mode bits allow the operation, zFS allows it. When a client has a write lease and another client requests a read lease for the same page, a write to the object store device is done if the page has been modied and the lease on the rst client is downgraded from write to read lease without discarding the page. This increases the probability of a cache hit by a client requesting the same page, thus increasing performance.
When a client wants to open a le for reading, the local cache is checked for the page. In case of a cache miss, zFS requests the page and its read lease from the zFS le manager. The le manager checks if a range of pages starting with the requested page has already been read into the memory of another machine in the network. If not, zFS grants the leases to the client A, which enables the client to read the range of pages from the OSD directly. The client A then reads the range of pages from the OSD, marking each page as a singlet (as A is the only node having this range of pages in its cache). If the le manager nds that the range of pages requested resides in the memory of some other node say B, it sends a message to B requesting that B send the range of pages and leases to
18
A. In this case, zFS records the fact that A also has this particular range internally and both A and B mark the pages as replicated. Node B is called a thirdparty node, since A gets the requested data not from the OSD but from a thirdparty.
When memory becomes scarce for a client, the Linux kernel invokes the kswapd() daemon. This daemon scans and discards inactive pages from the memory of the client. In our modied kernel, if the page is a replicated zFS page, a message is sent to the zFS le manager indicating that machine A no longer holds the page and the page is discarded.
If the zFS page is a singlet, the page is forwarded to another node using the following steps:
1. A message is sent to the zFS le manager indicating that the page is sent to
another machine B, the node with the largest free memory known to A.
2. The page is forwarded to B. 3. The page is discarded from the page cache of A.
zFS uses a recirculation counter, and if the singlet page has not been accessed after two hops, it is discarded. Once the page has been accessed, the recirculation counter is reset. When a le manager is notied about a discarded page, it updates the lease and page location and checks whether the page has become a singlet. If only one other node N holds the page, the le manager sends a singlet message to N to that eect.
The eects of node failure and network delays are also considered in this algorithm.
19
Because of this, the order of steps for forwarding a singlet page to another node is important and is to be followed as described above.
1. Node fails before Step 1:- The le manager will eventually detect this and update its data to reect that the respective node does not hold pages and leases. If the node fails to execute Step 1 and notify the le manager, it does not forward the page and only discards it. Thus, we end up with a situation where the le manager assumes the page exists on node A, although it does not. This is acceptable since it can be corrected without data corruption.
2. Node failed after Step 1:- In this case, the le manager is informed that the page is on B, but node A may have crashed before it was able to forward the page to B. Again, we have a situation where the le manager assumes the page is on B, although in reality that is not true.
20
1. The rst case that the authors have considered is where a replicated page residing on two nodes M and N is discarded from the memory of M:-
When the zFS le manager sees that a page has become singlet and only resides in the memory of N now, it sends a message to N with this information. However, due to network delays, this message may arrive after memory pressure developed on N. But on the node N this page is marked as replicated, while in reality it is a singlet and should have been forwarded to another node.
If a singlet message arrives at N and the page is not in the cache of N, the cooperative cache algorithm on N will ignore the singlet message. Because the le manager still knows that the page resides on N, it may ask N to forward the page to a requesting client B. In this case, N will send back a reject message to the le manager. Upon receiving a reject message, the le manager updates its internal tables and retries to respond to the request from B by nding another client who in the meantime read the page from the OSD or by telling B to read the page from the OSD. In such cases, network delays will cause performance degradation, but not inconsistency.
2. Another possible scenario is that no memory pressure occurred on N, the page has not arrived yet, and a singlet message arrived and was ignored. The le manager asked N to forward the page to B and N sent a reject message
21
back to the le manager. If the page never arrives at N due to sender failure or network failure, there is no problem.
However, if the page arrives after the reject message was sent, a consistency problem may occur if a write lease exists. Because the le manager is not aware of the page on N, another node may get the write lease and the page from the OSD. This leaves two clients having the same page with write leases on two different nodes.
To avoid this situation, A reject list is kept in the node N, which records the pages (and their corresponding leases) that were requested but rejected. When a forwarded page arrives at N and the page is on the reject list, the page and its entry on the reject list are discarded, thus keeping the information in the le manager accurate. The reject list is scanned periodically (by the FE) and each entry whose time on the list exceeds T, is deleted. T is the maximum time it can take a page to reach its destination node, and is determined experimentally depending on the network topology.
An alternative method for handling these network delay issues is to use a complicated synchronization mechanism to keep track of the state of each page in the cluster. This is unacceptable for two reasons. First, it incurs overhead from extra messages, and second, this synchronization delays the kernel when it needs to evict pages quickly.
3. Another problem caused by network delays is that suppose node N noties the zFS le manager upon forwarding a page to M, and M does the same forwarding the page to O. However, the link from N to the le manager is slow compared to the other links. Thus, the le manager may receive a message that a page was moved from M to O before receiving the message that the singlet page was moved from N to M. Moreover, the le manager does
22
not have in its records that this specic page and lease reside on M. The problem is further complicated by the fact that M may decide to discard the page and this notication may arrive at the le manager before the move notication.
To solve this problem, the researchers used the following data structures:
Each lease on a node has a hop_count, which counts the number of times the lease and its corresponding page were moved from one node to another.
Initially, when the page is read from the OSD, the hop_count in the corresponding lease is set to zero and is incremented whenever the lease and page are transferred to another node.
When a node initiates a move, the move notication passed to the le manager includes the hop_count and the target_node.
Two elds are reserved in each lease record in the le manager s internal tables for handling move notication messages: Last_hop_count initially set to -1, and target_node initially set to NULL.
23
If message (3) arrives rst, its hop count and target node are saved in the lease record. This is done since node M is not registered as holding the lease and page. When message (1) arrives, N is the registered node; therefore, the lease is moved to the target node stored in the target_node eld. This is done by updating the information stored in the internal tables of the le manager. If message (3) arrives rst and then message (5) arrives, due to the larger hop count, the information from message (5) is stored and used when message (1) arrives. When message (3) arrives, it is ignored due to its smaller hop count. In other words, using the hop count enables us to ignore late messages that are irrelevant.
4. Suppose the page was moved from N to M and then to O, where it was discarded due to memory pressure on O and its recirculation count exceeded its limit. O then sends a release_lease message, which arrives at the le manager before the move notications.
24
Since O is not registered as holding the page and lease, the release_lease message is placed on a pending queue and a ag is raised in the lease record. When the move operation is resolved and this ag is set, the release_lease message is moved to the input queue and executed.
25
This method of operating is not ecient when the pages are transmitted over the network. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. For comparatively small blocks, the setup overhead is a signicant part of the total overhead.
Intuitively, it seems that it is more ecient to transmit k pages in one message rather than transmitting them in a separate message for each page as the setup overhead is amortized over k pages.
To conrm this, the researchers wrote client and server programs that test the time it takes to read a le residing entirely in memory from one node to another. Using a le size of N pages, they tested reading it in chunks of 1k pages in each TCP message. That is, reading the le in NN/k messages. They found that the best results are achieved for k=4 and k =8. When k is smaller, the setup time is signicant, and when k is larger (16 and above) the size of the L2 cache starts to aect the performance. TCP performance decreases when the transmitted block size exceeds the size of the L2 cache.
Similar performance gains were achieved by the zFS pre-fetching mechanism. When the le manager is instantiated, it is passed a pre-fetching parameter, R, indicating the maximum range of pages to grant. When a client A requests a page (and lease) the le manager searches for a client B having the largest con-
26
tiguous range of pages, r, starting with the requested page p and r <= R. If such a client B is found, the le manager sends B a message to send r pages (and their leases) to A. The selected range r can be smaller than R if the le manager nds a page with a conicting lease before reaching the full range R. If no range is found in any client, the le manager grants R leases to client A and instructs A to read R pages from the OSD. The requested page may reside on client A, while the next one resides on client B, and the next on client C. In this case, the granted range will be only the requested page from client A. The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from client C. Thus, there is no interference with the kernel read ahead mechanism. However, if the le manager nds that client A has a range of k pages, it will ignore the subsequent requests that are initiated by the kernel read-ahead mechanism and covered by the granted range.
27
Chapter 4 Testing
28
The client running the zFS front end was implemented as a kernel mode process, while all other components were implemented as user mode processes. The le manager and lease manager were fully implemented. The transaction manager implemented all operations in memory, without writing the log to the OSD. However, this fact does not inuence the results because only the results of read operations using cooperative cache are recorded and not the meta data operations.
To begin testing zFS, the researchers congured the system much like a SAN (Storage Area Network) le system. The server PC ran an OSD simulator, a separate PC ran the lease manager, le manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.
When testing NFS performance, they congured the system dierently. The server PC ran an NFS server with eight NFS daemons (nfsd) and the four PCs
29
ran NFS clients. The nal results are an average over several runs, where the caches of the machines were cleared before each run.
To evaluate zFS performance relative to an existing le system, the researchers compared it to the widely-used NFS system, using the IOZONE benchmark. IOZONE is a lesystem benchmark tool. The benchmark generates and measures a variety of le operations. IOZONE is useful for performing a broad lesystem analysis of a computer platform. The benchmark tests le I/O performance for operations such as read,write,etc.
The comparison to NFS was dicult because NFS does not carry out prefetching. To make up for this feature, IOZONE was congured to read the NFS mounted le using record sizes of n=1,4,8,16 pages and compared its performance with reading zFS mounted les with record sizes of one page but with prefetching parameter R=1,4,8,16 pages.
30
separate PC ran the lease manager, file manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end.
When testing (nfsd) and the four configured the system NFS daemonsNFS performance, wePCs ran NFS clients. The differently. The server PC ran over several runs, eight reported results are an average an NFS server withwhere the caches of the machines were cleared before each run.
5
(sinc zFS w
2 KB/Sec
5 Comparison to NFS To evaluate zFS performance relative to an existing file system, we compared it to the widely-used NFS system,
31
This f the se
The the
(since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS.
File smaller than available se rver memory 256M B file, 512MB serv er me mory
30000.00 25000.00 20000.00 KB/Sec NFS 15000.00 10000.00 5000.00 0.00 1 4 Range 8 16 No c_cache c_cache
ce
file em,
the data fits entirely in the servers memory. The graphs show the relative performance of zFS to NFS, with and without cooperative cache.
32
not on-
We also saw that when using cooperative cache, the performance for a range of 16 was lower than for ranges of 4
File larger than av ailable serve r me mory 1GB file, 512M B Serv er M e mory
25000.00 20000.00 15000.00 10000.00 5000.00 0.00 1 4 Range 8 16
The ture 1.
KB/Sec
2.
3.
This figure shows the performance results when the data size is greater than the server cache size and the servers local disk has to be used. We see that cooperative cache provides much better performance than NFS. Deactivating cooperative cache results in worse performance than NFS. In both the cases, it was observed that the performance of NFS was almost the
same for dierent block sizes. However, the performance is greatly inuenced by the data size compared to the available memory. When the le can t entirely
6 Related Work Several existing works have researched cooperative caching to the case when the le size is larger than the available memory. [ [ [ [ [ ! 2], ! 3], ! 4], ! 5], ! 12]. Although we cannot cover all these works, we will concentrate on those systems that describe When the le ts entirely into memory ( Figure 1) the performance of zFS with network file systems, rather than those describing caching cooperative cache is much better than NFS. But when cooperative cache was for parallel machines ! 3], or those that use a micro-kernel [ deactivated, dierent behaviors were observed for dierent ranges of pages. [ ! 2].
This is due to the fact that extra messages are passed between the le manager and the client for larger ranges of pages. Hence, the performance of zFS for
into the memory, the performance of NFS is almost four times better compared
Dahlin et. al. ! 4] describe four caching techniques to im[ prove the performance of the xFS file system. xFS uses a R=1 is lower than that of NFS. However, for larger ranges, there are fewer mescentral server to coordinate between the various clients, and the load of the server increases as the number of clients 33 increase. Thus, the scalability of xFS was limited by the strength of the server. Implementing these techniques and
sages to the le manager (due to pre-fetching in zFS) and the performance of zFS was slightly better than that of NFS.
The researchers also observed that when cooperative cache was used, the performance for a range of 16 was lower than for ranges of 4 and 8. This is due to the fact that IOZONE starts the requests of each client with a xed time delay relative to other clients, each new request was for dierent 256 KB. This stems from the following calculation: For four clients with 16 pages each, we get 256 KB, the size of the L2 cache. Since almost the entire le is in memory, the L2 cache is cleared and reloaded for each new granted request, resulting in reduced performance.
When the cache of server was smaller than the requested data, it was expected that memory pressure would occur in the server (NFS and OSD) and the server s local disk would be used. In such a case, the anticipations that the cooperative cache would exhibit improved performance proved to be correct. The results are shown in Figure 2.
We can see that zFS performance, when cooperative cache is deactivated, is lower than that of NFS but it gets better for larger ranges. When the cooperative cache is active, zFS performance is signicantly better than NFS and increases with increasing range.
The performance with cooperative cache enabled is lower in this case when compared to the case when the le ts into memory. This is due to the fact that the le was larger than the available memory, hence the clients suered memory pressure,and discarded pages and responded to the le manager with reject messages. Thus, sending data blocks to clients was interleaved with reject messages to the le manager and the probability that the requested data is in memory was also smaller than when the le was almost entirely in memory.
34
Conclusion
The results show that using the cache of all the clients as one cooperative cache gives better performance as compared to NFS as well as to the case when cooperative cache is not used. This is evident when using pre-fetching with a range of one page. It is also noted from the results that using pre-fetching with ranges of four and eight pages results in much better performance. In zFS, the selection of the target node for forwarding pages during page eviction is done by the le manager, which chooses the node with the largest free memory as the target node. However, the le manager chooses target nodes only from the ones interacting with it. It may be the case that there is an idle machine with a large free memory that is not connected to this le manager and thus will not be used.
35
Bibliography
Improving Performance of a Distributed File System Using OSDs and Cooperative Cache A. Teperman, A. Weit IBM Haifa Labs, Haifa University, Mount Carmel, Haifa 31905, Israel O. Rodeh and A. Teperman. "zFS A Scalable distributed File System using Object Disks." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference, pages 207-218, San Diego, CA, USA, 2003.
T. Cortes, S. Girona and J. Labarta. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. Department d Arquitectura de Computadros, Universitat Politecnica de Catalunia Barcelona. M. D. Dahlin, Randolph Y. Wang, Thomas E. Anderson and David A. Patterson. " Cooperative Caching: Using Remote Client Memory to Improve File System Performance." Proceedings of the First Symposium on Operating Systems Design and Implementation,1994. V. Drezin, N. Rinetzky, A. Tavory and E. Yerushalmi. "The Antara Object-disk Design." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2001. Z. Dubitzky, I. Gold, E. Henis, J. Satran and D. Sheinwald. "DSF Data Sharing Facility." Technical report IBM Research Labs, Haifa University Campus, Mount Carmel, Haifa, Israel, 2000. http://www.haifa.il.ibm.com/projects/systems/dsf.html Iozone. See http://iozone.org/ http://www.lustre.org/docs/whitepaper.pdf P. Sarkar and J. Hartman. Ecient Cooperative Caching Using Hints. Department of Computer Science, University of Arizona, Tucson.
36