Professional Documents
Culture Documents
Alex Sherman , Philip A. Lisiecki , Andy Berkheimer , and Joel Wein . Akamai Technologies, Inc. Columbia University Polytechnic University.
Abstract
An important trend in information technology is the use of increasingly large distributed systems to deploy increasingly complex and mission-critical applications. In order for these systems to achieve the ultimate goal of having similar easeof-use properties as centralized systems they must allow fast, reliable, and lightweight management and synchronization of their conguration state. This goal poses numerous technical challenges in a truly Internet-scale system, including varying degrees of network connectivity, inevitable machine failures, and the need to distribute information globally in a fast and reliable fashion. In this paper we discuss the design and implementation of a conguration management system for the Akamai Network. It allows reliable yet highly asynchronous delivery of conguration information, is signicantly fault-tolerant, and can scale if necessary to hundreds of thousands of servers. The system is fully functional today providing conguration management to over 15,000 servers deployed in 1200+ different networks in 60+ countries.
1 Introduction
Akamai Technologies operates a system of 15,000+ widely dispersed servers on which its customers deploy their web content and applications in order to increase the performance and reliability of their web sites. When a customer extends their web presence from their own server or server farm to a third party Content Delivery Network (CDN), a major concern is the ability to maintain close control over the manner in which their web content is served. Most customers require a level of control over their distributed presence that rivals that achievable in a centralized environment. Akamais customers can congure many options that determine how their content is served by the CDN. These options may include: html cache timeouts, whether to allow cookies, whether to store session data for their web applications among many other settings. Congura-
tion les that capture these settings must be propagated quickly to all of the Akamai servers upon update. In addition to the conguring customer proles, Akamai also runs many internal services and processes which require frequent updates or recongurations. One example is the mapping services which assign users to Akamai servers based on network conditions. Subsystems that measure frequently-changing network connectivity and latency must distribute their measurements to the mapping services. In this paper we describe the Akamai Conguration Management System (ACMS), which was built to support customers and internal services conguration propagation requirements. ACMS accepts distributed submissions of conguration information (captured in conguration les) and disseminates this information to the Akamai CDN. ACMS is highly available through signicant fault-tolerance, allows reliable yet highly asynchronous and consistent delivery of conguration information, provides persistent storage of conguration updates, and can scale if necessary to hundreds of thousands of servers. The system is fully functional today providing conguration management to over 15,000 servers deployed in 1200+ different ISP networks in 60+ countries. Further, as a lightweight mechanism for making conguration changes, it has evolved into a critical element of how we administer our network in a exible fashion. Elements of ACMS bear resemblance to or draw from numerous previous efforts in distributed systems from reliable messaging/multicast in wide-area systems, to fault-tolerant data replication techniques, to Microsofts Windows Update functionality; we present a detailed comparison in Section 8. We believe, however, that our system is designed to work in a relatively unique environment, due to a combination of the following factors. The set of end clients our 15,000+ servers are very widely dispersed.
At any point in time a nontrivial fraction of these servers may be down or may have nontrivial connectivity problems to the rest of the system. An individual server may be out of commission for several months before being returned to active duty, and will need to get caught up in a sane fashion. Conguration changes are generated from widely dispersed places for certain applications, any server in the system can generate conguration information that needs to be dispersed via ACMS. We have relatively strong consistency requirements. When a server that has been out-of-touch regains contact it needs to become up to date quickly or risk serving customer content in an outdated mode. Our solution is based on a small set of front-end distributed Storage Points and a back-end process that manages downloads from the front-end. We have designed and implemented a set of protocols that deal with our particular availability and consistency requirements. The major contributions of this paper are as follows: We describe the design of a live working system that meets the requirements of conguration management in a very large distributed network. We present performance data and detail some lessons learned from a building and deploying such a system. We discuss in detail the distributed synchronization protocols we introduced to manage the front ends Storage Points. While these protocols bear similarity to several previous efforts, they are targeted at a different combination of reliability and availability requirements and thus may be of interest in other settings.
We assume that each submission of a conguration le foo completely overwrites the earlier submitted version of foo. Thus, we do not need to store older versions of foo, but the system must correctly synchronize to the latest version. Finally, we assume that for each conguration le there is either a single writer or multiple idempotent (non-competing) writers. Based on the motivation and assumptions described above we formulate the following requirements for ACMS: High Fault-Tolerance and Availability. In order to support all applications that dynamically submit conguration updates, the system must operate 24x7 and experience virtually no downtime. The system must be able to tolerate a number of machine failures and network partitions, and still accept and deliver conguration updates. Thus, the system must have multiple entry points for accepting and storing conguration updates such that failure of any one of them will not halt the system. Furthermore, these entry points must be located in distinct ISP networks so as to guarantee availability even if one of these networks becomes partitioned from the rest of the Internet. Efciency and Scalability. The system must deliver updates efciently to a network of the size of the Akamai CDN, and all parts of the system must scale effectively to any anticipated growth. Since updates, such as a customers prole, directly effect how each Akamai node serves that customers content, it is imperative that the servers synchronize relatively quickly with respect to the new updates. The system must guarantee that propagation of updates to all alive nodes takes place within a few minutes from submission. (Provided of course, that there is network connectivity to such alive or functioning nodes from some of our entry points.). Persistent Fault-Tolerant Storage. In a large network some machines will always be experiencing downtime due to power and network outages or process failures. Therefore, it is unlikely that a conguration update can be delivered synchronously to the entire CDN in the time of submission. Instead the system must be able to store the updates permanently and deliver them asynchronously to machines as they become available. Correctness. Since conguration le updates can be submitted to any of the entry points, it is possible that two updates for the same le foo arrive at different entry points simultaneously. We require that ACMS provide a unique ordering of all versions and that the system synchronize to the latest version for each conguration le. Since slight clock skews are possible among our machines, we relax this requirement and show that we allow a very limited, but bounded reordering. (See section 3.4.2). Acceptance Guarantee. ACMS accepts a submis-
sion request only when the system has agreed on this version of the update. The agreement in ACMS is based on a quorum of entry points. (The quorum used in ACMS is at the core of our architecture and is discussed in great detail throughout the paper). The agreement is necessary, because if the entry point that receives an update submission becomes cut off from the Internet it will not be able to propagate the update to the rest of the system. In essence, the Acceptance Guarantee stipulates that if a submission is accepted, a quorum has agreed to propagate the submission to the Akamai CDN. Security. Conguration updates must be authenticated and encrypted so that ACMS cannot be spoofed nor updates read by any third parties. The techniques that we use to accomplish this are standard, and we do not discuss them further in this document.
request them. We observed that the Akamai CDN itself is fully optimized for HTTP download, making the pullbased approach over HTTP download a natural choice. Since many conguration updates must be delivered to virtually every Akamai server, this allows us to use Akamai caches effectively for common downloads and thus reduce network bandwidth requirements. This natural choice helps ACMS scale with the growing size of the Akamai network. As an optimization we add an additional set of machines (the Download Points) to the front-end. Download Points offer additional sites for HTTP download and thus alleviate the bandwidth demand placed on the Storage Points. To further improve the efciency of the HTTP download we create an index hierarchy that concisely describes all conguration les available on the SPs. A downloading agent can start with downloading the root of the hierarchical index tree and work its way down to detect changes in any particular conguration les it is interested in. The rest of this paper is organized as follows. We give an architecture overview in section 2. We discuss our distributed techniques of quorum-based replication and recovery in sections 3 and 4. Section 5 describes the delivery mechanism. We share our operational experience and evaluation in sections 6 and 7. Section 8 discusses related work. We conclude in section 9.
2 Architecture Overview
The architecture of ACMS is depicted in Figure 1. First an application submitting an update (also known as a publisher) contacts an ACMS Storage Point. The publisher transmits a new version of a given conguration le. The SP that receives an update submission is also known as the Accepting SP for that submission. Before replying to the client the Accepting SP makes sure to replicate the message on at least a quorum (a majority) of Servers (i.e., Storage Points). Servers store the message persistently on disk as a le. In addition to copying the data, ACMS runs an algorithm called Vector Exchange that allows a quorum of SPs to agree on a submission. Only after the agreement is reached does the Accepting SP acknowledge the publishers request, by replying with Accept. Once the agreement among the SPs is reached, the data can also be offered for download. The Storage Points upload the data to their local HTTP servers (i.e., HTTP servers runs on the same machines as the SPs). Since only a quorum of SPs is required to reach an agreement on a submission, some SPs may miss an occasional update due to downtime. To account for replication messages missed due to downtime, the SPs run
Figure 1: ACMS: Publishers, Storage Points, and Receivers (Subscribers) a recovery scheme called Index Merging. Index Merging helps the Storage Points recover any missed updates from their peers. To subscribe for conguration updates, each server (also known as a node) on the Akamai CDN runs a process called Receiver that coordinates subscriptions for that node. Services on each node subscribe with their local Receiver process to receive conguration updates. Receivers periodically make HTTP IMS (If-ModiedSince) requests for these les from the SPs. Receivers send these requests via the Akamai CDN, and most of the requests are served from nearby Akamai caches reducing network trafc requirements. We add an additional set of a few well-positioned machines to the front-end, called the Download Points (DPs). DPs never participate in initial replication of updates and rely entirely on Index Merging to obtain the latest conguration les. DPs alleviate some of the download bandwidth requirements from the SPs. In this way data replication between the SPs does not need to compete as much for bandwidth with the download requests from subscribers.
SPs. We dene quorum as a majority. As long as a majority of the SPs remain functional and not partitioned from one another, this majority subset will intersect with the initial quorum that accepted a submission. Therefore, this latter subset will collectively contain the knowledge of all previously accepted updates. This approach is deeply rooted in our assumption that ACMS can maintain a majority of operational and connected SPs. If there is no quorum of SPs that are functional and can communicate with one another ACMS will halt and refuse to accept new updates until a connected quorum of SPs is re-established. Each SP maintains connectivity by exchanging liveness messages with its peers. Liveness messages also indicate whether the SPs are fully functional or healthy. Each SP reports whether it has pairwise connectivity to a quorum (including itself) of healthy SPs. The reports arrive at the Akamai NOCC (Network Operations Command Center) [2]. If a majority of ACMS SPs fails to report pairwise connectivity to a quorum, a red alert is generated in the NOCC and operation engineers perform immediate connectivity diagnosis and attempt to x the network or server problem(s). By placing SPs inside distinct ISP networks we reduce the probability of an outage that would disrupt a quorum of these machines. (See some statistics in section 6.) Since we require only a majority of SPs to be connected, it means we can tolerate a number of failures due to partitioning, hardware, or software malfunctions. For example, with an initial set containing ve SPs, we can tolerate two SP failures or partitions and still maintain a viable majority of three SPs. When any single SP malfunctions, a lesser priority alert also triggers corrective action from the NOCC engineers. ACMS operational experience with maintaining a connected quorum and various failure cases are discussed in detail in section 6. The rest of the section describes the quorum-based Acceptance Algorithm in detail. We also explain how ACMS replication and agreement methods satisfy Correctness and Acceptance requirements outlined in section 1.1 and discuss maintenance of the ACMS SPs.
3 Quorum-based Replication
The fault-tolerance of ACMS is based on the use of a simple quorum. In order for an Accepting SP to accept an update submission we require that the update be both replicated to and agreed upon by a quorum of the ACMS
one request per le per second. The Accepting SP then sends this le that contains the update along with its MD5 hash to a number of SPs over a secure TCP connection. Each SP that receives the le stores it persistently on disk (under the UID name), veries the hash, and acknowledges that it has stored the le. If the Accepting SP fails to replicate the data to a quorum after a timeout, it replies with an error to the publishing application. The timeout is based on the size of the update, and a very low estimate of available bandwidth between this SP and its peers. (If the Accepting SP does not have connectivity to a quorum it replies much sooner and does not wait for a timeout to expire). Otherwise, once at least a quorum of SPs (including the Accepting SP) has stored the temporary le, the Accepting SP initiates the second phase to obtain an agreement from the Storage Points on the submitted update.
of the other SPs via the recovery routine (section 4). Note, that it is possible for the Accepting SP to become cut-off from the quorum after it initiates the VE phase. In this case it does not know whether its broadcasts were received and whether the agreement took place. It is then forced to reply only with Possible Accept rather than Accept to the publishing application. We recommend that the publisher that gets cut off from the Accepting SP or receives a Possible Accept should try to re-submit its update to another SP. (From a publishers perspective the reply of Possible Accept is equivalent to Reject. The distinction was made initially purely for the purpose of monitoring this condition.) As in many agreement schemes, the purpose of the VE protocol is to deal with some Byzantine network or machine failures [18]. In particular, VE prevents an individual SP (or a minority subset of SPs) from uploading new data and then becoming disconnected from the rest of the SPs. A quorum of SPs could then continue to operate successfully without the knowledge that the minority is advertising a new update. This new update would become available only to a small subset of the Akamai nodes that can reach the minority subset, possibly causing a discord in the Akamai network viz. the latest updates. VE is based on earlier ideas of vector clocks introduced by by Fidge [10] and Mattern [24]. Section 8 compares Acceptance Algorithm with Two-Phase Commit and other agreement schemes used in common distributed systems.
3.3 An Example
We give an example to demonstrate both phases of the Acceptance Algorithm. Imagine that our system contains ve Storage Points named A, B, C, D, and E with SP D down temporarily for a software upgrade. With ve SPs the quorum required for the Acceptance algorithm is three SPs. SP A receives a submission update from publisher P for conguration le foo. To use the example from section 3.1 SP A stores the le under a temporary UID: foo.A.1234. SP A initiates the replication phase by sending the le in parallel to as many SPs as it can reach. SPs B, C, and E store the temporary update under the UID name. (SP D is down and does not respond). SPs B and C happen to be the rst SPs to acknowledge the reception of the le and the MD5 hash check. Now A knows that the majority (A, B, and C) have stored the le and A is ready to initiate the agreement phase. SP A broadcasts the following VE message to the other SPs: foo.A.1234 A:1 B:0 C:0 D:0 E:0
This message contains the UID of the pending update and the vector that has only As bit set. (A stores this vector state persistently on disk prior to sending it out). When SP B receives this message it adds its bit to the vector, stores the vector, and broadcasts it: foo.A.1234 A:1 B:1 C:0 D:0 E:0 After a couple of rounds all four live SPs store the following message with all bits set except for Ds: foo.A.1234 A:1 B:1 C:1 D:0 E:1 At this point, as each SP sees that the majority of bits is set, A, B, C, and E upload the temporary le in place of the permanent conguration le foo, and store in their local database the UID of the latest agreed upon version of le foo: foo.A.1234. All older records of foo can be discarded.
application treats this reply as Reject and tries to resubmit to another SP. The probability of a Possible Accept is very small, and we have never seen it occur in the real system. The reason for that is that in order for the VE phase to be initiated the replication phase must succeed. If the replication is successful it most likely means that the lighter VE phase that also requires connectivity to a quorum (but less bandwidth) will also succeed. If the replication phase fails, ACMS replies with a denite Reject. 3.4.2 Correctness The Correctness requirements state that ACMS provides a unique ordering of all update versions for a given conguration le AND that the system synchronizes to the latest submitted update. We later relaxed that guarantee to state that ACMS allows limited re-ordering in deciding which update is the latest, due to clock skews. More precisely, accepted updates for the same le submitted at least 2T + 1 seconds apart will be ordered correctly. T is the maximum allowed clock skew between any two communicating SPs. The unique ordering of submitted updates is guaranteed by the UID assigned to a submission as soon as it is received by ACMS (regardless of whether it will be accepted). The UID contains both a UTC timestamp from the SPs clock and the SPs name. The submissions for the same conguration le are rst ordered by time and then by the Accepting SP name. So foo.B.1234 is considered to be more recent than foo.A.1234, and it is kept as the later version. A Storage Point accepts only one update per second for a given conguration le. Since we do not use logical synchronized clocks, slight clock skews and reordering of updates are possible. We now explain how we bound such reordering, and why any small reordering is acceptable in ACMS. We bound the possible skew between any two communicating SPs by T seconds (where T is usually set to 20 seconds). Our communication protocols enforce this bound by rejecting liveness messages from SPs that are at least T seconds apart. (I.e., such pairs of servers appear virtually dead to each other). As a result it follows that no two SPs that accept updates for the same le can have a clock skew more than 2T seconds. Proof: Imagine SPs A and B that are both able to accept updates. This means both A and B are able to replicate these update to a majority of SPs. These majorities must overlap by at least one SP. Moreover, neither A nor B can have more than a T second clock skew from that SP. So A and B cannot be more than 2T seconds apart. Developers of the Akamai subsystems that submit conguration les to Akamai nodes via ACMS are advised to avoid mis-ordering by submitting updates to the
3.4 Guarantees
We now show that our Acceptance Algorithm satises the acceptance and correctness requirements, provided that our quorum assumption holds. 3.4.1 Acceptance Guarantee Having introduced the quorum-based scheme we now restate the acceptance guarantee more precisely than in section 1.1. The acceptance guarantee states that if the Accepting SP has accepted a submission, it will be uploaded by a quorum of SPs. Proof: The Accepting SP accepts only when the update has been replicated to a quorum AND when the Accepting SP can see a majority of bits set in the VE vector. Now if the Accepting SP can see a majority of bits set in the VE vector it means that at least a majority of the SPs have stored a partially lled VE vector during the agreement phase. Therefore, any future quorum will include at least one SP that stores the VE vector for this update. Once such a SP is part of a quorum, after a few re-broadcast rounds, all of the SPs in this future quorum will have their bits set. Therefore, all the SPs in the latter quorum will decide to upload. So based on our assumption that a quorum of connected SPs can be reasonably maintained, acceptance by ACMS implies a future decision by at least a quorum to upload the update. The converse of the acceptance guarantee does not necessarily hold. If the quorum decides to upload, it does not mean that the Accepting SP will accept. As stated earlier the Accepting SP may be cut off from the quorum after VE phase is initiated, but before it completes. In that case the Accepting SP replies with Possible Accept, because its likely but not denite. The publishing
same conguration le at intervals of at least 2T + 1. In addition, we use NTP [3] to synchronize our server clocks, and in practice we nd very rare instances when our servers are more than one second apart. Finally with ACMS, it is actually acceptable to reorder updates within a small bound such as 2T . We are not dealing with competing editors of a distributed lesystem. Subsystems that are involved in conguring a large CDN such as Akamai must and do cooperate with each other. In fact, we considered two cases of such subsystems that update the same conguration le. Either there is only one process that submits updates for le foo, or there are redundant processes that submit the same or idempotent updates for le foo. In the case of a single publishing process, it can easily abide by the 2T rule and therefore avoid reordering. In the case of redundant writers that exist for fault-tolerance we do not care whose update within the 2T period is submitted rst as these updates are idempotent. Any more complex distributed systems that publish to ACMS use leader election to select a publishing process, effectively reducing these systems to one-publisher systems.
quorum. Such upgrades are scheduled independently on individual Storage Points so that the remaining system still contains a connected quorum. Adding and removing machines with quorum-based systems is a theoretically tricky problem. Rambo [19] is an example of a quorum-based system that solves dynamic set conguration changes by having an old quorum agree on a new conguration. Since adding or removing SPs is extremely rare we chose not to complicate the system to allow dynamic conguration changes. Instead, we halt the system temporarily by disallowing accepts of new updates, change the set conguration on all machines, wait for a new quorum to sync up on all state (via the Recovery algorithm), and allow all SPs to resume operation. Replacing a dead SP is a simpler procedure where we bring up a new SP with the same SP ID as the old one and clean state.
3.6 Maintenance
Software or OS upgrades performed on individual Storage Points must be coordinated to prevent an outage of a
The recovery protocol is called Index Merging. The SPs merge their index trees to pick up any missed updates from one another. The Download Points also need to sync up state. These machines do not participate in the Acceptance Algorithm and instead rely entirely on the recovery protocol on Storage Points to pick up all state.
sions. Typically receivers are only interested in a subset of the index tree that describes their subscriptions. Receivers also download index les from the SPs via HTTP IMS requests. Using HTTP IMS is efcient but is also problematic because each SP generates its own snapshot and assigns its own timestamps to the index les that it uploads. Thus it is possible for a SP A to generate an index le with more recent timestamp than SP B , but less recent information. If a Receiver is unlucky and downloads the index le from A rst, it will not download an index with a lower timestamp from B , until the timestamp increases. It may take a while for it to get all the necessary changes. There are two solutions to this problem. In one solution we could require a Receiver to download an index tree independently from each SP, or at least a quorum of the SPs. Having each Receiver download multiple index trees is an unnecessary waste of bandwidth. Furthermore, requiring each Receiver to be able to reach a a quorum of SPs reduces system availability. Ideally, we only require that a Receiver be able to reach one SP that itself is part of a quorum. We implemented an alternative solution, where the SPs merge their index timestamps, not just the data listed in the those indexes.
Receiver). If so, it stores the timestamp listed for that index as the target timestamp, and keeps making IMS requests until it downloads the index that is at least as recent as the target timestamp. Finally it parses that index and checks whether any les in its subscription tree (that belong to this index) have been updated. If so the Receiver then tries to download a changed le until it gets one at least as recent as the target timestamp. There are a few reasons why a Receiver may need to attempt multiple IMS requests before it gets a le with a target timestamp. First some Storage Points may be a bit behind with Index Merging and not contain the latest les. Second, an old le may be cached by the Akamai network for a short while. The Receiver retries its downloads frequently until it gets the required le. Once the Receiver downloads the latest update for a subscription, it places the data in a le on local disk and points a local subscriber to it. The Receiver must know how to nd the SPs. The Domain Name Service provides a natural mechanism to distribute the list of SPs and DPs addresses.
5 Data Delivery
In addition to providing high fault-tolerance and availability the system must scale to support download by thousands of Akamai servers. We naturally use the Akamai CDN (Content Distribution Network) which is optimized for le download. In this section we describe the Receiver process, its use of the hierarchical index data, and the use of the Akamai CDN itself.
6 Operational Experience
The design of ACMS has been an iterative process between implementation and eld experience where our assumptions of persistent storage, network connectivity, and OS/software fault-tolerance were tested.
Since ACMS runs automatic recovery routines replacing damaged or old hardware on ACMS is trivial. The SP process running on a clean disk quickly recovers all of the ACMS state from other SPs via Index Merging.
individual SP corruption or downtime helps decrease the probability of full quorum failures.
12
10
7 Evaluation
To evaluate the effectiveness of the system we gathered data from the live ACMS system accepting and delivering conguration updates on the actual Akamai network.
% received
150
Figure 2: Propagation time distribution for a large number of conguration updates delivered to a sampling of thousands of machines. connectivity issues; the les were delivered promptly after connectivity was restored. These delivery times meet our objectives of distributing les within several minutes. The gure shows a high propagation time for especially small les. Although one would expect that the propagation time increases monotonically with the le size, CDN caching slows down les submitted more frequently. We believe that many smaller les are updated frequently on ACMS. As a result the caching TTL of the CDN is more heavily reected in propagation delay. The use of caching reduces bandwidth on the Storage Points anywhere from 90% to 99%, increasing in general with system activity and with the le size being pushed, allowing large updates to be propagated to tens of thousands of machines without signicant impact on Storage Point trafc. Finally to analyze general connectivity and the tail of the propagation distribution we looked at a propagation of short les (under 20KB) to another random sample of 300 machines over a 4 day period. We found that 99.8% of the time a le was received within 2 minutes from becoming available and 99.96% of the time it was received within 4 minutes.
7.2 Scalability
We analyzed the overhead of the Acceptance Algorithm and its effect on the scalability of the front-end. Over a recent 6 day period we recorded 43,504 successful le submissions with an average le size of 121KB. In a system with 5 SPs, the Accepting SP needs to replicate data to 4 other SPs requiring 484 KBytes per le on average. The size of a VE message is roughly 100 bytes. With n(n 1) VE messages exchanged per submission, VE uses 2 KB per le or 0.4% of the replication bandwidth.
120
90
60
10
10
10
10
Figure 3: Propagation times for various size les. The dashed line shows the average time for each le to propagate to 95% of its recipients. The solid line shows the average propagation time. For our purposes we chose 5 SPs, so that during a software upgrade of one machine the system cold tolerate one failure and still maintain a majority quorum of 3. Extending the calculation to 15 SPs, for example, with an average le size of 121 KB the system would require 1.7 MB for replication and 21KB for VE. The VE overhead becomes 1.2%, which is higher, but not signicant. Such a system is conceivable if one chooses not to rely on a CDN for efcient propagation, but instead offer more download sites (SPs). The VE overhead can be further reduced as described in section 3.5. However, the minimum bandwidth required to replicate the data to all 15 machines may grow to be prohibitive. In such a system one could still allow each Server to maintain all indexes, but split the actual storage into subsets based on some hashing function such as Consistent Hashing [4]. For ACMS choosing the Akamai CDN itself for propagation is the natural choice. The cacheability of the system grows as the CDN penetrates more ISP networks, and the system scales naturally with its own growth. Also, as the CDN grows the reachability of receivers inside more remote ISPs improves.
ACMS Index Merging these lesystems run recovery algorithms that synchronize the data among replicas, such as Bayous anti-entropy algorithm. However, all of these systems attempt to improve the availability of data at the expense of consistency. The aim is to allow le operations to clients on a set of disconnected machines. ACMS, on the other hand must provide a very high level of consistency across the Akamai network and cannot allow a single SP to accept and upload a new update independently. The two-phase Acceptance Algorithm used by ACMS is similar in nature to the Two Phase-Commit [12]. Twophase commit also separates a transaction phase from a commit phase, but its failure modes make it more suitable to a local environment. The Vector Exchange (the agreement phase of our algorithm) was inspired by the concept of vector clocks introduced by Fidge [10] and Mattern [24] which are used to determine causality of events in a distributed system. Bayou also uses vectors to represent latest known commit sequence numbers for each server. In our algorithm, the vectors contents are simply bits since each message only has two interesting states, known to a server or not. Each subsequent agreement is a separate instance of the protocol. VE uses a quorum-based scheme similar to Paxos [16] and BFS [17]. Paxos denes quorum as strict majority while BFS denes it as more than 2/3. VE allows quorum to be congurable as long as it is at least a majority. All these algorithms consider Byzantine failures and rely on persistent storage by a quorum to enable a later quorum to recover state. This strong property precludes scenarios allowed by a simpler two phase commit protocol for a minority of partitioned replicas to commit a transaction. Other quorum systems include weighted voting [11] and hierarchical quorum consensus [15]. At the same time VE is simpler than Paxos and BFS and does not implement a full Byzantine FaultTolerance. It does not require an auxiliary protocol to determine a leader or a primary as in Paxos or BFS respectively. This relaxation stems from the nature of ACMS applications where only a single or redundant writers exist for each le and thus, some bounded reordering is permissible as explained in section 3.4.2. No leader is enforcing ordering. OceanStore [31] is an example of a storage system that implements Byzantine Fault-Tolerance to have replicas agree on the order of updates that originate from different sources. ACMS, on the other hand must complete agreement at the time of an update submission. This is primarily due to the important aspect of the Akamai network where an application that publishes a new conguration le must know that the system has agreed to upload and propagate the new update. (Otherwise it will
keep retrying.)
the Akamai network. However, polling intervals for such updates are not as critical. Some Windows users take days to activate their updates while each Akamai node is responsible for serving requests to tens of thousands of users and thus must synchronize to the latest updates very efciently. Moreover, systems such as Windows Updates use a rigorous, centralized process to push out new updates. ACMS accepts submissions from dynamic publishers dispersed throughout the Akamai network. Thus, highly fault-tolerant, available, and consistent storage of updates is required.
9 Conclusion
In this paper we have presented the Akamai Conguration Management System that successfully manages conguration updates for the Akamai network of 15,000+ nodes. Through the use of simple quorum-based algorithms (Vector Exchange and Index Merging), ACMS provides highly available, distributed, and fault-tolerant management of conguration updates. Although these algorithms are based on earlier ideas, they were particularly adapted to suit a conguration publishing environment and provide high level of consistency and easy recovery for the ACMS Storage Points. These schemes offer much exibility and may be useful in other distributed systems. Just like ACMS, any other management system could benet from using a CDN such as Akamais to propagate updates. First, a CDN managed by a third party offers a convenient overlay that can span thousands of networks effectively. A solution such as multicast requires much management and simply does not scale across different ISPs. Second, a CDNs caching and reach will allow the system to scale to hundreds of thousands of nodes and beyond. Most importantly we have presented valuable lessons learned from our operational experience. Redundancy of machines, networks, and even algorithms helps a distributed system such as ACMS cope with network and machine failures, and even human errors. Despite 36 network failures that we recorded in the last 9 months, that affected some ACMS Storage Points, the system continued to operate successfully. Finally, active monitoring of any critical distributed system is invaluable. We relied heavily on the NOCC infrastructure to maintain a high level of fault-tolerance.
Acknowledgements
We would like to thank William Weihl, Chris Joerg, and John Dilley among many other Akamai engineers for their advice and suggestions during the design. We want to thank Gong Ke Shen for her role as a developer on this
project. We would like to thank Professor Jason Nieh for his motivation and advice with the paper. Finally, we want to thank all of the reviewers and especially our NSDI shepherd Jeff Mogul for their valuable comments.
[18] L. Lamport, R. Shostak, and M. Pease, The Byzantine Generals Problem, ACM Transactions on Programming Languages and Systems, July 1982. [19] N. Lynch and A. Shvartsman, RAMBO: A Recongurable Atomic Memory Service for Dynamic Networks, DISC, October 2002. [20] M. Satyanarayanan, Scalable, Secure, and Highly Available Distributed File Access, IEEE Computer, May 1990. [21] S. Saito, C. Karamanolis, M. Karlsson, M. Mahalingam, Taming Aggressive Replication in the Pangaea Wide-area File System, OSDI 2002. [22] K. Peterson, M. Spreitzer, D. Terry, Flexible Update Propagation for Weakly Consistent Replication, SOSP, 1997. [23] S. Paul, K. Sabnani, J. C. Lin, and S. Bhattacharyya, Reliable Multicast Transport Protocol (RMTP), IEEE Journal on Selected Areas in Communications, April 1997. [24] F. Mattern, Virtual Time and Global States of Distributed Systems, Proc. Parallel and Distributed Algorithms Conf., Elsevier Science, 1988. [25] Microsoft Corporation, Microsoft Message Queuing (MSMQ) Center, http://www.microsoft.com/windows2000/ technologies/communications/msmq/default.asp. [26] S. Rantmasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, A Scalable Content-Addressable Network, Proc. of ACM SIGCOMM, August 2001. [27] S. Ratnasamy, M. Handley, R. Karp and S. Shenker, Application-level multicast using content-addressable networks, Proc. of NGC, November 2001. [28] A. Rowstron and P. Drischel, Pastry: Scalable, distributed object location and routing for large-scale peerto-peer systems, Proc of Middleware, November 2001. [29] A. Rowstron, A. M. Kermarrec, M. Castro and P. Druschel, Scribe: The design of a large-scale event notication infra, structure in Proc of NGC, Nov 2001. [30] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, in Proc of ACM SIGCOMM, August 2001. [31] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, B. Zhao, OceanStore: an architecture for global-scale persistent storage, ASPLOS 2000, November 2000. [32] Sun Microsystems, Java Message Service, http://java.sun.com/products/jms. [33] B. Zhao, J. Kubiatowicz and A. Joseph, Tapestry: An infrastructure for fault-resilient wide-area location and routing, U. C. Berkeley, Tech. Rep. April 2001. [34] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubiatowicz, Bayeux: An Architecture for Scalable and FaultTolerant Wide-Area Data Dissemination, Proc. of NOSSDAV, June 2001. [35] http://www.lcfg.org/. [36] http://www2.novadigm.com/hpworld/. [37] Microsoft Windows Update http://windowsupdate.microsoft.com.
References
[1] Akamai Technologies, Inc., http://www.akamai.com/. [2] Network Operations Command Center, http://www.akamai.com/en/html/technology/nocc.html. [3] http://www.ntp.org/. [4] D.Karger, E.Lehman, T.Leighton, M.Levine, D.Lewin and R.Panigrahy, Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the TwentyNinth Annual ACM Symposium on Theory of Computing, pages 654-663 , 1997. [5] M. Castro, M. B. Jones, A-M Kermarrec, A. Rowstron, M. Theimer, H. Wang, and A. Wolman, An Evaluation of Scalable Application-level Multicast Built Using Peerto-Peer Overlays, in Proc. INFOCOM, 2003. [6] Y. Chawathe, S. McCanne, and E. Brewer, RMX: Reliable Multicast for Heterogeneous Networks, Proc. of INFOCOM, March 2000, pp. 795804. [7] Y. H. Chu, S. G. Rao, and H. Zhang, A case for end system multicast, Proc. of ACM Sigmetrics, June 2000, pp. 112. [8] S. B. Davidson, H. Garcia-Molina, and D. Skeen, Consistency in partitioned networks, ACM Comput. Surveys, 1985. [9] S. E. Deering and D. R. Cheriton, Host extensions for IP multicasting, Technical Report RFC 1112, Network Working Group, August 1989. [10] C. J. Fidge, Timestamp in Message Passing Systems that Preserves Partial Ordering, Proc. 11th Australian Computing Conf., 1988, pp. 5666. [11] D. K. Gifford, Weighted Voting for Replicated Data, Proceedings 7th ACM Symposium on Operating Systems, 1979. [12] J. Gray, Notes on database operating systems, Operating Systems: An Advanced Course. pp. 394481, 1978. [13] IBM Corporation, WebSphere MQ family, http://www-306.ibm.com/software/integration/ mqfamily/. [14] J. Jannotti, D. K. Gifford, K. L. Johnson, F. Kaashoek, and J. W. OToole, Overcast: Reliable Multicasting with an Overlay Network, Proc. of OSDI, October 2000, pp. 197212. [15] A. Kumar, Hierarchical quorum consensus: A new algorithm for managing replicated data, IEEE Trans. Computers, 1991. [16] L. Lamport, The Part-time Parliament, ACM Transactions in Computer Systems, May, 1998. [17] M. Castro, B. Liskov, Practical Byzantine FaultTolerance, OSDI 1999.