Veritas Cluster Server

Veritas Cluster Server
Learning Document for HA Concepts & Veritas Cluster Server
By Enterprise Services Wipro Infotech Delhi
Confidentiality
This document is being submitted to Adobe Pvt. Ltd.by Wipro Infotech, with the explicit understanding that the contents would not be divulged to any third party without prior written consent from Wipro Infotech.
Wipro Confidential
Page 1 of 90
VERITAS Cluster Server

This topic provides an overview of the key concepts, features, and benefits of VERITAS Cluster Server (VCS).
An Overview of VCS
VCS is an architecture-independent, availability management solution focused on proactive management of service groups, or application services. It is equally applicable in simple shared disk, shared nothing, or SAN configurations of up to 32 nodes and compatible with single node, parallel, and distributed applications. Cascading and multidirectional application failover is supported, and application services can also be manually migrated to alternate nodes for maintenance purposes. VCS provides a comprehensive availability management solution designed to minimize both planned and unplanned downtime. Designed with a modular and extensible architecture to make it easy to install, configure, and modify, VCS can be used to enhance the availability of any application service with its fully automated, application-level fault detection, isolation and recovery. All fault monitors, implemented in software, are themselves monitored and can be automatically restarted in the event of a monitor process failure. Monitored service groups and resources can either be restarted locally or migrated to another node and restarted. A service group may include an unlimited number of resources. Various off-the-shelf agents are available from VERITAS to monitor specific applications such as file services, RDBMS and enterprise resource planning, or the product can be customized to monitor any hardware component or software-based service. An SNMP agent allows VCS to generate SNMP traps so that resource state changes can be communicated to any SNMP-based management tool such as HP OpenView, CA Unicenter, Tivoli TME, and others. Although applicable to any application service that requires higher availability, VCS is most often deployed in mission-critical enterprise environments such as file serving, database, and enterprise resource planning (ERP).
The Industry's Most Scalable Availability Management Solution

Conventional cluster products rely on inefficient, point to point fault management and heartbeat mechanisms that do not scale well to large cluster configurations. To ensure scalability, VCS leverages a unique internode communication mechanism, called ClusterStat that supports global atomic broadcast across a very low latency transport. This internode communication protocol is faster, more reliable and significantly more scalable than the protocols in any of today's existing cluster products. In addition, all fault management has been multi-threaded to speed recovery in large configurations, and efficient multi-level fault management ensures very low overhead in configurations which may include thousands of managed resources. VCS supports 32 nodes today, but VERITAS expects this product to support hundreds of nodes in the future. Other features which support very large configurations include a Cluster Registry that is based on a single configuration file auto-replicated between all nodes, support for an unlimited number of service groups and a scalable, Java-based management GUI. A syntax checker built into the Cluster Registry minimizes operator error during configuration, and the registry supports dependency definitions between managed resources. During recovery, resources may either be started in parallel to speed recovery or according to the defined dependency hierarchy. An auto discovery capability automatically recognizes new nodes as they are added to the cluster and replicates the registry to them. Through the use of a scrollable, Microsoft Explorer-like management interface, the VCS Cluster Server Manager (CSM) can easily provide a comprehensive view of the status of all service groups in a single cluster with the ability to drill down for more detailed information or to perform administrative tasks with the click of a mouse button. It can also manage multiple clusters, if so configured, across up to 32 nodes in a SAN configuration from a single management console. VCS' ability to scale efficiently and manageably sets it apart from other availability management products on the market today. As enterprises move to SAN architectures, the scalability of cluster management software will play a key role in efficiently leveraging large, centralized disk stores. More scalable software will allow more nodes to share centralized storage, thus optimizing the use of storage and minimizing availability management and administrative costs. It will also provide for a much better long-term growth path, allowing more nodes and disk arrays to be added to accommodate even very rapid business expansion over a period of years. Wipro Confidential Page 2 of 90
VERITAS SANPoint Foundation Suite HA

This topic provides an overview of the key concepts, features, and benefits of VERITAS SANPoint Foundation Suite HA (SPFSHA).
Overview of SPFSHA
SPFSHA extends VERITAS File System and VERITAS Volume Manager to support shared data in a SAN environment. Using SANPoint Foundation Suite HA, multiple servers can access shared storage and files, transparently to the applications and concurrently with each other. SANPoint Foundation Suite HA incorporates VERITAS Cluster Server to provide cluster failover capabilities as well as internode communications across the servers.
Features and Benefits of SPFSHA

SANPoint Foundation Suite HA makes shared storage possible and practical for a wide variety of applications. Failover is faster on highly available configurations if a shared file system remains running even during a single server failure. Web serving gains manageability and scalability by accessing a common set of files for content serving on a site. In the event of a server failure, applications can redistribute load by reassigning network addresses. Workflow applications with large files, such as video production and CAD, can eliminate network traffic and data copying for improved performance and easier manageability. A backup process running on a separate server can access shared storage directly, reducing the impact of backups on production systems and networks. Transparent access to shared files Using SANPoint Foundation Suite HA, multiple servers can mount and access the same file system on shared media. No modifications to existing applications are required. File system integrity in a shared environment SANPoint Foundation Suite HA ensures the integrity of the shared file system by controlling access to the file system structure using the global lock manager. It also manages cache coherence and locking, so that systems accessing shared file systems always see the most current information. Faster failover for high availability environments SANPoint Foundation Suite HA includes the robust application-level failover capabilities of VERITAS Cluster Server. Application failover is very fast, as another server can start a failed server's application without having to restart the file system. Cluster-wide management of SAN data SANPoint Foundation Suite HA simplifies the management of shared data, with clusterwide logical device naming, volume and file system operations.
Course Overview
System availability continues to receive wide attention as many organizations grow their critical business applications on Local Area Networks (LANs). The primary reason to address availability issues is the cost of downtime. You can establish an annual cost of downtime for every system and measure the benefits obtained by solving the problems that cause a system to fail. You can then select among the various available options to improve server uptime, based upon a reasonable cost and effort as well as a reasonable return on your investment. Wipro Confidential Page 3 of 90
Course Objectives
The overall goal of this learning experience is to provide a basic understanding of the concepts related to HA. This course will build the foundation on which to base more advanced courses on VERITAS HA products. During this course you will: Define the general concept of high availability. Identify HA storage management solutions at the disk level, such as hardware Redundant Array of Independent Disks (RAID) and volume management software. Describe the concept of clustering and investigate common clustering configurations. Identify HA methods at the network level, such as redundant network connections and redundant networks. Describe VERITAS HA products.
Lessons
Defining High Availability What is High Availability? Describe the concept of high availability. The Need for High Availability Identify the need for increased data availability in today's computer environments. Types of Faults and Failures Identify different types of faults and failures that can occur. High Availability vs. Disaster Planning Differentiate between the goals and functions of high availability and disaster planning. High Availability vs. Fault Tolerance Differentiate between the goals and functions of high availability and fault tolerant availability methods. High Availability Planning Identify guidelines to consider when planning a high availability solution. The Layered Approach to Availability Describe the concept of the layered availability approach. Online Storage Management General RAID Levels Describe the various RAID levels. Software RAID vs. Hardware RAID Identify the advantages and disadvantages of hardware and software RAID. Defining a Volume Describe volumes and identify the advantages of using them. VERITAS Volume Management: Virtual Objects Describe the relationships between the virtual objects in VERITAS Volume Manager. VERITAS Volume Management: Volume Layouts Identify the volume layouts that are available in VERITAS Volume Manager. VERITAS Volume Management: Hot Relocation Describe the hot relocation process. High Availability Clustering Fault Resilient Clustering Concepts Describe the general characteristics of fault resilient HA solutions. Asymmetric 1 to 1 Configurations Describe an asymmetric 1 to 1 configuration. Symmetric 1 to 1 Configurations Wipro Confidential Page 4 of 90
Veritas Cluster Server Describe a symmetric 1 to 1 configuration. N to 1 Clustering Describe a traditional N to 1 networked cluster configuration. N to 1 SAN Clustering Describe clustering techniques in a Storage Area Network environment. Failover Granularity in Clusters Describe how resource and service groups enable application-level failover. Highly Available Networks Networking Overview Describe general network components, concepts, and common topologies. Public Network Failures Describe failures that may affect the public service network. Heartbeat Network Failures Describe challenges to maintaining proper heartbeat communication between nodes in a cluster. Redundant Networks and Network Connections Describe how to configure redundant networks and network connections. VERITAS Comprehensive Availability Solutions VERITAS Comprehensive Availability Identify the role VERITAS software components play in an overall high availability solution. VERITAS Volume Manager Provide an overview of the key concepts, features, and benefits of VERITAS Volume Manager. VERITAS Storage Replicator Provide an overview of the key concepts, features, and benefits of VERITAS Storage Replicator. VERITAS NetBackup Provide an overview of the key concepts, features, and benefits of VERITAS NetBackup. VERITAS Cluster Server Provide an overview of the key concepts, features, and benefits of VERITAS Cluster Server. VERITAS SANPoint Foundation Suite HA Provide an overview of the key concepts, features, and benefits of VERITAS SANPoint Foundation Suite HA.
What is High Availability?

You design a system, utilizing software and hardware components and implementing appropriate procedures, to satisfy the basic functional requirements of your organization. This system functions properly assuming that no faults or failures occur. However, whenever a fault or failure occurs that requires some type of maintenance operation, an outage is observed by your users. An HA solution enables you to design, implement, and deploy software and hardware components to satisfy your functional requirements and features sufficient redundancy to mask faults and failures from your users. This topic describes the general concept of HA solutions.
Defining HA
HA is defined as the ability of a system to perform its function without interruption for an extended period of time. HA can be accomplished through special HA software and the implementation of redundant system and network hardware components. In a properly designed HA system, all of the possible failure modes for critical applications, network connections, and data storage have been identified and the recovery times have been analyzed. Therefore, you can determine how long the system will be down for any given failure. You can scale an HA system to an appropriate level so that in the event of a fault or failure, the system can recover to a known, consistent state in an acceptable period of time. Page 5 of 90 Wipro Confidential
Availability Statistics
System availability is expressed as a measure of the period of time that the system is functioning normally. This involves the determination of the various component failures to factor into the overall rate of system failure. It is important to note that there is a distinction between component failure statistics and system failure statistics. The basic availability equation is used to determine the availability of a specific system component: Availability = MTBF/ (MTBF + MTTR) Where MTBF is the mean time between failures and MTTR is the mean time to repair. MTBF MTBF = Total actual operating time/Total number of failures The MTBF is an expected future performance based on the past performance of a system component. If the component is new, there is no historical data to base the MTBF upon. When determining the MTBF of new hardware components, you should obtain these statistics from the particular vendor. However, these statistics may be inflated or have been calculated using a high standard deviation. MTTR The MTTR is an average amount of time that it takes to repair a component, based upon actual statistical data. When calculating the MTTR, you can consider only the amount of on-site time that it takes to recover the component from the time when it failed. You can also calculate the MTTR including factors such as unavailability, response time, and travel time, in addition to on-site repair time. Many aspects of MTTRs are out of your control. For example, you may need to replace a specific part of a server. If this part is not currently in your stock, you will have to purchase the replacement component from the vendor or some other source and rely solely upon their ability to deliver the part in a short amount of time. System Availability As stated earlier, to calculate the availability of a system, you must take into account the availability of the individual system components such as servers, disks, I/O cards, etc. The more hardware the system features, the more likely the system will fail. It is here that the effect of having many of a single type of component affects the availability of a system. For example, suppose a new disk has a quoted manufacturer's MTBF of 600,000 hours, which indicates that a disk would be expected to fail once in about 70 years. This MTBF is calculated rather than based on actual failures. In addition, this MTBF value considers only the disk mechanism itself. If you factor in the power supply, controller, and fans, the MTBF becomes about 150,000 hours or about 17 years. If your system utilizes 500 disks, the failure rates are multiplicative and the MTBF for 500 disks is 150,000 hours divided by 500, or only 300 hours. This means that the system would fail about 30 times a year due to disk failure. The best way to reduce the frequency and duration of failures that affect the system is to employ a properly designed HA solution.
The Rule of the Nines

Availability is often measured by the "rule of the nines". Percentage Uptime Percentage Downtime Downtime Per Year 98% 2% 7.3 days 99% 1% 3.65 days (2 nines) 99.8% 0.2% 17 hours, 13 minutes 99.9% 0.1% 8 hours, 45 minutes (3 nines) 99.99% 0.01% 52.5 minutes (4 nines) 99.999% 0.001% 5.25 minutes (5 nines) Wipro Confidential Downtime Per Week 3 hours, 22 minutes 1 hour, 41 minutes 20 minutes, 10 seconds 10 minutes, 5 seconds 1 minute 6 seconds Page 6 of 90
Veritas Cluster Server 99.9999% 0.0001% 31.5 seconds 0.6 seconds (6 nines) For most environments, 99% availability is adequate. This level of availability results in less than 2 hours of downtime a week. It is important to consider when this downtime is taking place. For example, if a typical business system is down on a Sunday between 3 A.M. and 4:30 A.M., this is more acceptable than if the system is down during a Tuesday afternoon between 2 P.M. and 3:30 P.M. It is also important to consider when 100% availability is required. For example, suppose that a brokerage house performs all stock transactions between 9 A.M. and 4 P.M on weekdays. If the system is designed for 99% availability, it is crucial that you ensure that no system downtime occurs during the most critical business hours. HA Requirements There is a trade-off in costs and benefits for various degrees of availability. When designing a system with HA requirements, the initial requirements often include: System availability at all times with no perceived loss of service No loss of data at any time Maintenance and upgrade activities do not interfere with operational service Without being properly informed of the total costs and consequences of implementing a system that satisfies these requirements, it is natural to want an HA solution to satisfy these lofty goals. 100% data availability is an ideal concept, but the implementation of this solution results in very high monetary, performance, and complexity costs. As you move from lower to higher degrees of availability, the costs can increase dramatically. In most environments, a step from one level to the next (for example, from 99% to 99.9%), increases costs 5 to 10 times. It is ultimately the responsibility of an HA system designer to determine: The degree of availability that is actually required by the users, as opposed to what they might like to have The technological alternatives that can be used to meet these requirements All the costs Not only monetary, but also performance degradation and system complexity.
The High Availability Equation

One way to look at high availability is to view it as a simple equation. The effectiveness of any HA system must include reducing the time required to recover from a fault and simplifying management of the system to help enable
you to scale and grow your system. Time to Recovery Most enterprise environments feature a wide range of systems ranging from on-line e-commerce systems to lesscritical human resources (HR) systems. It is important to analyze the required recovery times of the various systems in your enterprise by performing a business impact analysis. Currently, there is a lot of work being done in this area by organizations in the analyst community such as the Gartner Group, Matter Group, and Intelligent Directions Consulting (IDC) among others. Typically, you can break the systems in an enterprise down into five basic levels based upon the time to recovery requirements: Safety critical Mission critical Business critical Wipro Confidential Page 7 of 90
Veritas Cluster Server Task critical Task known critical Examples of safety critical applications include systems that manage a nuclear reactor or maintain patients' heartbeats at a hospital. The other end of the spectrum is task known critical systems such as an HR system, that can probably withstand an extended outage without significant impact on the overall enterprise.
Levels of Availability It may be acceptable for a task known critical system to have a recovery time in terms of days or tens of hours. For these systems, basic availability, such as a traditional offline tape backup, is sufficient. If you lose your HR system, you can simply recover it from a secondary copy of the data from tape and bring the system back online in a number of hours. If the recovery process takes a day or two, the downtime will not significantly impact users. For business and mission critical systems, you should use a different availability approach. For example, rather than restoring from an offline copy, you can recover from an online copy of the data. You can utilize technology such as replication, snapshots, and mirroring to reduce the time to recovery to tens of minutes up to a couple of hours. For even more critical systems, you can reduce the recovery time to minutes or seconds by using clustering. There is a wide range of data availability possible. However, this range can be divided into four common levels of availability: Basic Availability
A basic availability environment requires no specific planning for downtime. Backups might be taken to protect data, but the time required to restore the data can be quite extensive in this environment. Basic availability can be adequate for many applications, but if downtime causes any significant costs, you should consider a higher level of availability. Task known systems would probably feature a basic availability solution.
Increased Availability
This level of availability is achieved by employing RAID (redundant array of independent disks) technology to provide online data protection in addition to the advantages of basic availability. RAID is an array of disks in which redundant data is stored in different places on multiple disks. RAID technology is described in detail in the
Wipro Confidential Page 8 of 90
"RAID Basics" section of this course. A task critical system might employ an increased availability solution.
High Availability In an HA architecture, hardware and software failures may occur. However, the intent is to mask the failure from the user and to reduce the time needed to recover from that failure down to several minutes or less. It is important to note that HA solutions are not fault tolerant. It is possible for all of the systems in an HA configuration to fail. The goal of an HA strategy is to recover as soon as possible from a system failure, rather than ensuring that a failure never occurs. In a simple example of an HA system, two independent servers are logically connected to form a cluster. One server stores a copy of every component of the other system. If a failure occurs on the primary server, files or services can be transferred from the secondary server. In addition to masking the failure, HA methods enable you to significantly reduce recovery times to a matter of minutes in the event of a major system failure. Typically, business critical and mission critical systems would need to use an HA solution.
Continous Availability The most advanced level of the availability is continuous availability (CA). CA is defined as an environment explicitly designed to eliminate all computer downtime, both unplanned and planned. Today, CA environments approach 99.999% availability, or less than 5 minutes of downtime per year. However, it is important to note that the costs for CA systems can range into the millions of dollars. Examples of industries that most often utilize continuous availability solutions include air-traffic control and stock-floor trading systems Wipro Confidential Page 9 of 90
Veritas Cluster Server Advanced CA architectures usually feature proprietary, large, hardware-based fault tolerant host machines. In a fault tolerant system, hardware is designed to perform self-checking diagnostics and all of the main hardware components are physically duplicated. Self-checking resides on each major hardware component and detects and isolates failures instantly. This ensures that erroneous data cannot corrupt other system areas. In fact, some diagnostics built into specific CA architectures often automatically detect problems before they lead to failures, and initiate service instantaneously should a component fail. Component duplication enables normal processing to continue even in the event of a hardware failure, with no performance degradation. Safety critical systems would require a CA solution. Simplified Management In a typical data center environment, you may have a number of servers that have different operating systems: Solaris, HP, Windows NT, and Windows 2000. The system might feature a number of network connections as well, such as traditional Ethernet or SCSI connections, fibre-type connections, or storage area networking (SAN). There are also various types of storage devices in the system. Today's enterprise is a very heterogeneous environment. In addition, almost every environment is growing at tremendous speeds. This requires more disk storage, different types of storage, more systems, applications, networks, etc. How do you manage all of this? The second part of the high availability equation is simplifying management.
Today's enterprise is a very heterogeneous environment. In addition, almost every environment is growing at tremendous speeds. This requires more disk storage, different types of storage, more systems, applications, networks, etc. How do you manage all of this? The second part of the high availability equation is simplifying management.
Wipro Confidential
Page 10 of 90
It is important for the enterprise to feature an infrastructure that enables scalability required by future demands. In addition, you need to implement a solution that enables you to perform automated tasks, virtualization, and consolidation across all systems in the enterprise, no matter the platform or operating system.
The Need For High Availability

Historically, only a select number of applications were considered critical enough to require an HA solution. In the past several years, the cost of systems has been significantly reduced and many new technologies have emerged in the business landscape, such as fibre-based Storage Area Networks (SANs). Modern applications have improved user productivity and increased the speed of business transactions. Modern businesses are much more dependent on the availability of their computer systems. This topic identifies the need for increased data availability in today's computer environments. To make a business successful, employees and customers need to have access to their data and the services to work with that data. In today's E-commerce environment, customer expectations require round-the-clock data availability. The maintenance of corporate data and access to the data is a business necessity. Critical applications and services include: Database servers File servers and filesystems Web servers Enterprise Resource Planning (ERP) Application servers There are many different reasons for implementing an HA solution. Typically, there are two situations that an HA solution is designed to address: The system crashes due to an unforeseen fault or failure. The system is brought down intentionally for system maintenance and upgrade. Originally, it was the utility companies that led the way toward more available systems and applications. Now, global business and E-commerce are having a significant impact on the definition of acceptable system availability.
Wipro Confidential
Page 11 of 90
Veritas Cluster Server Data must be available round-the-clock. Regular business hours do not exist in our contemporary global marketplace. For example, an Internet service organization must account for customers arriving at their site at any hour of any day. In addition, most modern organizations depend on networking technologies. More and more business-critical data is available through networks. Access to corporate information and shared knowledge has significantly improved productivity and communication. However, this reliance on network solutions has also helped to create a need for an HA solution to ensure that the network is resilient to failures. These new requirements are creating greater demands on the corporate information technology (IT) infrastructure. In the past, it was acceptable to expect 99% system availability. This would equate to about 3.5 days of downtime per year. However, the growth of E-commerce, greater demands for customer service, an increased dependence on network solutions, and a competitive global market have contributed to a need for high availability. When you consider the new costs of downtime, 99% system availability is no longer acceptable.
The Costs of Downtime

Before you can analyze the costs of an HA solution, you should consider the cost of not implementing such a solution. For example, in the highly competitive world of Web-based brokerage houses, one hour of downtime can cost a firm an estimated $6.5 million an hour. Gartner Group and Dataquest studies indicate that in 2000, downtime cost United States firms over $4.6 billion. Industry Business Operation Average Downtime Cost Per Hour Financial Brokerage operations $6.45M Financial Credit card/sales authorization $2.6M Media Pay per view TV $150K Retail Home shopping (TV) $113K Retail Home catalog sales $90K Transportation Airline reservations $89.5K Media Telephone ticket sales $69K Transportation Package shipping $28K Financial ATM fees 14.5K These numbers only represent direct monetary losses. They don't include less obvious losses, such as lost opportunities or customers moving their business to a competitor. Downtime can adversely affect your corporate image in the industry as well. Competitors may discover this loss and spread the news through the corporate community. Today, many companies find themselves relying on their systems to provide data continually to facilitate employee productivity, improve corporate image, and better serve their customers.
Types of Faults and Failures

Before learning about HA solutions that can be used to recover from a fault or failure, it is useful to explore faults and failures in more detail. This topic identifies different types of faults and failures that can occur. There can be a distinction made between faults and failures. Faults are often defined as non-compliances within the system which may or may not be externally visible to the end user. Whereas, failures can be defined as those faults which are externally visible. Within this course the terms fault and failure are used interchangeably.
Defining a Failure
Wipro Confidential
Page 12 of 90
A failure is a deviation from the expected behavior of the system. In other words, if the system is specified to exhibit a certain functionality, and in the process of execution the system produces a discernibly different functionality, a failure has occurred. Functionality is typically delivered from the system by running a procedure to execute the logic contained in software that runs in a hardware environment containing client and server machines, networks, data storage, and other peripherals. Failures can occur in any of these software procedures or the hardware in a system. Failures can be classified as either: Reproducible A prescribed set of actions leads to the observance of the failure in a predictable manner. Hard reproducible failures occur identically on every execution with the same input. Soft reproducible failures might occur with a certain probability on identical executions. Nonreproducible The appearance of the failure is random, or is linked to a root cause outside of the environment for which the system was engineered. HA solutions are useful in dealing with soft reproducible and non-reproducible failures, but less effective with hard reproducible failures.
Types of Possible Nonreproducible Failures

There are several different types of failures: Physical Hardware Failure Although the industry has come a long way in increasing the MTBF rates for individual hardware packaging and mechanical components, hardware is still vulnerable to faults. Hardware failures are typically non-reproducible. For example, a hard drive crashes or a tape library breaks. The most common examples of hardware failures include:
System memory or CPU failures
Some contemporary computer systems have the ability to reconfigure a failed component without requiring a reboot of the system. This capability helps increase data availability in the event of CPU or memory failures. Backplane failure Wipro Confidential Page 13 of 90
Backplanes, or motherboards, are the large circuit boards that contain sockets for expansion cards and provide the general pathway for all data in a computer system. These components rarely fail, but they can fail in some circumstances. In addition to the expansion sockets, active backplanes also contain logical circuitry that performs CPU operations. Passive backplanes contain almost no computing circuitry. Usually, the CPU is inserted on an additional card in the passive backplane. Passive backplanes enable you to repair failed components or upgrade to new components easily. Disk failure Backplanes, or motherboards, are the large circuit boards that contain sockets for expansion cards and provide the general pathway for all data in a computer system. These components rarely fail, but they can fail in some circumstances. In addition to the expansion sockets, active backplanes also contain logical circuitry that performs CPU operations. Passive backplanes contain almost no computing circuitry. Usually, the CPU is inserted on an additional card in the passive backplane. Passive backplanes enable you to repair failed components or upgrade to new components easily.
Disk failure
Wipro Confidential
Page 14 of 90
Disks are very prone to failures because of the high rotation speed, low tolerances, and possible problems with the controller boards or cables.
Tape device failure
Tape devices have similar characteristics to disks, such as high speeds and low tolerances, and are also failure-prone. In addition, tape devices are repeatedly stopping and starting. These actions may strain or overheat the motor and lead to motor failure.
Fan failure
Fans can also fail. If the cooling system fails, the effects may not be immediately visible, but over time excessive heat can cause a system to act unpredictably or fail at an undesirable point in the future. Power supply failure Wipro Confidential Page 15 of 90
Power supplies often have the worst MTBF of all components in a system. They can fail instantly or over time. The gradual failure of a power supply can cause intermittent failures or unpredictable behavior in other components. Failures in power supplies are caused by excessive switching, varying voltage levels, or other stress-inducing factors. Network Interface Card (NIC) failure
NICs are expansion boards inserted into a computer so the computer can be connected to a network. If a NIC fails, network connectivity is lost. It may be difficult to detect a NIC failure. A simple method used to detect these failures is to initiate some network traffic, and then use a command to display the packet count. If the packet count does not increase, it is likely that the NIC has failed. Redundant NICs should be used to avoid any loss of network connectivity due to the failure of a single NIC.
Environmental Failures Failures can not only be caused by internal system components, but also by environmental forces beyond your control. Such environmental failures include: Power fluctuations or outages The most common external source of system failures is power outages. Things to consider in determining the probability of power outages should include, but not be limited to, the history of local utility companies providing uninterrupted service, the history of brownouts due to high temperatures in your area, and your proximity to major power sources. Cooling system failure The environmental cooling system can fail. This would cause massive overheating of some of your crucial system components. You should analyze your facilities' environmental control system for the likelihood of failure. Wipro Confidential Page 16 of 90
Structural failure Structural failures can range from the complete collapse of the building's support structure, to the structural failure of a single computer rack or cabinet. Natural disasters Natural disasters are occurrences such as fires, floods, earthquakes, typhoons, or hurricanes. Considerations when identifying your organization's susceptibility to natural disasters can include geographic location, the topography of the land, or the history of natural disasters in the local area.
Human Error or Acts of Terrorism The causes of system failures are not limited to natural causes, failures can be caused by human error or acts of terrorism. Human error A failure can result from an operator or administrator issuing an inappropriate command or an individual disrupting the system by accidentally tripping over a cable or unplugging a power supply. Acts of terrorism Unfortunately, in the contemporary computing world, there are many examples of terrorism: sabotage, vandalism, arson, robbery, vehicle crashes, hazardous waste, civil disorder, war, or malicious computer crimes. Human threats are difficult to identify, but you might consider such things as your proximity to major highways that might transport combustible or otherwise hazardous materials, the implementation of virus protection software, or the degree of employee's accessibility to the your computer facilities. Network Failures Networks are susceptible to failures in every component within the network. Network failures include physical failures that can take place in many network-specific hardware components, such as switches, network cables, or NICs (network interface cards). In addition to these physical components, networks feature complex configurations and service information that, if misconfigured or not authenticated properly, can lead to countless failures that can bring down a network. Database Failures A database system typically features a large source of data and many sub applications to use to extract specific information from this data based on specified conditions. Failures can occur at any level in a database system, ranging from a catastrophic failure of the main database engine to a temporary hang in the client-side application. Web servers often interact with back-end database servers and therefore all the possible failures that can occur in a database application can adversely affect a Web server as well.
High Availability vs. Disaster Planning

This topic differentiates between the goals and functions of high availability and disaster planning.
Disaster Recovery
The ability to recover from a natural disaster, such as fire, flood, or earthquake, in a short time is called disaster recovery. The results of these disasters include physical damage to systems, and loss of data, telecommunication, power and work space. Recovery time might be as short as minutes or hours, or as long as days or weeks. Frequently, recovery time is directly related to how quickly a system can be accessed, the data and applications loaded, and telecommunications restored. Redundancy is usually provided by a duplicate system at a different, geographically remote site. The need for disaster recovery solutions and services is increasing rapidly. The costs after a disaster become quite large, and the need to restore access to systems and applications becomes very important. Two important issues associated with disaster recovery are the replication of data and the currency of the data. The replication of data to Wipro Confidential Page 17 of 90
Veritas Cluster Server an alternate site is affected by distance and speed of the links. The slower the replication method, the more data will be lost in case of disaster. The impact of a disaster on the organization must be assessed along with the cost of providing for disaster recovery.
Comparing Disaster Recovery and HA

Disaster Planning High Availability Critical Urgent Testing and planning for a theoretical event Optimizing and responding to current events Offsite/offline data storage Everything online Notification of loss by event; possible advance notification Notification of loss by users HA involves system optimization and the ability to respond to current events. HA is an urgent requirement for most organizations. Disaster planning and recovery should be a critical concern to most organizations. Disaster planning requires continual testing and refinement. Opportunities to conduct real world drills are scarce or non-existent. In most cases, testing your disaster recovery plan is more of a theoretical exercise than a real world experience. HA is much more of a day-to-day operation and therefore, more organizations neglect disaster planning in favor of HA. A properly developed disaster recovery plan should involve offsite and offline storage for recent copies of your data. HA strives to keep all data available at all times. In the event of a disaster, you will often know about the loss of data before your users will. In an HA environment, users will often inform you in the event of the loss of data or any system downtime.
HA and DR Together
When defining a disaster recovery plan, your top priority is your mission critical applications. Mission critical applications are required to be available at all times. While backup and recovery technology ensures data protection, recovery methods are often not fast enough to handle the recovery of data used by mission critical applications. HA methods such as replication and clustering can help to ensure immediate recovery whenever a disaster strikes.
This example illustrates a plan that addresses HA and disaster planning. By implementing a configuration with cluster management and replication concerns, you can effectively maintain and protect your end-users and information. You can manage clusters and move applications running at a primary site to a secondary site, while Wipro Confidential Page 18 of 90
Veritas Cluster Server maintaining access to critical information through the continuous replication of data between sites. Clustering and replication are covered in more detail later in this course.
Disaster Planning Using Storage Level Data Replication

Storage level data replication is a popular choice for disaster planning. However, replication solutions must ensure that your data is replicated with full integrity. Replicated data should be consistent, up-to date, and ready to use at a moment's notice, while also being transparent to the application. The replication of data should also be seamless, such that the application data can be sent from one primary site to multiple secondary sites for greater protection. Replication products should not rely on any dedicated networks or vendor specific storage hardware platforms, in order to offer better protection against a single point of failure and offer greater flexibility for change and growth. Clustering with replication offers mission critical applications the optimal mechanism for immediate recovery. If both the machine and the disks fail, recovery can occur in such a way that the application can fail over to another machine in the cluster using the replicated data.
High Availability vs. Fault Tolerance

This topic differentiates between the goals and functions of high availability and fault tolerant availability methods. A system described as fault tolerant contains multiple hardware components that function concurrently, replicating all of the I/O. This type of system protects against hardware failures by incorporating redundant hardware components in a single system. Fault tolerant systems cost as much as ten times more than non-fault tolerant, but highly available solutions.
Defining Fault Tolerance

Fault tolerance extends the definition of high availability. This term is used for systems that can tolerate nearly any type of possible fault without going down. This is a solution used by industries like power companies and telephone companies. Fault tolerance guarantees 99.9999% availability, or approximately 30 seconds of downtime per year. Fault tolerant systems are very expensive because of the way they are designed. They include complete hardware redundancy with no single point of failure from a hardware perspective. The only situations that can cause downtime in a fault tolerant system is software or application failure, or a catastrophic environmental disaster. Hardware redundancy is necessary, but not sufficient for fault tolerance. A fault tolerant system must also feature some sort of redundancy management. For example, a system may provide redundant hardware components to ensure that at least one result is correct in the presence of a fault. If a user must somehow examine the results and select the correct one, then the only fault tolerance is performed by the user. However, if the system selects the correct redundant result for the user, then the system is not only redundant, but also fault tolerant. Fault tolerant systems cannot run on typical configurations because their specialized applications must communicate directly with the hardware, sometimes for each transaction. Although 99.9999% availability is appealing, the costs of this solution are well beyond the affordability of most companies.
Characteristics of a Fault Tolerant System

It is important to note that fault tolerant solutions and HA solutions are two very different concepts. A fault tolerant system: Is not impacted in the event of a fault or failure. Features no loss of access. Enables immediate and transparent recovery. Includes replacement, or spare, hardware components that are on-line and running in sync with the primary system. Wipro Confidential Page 19 of 90
Is expensive. Is limited in scalability. Does not use off-the-shelf hardware and software. The hardware usually has very specific software hooks, and applications need to be written to a specific API of the operating system. Requires a specially modified operating environment. Features inherent redundancy management.
Fault Tolerance Processes

Fault tolerance involves the following actions in the event of a fault or failure: Detection The system determines that a fault or failure has occurred. Diagnosis The system identifies the precise subsystem or system component that failed, and determines the immediate cause of the fault. Containment The system prevents the propagation of faults from their origin to a point in the system where the fault can have any effect on the service to the user. Masking The system ensures that the correct output is passed to the user in spite of the failed component. Compensation It may be necessary for the system to provide a proper response to compensate for the output of the faulty component. Repair The system removes the fault from the system or recovers the system. In well-designed fault tolerant systems, faults are contained before they propagate to the extent that the delivery of system service is affected. This leaves a portion of the system unusable because of residual faults. If subsequent faults occur, the system may be unable to cope because of this loss of resources, unless these resources are reclaimed through a recovery process which ensures that no faults remain in system resources or in the system state.
High Availability Planning

Determining an organization's availability requirements and architecting a system to meet them is a complicated process. This topic identifies guidelines to consider when planning a high availability solution.
Guidelines When Planing an HA Implementation

When you are planning your HA system, you need to consider many different factors. Because every environment is unique and every business has different needs, it is difficult to create an all-encompassing checklist for planning an HA system. This list addresses the most important guidelines to consider: Determine the cost of downtime. It is difficult to estimate the cost to an organization if the system goes down. In general, the consequences of a serious system failure will vary depending on the characteristics of the specific business. The investment in a high availability solution should match the cost and risk of unavailability.
Wipro Confidential
Page 20 of 90
Understand the recovery point and recovery time. It is important for you to determine when recovery is necessary in your system's operations and how long a time exists between the point of failure and recovery. The recovery point is more significant in data-centric operations where any loss of data is unacceptable. Recovery time is most important in transaction-centric environments. For example, this simple diagram illustrates how the estimated recovery times and recovery points relate in CA, electronic vaulting, off-site storage, and standard HA scenarios
Protect appropriate system components. When you design your high availability solution, you should allocate more money to protect specific system components. When determining the specific system components to protect, select the components that would have the most impact in the event of a failure, are most likely to fail, or are the most expensive to replace. Focus on the areas that can have a significant, negative impact on the ability to keep an application and your organization up and running if they fail. You should consider which components are most likely to fail, because these will have the most harmful effect on the MTBF values. Protect components that may be expensive to replace in the event of failure. Isolate and eliminate any single points of failure (SPOFs). Wipro Confidential Page 21 of 90
Veritas Cluster Server A SPOF is any system component that will cause downtime if it fails. It is important to investigate the path of execution in your system and identify all the weak links in the chain. If one link breaks, the whole system fails no matter how well constructed the rest of the system. You should walk through the whole process from your servers and disk storage, to the applications, through the network, and to the client systems. Common SPOFs are: the computer system Clustering software can be used to link several systems that can each run each other's applications in time of failure of the primary system. disks Disk mirroring or disk array technology can be used to protect data. host adapters and cables Host adapter failures can be protected against with operating system features and redundant host adapters. networks Networking has many hardware components; each could be a SPOF. The key to eliminating failures within the network is understanding the topologies being used, understanding the failure points within those topologies, and removing these failure points from the network. There are many hardware and software products which provide increased network availability. electrical power Uninterruptible Power Supplies (UPSs) and/or multiple power sources can protect against electrical power failures. Ensure the security of the system. Prevent data corruption and unauthorized access to your system. Security is an issue that is often overlooked in discussions of HA management, because it does not immediately reduce the impact of failure. However, it is important to any HA solution. The management center must be secured, so that only authorized personnel have access to it. The management systems, or applications, also need to support some type of user authentication, such as userIDs and passwords. Secure transactions between the applications and the system components are available through Remote Procedure Calls (RPC), or some other protocol. Secure communications should be implemented whenever possible in an HA configuration. Centralize similar applications and services on large servers. It should be noted that this is not a steadfast rule, sometimes many small machines running single instances of databases or single applications can be a more appropriate configuration. In general, by consolidating similar applications and services on centralized large servers, you can significantly reduce the complexity of your system, the number of backups that are required, and the number of components that can fail. Automate repetitive tasks. You can significantly reduce the number of hours required for hands-on operations by automating the tasks that are standard and repetitive. In addition, automation reduces the number of possible faults due to human error, such as mis-typed commands or accidental file deletion. You can also update and maintain consistent policies and procedures in a single centralized location. Perform a thorough test initially and perform additional tests on a regular basis once the system is up and running. Before you deploy your HA solution, you should perform a thorough test that investigates every level in your system, from hardware component faults to network failures. The testing environment should mimic the eventual system environment as closely as possible: the same hardware, software, services, networks, configurations, loads on the system, and users.
It is also important to perform tests on a regular basis once the system is up and running. Systems and environments are constantly changing. The only way to ensure that the system can react to failures appropriately at any given point in time is to test the system throughout it's life cycle. Account for future growth. Wipro Confidential Page 22 of 90
Veritas Cluster Server It is important that any HA solution account for scalability. All data systems will expand with time. It is much less expensive and easier to manage this growth if it is planned for early in development. For example, it is much easier and cost-effective to add disks to large servers with many empty slots than it would be to purchase additional servers. Document policies and procedures. While you are initially planning your HA system and while your system is being implemented, it is important to document every policy or procedure that you develop. This documentation can serve as an official archive of system information, a source for any troubleshooting actions that may be required in the future, and it can ensure that other individuals can access vital system information in the event that you are unavailable.
This documentation can be in a variety of formats: HTML This is probably the most common choice. This format is extremely portable and can be read in any browser. The major consideration with this format is that you ensure that you make relative references to the servers rather than hard links to particular URLs. HTML may prove to take up a little more room that other formats. PDF Adobe PDF documents are very compressed and platform-independent. The only major drawback is the limited ability to edit the documents in their native format. Word processor This format is the easiest to manage, however access to an appropriate reader may be an issue. This is not as portable as other formats. Paper documentation (hard copies) Soft copies of your documentation are much easier to update than hard copies. However, you may want to print a limited number of copies of your documentation once in a while to refer to quickly, in case of a complete or extended system outage. Select the appropriate software and hardware. You should select the appropriate software to maximize data availability for your organization. There are many considerations when selecting this software. Your data management software should feature capabilities for clustering, load-balancing, application-level recovery, intelligent system and application monitoring, centralized management, and you should select software that will be easy to troubleshoot through mature customer/technical support and consulting organizations. You should always arrange for on-site consulting to help you implement your HA solution. It is also a good idea to take advantage of other resources such as the product documentation, user groups and news groups, the software company's Website, and classroom training on the products.
There is a direct correlation between the reliability of your hardware and the your overall system reliability. It is important that you obtain appropriate reliability data from hardware vendors, such as mean time between failure figures that are proven and realistic. There are several other hardware considerations in addition to reliability, such as ease of repair, ease of access, cost, compatibility, and storage capacity. It is also a good idea to purchase spare hardware for components that may be more prone to fail that others. Do not overcomplicate the system. Wipro Confidential Page 23 of 90
Veritas Cluster Server This is a very important guideline to consider when designing an HA solution, and is half of the availability equation. There are many points in any system at which failures can occur. You should always try to keep the design simple. For example, you should eliminate any extraneous system components, maintain servers that are running only a single application or service, and choose a naming convention throughout your system that is easy to remember and organize. Reduce planned downtime. Downtime is best defined as the period of time in which a user is unable to perform tasks in an efficient and timely manner due to poor system performance or system failure. In data centers worldwide, a lot of attention and investment has been made to ensure redundancy and high availability of hardware system components, the vessels which process and hold corporate data.
In a study published by the IEEE (International Electric and Electronic Engineering Association), hardware failures are the cause of only 10% of total system downtime. As much as 30% of all downtime is prescheduled, and most of this time is required due to the inability of system tools to permit online administration of systems. Another 40% of downtime is due to software errors. Some of these errors are as simple as a database running out of space on disk and stopping its operations as a result. Any comprehensive HA solution has to be able to deliver application and information availability in the event of any cause of downtime. Examples of planned downtime include those times when the system is shutdown to add additional hardware, upgrade the operating system, rearrange or repartition disk space, or clean up logfiles and memory. If you implement an effective HA strategy, you can significantly reduce the amount of planned downtime. For example, you can provide for backups, maintenance, and upgrades while the system is up and running. You can also reduce the time required to perform the tasks that can only be done while the system is down. Wipro Confidential Page 24 of 90
Balance the cost of the availability solution with the rewards. The cost of purchasing, implementing, and managing the HA solution should be consistent with the operational loss you wish to prevent. Achieve an appropriate trade off between the cost and the rewards of an HA system. The relationship between cost and return on investment in HA systems can be viewed as a curve. This curve illustrates the law of diminishing returns. As you move from a less expensive, simple solution to more advanced solutions the costs increase dramatically
The Layered Approach to High Availability

This topic describes the layered availability approach and introduces the concepts and terminology involved in the availability issues posed by each layer. The range of layers includes the application layer, and storage management layer that enables you to manage logical, or virtual, storage volumes. The storage network infrastructure layer features such components as hubs, switches, and Fibre Channel connectivity. Finally the disk and data storage layer contains the tape libraries, intelligent disk arrays, and other storage devices. The concepts and terminology introduced in this topic, are covered in greater detail in other topics throughout the course.
Wipro Confidential
Page 25 of 90
To simplify the management of a complicated system, you can break the system down to four basic layers: Application layer Storage management layer Storage network infrastructure layer Data storage layer In order to reduce the time of recovery, you need to determine the level of service that each layer must deliver to the others. You can also simplify management by logically organizing the resources in each layer.
Application Layer
Wipro Confidential
Page 26 of 90
The application layer is the direct interface between the system and the client machines, such as a database, an Email, or a custom application. HA solutions feature functionality that provides continuous service or access to applications in the event of a fault or failure in a transparent manner. Throughout this course, it is important to view your system from an application-based viewpoint. In other words, no matter what components, structure, policies, and procedures are implemented in your HA solution, the most important consideration at any time is to minimize the impact of a fault or failure on the users ability to access data through the application or service. HA issues involved in this layer include clustering, application-level failovers, simplified management of large server farms, common availability management, and replication of data to multiple sites.
Storage Management Layer
The storage management layer refers to the method by which the server manages the storage devices or disks. This management is performed by the building blocks of an HA solution: volume management and a journaling filesystem. Volume Management Often, the first step taken towards increasing a system's availability is to enable software-based redundancy of disks, or software RAID. Software RAID defines a logical volume. A volume is a logical object on which filesystems are written or to which databases write their data. Software RAID is often packaged with volume management software. Journaling Filesystem A file system is a collection of directories organized into a structure that enables you to locate and store files. All information processed is eventually stored in a file system. When a system or server fails, the filesystem can be eliminated. To avoid this problem, a tape backup is required to restore the filesystem. A journaling filesystem journals the changes to the file system structure (and occasionally data). If the system crashes and is rebooted, the journal is replayed to ensure the correctness of the file system structure. Data recovery is dependent upon the specific application. For example, recovery of an Oracle database would require the use of Oracle log files.
Storage Network Infrastructure Layer
Wipro Confidential
Page 27 of 90
This layer refers to storage network connectivity. This layer is becoming more and more of a concern to the modern enterprise. Originally, most environments simply connected a server to a storage device through a SCSI connection. Now, organizations are using other more advanced network connection technology such as Fibre Channel technology and storage area networking (SAN). Rather than viewing this layer simply as a server connecting to a piece of storage, you should consider multiple paths between servers and storage. You need to investigate the possibility of implementing some sort of network redundancy to ensure that if you lose an access route between the system and storage, there is another access path available.
Data Storage Layer
In addition to application availability, managing storage effectively, and ensuring that you maintain network connectivity, there are data availability concerns in the storage pool itself. In this layer, you can enable online, dynamic reconfiguration of storage pool. You need to account for growth and scalability. No matter how many disk arrays you have, you will inevitably require more in the future. You should also consider the capacity management aspects of your storage devices and determine how to optimize storage space across common disk hardware.
General RAID Levels

RAID is an array of disks in which redundant data is stored in different places on multiple disks. The redundant information enables regeneration of user data in the event that one of the disks in the array or the access data path to it fails. By placing data on multiple disks, I/O operations can overlap in a balanced way, improving performance. RAID also increases the MTBF and fault tolerance. RAID employs the disk striping, or partitioning of each drive's storage space into units. The stripes of all the disks are interleaved and addressed in order. In this topic, you learn about the various RAID levels.
General RAID Levels

There are five basic RAID levels that are commonly recognized. In addition, there are several other RAID levels that are less common variations on these five basic levels. There are also several common RAID combinations that can also be configured. The most appropriate RAID configuration for a specific filesystem or database tablespace must be determined based on data access patterns and determining an appropriate trade-off between cost and performance. Page 28 of 90 Wipro Confidential
RAID-0 (striping)
This RAID level features disk striping, but no redundancy of data. In this configuration, a collection of data is divided into small chunks that are written to a separate disk in the array. This RAID level supplies performance acceleration at no increased storage cost, because individual disks can perform concurrent write operations. RAID-0 offers no increase in data availability. In fact, if implemented by itself, RAID-0 decreases overall data availability. This is because for one disk to function, all the other disks in the array must be functioning as well. Any failure of an individual disk in the stripe will result in the inability to perform any read or write operations in the entire stripe. RAID-0 would be an option for applications requiring high bandwidth such as video production and editing, image editing, or pre-press applications.
RAID-1 (mirroring)
RAID-1 requires at least double the disk capacity of RAID-0. In RAID-1, the data is replicated on a separate disk, or multiple disks. No disk striping occurs. Every byte on one disk is copied block-for-block on a separate disk that acts as a peer and is completely in sync with the original disk. In the event of an individual disk failure, the other disk maintains operation without any service interruption. RAID-1 provides the highest performance for redundant storage, because it does not require read-modify-write cycles
to update data, and because multiple copies of data can be used to accelerate readintensive applications. However, resyncing or creating a new RAID-1 copy requires time and a significant amount of I/O. Therefore, a disadvantage to RAID-1 is the fact that write performance may suffer. RAID-1 requires 100% additional disk capacity for each mirror copy. Therefore, another major disadvantage is cost. This RAID level would be recommended for applications requiring increased availability such as accounting, payroll, or other financial applications.
RAID-2 (Hamming encoding)
RAID-2 features disk striping. This RAID level detects errors that occur and determines which part is in error by using error checking and correcting (ECC) information. RAID-2 detects 2-bit errors and corrects 1-bit errors on the fly. Each data disk has its Hamming Code ECC information recorded on ECC disks. On read operations, the ECC code verifies data or corrects single disk errors. You need a high ratio of ECC disks to data disks with smaller word sizes. It has no clear advantages over RAID-3, and is not used in practice.
RAID-3 (byte striped across a group of disks)
Wipro Confidential
Page 30 of 90
RAID-3 uses disk striping in a parallel fashion with each virtual disk block distributed across all the disks in the array except for one that stores the parity check. The parity disk permits the regeneration and rebuilding of data in the event of a disk failure. In RAID-3, the stripe depth of an N+1 array is equal to 1/N virtual blocks and each disk drive must be on its own separate I/O channel. For example, if the virtual block size for a 4+1 set, is 512 bytes, then the stripe depth is 128 bytes (512/4). The RAID volume can only process one disk I/O at a time. All I/O operations access all disks, because the bytes are distributed across multiple disks (parallel transfer). For this reason, RAID-3 is best for applications that are single stream bandwidth-oriented. This would not be a good choice for a database server, because databases tend to read and write smaller blocks. RAID-3 is likely to perform significantly better in a controller-based implementation.
RAID-4 (dedicated parity disk)
Wipro Confidential
Page 31 of 90
RAID-4 uses large stripes, and dedicates one drive to storing parity information. RAID-4 is very similar to RAID-3. The major difference is that where in a RAID-3 array, the stripe and logical block size are equal, RAID-4 arrays implement variable stripe sizes. In RAID-4, the stripe depth is an integer multiple of the virtual block size. This means that multiple virtual blocks can be placed within a single stripe in the RAID-4 array.You can read records from any single drive. This enables you to take advantage of overlapped I/O for read operations. Since all write operations have to update the parity drive, no I/O overlapping is possible. RAID-4 offers no advantage over RAID-5. As with RAID-3, a RAID-4 implementation is ideal for systems performing large file transfers. It does not perform well when used in applications that require small file writes at high I/O rates.
RAID-5 (block striped across a group of disks)
RAID-5 removes a possible bottleneck on the parity drive by rotating parity across all drives in the set. RAID-5 requires at least three and usually five disks for the array. All read and write operations can be overlapped. RAID-5 stores parity information but not
redundant data. Recovery from a RAID-5 disk failure requires a complete read of all the disks in the stripe. The recovery process can be time-consuming and system performance will suffer during recovery. This is the most complex and versatile of the basic RAID architectures. RAID-5 is best suited for file and application servers, database servers in a datawarehousing environment, Web servers, and e-mail servers. The performance overhead for writes can be substantial in a RAID-5 configuration, because a write can involve much more than simply writing to a data block. A write can involve reading the old data and parity, computing the new parity, and writing the new data and parity. RAID Level Variations
RAID-6 RAID-6 is similar to RAID-5, but with additional independently computed check data. It includes a second parity scheme that is distributed across different drives and offers very high fault-tolerance. Currently, there are very few commercial examples of RAID-6. RAID-7 RAID-7 includes a real-time embedded operating system as a controller, caching data through a high-speed bus, and other characteristics of a stand-alone computer. This RAID level is not common. RAID-01 (mirrored stripes)
RAID Combinations
RAID-01 is a mirrored RAID-1 pair made from two RAID-0 stripe sets. It is configured by creating two RAID-0 sets and adding RAID-1. If you lose a drive on one side of a RAID-01 array, then lose another drive on the other side of that array before the first side is recovered, you will suffer complete data loss. It is also important to note that in the event of a single disk failure, all drives in the surviving mirror are involved in rebuilding the entire damaged stripe set. Performance during recovery is severely degraded during recovery unless the RAID subsystem allows adjusting the priority of recovery. However, shifting the priority toward production will lengthen recovery time and increase the risk of the kind of the catastrophic data loss mentioned earlier.
Wipro Confidential
Page 33 of 90
Example of RAID01 Failure
In this example, if Disks A and D fail, all the disks are unavailable. RAID-10 (striped mirrors)
RAID-10 is a stripe set made up from a number of mirrored pairs. Only the loss of both drives in the same mirrored pair can result in any data loss and the loss of that particular drive is 1/Nth as likely as the loss of some drive on the opposite mirror in RAID-01. Recovery only involves the replacement drive and its mirror so the rest of the array performs at 100% capacity during recovery. Since only the single drive needs recovery bandwidth, requirements during recovery are lower and recovery takes far less time, reducing the risk of catastrophic data loss. The performance of RAID-10 and RAID-01 are identical, but they have different levels of data integrity. Example of RAID10 Failure
In this example, first Disk A fails and all the other disks are available. If disk D fails, only the data on disks A and D are offline. RAID-53
Wipro Confidential
Page 34 of 90
Veritas Cluster Server RAID-53 offers an array of stripes in which each stripe is a RAID-3 array of disks. This offers higher performance than RAID-3 but at much higher cost and requires at least 5 drives
Software RAID vs. Hardware RAID

The basic characteristics and configurations of RAID levels are the same in both software and hardware RAID. The main difference is the point at which the disk management operations occur. This topic identifies the advantages and disadvantages of software and hardware RAID
Hardware RAID (Controller-Based)
Wipro Confidential
Page 35 of 90
In hardware RAID, the management operations required to implement the RAID disk array occur within the disk array itself. The host system does not perform the operations, but an interface program runs on the host system that enables you to monitor the disk management operations. A hardware RAID operation creates a logical unit (LUN) that can be monitored regardless of the operating system of the host system. It is often a safe assumption that the disks are managed properly, no matter what the RAID level, within a hardware RAID configuration. A hardware RAID system is basically a specialized, single-purpose system that features a controller that does nothing but aggregate storage disks, stripe and mirror data across these disks, and calculate parity. The advantages of using hardware RAID over software RAID: Increased performance on the host system Performance is increased because the disk management operations are off-loaded onto the disk array. For example, in a mirrored controller-based configuration, the host would need to pass only one write request through the disk driver and across the I/O bus, where the controller would decompose it into two separate writes. Enhanced features Hardware RAID manufacturers often add enhanced functionality to their hardware. Such enhancements include additional internal memory in the disk array, and the abilities to replicate data over a WAN, share specific disks between multiple host systems, and lock out other hosts while a single host is accessing a Wipro Confidential Page 36 of 90
Veritas Cluster Server disk. Enterprise class hardware RAID systems also often include redundant power supplies and cooling fans. Efficiency Hardware RAID systems tend to be very efficient because they feature hardware that is only concerned with performing RAID operations. The RAID controller does not have to concern itself with graphic user interfaces (GUIs) and other aspects of a general purpose operating system. The disadvantages of using hardware RAID over software RAID: Dependence on one RAID hardware vendor Every RAID manufacturer uses a different management interface and once you familiarize yourself to one, it will be difficult to switch to a different vendor. Inability to combine disks from different arrays into a single array This will create another SPOF in the system. Hard to resize LUNs. In most cases, once a LUN is full, you cannot simply increase the size of the LUN to accommodate new data. You have to destroy the original LUN, create another larger LUN, and then restore the original data to the new LUN. Hardware limits on the number and size of LUNs Often, RAID vendors will enforce some hardware limits that might limit your ability to configure your system for optimal performance. Cost Hardware RAID is more expensive than software RAID. No inter-box protection A specific RAID controller has no visibility to other RAID boxes or storage devices.
Software RAID (Host-Based)

Rather than utilizing a dedicated hardware controller to perform the various management operations required to implement a RAID array, in software RAID the operations are performed by the host system processor using special software. Disk array management is a somewhat low level activity that is performed underneath the other applications that run on the host system. Therefore, software RAID is usually implemented at the operating system level. Software RAID is supported on Windows NT and 2000 platforms, as well as a majority of the various UNIX platforms. The output of software RAID is a logical volume. A volume is a logical object on which file systems are written or to which databases write their data. Advantages of using software RAID over hardware RAID: Cost If you are already running an operating system that supports software RAID, you have no additional costs for controller hardware. However, you may be required to add more system memory. Simplicity You are not required to install, configure, or manage a hardware RAID controller. Flexibility in hardware By moving the management operations off the hardware, you are allowed more flexibility in selecting appropriate hardware. In fact, you can use a wide range of online storage, such as just a bunch of disks (JBOD), enterprise RAID, and a smaller RAID system. Flexibility in disk configuration Software RAID implementations can build RAID objects from partitions of disks, rather than being restricted to whole disks, they can use a disk pool to meet a diverse set of performance and availability requirements. For instance, one might create a small high-performance striped file system by using only a few cylinders on a very large number of drives, and use the remaining space on those same drives for concatenated, mirrored or RAID-5 volumes with different I/O characteristics. Increased redundancy Wipro Confidential Page 37 of 90
Veritas Cluster Server A duplexed RAID-1 array can sometimes be implemented in software RAID, but not in hardware RAID, depending on the controller. Building redundant layouts using disks with separate connections to the host can enhance availability, eliminating the single points of failure introduced by non-redundant host connections.
The disadvantages of using software RAID over hardware RAID: Performance The most significant drawback of software RAID is that it provides lower overall system performance than hardware RAID. Cycles are taken from the CPU of the host system to manage the RAID array. In reality, the impact of these operations is not that excessive for simple RAID levels like RAID-1. However, the impact on performance can be substantial, particularly with any RAID levels that involve striping with parity, such as RAID-5. Boot volume limitations The operating system cannot boot from the RAID array, due to the fact that the operating system has to be running to enable the RAID array. A separate partition needs to be created for the operating system. This segments the system capacity, lowers the performance, and increases the time required to boot the system. RAID level limitations Software RAID is usually limited to RAID-0, RAID-1, RAID-5, RAID-01, and RAID-10. Advanced feature support Software RAID normally does not include support for the advanced features that may be available to hardware RAID arrays. Operating system (OS) compatibility issues Generally, if you enable software RAID by using a particular operating system, only that particular operating system can access that array. This creates problems with multiple-OS environments. Software compatibility issues Some software utilities, such as partitioning and formatting utilities, may have conflicts with software RAID arrays. Reliability Implementing software RAID increases the chance of potential bugs that might compromise the integrity and reliability of the array.
Combining Software and Hardware RAID

You should not consider there to be a distinct choice between software and hardware RAID solutions. Host-based volume management features all the advantages of software RAID to complement hardware RAID systems. By combining hardware and software RAID, you can realize the best features of both solutions. The off-loaded processing, reduced I/O transfer requirements, and redundant components of most.hardware RAID subsystems, Wipro Confidential Page 38 of 90
Veritas Cluster Server coupled with the configuration flexibility added by the inclusion of software-based RAID. Combining hardware and software RAID solutions offer several key benefits: Increased availability Many hardware RAID solutions retain single points of failure (SPOFs), allowing data to become unavailable if a non-disk component of the array fails. When software RAID is used to build configurations that incorporate hardware RAID units in separate arrays, many of these vulnerabilities can be eliminated. Increased performance A single hardware RAID controller may present a bottleneck to data access because of limited array bus and host-to-array bandwidth, as well as CPU cycles needed for parity calculations. Efficient controllerbased algorithms can be combined with multiple host connections and supplementary software RAID processing to increase bandwidth and throughput. Improved manageability The limited set of configuration options and the static configuration utilities for hardware RAID subsystems may make initial setup seem simpler than setting up a software RAID configuration. However, after running the system, the configuration may need to be modified to reflect the actual I/O pattern of the applications. With a controller-based setup, this is usually achieved by backing up the data, reconfiguring the array, and reloading the data. This requires interruption of data access. The on-line reconfiguration capabilities of most software RAID solutions can be used to enhance the performance monitoring, tuning, and reconfiguration of hardware RAID, simplifying administration while increasing uptime and performance.
Wipro Confidential
Page 39 of 90
Defining a Volume
The basis for any volume management solution is a volume. This topic defines a volume and identifies the advantages of using volumes to manage storage.
What Is a Volume?
Volumes enable an application to view a number of disks as a single logical unit, no matter the physical location of the disks. This volume has the performance, reliability, and other attributes of its individual components. Each volume records and retrieves data from one or more physical disks. Volumes are accessed by file systems, databases, or other applications in the same way that physical disks are accessed. Volumes are also composed of other virtual objects that are used to change the volume configuration. Volumes and their virtual components are called virtual objects. Volumes can be used to perform administrative tasks on disks without interrupting applications and users.
Advantages of Volumes
There are several advantages to using volumes: Ability to combine RAID levels Volumes enable you to combine any number of different RAID levels. For example, if the important consideration is cost, maybe you would implement a RAID-5 solution. Alternatively, if you require very high performance, then you might use striped mirrors. Scalability Virtual volumes also offer the flexibility to grow the storage capacity without disrupting the system. Instead of taking the server off-line or physically moving data from point A to point B, you can simply add more storage to the volume. Increased performance and failure tolerance You can combine enterprise RAID and JBOD and your system will feature the advantages of both. You can take advantage of a hardware controller and the flexibility of host-based volume management.
VERITAS Volume Management: Virtual Objects

There are several basic methods to manage online storage to increase data availability. Before you can understand the specific principles involved in each of these methods, it is important to define some of the basic virtual objects and their relationships to each other. This topic provides an overview of VERITAS Volume Manager (VxVM) and describes the relationships between the various VxVM objects.
Overview of VxVM
VxVM provides easy-to-use online disk storage management for computing environments. Traditional disk storage management often requires that systems be taken offline at a major inconvenience to users. VxVM provides the Wipro Confidential Page 40 of 90
Veritas Cluster Server tools to improve performance and ensure data availability and integrity. VxVM also enables you to dynamically configure disk storage while the system is active. The connection between physical objects and VxVM objects is made when you place a physical disk under VxVM control. VxVM creates virtual objects and makes logical connections between the objects. The virtual objects are then used by VxVM to perform storage management tasks. VxVM objects include:
VxVM disks
When you place a physical disk under VxVM control, a VxVM disk is assigned to the physical disk. Each VxVM disk corresponds to at least one physical disk. A VxVM disk typically includes a public region where user data is stored, and a private region where VxVM internal configuration information is stored.
Disk groups
A disk group is a collection of VxVM disks. You group disks into disk groups for management purposes, such as to hold the data for a specific application or set of
applications. For example, data for accounting applications can be organized in a disk group called "acctdg". A disk group configuration is a set of records with detailed information about related VxVM objects, their attributes, and their connections. Disk groups are configured by the system administrator and represent management and configuration boundaries. You can create additional disk groups as necessary. Disk groups allow you to group disks into logical collections. Disk groups enable high availability, because a disk group and its components can be moved as a unit from one host system to another. Disk drives can be shared by two or more hosts, but accessed by only one host at a time. If one host crashes, the other host can take over the failed host's disk drives, as well as its disk groups.
Subdisks
A subdisk is a set of contiguous disk blocks. VxVM allocates disk space by dividing a VxVM disk into one or more subdisks. Each subdisk represents a specific portion of a VxVM disk, which is mapped to a specific region of a physical disk. A VxVM disk can contain multiple subdisks, but subdisks cannot overlap or share the same portions of a VxVM disk.
Plexes (mirrors)
Wipro Confidential
Page 42 of 90
VxVM uses subdisks to build virtual objects called plexes (or mirrors). A plex consists of one or more subdisks located on one or more physical disks. To organize data on the subdisks to form a plex, use the following methods:
Concatenation Striping (RAID-0) Mirroring (RAID-1) Striping with parity (RAID-5) Volumes
Wipro Confidential
Page 43 of 90
A volume consists of one or more plexes, each holding a copy of the data in the volume. Due to its virtual nature, a volume is not restricted to a particular disk or a specific area of a disk. The configuration of a volume can be changed by using the VxVM user interfaces. Configuration changes can be done without causing disruption to applications or file systems that are using the volume. For example, a volume can be mirrored on separate disks or moved to use different disk storage. A volume can consist of up to 32 plexes, each of which contains one or more subdisks. A volume must have at least one associated plex that has a complete copy of the data in the volume with at least one associated subdisk. VxVM Object Relationships
VxVM virtual objects are combined to build volumes. The virtual objects contained in volumes are: VxVM disks Disk groups Subdisks Plexes Volume Manager objects have the following connections: VxVM disks are grouped into disk groups. One or more subdisks (each representing a specific region of a disk) are combined to form plexes. A volume is composed of one or more plexes. In this example, a disk group has two VxVM disks. One disk has a volume with one plex and two subdisks. The other disk has a volume with one plex and a single subdisk.
Wipro Confidential
Page 44 of 90
VERITAS Volume Management: Volume Layouts

A volume's layout refers to the organization of plexes in a volume. Volume layout is the way plexes are configured to remap the volume address space through which I/O is redirected at run-time. Volume layouts are based on the concept of disk spanning, which is the ability to logically combine physical disks in order to store data across multiple disks. This topic identifies the volume layouts that are available in VERITAS Volume Manager and relates these layouts to the appropriate, general RAID level.
Common Volume Layouts Available in VxVM

A variety of volume layouts are available, and each layout has different advantages and disadvantages. The layouts that you choose depend on the levels of performance and reliability required by your system. With VxVM, you can change the volume layout without disrupting applications or file systems that are using the volume. A volume layout can be configured, reconfigured, resized, and tuned while the volume remains accessible. Common volume layouts include: Concatenated (No RAID) Mirrored (RAID-1) Striped (RAID-0) RAID-5 Layered volumes (RAID-01 and RAID-10)
Concatenated Layout
A concatenated volume layout maps data in a linear manner onto one or more subdisks in a plex. Subdisks do not have to be physically contiguous and can belong to more than one VxVM disk. Storage is allocated completely from one subdisk before using the next subdisk in the span. Data is accessed in the remaining subdisks sequentially until the end of the last subdisk. For example, if you have 14GB of data, then a concatenated volume can logically map Wipro Confidential Page 45 of 90
Veritas Cluster Server the volume address space across subdisks on different disks. The addresses 0GB to 8GB of volume address space map to the first 8-gigabyte subdisk, and addresses 9GB to 14GB map to the second 6-gigabyte subdisk. An address offset of 12GB therefore maps to an address offset of 4GB in the second subdisk.
Concatenation removes the restriction on size of storage devices imposed by physical disk size. It also enables better utilization of free space on disks by providing for the ordering of available discrete disk space on multiple disks into a single addressable volume. In addition, large file systems can be created to reduce overall system administration complexity. However, concatenation does not protect against disk failure. A single disk failure may result in the failure of the entire volume.
Striped Layout
A striped volume layout maps data so that the data is interleaved, or allocated in stripes, among two or more subdisks on two or more physical disks. Data is allocated alternately and evenly to the subdisks of a striped plex.
Wipro Confidential
Page 46 of 90
The subdisks are grouped into "columns". Each column contains one or more subdisks and can be derived from one or more physical disks. To obtain the performance benefits of striping, each column within a striped volume should not be allocated space from any disk used by any other column within that volume. All columns must be the same size. The size of a column should equal the size of the volume divided by the number of columns. Data is allocated in equal-sized units, called stripe units, that are interleaved between the columns. Each stripe unit is a set of contiguous blocks on a disk. The stripe unit size can be in units of sectors, kilobytes, megabytes, or gigabytes. The default stripe unit size is 128 sectors (64K), which provides adequate performance for most general purpose volumes. Performance of an individual volume may be improved by matching the stripe unit size to the I/O characteristics of the application using the volume.
Mirrored Layout
By adding a mirror to a concatenated or striped volume, you create a mirrored layout. A mirrored volume layout consists of more than one plex that is a duplicate of the information contained in a volume. Each plex in a mirrored layout contains an identical copy of the volume data. In the event of a physical disk failure and the plex on the failed disk becomes unavailable, the system can continue to operate using the unaffected mirrors.
Wipro Confidential
Page 47 of 90
Although a volume can have a single plex, at least two plexes are required to provide redundancy of data. Each of these plexes must contain disk space from different disks to achieve redundancy. Volume Manager uses true mirrors, which means that all copies of the data are the same at all times. When a write occurs to a volume, all plexes must receive the write before the write is considered complete. Each plex in a mirrored configuration can have a different layout. For example one plex can be concatenated and the other plex can be striped. You should distribute mirrors across all types of hardware to prevent the loss of more than one copy of the data in case of a single point of failure.
RAID-5 Layout
A RAID-5 volume layout has the same attributes as a striped plex, but includes one additional column of data that is used for parity. Parity provides redundancy.
Wipro Confidential
Page 48 of 90
Parity is a calculated value used to reconstruct data after a failure. While data is being written to a RAID-5 volume, parity is calculated by doing an exclusive OR (XOR) procedure on the data. The resulting parity is then written to the volume. If a portion of a RAID-5 volume fails, the data that was on that portion of the failed volume can be recreated from the remaining data and parity information. RAID-5 volumes keep a copy of the data and calculated parity in a plex that is striped across multiple disks. Parity is spread equally across disks. Given a 5- column RAID-5 where each column is 1G in size, the RAID-5 volume size is 4G. One column of space is devoted to parity, and the remaining four 1G columns are used for data. The default stripe unit size for a RAID-5 volume is 32 sectors (16K). Each column must be the same length but may be made from multiple subdisks of variable length. Subdisks used in different columns must not be located on the same physical disk. RAID-5 requires a minimum of three disks for data and parity. When implemented as recommended, an additional disk is required for the RAID-5 log. RAID-5 cannot be mirrored.
Layered Volume Layout

A layered volume is a virtual Volume Manager object that nests volumes within volumes to create more complex volume structures that mirror data at a more granular level. With this new method of mirroring, data is mirrored at the column or subdisk level. Loss of a disk results in the loss of a copy of a column or subdisk within a plex. Further disk losses may occur without affecting the complete volume. Only the data contents of the column or subdisk affected by the loss of the disk needs to be recovered. Wipro Confidential Page 49 of 90
Stripe-Mirror (RAID-10)
This example illustrates a layered volume layout called a stripe-mirror layout. In this layout, VxVM creates underlying volumes that mirror each subdisk. Each of these underlying volumes are used as subvolumes to create a top-level volume that contains a striped plex of the data. If two drives fail, the volume survives 4 out of 6 (2/3) times. In other words, the use of layered volumes reduces the risk of failure rate by 50%. If a disk fails in a stripe-mirror layout, only the failing subdisk must be detached, and only that portion of the volume loses redundancy. When the disk is replaced, only a portion of the volume needs to be recovered, which takes less time. Mirror-Stripe (RAID-01)
This layout mirrors data across striped plexes. The striped plexes can be made up of different numbers of subdisks. In the example, plexes are mirrors of each other; each plex is striped across the same number of subdisks. Each striped plex can have different numbers of columns and different stripe unit sizes. One plex could also be concatenated. Wipro Confidential Page 50 of 90
Veritas Cluster Server When you create a volume that is less than one gigabyte in size, a nonlayered mirrored volume is created by default. Nonlayered, mirrored layouts are recommended if you are using less than 1GB of space, or using a single drive for each copy of the data. How Do Layered Volumes Work? In a regular mirrored volume, subdisks originate from the disk media. In a layered volume, the subdisks originate from underlying volumes. These subdisks are also called subvolumes. Subvolumes and subdisks are equivalent objects in terms of constructing a volume. In a layered volume, only the top-level volume is accessible as a device for use by applications. Layered volumes tolerate disk failure better than non-layered volumes and provide improved data redundancy. If a disk in a layered volume fails, a smaller portion of the redundancy is lost and recovery and resynchronization time is usually quicker than it would be for a nonlayered volume that spans multiple drives. Stripe-Mirror Mirror-Stripe Attribute Volume Volume Recovery of a single The entire plex (full Only the lower plex, subdisk failure volume contents) that requires not the top-level plex. contain the subdisk. resynchronization of: For example, at 10 75 seconds (both MB per second, the subvolumes can be 150 seconds. time it will take to synchronized at the resynchronize the same time). mirror is: Layered volumes consist of more VxVM objects than nonlayered volumes. Therefore, layered volumes may fill up the disk group configuration database sooner than nonlayered volumes. When the configuration database is full, you cannot create more volumes in the disk group.
Volume Management: Hot Relocation

Your system can be protected from the impact of disk failure through a process called hot relocation. Hot relocation is a feature of VxVM, and automatically detects disk failures and restores redundancy to failed VxVM objects by moving subdisks from failed disks to other disks. When hot relocation is enabled, the system administrator is notified by email about disk failures. This topic describes the hot relocation process.
Disk Failures
Disk failures can be classified into two general categories:
Permanent disk failure
When a disk is corrupted and no longer usable, the disk must be logically and physically removed, and then replaced with a new disk. With permanent disk failure, data on the disk is lost. Wipro Confidential Page 51 of 90
Temporary disk failure
When communication to a disk is interrupted, but the disk is not damaged, the disk can be logically removed, then reattached as the replacement disk. With temporary (or intermittent) disk failure, data still exists on the disk.
What Is Hot Relocation?
Hot relocation is a feature of VxVM that enables a system to automatically react to I/O failures on redundant VxVM objects and restore redundancy and access to those objects. VxVM detects I/O failures on objects and relocates the affected subdisks. The subdisks are relocated to disks designated as spare disks or to free space within the disk group. VxVM then reconstructs the objects that existed before the failure and makes them redundant and accessible again. Partial Disk Failure A partial disk failure is a failure that affects only some subdisks on a disk. When a partial disk failure occurs, redundant data on the failed portion of the disk is relocated. Existing volumes on the unaffected portions of the disk remain accessible. With partial disk failure, the disk is not removed from VxVM control. Before removing a failing disk for replacement, you must evacuate any remaining volumes on the disk.
How Does Hot Relocation Work?

The hot relocation feature is enabled by default. No system administrator action is needed to start hot relocation when a failure occurs.
Wipro Confidential
Page 52 of 90
The vxrelocd service, or daemon, starts during system startup and monitors VxVM for failures involving disks, plexes, or RAID-5 subdisks. When a failure occurs, vxrelocd triggers a hot relocation attempt and notifies the
system administrator, through email, of failures and any relocation and recovery actions. A successful hot relocation process involves: 1. Failure detection Detecting the failure of a disk, plex, or RAID-5 subdisk. 2. Notification Notifying the system administrator and other designated users and identifying the affected Volume Manager objects. 3. Relocation Determining which subdisks can be relocated, finding space for those subdisks in the disk group, and relocating the subdisks. The system administrator is notified of the success or failure of these actions. Hot relocation does not guarantee the same layout of data or the same performance after relocation. 4. Recovery Initiating recovery procedures, if necessary, to restore the volumes and data. Again, the system administrator is notified of the recovery attempt.
Fault Resilient Clustering Concepts
Wipro Confidential
Page 53 of 90
A fault resilient cluster features at least one machine that is configured to assume responsibility for a failed server. When one machine in the pair fails, its services are moved to the second server. This is called failover. Failover is defined as the migration of services from one server to another. In a fault resilient cluster, a significant outage of your primary server will have little impact on your users. Software can be added to hardware clustering solutions to provide 99.99% data availability. This accounts for only 53 minutes of downtime per year. In most instances, only seconds or minutes are lost. This topic describes the general characteristics of fault resilient HA clusters.
Fault Resilient Failover

It is important to view a cluster as a collection of servers that are configured to failover to one another in the event of a fault or failure from an application and service viewpoint. The servers are seen as the service or application providers, or as a delivery system. It does not matter which server is providing the service to the user, as long as the service is running. In the event of a failure, there may be a temporary loss of access by the user, but the system should be able to recover quickly. With few exceptions, nearly every environment can benefit from fault resilient clusters. If system failures cost your company any money at all in either hard costs or customer perception, a fault resilient HA solution is most likely the appropriate solution. Characteristics of a Fault Resilient Pair or Cluster A fault resilient HA system features: Redundant components in separate servers You require at least two servers that are similarly configured. Multiple independent copies of the operating system Each server in the pair or cluster should be running the same version of operating system. They should each feature unshared, independent disks that contain not only the operating system, but also any software required for the failover process. No single point of failure (SPOF) A SPOF in any server in a fault resilient pair or cluster can cause the failure of the whole pair or cluster and should be eliminated. Commercially available failover management software (FMS) A well-tested and robust FMS, such as VERITAS Cluster Server, supports all common networks, databases, and applications, and features many advantages over other options for monitoring and managing failovers in a fault resilient pair or cluster. Support for planned maintenance Wipro Confidential Page 54 of 90
Veritas Cluster Server Fault resilient solutions support planned maintenance of OS software, applications, or hardware. When a system is brought offline for maintenance, other systems can immediately take over services to ensure that the failover is completely transparent to users. Minimal effects of failover on users The effect of a fault or failure should be almost completely transparent to your users. The most intrusive effect that a failover can have on a user in a fault resilient system is a simple reboot of the client machine. In most cases, even this much of an intrusion is not acceptable. After a failover, the user should not have to perform any actions to return to work, once the services have been restored by another server in the cluster. Very quick failover times Ideally, the failover time in a fault resilient system will be less than 2 minutes. You should always have the backup server running and have as many system processes active as possible to enable minimal failover time. The takeover server should never require a reboot in the event of a failover. If this happens, the failover time can increase to almost an hour in some cases. It is a good idea to create a failover time expectation for your users. Minimal hands-on interaction Ideally, the failover process should never require any sort of human intervention. Data integrity To guarantee data integrity, it is required that the servers in a fault resilient cluster share the same storage disks. After a failover, the user must see the same consistent data that was available to the original server. These shared disks are critical and should feature some sort of mirrored RAID protection. Communication networks Each server in a cluster must continuously monitor the state of the other servers in the cluster. This is accomplished through a pair of heartbeat networks that run independent of one another. Another network is required to communicate with the clients or users. This is called the public, or service, network. It is not a necessary requirement, but the servers in a fault resilient pair or cluster should also maintain communication with system administrators. This can be accomplished by a separate administrative network.
Wipro Confidential
Page 55 of 90
Fault Resilient System Components

Servers
To simplify configuration and administration, all the servers in a fault resilient pair or cluster are completely identical. This means that they have the same processor type, identical memory, and they are running the same version of operating system with identical patches. Many system vendors manufacture models that have subtle differences. It is important that you avoid any incompatibility issues by using identical servers. If you do utilize different system models, you should use combinations that are proven to be compatible and are well-tested in cluster environments. Networks A fault resilient pair or cluster has three separate levels of network communication: Page 56 of 90 Wipro Confidential
Veritas Cluster Server 1. Public network
The public network is the means by which the server pair or cluster communicates with the end users. In many systems, the network is the least available component. You can determine methods to increase availability by breaking the public network down into three basic components: User access devices These devices include client terminals, PCs, and workstations. For user access devices, there are no special redundant components that you can use to improve availability. When a user access device fails, the failure affects a single user. Often a user can still access the system by using another device or accessing a shared
pool of devices. Local LAN segments Local LAN devices are generally connected to a backbone network with routers or bridges. It is preferable to configure parallel access points to the network, especially if you have a large number of users who must have access to applications at all times. You can implement redundant networks that enable you to switch the flow of data from one network to another in the event of a loss of network connectivity.
Wipro Confidential
Page 57 of 90
LAN interface components
The servers link to the network through a network interface card (NIC). You should implement some sort of redundancy at the NIC level to ensure that the servers can connect to the network even if there is a fault or failure in the NIC. At each cluster node, you should allow for at least two parallel, independent networking access points. If message traffic is heavy, you may need additional access points to support message traffic during system failover. 2. Heartbeat networks
Heartbeat networks are the channel through which the servers in a pair or cluster communicate and monitor each other. When the heartbeat stops, connectivity is lost. 3. Administrative network
An administrative network is not a required component of a fault resilient cluster. However, it is a good idea to have a redundant network that is able to be accessed solely by the administrator. This network enables the administrator to monitor the status of the servers or system resources, even if other public or private networks in your system fail.
Disks
Private disks The private disks are unshared, independent disks that contain not only the operating system, but also any software required for the failover process. Public disks Public disks are the shared storage disks that are accessed by the end user. After a failover, the user should see the same consistent data that was available to the original server. Public disks are critical and should feature some sort of mirrored RAID protection.
Stages of Failover
There are three basic stages of failover: Discovery
First, a hardware or software fault triggers the failover process. This fault can be a part of one system, an entire system, or a group of systems. Next, the system recognizes that there has been a downgrade in status. Some subsystems, such as RAIDs, may have built-in automatic recovery capabilities. If not, then the failover process begins. Notification
Wipro Confidential
Page 59 of 90
Veritas Cluster Server In this stage, the system is made aware of the failure. In fault tolerant systems, subassemblies may be configured to notify their parent assemblies that they have failed. A driver must be written for notification to take place. In a cluster, once a loss of a resource has been detected, in order to compensate, all systems are made aware of the loss. This notification must occur even if the network shared by the servers and users fails. Therefore, a separate private network must be available for inter-server communication. Systems must have redundant communication methods available. It is important to note that some servers may continuously monitor each other's ability to communicate. If one server is unavailable to communicate with the others, the others will assume that the server's resources and services are offline. The servers will notify each other and failover its services to other servers in the configuration automatically.
Recovery
Once the cluster has responded to the loss of a resource, operators can repair the resource. The cluster should then be able to restore operations to the state before the failure in a way that is virtually transparent to client processes.
Data Access Models for Fault Resilient Clusters

Clustered servers must cooperate so as not to interfere with each other's accesses to file system metadata or user data. There are two basic cluster data access models: Shared nothing Shared data Shared Nothing Clusters
Wipro Confidential
Page 60 of 90
Veritas Cluster Server In a shared nothing model, each storage device is connected to exactly one node in the cluster. Storage device ownership may pass from server to server, but a server must relinquish ownership before another can claim a device. In the shared nothing cluster model, applications running on different servers cannot access the same file systems concurrently. Shared nothing clusters enhance the availability of an application. If an application or the server on which it is executing fails, a failover server takes control of the application's storage devices, and restarts the application service. Shared nothing clusters also enable read-only applications to scale beyond the capacity of a single server. Prior to the Internet explosion, read-only applications were of limited utility. Currently, however, most commercial web servers are heavily loaded with read-only data. Multiple instances of a read-only web application can run on shared nothing clustered servers, each accessing its own copy of served web pages. As long as access is read only, there is no need to synchronize copies of the web pages. The storage in a shared nothing cluster is not dual-ported. This storage is often mirrored or uses fault-tolerant, hardware arrays with redundant controllers. This cluster configuration is relevant only for an application which features a shared-nothing parallel database architecture. Clusters providing highly available data services, such as Oracle Parallel Server, require physical connections from all nodes to all storage devices, and cannot be configured in a shared-nothing manner. Shared Data Clusters
Shared data clusters enhance application availability, and in addition, enable any partitionable application to scale beyond the capacity of a single server. Shared data clusters provide read-write access to a single copy of data to multiple application instances executing on different servers. Since all applications access the same copy, all applications have instant access to all data updates. There are two different access modes in a shared data cluster: Shared parallel access In a this shared data model, storage devices can be accessed by more than one server at the same time. In the simplest variation of this shared data model, servers share access to storage devices on which they create private, logical volumes.
Shared disk clusters feature a common I/O bus for disk access. Because all nodes can write to or cache data from the centralized disks at the same time, a synchronization
mechanism must be used to preserve the coherence of the system. Some sort of lock manager serves this purpose in a shared disk cluster configuration. A sophisticated shared data model, such as VERITAS SANPoint Foundation Suite HA, supports concurrent access to file system data by all servers in a cluster.
Shared exclusive access This model features storage that is dual-ported. However, rather than concurrent storage access, in this model, each individual node would have exclusive access to the shared storage at a certain point in time. In the event of a failure, the faulty server would fail over to the other node. This node would access the same storage as the original server.
Asymmetric 1 to 1 Configurations
This topic describes fault-resilient, asymmetric, 1 to 1 cluster configurations.
Overview of Asymmetric Failover

In an asymmetric failover configuration, one primary server performs the critical processes, while the secondary server is either idle or is running a low-priority application. If the primary server fails, the secondary takes ownership of the shared storage and starts the application. The failover process can also be initiated manually. For example, manual failover would be used if you wanted to perform maintenance or updates on the primary server. Once the fault is repaired on the primary server, if the active server fails, the application can fail back to the primary server.
Wipro Confidential
Page 62 of 90
In this example, a file server application is failed over from the master server to the backup server. Notice that the IP address used by the client systems moves as well. This is extremely important; otherwise all clients would have to be updated on each server failover process.
Wipro Confidential
Page 63 of 90
Advantages of Asymmetric Pairs Asymmetric pairs: Provide very high data availability. Are relatively easy to configure. Disadvantages of Asymmetric Pairs Asymmetric pairs: Are expensive. Involve hardware that is used solely for monitoring purposes. Make it difficult to get budget approval for idle hardware.
Capacity Considerations for Asymmetric Clusters

In asymmetric configurations, all of the systems might not have equivalent capacities. The node to which applications are failed over may be a smaller system. Suppose an asymmetric cluster contains three nodes: Node1, Node2, and Node3. Node2 and Node3 have considerably smaller capacities than Node1
Wipro Confidential
Page 64 of 90
Each node has multiple applications running when all of the nodes are functioning properly. If Node1 fails, App1 fails over to Node2, and App2 and App3 to Node3. App4 and App5 on Node1 are discarded. All local applications on Node3 will also be discarded to make room for App2 and App3.
Symmetric 1 to 1 Configurations
This topic describes fault resilient, symmetric, 1 to 1 configurations.
Overview of Symmetric Failover

Symmetric failover enables both hosts to run production applications. The hosts then monitor each other through the dedicated heartbeat networks
Wipro Confidential
Page 65 of 90
In the event of a service failure, the other server would take over and run both applications.
Wipro Confidential
Page 66 of 90
The IP address moves to the host that is running the service. When a failover occurs, the service is failed to the alternate node, and that node is configured with the new IP address, as well as its old address. This way, client-side applications do not require reconfiguration to be able to locate the recovered version of the application. Of course, any TCP connections that were open with the old instance of the service will be terminated by the failover, and new TCP connections will need to be established. In many cases, the restoration of the TCP connection is transparent to the user.
Capacity Considerations for Symmetric Clusters

In a symmetric configuration, all of the systems should have equivalent hardware, such as memory, CPU, and I/O capacities. They can be used simultaneously under normal operating conditions. The administrator must ensure that sufficient memory, CPU, and I/O capacity are available on the surviving servers in the event that an application is failed over. In an example of a symmetric cluster, suppose the cluster contains three nodes: Node1, Node2, and Node3.
Wipro Confidential
Page 67 of 90
Each node has multiple applications running when all of the systems are functioning properly. If Node1 fails, App4 is transferred to Node2, App3 to Node3, and App6 is discarded. Node2 must have enough available capacity during normal operations to accommodate App4 in the event of a Node1 failure. Similarly, Node3 must have enough available capacity for App3. Note that in symmetric failover, the hosts are generally configured with more processing and I/O power than is needed to run their individual applications. The effect of running both sets of applications on one host must be considered. If both are running at capacity and one fails, the performance of the remaining one will be poor. On the surface, it would appear the symmetrical configuration is a far more beneficial configuration in terms of hardware utilization. Many organizations dislike the concept of a valuable system sitting idle. There is a flaw in this line of reasoning, however. In asymmetrical failover, the takeover server would need only as much processor power as its peer. On failover, performance would remain the same. In symmetrical failover, the takeover server would need sufficient processor power to not only run the existing application, but also enough for the new application it takes over. If a single application needs one processor to run properly, an asymmetric configuration would need two single processor systems. To run identical applications on each server, a symmetrical configuration would require two dual processor systems.
N to 1 Clustering
This topic describes a traditional N to 1 networked cluster configuration.
N to 1 Cluster Scalability
One important consideration in clustering is scalability. Most HA packages can scale to eight or more nodes. It is important to note that attaching more than two hosts to a single SCSI storage device becomes problematic, as specialized cabling must be used. In most cases, scaling beyond four hosts is not practical, as it severely limits the actual number of SCSI disks that can be placed on the bus.
Example of a 4 to 1 Cluster
This example illustrates the inherent complexities of a 4 to 1 cluster. Each of the four primary servers are connected to a set of two disks. All the disks are connected to a fifth server that acts as the backup server. This could be asymmetric or symmetric cluster. The major difference is in the functionality of the backup server: In a 4 to 1 asymmetric configuration the fifth server would simply act as the standby server. The four primary servers act independently of one another. In the event of a single server failure, its services would be failed over to the standby server. In a 4 to 1 symmetric configuration, the fifth server would act as the standby server and also run applications.
N to 1 Clustering on the SCSI Bus

An initiator, as the name implies, initiates commands. SCSI host bus adapters (HBAs) are the initiators. SCSI drives are targets. On a SCSI bus, N to 1 data sharing requires multi-initiator attachment. Multi-initiator attachment requires the capability to change the SCSI target ID of the HBA since only one HBA on the bus can have the highest
priority ID of the seven. It also requires special support in the driver to release control of the bus to another initiator. Wipro Confidential Page 69 of 90
Veritas Cluster Server This diagram shows how a cluster of systems might share a group of disks. Notice that each of the HBAs on the bus must have a high-priority, but different SCSI target IDs. Special cables must be used to attach more than two hosts to the bus. Disadvantages of this configuration include: The potential for duplicate IDs Complicated termination issues that can result in the loss of data Compatibility between controllers (For example, you must have differential SCSI devices if you have a differential controller.)
Dual hosted SCSI

Dual hosted SCSI has existed for a number of years and functions well in small cluster configurations. The primary limitation of dual hosted SCSI is scalability. Typically, two (to a maximum of four) systems can be connected to a single drive array. Large storage vendors, such as EMC, provide high-end arrays with multiple SCSI connections in order to overcome scalability issues. In most cases, the nodes are connected to a simple array in a configuration illustrated in this diagram.
A typical SCSI bus has one SCSI initiator for the controller or HBA, and one or more SCSI targets for the drives. To configure a dual hosted SCSI configuration, one SCSI initiator ID must be set to a value different than its peer. The SCSI target IDs must be chosen so they do not conflict with the ID for any drive that is installed or an initiator ID. Setting the SCSI Initiator ID The method of setting SCSI initiator IDs are dependent on the system manufacturer. For example, Sun Microsystems provides two methods to set SCSI initiator IDs: Changing the scsi-initiator-id value This affects all SCSI controllers in the system, including the internal controller for the system disk and CDROM. Be careful when choosing a new controller ID to not conflict with the boot disk, floppy drive, or CD-ROM. Editing the SCSI driver control file This file is in the /kernel/drv area. This will set the SCSI initiator ID on a per controller basis. NT and Intel systems are typically set on a per controller basis with a utility package provided by the SCSI controller manufacturer. You should refer to your system documentation for details. Wipro Confidential Page 70 of 90
Common Problems in Dual Hosted SCSI The most common problems that are encountered when attempting to configure dual hosted SCSI are: Duplicate SCSI Target IDs The most common problem encountered when configuring shared SCSI storage is duplicate SCSI target IDs. A duplicate SCSI target ID will, in many cases, exhibit different symptoms depending on whether there are duplicate controller IDs, or a controller ID conflicting with a disk drive. Duplicate Initiator IDs This is a very serious problem that is more difficult to identify than duplicate SCSI target IDs. In a normal communication sequence, a target can only respond to a command from an initiator. If an initiator sees a command from an initiator, it will be ignored.
The problem may manifest itself during simultaneous commands from both initiators. A controller could issue a command, and see a response from a drive and assume all is well. This command may actually have been from the peer system. The original command may have not executed successfully. Carefully examine systems attached to shared SCSI and make certain that the controller ID is different.
Configuring Dual Hosted SCSI: Example The following is an example of how to set up a typical dual hosted SCSI configuration: 1. Attach the storage to one system. 2. Terminate the SCSI bus at the array. 3. Power up the host system and array. 4. Verify all drives can be seen with the operating system by using available commands such as the format command. 5. Identify the SCSI drive IDs that are used in the array and the internal SCSI drives if they are present. 6. Identify the SCSI controller ID. 7. Identify a suitable ID for the controller on the second system. This ID must not conflict with any drive in the array or the peer controller. If you plan to set all controllers to a new ID, ensure that the controller ID chosen on the second system does not conflict with internal SCSI devices. 8. Set the new SCSI controller ID on the second system. 9. Power down both systems and the external array. 10. Disconnect the SCSI terminator and connect the array to the second system. 11. Power up the array and both systems. Depending on hardware platform, you may be able to check for array connectivity before the OS is brought up. Boot console messages such as "unexpected SCSI reset" are a normal occurrence during the boot sequence of a system connected to a shared array. Most SCSI adapters will perform a bus reset during initialization. The error message is generated when it sees a reset that was initiated by the peer.
N to 1 SAN Clustering
This topic describes the implementation of an N to 1 clustering design in a Storage Area Network (SAN) environment. SANs are specialized high-speed networks that enable fast, reliable access among computers and independent storage resources. In a SAN, all networked servers share storage devices as peer resources. In other words, they are not the exclusive property of any one server. You can use a SAN to connect servers to storage, servers to each other, and storage to storage through hubs, switches, and routers.
Wipro Confidential
Page 71 of 90
Defining SAN
SANs are defined as specialized, high-speed networks that are specifically dedicated to storage. SANs provide fast, reliable access among systems and storage resources. The Storage Networking Industry Association (SNIA) defines a SAN as:
"A network whose primary purpose is the transfer of data between computer systems and storage elements and among storage elements. Abbreviated SAN. A SAN consists of a communication infrastructure, which provides physical connections, and a management layer, which organizes the connections, storage elements, and computer systems so that data transfer is secure and robust." Fibre Channel
Although the definition of a SAN does not specifically mention Fibre Channel technology, the Fibre Channel protocol was the foundation for the development of SAN technology. With the emergence in the mid-1990s of Fibre Channel-based networking devices, such as Fibre Channel switches, companies began to create networked environments for storage in which servers and storage were connected in an any-to-any fashion, supported by a highly reliable, high-performance fabric network. Fibre Channel, for the first time, enabled companies to virtualize storage and provide high-speed access to information from any storage device to any server.
SAN Benefits
Attaching more than two hosts to a traditional, single SCSI storage device becomes problematic. SANs enable you to connect a large number of hosts to a nearly unlimited amount of storage. This allows much larger clusters to be constructed relatively easily. A SAN carries only I/O traffic between servers and storage devices. It does not carry general-purpose traffic such as email or other end user applications. Therefore, it avoids the compromises inherent in using a single network for all applications. With this shared capacity, organizations can acquire, deploy, and use storage devices more costeffectively. Ultimately, on a SAN, any data at any network location is accessible, often through multiple paths, by any nodes, applications, or users on the network. Storage on a SAN is shared, resulting in centralized management, better utilization of disk and tape resources, and enhanced enterprise-wide data management and protection.
Wipro Confidential
Page 72 of 90
SANs are designed to replace today's point-to-point access methods with a new any-to-any architecture. In the traditional model, if disks are logically shared, this sharing occurs at LAN speeds, such as 100 megabits/second, or is limited to the small number of nodes which can be directly attached to a given disk array. Through the addition of a high-speed switch, clients can access any disk from any node on the SAN at channel speeds, such as 100MB/sec. This allows a much larger number of nodes much faster access to a much larger centralized data store.
Wipro Confidential
Page 73 of 90
Redundancy is easily added to a SAN through the incorporation of a second switch or redundant switching components to support high availability data access. Additional nodes and disk arrays can be easily added to these configurations with minimal disruption by plugging new components into the switch, providing a much simpler and more scalable growth path than traditional architectures. Finally, any node in the SAN may potentially back up any other node. One or two dedicated nodes can now backup a much greater number of nodes, thereby significantly reducing the hardware costs associated with cluster configurations.
Wipro Confidential
Page 74 of 90
Failover Granularity in Clusters

This topic describes the concepts and requirements of application-level failover, as opposed to server-level failover.
First Generation Failover Granularity

A significant limitation of first generation failover management systems is failover granularity. Failover granularity refers to what it is that must failover on the event of a failure. First generation systems had failover granularity equal to a server. This means that in the event of the failure of any HA application on a system, all applications would fail to a second system. This severely limited the scalability of any server. For example, running multiple production Oracle instances on a single system is problematic. The failure of any instance will cause an outage of all the instances on the system while all applications are migrated to another server.
Second Generation Failover Granularity

One of the distinguishing features of a second generation HA systems is the concept of resource groups, or service groups. Particularly on the large servers, it is rare that the entire server will be dedicated to a single application service. Configuring multiple domains on an enterprise server partially alleviates the problem, however multiple applications may still run within each domain. Failures that affect a single application service, such as a software failure or hang, do not necessarily affect other application services that may reside on the same physical host or domain. If they do, then downtime may be unnecessarily incurred for the other application services.
Application Services
An application service is the service the end user perceives when accessing a particular network address. An application service is typically composed of multiple resources, some hardware and some software based, all cooperating together to produce a single service. Wipro Confidential Page 75 of 90
Example of an Application Service For example, a database service may be composed of one or more logical network IP addresses, RDBMS software, an underlying file system, a logical volume manager, and a set of physical disks that are being managed by VERITAS Volume Manager. If this database service needed to be migrated to another node for recovery purposes, all of its resources must migrate together to re-create the service on another node. A single large node may host any number of application services, each providing a discrete service to networked clients who may or may not know that they physically reside on a single node. Application Service Management Application services can be proactively managed to maintain service availability through an intelligent availability management tool. An application service can be made highly available if it is possible to test the application service to ensure that it is providing the expected service to networked clients and you can automatically start and stop the application service. If multiple application services are running on a single node, then they must be monitored and managed independently. Independent management allows an application service to be automatically recovered or manually idled for administrative or maintenance reasons without necessarily impacting any of the other applications running on a node. This is particularly important on larger servers, which may easily be running eight or more applications concurrently. Of course, if the entire server crashes, as opposed to just a software failure or hang, then all the application services on that node must be recovered elsewhere. At the most basic level, the fault management process includes monitoring an application service and, when a failure is detected, restarting that application service automatically. This could mean restarting it locally or moving it to another node and then restarting it, as determined by the type of failure incurred. In the case of local restart in response to a fault, the entire application service does not necessarily need to be restarted; perhaps just a single resource within that application service may need to be restarted to restore the service. Load Balancing Given that application services can be independently manipulated, a failed node's workload can be load balanced across remaining cluster nodes, and potentially failed over successive times without manual intervention. In this example, a three node cluster is operating normally while running four applications.
The second node fails. On recovery, the application load of the failed server is balanced across the other two nodes
Wipro Confidential
Page 76 of 90
If another server fails, all of the applications would failover to the remaining server.
Application Requirements for Failover

Nearly all applications can be placed under cluster control, as long as basic guidelines are met: The application must have a defined procedure for startup This means that the failure management software can determine the exact command used to start the application, as well as all other outside requirements the application may have, such as mounted file systems, IP addresses, etc. For example, an Oracle database agent needs the Oracle user, Instance ID, Oracle home directory, and the pfile. The developer must also know implicitly what disk groups, volumes, and file systems must be present. The application must have a defined procedure for stopping. This means that an individual instance of an application must be capable of being stopped without affecting other instances. For example, using a Web server, killing all HTTPD processes is unacceptable since it would stop other Web servers as well. The application must have a defined procedure for monitoring the overall health of an individual instance. Using the Web server as an example, simply checking the process table for the existence of "httpd" is unacceptable, as any Web server would cause the monitor to return an online value. Checking if the pid contained in the pid file is actually in the process table would be a better solution. To add more robust monitoring, an application can be monitored from closer to the user perspective. For example, an HTTPD server can be monitored by connecting to the correct IP address and port and testing if the Web server responds to http commands.
In a database environment, the monitoring application can connect to the database server and perform SQL commands and verify read and write to the database. It is important
that data written for subsequent read-back is changed each time to prevent caching from hiding underlying problems. In both cases, end-to-end monitoring is a far more robust check of application health. The closer a test comes to exactly what a user does, the better the test is in discovering problems. This does come at a price. End to end monitoring increases system load and may increase system response time. From a design perspective, the level of monitoring implemented should be a careful balance between assuring the application is up and minimizing monitor overhead.
The application must be capable of storing all required data on shared disks. This may require specific setup options or even soft links. For example, the VERITAS NetBackup product is designed to install in /usr/openv directory only. This requires either linking /usr/openv to a file system mounted from the shared storage device or actually mounting file system from the shared device on /usr/openv. Similarly, the application must store data to disk, rather than maintaining in memory. The takeover system must be capable of accessing all required information. The application must be capable of being restarted to a known state. This is the most important application requirement. On a switchover, the application is brought down under controlled conditions and started on another node. The application must close out all tasks, store data properly on shared disk, and exit. At this time, the peer system can startup from a clean state. A problem arises when one server crashes and another must take over. The application must be written in such a way that data is not stored in memory, but regularly written to disk.
A commercial database such as Oracle, is the perfect example of a well written, crash tolerant application. On any given client SQL request, the client is responsible for holding the request until it receives an acknowledgement from the server. When the server receives a request, it is placed in a special log file, or "redo" file. This data is confirmed as being written to stable disk storage before acknowledging the client. At a later time, Oracle then de-stages the data from redo log to actual table space. After a server crash, Oracle can recover to the last known committed state by mounting the data tables and applying the redo logs. This in effect brings the database to the exact point of time of the crash. The client resubmits any outstanding client requests not acknowledged by the server; all others are contained in the redo logs. One key factor to note is the cooperation between client application and server. This must be factored in when assessing the overall cluster compatibility of an application. The application must be capable of running on all servers designated as potential hosts. This means there are no license issues, host name dependencies, or other such problems. Prior to attempting to bring an application under cluster control, it is highly advised the application be test run on all systems in the proposed cluster that may be configured to host the application.
Wipro Confidential
Page 78 of 90
Example: Configuring Veritas Cluster on Solaris 2.8 with VxFS and Volume manager
Setup : 1. 2*E450's with 2Gb RAM and 2*480 MHz CPU. 2. 2* A1000 Raid Box with 4*18.1 Gb HDD and raid manager rm6 6.22 3. One onboard card (hme0), i additional card (hme1) and gigabit Ethernet card (ge0) 4. A cross cable and a fiber cable for heart beat. Procedure : 1. Load Solaris 2.8 on both the machines along with latest patches. 2. Load the rm6 utility for configuring A1000.There are 4 hard disks in each array, configured Them as Raid 0 .So each logical Volume will appear as 72Gb Hard disk in Volume manager. 3. Now in the cluster the 2 A1000 boxes will be used as shared disk. 4. Before connecting the A1000's to both the machines, dont forget to change the scsiinitiator-id Of one machine (preferably this machine will be your secondary server in the cluster) Now put the differential card in slot no.1 and another in slot no.5 (from above). Do this on both the servers. Otherwise the server will continuously give errors --> AUTO SENSE RESET FAILED. 5.after doing this activity connect each E450 to both the arrays can verify the hard disks Using format command. 6.Load the required array patches. 7.Load the Veritas Volume 3.1 The veritas Volume manager 3.1 has in built VxFS 3.3.3. The following packages are used for VxFS a. VRTSvxfs b. VRTSqio c. VRTSqlog. Remember to put VRTSvxfs package first and then put the packages (b) and (c), otherwise the
Remaining two packages will give errors while installing. After loading the Volume manager 3.1, load the required volume manager and VxFS patches. Include the Volume Manager as well as VxFS license using vxlicense -c command. 8. On both the servers include both the internal hard disks in the rootdg volume group. Mirror the internal disks thro Volume Manager. 9. Assume the hostname of primary server is dotsoft1 and that of the secondary server is dotsoft2. On dotsoft1 create additional diskgroup called bsnldg and include the array hard disks (configured as Raid 0) in that disk group. Mirror the hard disks included in bsnldg diskgroup from the volume manager. 10. Now check the Major and Minor numbers on both primary and secondary servers in /dev/dsk directory after confirming that Array volumes are detected and mounted on both servers. Also check the values of vxio and vxspec in /etc/name_to_major on both the servers. If the values are different, change the values preferably on secondary server. See to it that the values are same; otherwise you will face problems later during clustering. 11. Load the gigabit Ethernet driver from the cd, which comes with Solaris 8 pack. 12. For dotsoft1 hme0 - 192.168.33.6 dotsoft2 hme0 - 192.168.33.7 13. DO NOT PLUMB THE PRIVATE CARDS ( hme1 and ge0) as it will been take care by Veritas Cluster software. Also please remove the entries of all mount points (of the shared array) from the /etc/vfstab file. 14. Before proceeding further pl. write down the resources you want to put under VCS and will failover on other live system in cluster. Eg: NIC, Disk Partition, IP, Mount points, etc. 15. VCS uses 2 components LLT and GAB to share data over private networks among systems. LLT provided fast kernel-to-kernel communication and monitors network connections. LLT is configured using llttab file. This file describes the system in the cluster and private network links among them. The GAB ( group membership and atomic broadcast) provides global message order required to provide synchronized state among the systems and monitors disk communications such as that required by VCS heartbeat utility.GAB driver is configured by creating gabtab file. 16. Mount the VCS cd and runt the command #./InstallVCS
Enter the following info .. Please enter unique cluster name : bsnlcluster Please enter unique cluster id : 2 (has to be unique if multiple cluster setups are there in the organization) Enter the systems in which u want to install VCS : dotsoft1 dotsoft2 (the name should be separated by spaces) after this it will start installing software on both the machines. The process discovers information about the network cards. It will discover all the network card in the system. It will prompt for entering device files for private link. for eg: for 2nd hme card enter /dev/hme1 for qfe cards : /dev/qfe1 or 2 for gigabit Ethernet cards select --others...when u select "other" option it will prompt u to enter the actual device file. So for ge card enter the following : /dev/ge0 In our case hme1 and ge0 card were used for private link. so the device files will be --/dev/ge0 and /dev/hme1 17. The same info will be asked for the other server. So enter the same device files for the other server. 18. Reboot the servers. 19.To verify whether the installation is successful check the following files a. /etc/llthosts -->it should contain the entry for both the servers. b. /etc/llttab --> cluster node id along with private links info. c. /etc/gabtab --> should contain the gabconfig command d. /etc/VRTSvcs/conf/config/main.cf--> the main configuration file for configuring veritas cluster .The entries have to be made in this file (either from command line mode or GUI mode) THE ./InstallVCS script creates a user 'admin' with password as 'password' required for veritas cluster administration through GUI. It is called veritas cluster manager. 19. Before starting the configuration set the path variable as
PATH=$PATH:/sbin:/usr/sbin:/opt/VRTSvcs/bin ; export PATH 20.check the private link status using the command #lltstat -nvv | more 21. to verify gab is operating use the command /sbin/gabconfig -a this command should return gab port membership info 22. to verity whether cluster is operating use the following command # hastatus -summary 23.to install the veritas cluster manager GUI add the following package from the cluster cd --> VRTScscm 24.the core VCS processes, which should be running, are a.had --> VCS engine that maintains configuration info administers direct fail over. b.hashadow -->process that monitors and starts VCS engine c.halink--> process that monitors link between systems in the cluster. 25. Following are the three components of a veritas cluster--> a. Resources--> can be hardware or software entities like hard disk , NIC , ip address , application , database b. Resource Type -> resources is classified into types. like multiple resources can be of one type. for eg: 2 volumes can be classified as of type VOLUME. c. Service Group--> most important component of VCS. Its composed of related resources. when a service group os brought online , all the resources within a service group are automatically brought online. the fail over service group is a service group which can be brought online only on 1 system at a time. eg : file systems are configured as fail over service groups .
A parallel service group can be fully or partially online on both the servers at a time for eg : OPS is configured as parallel service group. 26. In veritas cluster dependencies between the resources has to be created.the dependencies between the resources specify the order in which the resources within a service group are brought online and taken offline. for eg if a servicegroup called abc is created which has diskgroup and volumes as resources.so when a servicegroup is brought online then diskgroup is brought online first and then the volume. So dependency between diskgroup and volume has to be created. Same way the dependency between the volume and the mount point as top be created. Since diskgroup comes up first , it is called the child and volume is called the parent. Sameway between the volume and the mount point , volume is the child and the mount point is the parent. the same holds true between a NIC and an ip address. 27.before starting the VCS gui the database has to be made r/w.use the following command to make it r/w # haconf -makerw -- to set in read/write mode # hauser -add username -- to add another user # haconf -dump -makero --reset the configuration to read only # xhost + # hagui & it will open a console . Enter the user as admin or the user u just created
Wipro Confidential
Page 83 of 90
28 . To create dependencies a. create a service group called bsnl.Include the diskgroup resource in it by selecting the Add Resource tab.The diskgroup "bsnldg" is already created using veritas volume manager. click on the properties of resource diskgroup and enter its properties like diskgroup name. b. create resource called volume by selecting Add resource.include all the 10 volumes in it which are being created using veritas volume manager.in the properties tab of each volume resource , enter the volume name. (same way create 10 mount points after creating a resource called mount.specify the mount point name , the physical device to which the mount point corresponds to , and the volume name to which the mount point corresponds from the properties tab) c. create the dependency between the diskgroup and the volume.Volume will be the parent and the diskgroup will be the child.while
creating the dependency , actually a link has to be created between the volume and the diskgroup. ( by dragging the mouse) for admin , the password will be password. Also enter the cluster name. The screen looks as follows :
As shown .. Exch_NIC represents the NIC card resource and Exch_IP represents resource Ip. Exch_NIC is the child whereas Exch_IP is the master. Same way Exch_DiskResis the child and Exch_MountX is the parent. So to create a link ( between 2 blue boxes) , drag the mouse from one object to another. It will ask whether Exch_DiskRes is the child and Exch_MountX is the parent. In the same way create links or dependencies between all the objects. In the above diag. VCSNT5 and VCSNT6 are 2 systems in the cluster. Once these dependencies are created , start the cluster services on the primary server.Hence all the volumes in the shared array will get mounted. On giving up the haswitch command all the volumes will get mounted on the secondary server
To check which commands are being executed in the background click the "command center" icon.
Wipro Confidential
Page 86 of 90
Configuring Membership Heartbeat Regions on Disk Group membership heartbeat regions can also be setup on shared disks/array for use as an additional path for VCS heartbeating.With these regions configured , in addition to network connections VCS has multiple heartbeat paths available.so if one private link fails , VCS has another network connection and heart beat disk region that continue heartbeating.for this 2 regions each of 128 blocks need to be configured But this region cannot be configured on Vxvm volumes but on the block ranges of the underlying physical device. So if the shared disk is under volume manager control , following steps need to be followed. a. bring the shared disk under volume manager control. Say the volume disk name is c3t5d0 b. unmount all file systems on the disk c. remove all the volumes. ( hence create this region before creating volumes) d. remove the disk from volume manager control. e. Give the following command : # hahbsetup c3t5d0 it will give u the following output :
The hadiskhb command is used to set up a disk for combined use by VERITAS Volume Manager and VERITAS Cluster Server for disk communication. WARNING: This utility will destroy all data on c3t5d0 Have all disk groups and file systems on disk c1t1d0 been either unmounted or deported? y There are currently slices in use on disk /dev/dsk/c3t5d0s2 Destroy existing data and reinitialize disk? y 1520 blocks are available for VxCS disk communication and service group heartbeat regions on device /dev/dsk/c3t5d0s7 This disk can now be configured into a Volume Manager disk group. Using vxdiskadm, allow it to be configured into the disk group as a replacement disk. Do not select reinitialization of the disk. After running vxdiskadm, consult the output of prtvtoc to confirm the existence of slice 7. Reinitializing the disk under VxVM will delete slice 7. If this happens, deport the disk group and rerun hahbsetup.
f. On running the format command for c3t5d0 disk , u will observe that 2 Mb space is created in s7 slice.
g. Now readd the disk under volume manager control using the option "remove the disk after replacement " option from the vxdiskadm option (PLEASE DO NOT REINITIALIZE THE DISK) After Dependency for resources is created try to online the group by following procedure. 1) Select the group you want to online/offline/switch-to in "Services group" 2) Right click on the group. (if group is offline, you will get tabs for bringing it online & if the group is already online, tabs for offline and switch to is available) 3) Check with the operations of online and offline on local system as well as remote system. If up to this step every thing goes through we are ready for switch over of the group. 4) Right click on the group from the "Services group" click switch to and click on the remote system name in the cluster. (This will offline the group on local system and online the same on the remote system , But before Switch-to Group should be online on Local System. 5) Check with the help of ifconfig -a and df -k commands on both systems whether the mount points and IP is transferred from local to remote system. 6) If all the mount points and IP address configured in VCS is switched over & come up online successfully on remote system , now go ahead for directly switching off the system on which the group is presently online. ORACLE AGENT FOR ORACLE DATABASE : The Oracle agent monitors the Oracle service and the SQLnet listener process 1. The oracle agent works in 3 modes a. ONLINE : uses svrmgrl command to open the database b. OFFLINE : uses svrmgrl command close the database (shutdown immediate. c. MONITOR : scans process table for ora_pmon , ora_smon and ora_lqwr. 2. The SQLnet Listener process does the following : a. ONLINE : uses lsnrctl -start to start the listener process b. OFFLINE: uses lsnrctl -stop to stop listener process c. MONITOR: scans process table for tnslsnr $LISTENER Requirements for Oracle Agent : When Oracle server application ($ORACLE_HOME) is installed on shared disk , each cluster system must have same mount point directory for shared file system. To install the Oracle agent :
# cd /cdrom/cdrom0 # pkgadd -d. Now start the cluster manager GUI and import OracleTypes.cf file to the VCS engine using following method a. Start cluster manager GUI b. Click on file menu and select import files c. In import files dialog box select the file /etc/VRTSvcs/conf/sample_Oracle/OracleTypes.cf d. save the configuration using file save option . This will put Oracle as a resource in the Cluster. (by default diskgroup , mount points , volumes , NIC card , IP , disks are present as resources. But since oracle is not present , so this method has to be used to make oracle as resource available for the cluster. After installing the Oracle agent , when you open the Cluster Manager GUI , "Oracle" will be present as resource.So when u create a new resource of type Oracle , it will ask for following information : a. sid b. owner c. $ORACLE_HOME path d. Pfile -> $ORACLE_HOME/dbs/initSID.ora value_name of startup profile. Similarly a resource of type SQLnet will be available. So add the resource of the type SQLnet and enter the following information : a. b. c. d. owner $ORACLE_HOME path to oracle binaries name of listener. (default is LISTENER) $TNS_ADMIN path to directory in which listener configuration file resides( listener.ora)
Now to create the dependencies between these 2 resources (oracle and SQLnet) Also assign a demo ip which will float from one system to another system incase of system failover.so IP will be the parent and the public NIC card can be the child.This demo Ip will act as child to Oracle agent which will be the parent to demo Ip. So oracle will be the client and SQLnet will be the parent So the final dependency looks this way : ( from left to right i.e. from child to parent )
(also oracle agent is parent to demo Ip, which is the parent to NIC) diskgroup- volumes- mountpoints- Oracle agent- Sqlnet | | | | demo IP | | | NIC card (hme0:1) So in case the system fails , the Sqlnet services will stop first (parent goes offline first ) , then oracle will shutdown , mnt points will get unmounted , Volumes will go offline and diskgroup will automatically deport. Also at the same time demo IP will go offline. Now since child comes online first , So on the other system the demo Ip will come up , diskgroup will automatically get imported , volumes will then come online , mnt points will get mounted, database will come up and finally listener service will start successfully. Now u can switch off the machine and check whether the Oracle database comes up on the other system in the cluster. IMP COMMANDS 1.hastart --start the VCS engine 2.hagrp -display -- to display service groups 3.hastatus -summary-- summary of cluster info. U can also check cluster which u have configured thro GUI form the main.cf file present in the path /etc/VRTSvcs/conf/config
Wipro Confidential
Page 90 of 90

Veritas Cluster Server

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Veritas Cluster Server

Uploaded by

Copyright:

Available Formats

Veritas Cluster Server

Learning Document for HA Concepts & Veritas Cluster Server

By Enterprise Services Wipro Infotech Delhi