High Availability IT Services: 1st Edition (Hardback) book cover

High Availability IT Services

1st Edition

By Terry Critchley

Auerbach Publications

537 pages | 157 B/W Illus.

Purchasing Options:$ = USD
Hardback: 9781482255904
pub: 2014-12-17
SAVE ~$17.39
eBook (VitalSource) : 9780429174520
pub: 2014-12-17
from $41.98

FREE Standard Shipping!


This book starts with the basic premise that a service is comprised of the 3Ps—products, processes, and people. Moreover, these entities and their sub-entities interlink to support the services that end users require to run and support a business. This widens the scope of any availability design far beyond hardware and software. It also increases the potential for service failure for reasons beyond just hardware and software; the concept of logical outages.

High Availability IT Services details the considerations for designing and running highly available "services" and not just the systems infrastructure that supports those services. Providing an overview of virtualization and cloud computing, it supplies a detailed look at availability, redundancy, fault tolerance, and security. It also stresses the importance of human factors.

The book starts off by providing an availability primer and detailing the reasons why you need to be concerned with high availability. Next, it outlines the theory of reliability and availability and the elements of actual practices in this high availability (HA) area, including Service Level Agreements (SLAs) and Change Management.

Examining what the major hardware and software vendors have to offer in the HA world, the book considers the ubiquitous world of clouds and virtualization as well as the availability considerations they present.

The book examines high availability concepts and architectures such as reliability, availability, and serviceability (RAS); clusters; grids; and redundant arrays of independent disks (RAID) storage. It also covers the role of security in providing high availability, cluster offerings, emergent Linux clusters, online transaction processing (OLTP), and relational databases.

Table of Contents


Preamble: A View from 30,000 Feet

Do You Know. . .?

Availability in Perspective

Murphy’s Law of Availability

Availability Drivers in Flux: What Percentage of Business Is Critical?

Historical View of Availability: The First 7 × 24 Requirements?

Historical Availability Scenarios

Planar Technology

Power-On Self-Test

Other Diagnostics

Component Repair

In-Flight Diagnostics


Reliability and Availability

Introduction to Reliability, Availability, and Serviceability

RAS Moves Beyond Hardware

Availability: An Overview

Some Definitions

Quantitative Availability

Availability: 7 R’s (SNIA)

Availability and Change

Change All around Us

Software: Effect of Change

Operations: Effect of Change

Monitoring and Change

Automation: The Solution?

Data Center Automation

Network Change/Configuration Automation

Automation Vendors

Types of Availability

Binary Availability

Duke of York Availability

Hierarchy of Failures

Hierarchy Example

State Parameters

Types of Nonavailability (Outages)

Logical Outage Examples


Planning for Availability and Recovery

Why Bother?

What Is a Business Continuity Plan?

What Is a BIA?

What Is DR?

Relationships: BC, BIA, and DR

Recovery Logistics

Business Continuity

Downtime: Who or What Is to Blame?

Elements of Failure: Interaction of the Wares


DR/BC Source Documents

Reliability: Background and Basics


IT Structure—Schematic

IT Structure—Hardware Overview

Service Level Agreements

Service Level Agreements: The Dawn of Realism

What Is an SLA?

Why Is an SLA Important?

Service Life Cycle

Concept of User Service

Elements of Service Management


Scope of Service Management

User Support

Operations Support

Systems Management

Service Management Hierarchy

The Effective Service

Services versus Systems

Availability Concepts

First Dip in the Water

Availability Parameters


What Is High Availability?

IDC and Availability

Availability Classification

Availability: Outage Analogy

A Recovery Analogy

Availability: Redundancy

Availability: Fault Tolerance

Sample List of Availability Requirements

System Architecture

Availability: Single Node

Dynamic Reconfiguration/Hot Repair of System Components

Disaster Backup and Recovery

System Administration Facilities

HA Costs Money So Why Bother?

Cost Impact Analysis

HA: Cost versus Benefit

Penalty for Nonavailability

Organizations: Attitude to HA

Aberdeen Group Study: February 2012

Outage Loss Factors (Percentage of Loss)

Software Failure Costs

Assessing the Cost of HA

Performance and Availability

HA Design: Top 10 Mistakes

The Development of HA


Systems and Subsystems Development

Production Clusters

Availability Architectures

RAS Features

Hot-Plug Hardware






Fault Tolerance

Outline of Server Domain Architecture


Domain/LPAR Structure

Outline of Cluster Architecture

Cluster Configurations: Commercial Cluster

Cluster Components



Commercial LB

Commercial Performance

Commercial HA

HPC Clusters

Generic HPC Cluster

HPC Cluster: Oscar Configuration

HPC Cluster: Availability

HPC Cluster: Applications

HA in Scientific Computing

Topics in HPC Reliability: Summary

Errors in Cluster HA Design

Outline of Grid Computing

Grid Availability

Commercial Grid Computing

Outline of RAID Architecture

Origins of RAID

RAID Architecture and Levels



Hardware versus Software RAID

RAID Striping: Fundamental to RAID

RAID Configurations

RAID Components



RAID Level 0

RAID Level 1

RAID Level 3

RAID Level 5

RAID Level 6

RAID Level 10

RAID 0 + 1 Schematic

RAID 10 Schematic

RAID Level 30

RAID Level 50

RAID Level 51

RAID Level 60

RAID Level 100

Less Relevant RAID

RAID Level 2

RAID Level 4

RAID Level 7

Standard RAID Storage Efficiency


SSD Longevity

Hybrid RAID: SSD and HDD

SSD References

Post-RAID Environment

Big Data: The Issue

Data Loss Overview

Big Data: Solutions?


Erasure Codes

RAID Successor Qualifications

EC Overview

EC Recovery Scope

Self-Healing Storage



High Availability: Theory

Some Math

Guide to Reliability Graphs

Probability Density Function

Cumulative Distribution Function

Availability Probabilities

Lusser’s Law

Availability Concepts

Hardware Reliability: The Bathtub Curve

Software Reliability: The Bathtub Curve

Simple Math of Availability



Mean Time between Failures

Mean Time to Repair

Online Availability Tool

Availability Equation I: Time Factors in an Outage

Availability Equation II

Effect of Redundant Blocks on Availability

Parallel (Redundant) Components

Two Parallel Blocks: Example

Combinations of Series and Parallel Blocks

Complex Systems

System Failure Combinations

Complex Systems Solution Methods

Real-Life Example: Cisco Network Configuration

Configuration A

Configuration B

Summary of Block Considerations

Sample Availability Calculations versus Costs

Calculation 1: Server Is 99% Available

Calculation 2: Server Is 99.99% Available

Availability: MTBFs and Failure Rate

Availability Factors

Planned versus Unplanned Outages

Planned Downtime: Planned Downtime Breakdown

Unplanned Downtime

Security: The New Downtime

Disasters: Breakdown of Causes

Power: Downtime Causes

Power Issues Addenda

So What?

External Electromagnetic Radiation Addendum

Power: Recovery Timescales for Uninterruptible Power Supply

Causes of Data Loss

Pandemics? Disaster Waiting to Happen?

Disasters: Learning the Hard Way

Other Downtime Gotchas

Downtime Gotchas: Survey Paper

Downtime Reduction Initiatives

Low Impact Outages

Availability: A Lesson in Design

Availability: Humor in an Outage—Part I

Availability: Humor in an Outage—Part II

So What?

Application Nonavailability

Traditional Outage Reasons

Modern Outage Reasons


High Availability: Practice

Central Site

Service Domain Concept

Sample Domain Architecture

Planning for Availability—Starting Point

The HA Design Spectrum

Availability by Systems Design/Modification

Availability by Engineering Design

Self-Healing Hardware and Software

Self-Healing and Other Items

Availability by Application Design: Poor Application Design

Conventional Programs

Web Applications

Availability by Configuration




Operating System


Availability by Outside Consultancy

Availability by Vendor Support

Availability by Proactive Monitoring

Availability by Technical Support Excellence

Availability by Operations Excellence

First Class Runbook

Software Level Issues

System Time

Performance and Capacity

Data Center Efficiency

Availability by Retrospective Analysis

Availability by Application Monitoring

Availability by Automation

Availability by Reactive Recovery

Availability by Partnerships

Availability by Change Management

Availability by Performance/Capacity Management

Availability by Monitoring

Availability by Cleanliness

Availability by Anticipation

Predictive Maintenance

Availability by Teamwork

Availability by Organization

Availability by Eternal Vigilance

Availability by Location

A Word on Documentation

Network Reliability/Availability

Protocols and Redundancy

Network Types

Network Outages

Network Design for Availability

Network Security

File Transfer Reliability

Network DR

Software Reliability

Software Quality

Software: Output Verification

Example 1

Example 2

Example 3

Software Reliability: Problem Flow

Software Testing Steps

Software Documentation

Software Testing Model

Software Reliability—Models

The Software Scenario

SRE Models

Model Entities

SRE Models: Shape Characterization

SRE Models: Time-Based versus Defect-Based

Software Reliability Growth Model

Software Reliability Model: Defect Count

Software Reliability: Standard IEEE 1633–2008

Software Reliability: Hardening

Software Reliability: Installation

Software Reliability: Version Control

Software: Penetration Testing

Software: Fault Tolerance

Software Error Classification



Reliability Properties of Software

ACID Properties

Two-Phase Commit

Software Reliability: Current Status

Software Reliability: Assessment Questions

Software Universe and Summary

Subsystem Reliability

Hardware Outside the Server

Disk Subsystem Reliability

Disk Subsystem RAS

Tape Reliability/RAS

Availability: Other Peripherals

Attention to Detail

Liveware Reliability


Be Prepared for Big Brother!

High Availability: SLAs, Management, and Methods


Preliminary Activities

Pre-Production Activities

BC Plan

BC: Best Practice

Management Disciplines

Service Level Agreements

SLA Introduction

SLA: Availability and QoS

Elements of SLAs

Types of SLA

Potential Business Benefits of SLAs

Potential IT Benefits of SLA

IT Service Delivery

SLA: Structure and Samples

SLA: How Do We Quantify Availability?

SLA: Reporting of Availability

Reneging on SLAs

HA Management: The Project

Start-Up and Design Phase

The Management Flow

The Management Framework

Project Definition Workshop

Outline of the PDW

PDW Method Overview

Project Initiation Document

PID Structure and Purpose

Multistage PDW

Delphi Techniques and Intensive Planning

Delphi Technique

Delphi: The Steps

Intensive Planning

FMEA Process

FMEA: An Analogy

FMEA: The Steps

FMECA = FMEA + Criticality

Risk Evaluation and Priority: Risk Evaluation Methods

Component Failure Impact Analysis

CFIA Development—A Walkthrough and Risk Analysis

CFIA Table: Schematic

Quantitative CFIA

CFIA: Other Factors

Management of Operations Phase

Failure Reporting and Corrective Action System


FRACAS: Steps for Handling Failures

HA Operations: Supporting Disciplines

War Room

War Room Location


Change/Configuration Management

Change Management and Control: Best Practice

Change Operations

Patch Management

Performance Management



Security Management

Security: Threats or Posturing?

Security: Best Practice

Problem Determination

Problems: Short Term

Problems: After the Event

Event Management

Fault Management

Faults and What to Do about Them

System Failure: The Response Stages

HA Plan B: What’s That?

Plan B: Example I

Plan B: Example II

What? IT Problem Recovery without IT?

Faults and What Not to Do

Outages: Areas for Inaction

Problem Management

Managing Problems

Problems: Best Practice

Help Desk Architecture and Implementation

Escalation Management

Resource Management

Service Monitors

Availability Measurement

Monitor Layers

System Resource Monitors

Synthetic Workload: Generic Requirements

Availability Monitors

General EUE Tools

Availability Benchmarks

Availability: Related Monitors

Disaster Recovery

The Viewpoint Approach to Documentation



High Availability: Vendor Products

IBM Availability and Reliability

IBM Hardware



IBM Series x

IBM Clusters

z Series Parallel Sysplex

Sysplex Structure and Purpose

Parallel Sysplex Schematic

IBM: High Availability Services

IBM Future Series/System

Oracle Sun HA

Sun HA

Hardware Range

Super Cluster

Oracle Sun M5-32

Oracle HA Clusters

Oracle RAC 12c

Hewlett-Packard HA

HP Hardware and Software




Servers: Integrity Servers

HP NonStop Integrity Servers

NonStop Architecture and Stack

NonStop Stack Functions

Stratus Fault Tolerance

Automated Uptime Layer

ActiveService Architecture

Other Clusters

Veritas Clusters (Symantec)

Supported Platforms

Databases, Applications, and Replicators

Linux Clusters


Oracle Clusterware

SUSE Linux Clustering

Red Hat Linux Clustering

Linux in the Clouds

Linux HPC HA


Carrier Grade Linux

VMware Clusters

The Web and HA

Service Availability Software

Continuity Software

Continuity Software: Services


High Availability: Transaction Processing and Databases

Transaction Processing Systems

Some TP Systems: OLTP Availability Requirements

TP Systems with Databases

The X/Open Distributed Transaction Processing Model: XA and XA+ Concepts


Relational Database Systems

Some Database History


SQL Server and HA

Microsoft SQL Server 2014 Community Technology Preview 1

SQL Server HA Basics

SQL Server AlwaysOn Solutions

Failover Cluster Instances

Availability Groups

Database Mirroring

Log Shipping


Oracle Database and HA


Oracle Databases

Oracle 11g (R2.1) HA

Oracle 12c

Oracle MAA

Oracle High Availability Playing Field


MySQL: HA Features

MySQL: HA Services and Support

IBM DB2 Database and HA

DB2 for Windows, UNIX, and Linux

DB2 HA Feature

High Availability DR

DB2 Replication: SQL and Q Replication

DB2 for i

DB2 10 for z/OS

DB2 pureScale

InfoSphere Replication Server for z/OS

DB2 Cross Platform Development

IBM Informix Database and HA

Introduction (Informix 11.70)

Availability Features

Fault Tolerance

Informix MACH 11 Clusters

Connection Manager

Informix 12.1

Ingres Database and HA

Ingres RDBMS

Ingres High Availability Option

Sybase Database and HA

Sybase High Availability Option


Use of SAP ASE

Vendor Availability

ASE Cluster Requirements

Business Continuity with SAP Sybase


NonStopSQL Database



High Availability: The Cloud and Virtualization


What Is Cloud Computing?

Cloud Characteristics

Functions of the Cloud

Cloud Service Models

Cloud Deployment Models

Resource Management in the Cloud

SLAs and the Cloud

Cloud Availability and Security

Cloud Availability

Cloud Outages: A Review

Aberdeen: Cloud Storage Outages

Cloud Security


What Is Virtualization?

Full Virtualization


Security Risks in Virtual Environments

Vendors and Virtualization



VMware VSphere, ESX, and ESXi

Microsoft Hyper-V

HP Integrity Virtual Machines

Linux KVM

Solaris Zones


Virtualization and HA

Virtualization Information Sources


Disaster Recovery Overview

DR Background

A DR Lesson from Space

Disasters Are Rare . . . Aren’t They?

Key Message: Be Prepared

DR Invocation Reasons: Forrester Survey

DR Testing: Kaseya Survey

DR: A Point to B Point



Backup Modes

Cold (Offline)

Warm (Online)

Hot (Online)

Backup Types

Full Backup

Incremental Backup

Multilevel Incremental Backup

Differential Backup

Synthetic Backup

Progressive Backup

Data Deduplication

Data Replication

Replication Agents

Asynchronous Replication

Synchronous Replication

Heterogeneous Replication

Other Types of Backup

DR Recovery Time Objective: WAN Optimization

Backup Product Assessments

Virtualization Review

Gartner Quadrant Analysis

Backup/Archive: Tape or Disk?

Bit Rot

Tape Costs

DR Concepts and Considerations

The DR Scenario

Who Is Involved?

DR Objectives

Recovery Factors

Tiers of DR Availability

DR and Data Tiering

A Key Factor

The DR Planning Process

DR: The Steps Involved

In-House DR

DR Requirements in Operations





DR Cost Considerations

The Backup Site

Third-Party DR (Outsourcing)

DR and the Cloud

HA/DR Options Described

Disaster Recovery Templates



Appendix 1

Reliability and Availability: Terminology


Appendix 2

Availability: MTBF/MTTF/MTTR Discussion

Interpretation of MTTR

Interpretation of MTTF

Interpretation of MTBF

MTTF and MTBF—The Difference

MTTR: Ramp-Up Time

Serial Blocks and Availability—NB

Typical MTBF Figures

Gathering MTTF/MTBF Figures

Outage Records and MTTx Figures

MTTF and MTTR Interpretation

MTTF versus Lifetime

Some MTxx Theory


Final Word on MTxx

Forrester/Zenoss MTxx Definitions


Appendix 3

Your HA/DR Route Map and Kitbag

Road to HA/DR

The Stages

A Short DR Case Study

HA and DR: Total Cost of Ownership

TCO Factors

Cloud TCO

TCO Summary

Risk Assessment and Management

Who Are the Risk Stakeholders?

Where Are the Risks?

How Is Risk Managed?

Availability: Project Risk Management

Availability: Deliverables Risk Management

Deliverables Risk Management Plan: Specific Risk Areas

The IT Role in All This


Appendix 4

Availability: Math and Other Topics

Lesson 1: Multiplication, Summation, and Integration Symbols

Mathematical Distributions

Lesson 2: General Theory of Reliability and Availability

Reliability Distributions

Lesson 3: Parallel Components (Blocks)

Availability: m-from-n Components

m-from-n Examples

m-from-n Theory

m-from-n Redundant Blocks

Active and Standby Redundancy


Summary of Redundancy Systems

Types of Redundancy

Real m-from-n Example Math of m-from-n Configurations

Standby Redundancy

An Example of These Equations

Online Tool for Parallel Components: Typical Calculation

NB: Realistic IT Redundancy

Overall Availability Graphs

Try This Availability Test

Lesson 4: Cluster Speedup Formulae

Amdahl’s Law

Gunther’s Law

Gustafson’s Law

Amdahl versus Gunther

Speedup: Sun-Ni Law

Lesson 5: Some RAID and EC Math

RAID Configurations

Erasure Codes

Lesson 6: Math of Monitoring

Ping: Useful Aside

Ping Sequence Sample

Lesson 7: Software Reliability/Availability


Software Reliability Theory

The Failure/Defect Density Models

Lesson 8: Additional RAS Features

Upmarket RAS Features


I/O Subsystem

Memory Availability

Fault Detection and Isolation

Clocks and Service Processor


Predictive Failure Analysis

Lesson 9: Triple Modular Redundancy

Lesson 10: Cyber Crime, Security, and Availability

The Issue

The Solution

Security Analytics

Zero Trust Security Model

Security Information Event Management

Security Management Flow

SIEM Best Practices

Security: Denial of Service

Security: Insider Threats

Security: Mobile Devices (BYOD)

BYOD Security Steps

Security: WiFi in the Enterprise

Security: The Database

Distributed DoS

Security: DNS Servers

Cost of Cyber Crime

Cost of Cyber Crime Prevention versus Risk

Security Literature


Appendix 5

Availability: Organizations and References

Reliability/Availability Organizations

Reliability Information Analysis Center

Uptime Institute

IEEE Reliability Society

Storage Networking Industry Association

Availability Digest

Service Availability Forum

Carnegie Mellon Software Engineering Institute

ROC Project—Software Resilience

Business Continuity Today

Disaster Recovery Institute

Business Continuity Institute

Information Availability Institute

International Working Group on Cloud Computing Resiliency

TMMi Foundation

Center for Software Reliability


Security Organizations

Security? I Can’t Be Bothered

Cloud Security Alliance

CSO Online


Cyber Security and Information Systems IAC

Center for International Security and Cooperation

Other Reliability/Security Resources

Books, Articles, and Web sites

Major Reliability/Availability Information Sources

Other Information Sources

Appendix 6

Service Management: Where Next?

Information Technology Infrastructure Library

ITIL Availability Management

Service Architectures


Availability Architectures: HA Documentation

Clouds and Architectures

Appendix 7


About the Author

Dr. Terry Critchley is a retired IT consultant living near Manchester in the United Kingdom. He studied physics at the Manchester University (using some of Rutherford's original equipment!), gained an Honours degree in physics, and 5 years later with a PhD in nuclear physics. He then joined IBM as a Systems Engineer and spent 24 years there in a variety of accounts and specializations, later served in Oracle for 3 years. Terry joined his last company, Sun Microsystems in 1996 and left there in 2001, after planning and running the Sun European Y2000 education, and then spent a year at a major UK bank.

In 1993 he initiated and coauthored a book on Open Systems for the British Computer Society (Open Systems: The Reality) and has recently written this book IT Services High Availability. He is also mining swathes of his old material for his next book, Service Performance and Management.

Subject Categories

BISAC Subject Codes/Headings:
BUSINESS & ECONOMICS / Production & Operations Management
COMPUTERS / Information Technology
COMPUTERS / Networking / General