How to set up high available services
Almost every company in the world relies on some application or applications today. If such an application fails the consequences could be various…from few unpleasant error messages which don’t influence your work to complete disaster following by thousands phone calls from angry customers. Depending on possible effects of the application failure we should prepare system design to reach as highest availability as it is possible for reasonable price.
In last few weeks we’ve run into this problem. We have our application/service running on application server accessible from internet. But we would like to make it more available. So I’ve searched the internet and now I’ll try to share my knowlidge and experiences with you.
Little bit of theory
Let’s start with little theoretical background. High availability systems ensure that services are available during contractual measurement period. If user cannot access the system, it is said to be unavailable. Periods when a system is unavailable are called downtimes.
Availability is usually expressed as a percentage of uptime in a given year. Service level agreements refer to monthly downtime or availability.
|Availability %||Downtime per year||Downtime per month*||Downtime per week|
|90% (“one nine”)||36.5 days||72 hours||16.8 hours|
|95%||18.25 days||36 hours||8.4 hours|
|97%||10.96 days||21.6 hours||5.04 hours|
|98%||7.30 days||14.4 hours||3.36 hours|
|99% (“two nines”)||3.65 days||7.20 hours||1.68 hours|
|99.5%||1.83 days||3.60 hours||50.4 minutes|
|99.8%||17.52 hours||86.23 minutes||20.16 minutes|
|99.9% (“three nines”)||8.76 hours||43.8 minutes||10.1 minutes|
|99.95%||4.38 hours||21.56 minutes||5.04 minutes|
|99.99% (“four nines”)||52.56 minutes||4.32 minutes||1.01 minutes|
|99.999% (“five nines”)||5.26 minutes||25.9 seconds||6.05 seconds|
|99.9999% (“six nines”)||31.5 seconds||2.59 seconds||0.605 seconds|
|99.99999% (“seven nines”)||3.15 seconds||0.259 seconds||0.0605 seconds|
Our system architecture
Our application have simple structure. Following figure shows its design:
This simple web application consists of application server, where logic resides, and database server used as data storage. Application server is connected to internet so users could access its services via their browsers. We use Tomcat as application server and PostgreSQL as database server. For simplicity I’ve ommited Apache which serves requests for static content and forwards requests to application server.
Even in such a simple construction there are a lot of components which could fail and prevent user from using our services. We’ll skip the problems with client computer which is out of reach and there is no way we can handle these troubles.
First thing we should consider is our internet connection. Which internet provider we are using and what availability he guarantees. In most cases server computers are placed in server hosting company. If we choose this company wisely, there is high probability that we won’t face connection problems in the future and we could rely on the connection we already have. However it is good point to think of having backup server placed in different server hosting company in case of any problem occurs.
Our responsibility basically starts by server itself and all the applications running on it. In this case it is application server and database.
High availability (HA) is often achieved by harnessing redundant computers in groups or clusters. High-availability clustering can detect hardware/software faults and immediately restart the application on another system without any administrative intervention. This process is known as failover.
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle but serious condition all clustering software must be able to handle is split-brain, which occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.
The most common size for an HA cluster is a two-node cluster, since that is the minimum required to provide redundancy, but many clusters consist of many more, sometimes dozens of nodes. Such configurations can sometimes be categorized into one of the following models:
- Active/active — Traffic intended for the failed node is either passed onto an existing node or load balanced across the remaining nodes. This is usually only possible when the nodes utilize a homogeneous software configuration.
- Active/passive — Provides a fully redundant instance of each node, which is only brought online when its associated primary node fails. This configuration typically requires the most extra hardware.
- N+1 — Provides a single extra node that is brought online to take over the role of the node that has failed. In the case of heterogeneous software configuration on each primary node, the extra node must be universally capable of assuming any of the roles of the primary nodes it is responsible for. This normally refers to clusters which have multiple services running simultaneously; in the single service case, this degenerates to active/passive.
- N+M — In cases where a single cluster is managing many services, having only one dedicated failover node may not offer sufficient redundancy. In such cases, more than one (M) standby servers are included and available. The number of standby servers is a tradeoff between cost and reliability requirements.
- N-to-1 — Allows the failover standby node to become the active one temporarily, until the original node can be restored or brought back online, at which point the services or instances must be failed-back to it in order to restore high availability.
- N-to-N — A combination of active/active and N+M clusters, N to N clusters redistribute the services, instances or connections from the failed node among the remaining active nodes, thus eliminating (as with active/active) the need for a ‘standby’ node, but introducing a need for extra capacity on all active nodes.
HA clusters usually utilize all available techniques to make the individual systems and shared infrastructure as reliable as possible. These include:
- Disk mirroring so that failure of internal disks does not result in system crashes.
- Redundant network connections so that single cable, switch, or network interface failures do not result in network outages.
- Redundant storage area network or SAN data connections so that single cable, switch, or interface failures do not lead to loss of connectivity to the storage (this would violate the share-nothing architecture).
- Redundant electrical power inputs on different circuits, usually both or all protected by uninterruptible power supply units, and redundant power supply units, so that single power feed, cable, UPS, or power supply failures do not lead to loss of power to the system.
These features help minimize the chances that the clustering failover between systems will be required. In such a failover, the service provided is unavailable for at least a little while, so measures to avoid failover are preferred.
As we want ot improve our availability, we’ve decided that we will have second (passive) node which will be used in case of primary node failure. That is active/passive nodes configuration will be used. This second node will have the same software configuration as primary one. Application server and database will be installed there. Also RAID will be used for disks in both nodes.
In next posts I’ll try to explain how we handle failovers and data synchronization between both nodes. Which software we use, why and what are our experiences.
Posted in programming