Mad Engineering: Uptime

The length of time that has passed since a computer was booted is a variable that system administrators call "uptime." They want it to be as high as possible. Every time the computer is shut down or rebooted, it is not available for the duration. In addition, longer uptimes are most difficult to achieve, so having a large one is a bragging point. Uptimes are measured in time units, starting with seconds, and averaging in weeks. (A few people have managed uptimes in years.) Of course, the difficulty is that the administrator must avoid conditions that would end this uptime, from minor ones like crashes that require rebooting, the major ones like power-loss. (Power loss tends to shut down the computer unless a backup system quickly restores it.)
Clients of a system are also fond of referring to uptime as availability, meaning how often the computer is available to do work. They are not quite the same: If the network is down, the computer can be on, and thus giving uptime, but not available, because it can't be reached by the client. Availability is measured in "Nines", meaning how often the computer is providing services. One nine is easy, this would mean the computer is available 90% of the time. Wikipedia provides a useful chart showing what this means for administrators. More nines means less time down.
Of course, most clients demand five nines, would would mean the system is available for use 99.999% of the time. Clients often don't realize how intricate this has to be, as it is more available than technically possible with one computer, requiring elaborate backup-systems and the ability to switch machines at a microsecond's notice. A five-nines availability means the administrator gets to take the machine down for only five minutes a year, for both planned outages like operating systems upgrades and unplanned ones like equipment failure. They had better have backup power, storage, and even duplicated machines ready to go at an instant. If a client is sold a five-nines contract, the system administrator better be given notice and a large hardware budget!
This is another motivator for virtual, or cloud, computing. Clients are given, rather than a physical machine run by the administrator, a simulated computer on the administrator's larger, more powerful one. For the client's purposes, this acts just like a physical computer, but it's all centralized, with more expensive and less likely to fail parts that all the clients share. The simulated computer can also be transferred to an identical machine without it being apparent to anyone using it. This allows for easy maintenence, as the virtual machines can be tossed over so that the physical computer can be shut down and repaired or updated, and no one but the administrator is the wiser. The simulated computers lose no uptime, even though it was transferred to a new environment. The expensive hardware is shared, extra capacity prevents problems, and there's a quick solution to all but the most catastrophic of problems. (This setup may survive even power failures with the right equipment, but deliberate sabotage, like a rival vandalizing the machine it's running on, could still kill the availability.)
Now, all of this is very expensive, so if you're a hosting company and your client is demanding five nines (or worse, six nines), you better charge a huge amount of money, or else your arrangement will prove unprofitable. (Either because the fees paid by the client don't pay your hardware expenses, or because there's an unavoidable outage and you get sued for breaching your contract.)
Equipment that increases availability includes:
* RAID 1: An arrangement of hard drives so that the computer survives the loss of up to half its drives. (Drives should of course be replaced as soon as possible, and for best results must be a hot-swappable type that can be replaced while the machine is still running. Yes you can buy that, but it's more expensive.)
* Uninterruptable power supply: Gives the computer 20 or so minutes of power in the event of a power failure. These 20 minutes should be used to transfer virtual machines to another site that is not experiencing a power failure, or at least gracefully shut down the machine.
* "heartbeat" software: This detects if a machine is down, and if it is, transfers all references to it to another system. The backup system should be an exact clone of the original machine.
* Virtualization: The clients computer is actually simulated by a larger, more powerful one. The larger machine can simulate many smaller ones, each of which looks from the outside like an individual computer. These virtual machines can be passed around to another server without the outside world noticing, but this is best done only during non-peak hours like the weekend, or an emergency, in case there's a problem. The virtual machines also never have hardware or memory problems, as they don't have hardware, and their memory is provided by the host computer's.
* Regular maintenance: On certain weekends, transfer one computer to the other, shut down the first one, and completely take it apart, clean out all dust, upgrade components if necessary, and replace anything failed. Then put it back together and test all the subsystems. This ensures that both machines remain clean, fresh, and available. The next cycle, the other machine is the one taken apart.
A business study showed that 3 nines, 99.9% availability, is good enough for most companies, and this costs far, far, far less.

Mad Engineering

Friday, June 25, 2010

Uptime

No comments: