Thursday, December 22, 2005

Reliability and Liability

37signals posted on why relability wasn't critical for web service companies on the 6 Dec 2005 . Two weeks later several web service companies, SixApart, Del.icio.us and even the granddaddy, SFdc, had outages. Bloglines service has been spotty. The frustration that these failures have caused is evident from the moaning across the blogsphere. The SFdc failure has caused businesses to loose money.

Reliability matters in operations. Even for web service companies.

But debates on reliability are clouded by misconception and misunderstanding. We need to understand why reliability numbers can be misleading and how reliability is achieved.

Reliability as a Number
Reliability is often specified by a wonderful percentage: 98...99...99.9...99.98 and so on but what does the number say and what doesn't it say. Reliability in its simplest form is how many hours within 100 that the service is not available. Therefore 99% means 1 hour in every 100 the system is unavailable. The number does not tell you is how that hour is spread out across the 100.

What reliability numbers do not tell is how that 1 hour in 100 came about. The quality of the reliability. Two scenarios. Every 50 hours the system is take down for 30 minutes for preventitive maintence. The time is scheduled and announced well in advanced. Or that 1 hour is randomly spread across the 100 hours from failures and random firefighting. Users receive no warning of the outtages and they can come at any time.

The first scenario is a lot less fustrating to users than the second and yet they have the same reliability number. Reliability specified as a number is next to useless and can be grossly misleading. Before you can use a reliability number you have to understand how it occured. You need to understand the quality of the operations overall.

Achieving Reliability
There are two methods to achieving reliability: the brute force method and the smart method.

Using the brute force method to acheiving reliability is expensive. Each extra step in reliability is more expensive than the next. Each step increases the complexity of the overall system. Which of course increases the risk of something going wrong. To make matters worse not only does the risk of failure go up, the risk of a spiral into catastrophe also increases. Not a nice combination.

The brute force method is often used as it is easy to understand. But for many web service companies is overkill and too expensive.

The smart method is routed in engineering risk analysis: identifying the types of failure, the probability of failure and the consequence of failure. The various failures are ranked by risk: a combination of the probability of failure and consequence of failure. These failures can then be dealt with from the riskiest to the least risky.

Risk analysis reduces the cost of reliability by giving the users an objective method for identifying where they will get the biggest bang for their buck. But it is a continual process. It is not something that you do once, place into a drawer and forget about. The risk analysis must be done continuously as the risks and likelihood change as the environment, technology and business evolve.

There is going to be a lot of resistance to using risk analysis techniques in web services companies. If nothing else simply because it challenges the current way of doing business. But for the web service companies to survive they are going to have to embrance and internalise engineering risk management.

Risk management techniques were developed to address liability issues that engineering firms faced. They were hard lessons but liability force the engineering companies to develop better operations. Perhaps it is time for the web service companies to be liable for the quality of their service.

Tags: , , , , , , , ,

1 comments:

Marc Steffen said...

Hi thanks for posting tthis