[GLLUG] Why RAID 5 stops working in 2009

Wed Mar 25 17:48:13 EDT 2009

On Wed, March 25, 2009 3:36 pm, Peter Smith wrote:
> I mentioned this article/idea at the post-meeting activities last week.
> The
> original post is at http://blogs.zdnet.com/storage/?p=162 from July 18,
> 2007. A nice commentary on/against it is at
> http://dansdata.blogsome.com/2008/10/23/death-of-raid-predicted-film-at-11/
>
> The basic premise is thus: The unrecoverable read error (URE) rate for
> SATA
> drives is generally documented at 10^14. About 12 TB. When we hit 2TB
> Drives, there's a problem. With a 7 drive RAID 5 disk failure, you're left
> with six 2TB drives to rebuild the replaced 'dead' drive. That's about 12
> TB.  So, um, that means  you'll have a URE while recovering, and the
> recovery will shut down and tell you to restore from backups.
>
> So you go to RAID 6, which becomes the new RAID 5. :) For a while, and
> REQUIRED to have safety from one disk failure.
>
> That's the premise. I found the commentary AFTER the meeting, and it
> solved
> the one problem *I* had with it (that a failure rate of one in 10^14 over
> a
> sample size of 10^14 isn't 100%) that I didn't get around to quantifying
> with math.
>
> But, I thought i'd post it anyhow, since I offered to do so. :) Discuss
> among yourselves.

If I understand the article correctly, the author relies on the assumption
that as disk capacities rise, the average failure rate per GB (or however
you care to measure capacity) will stay the same. The former will
eventually catch up to the latter at some unspecified point in 2009
whereupon the earth explodes and takes all of known civilization with it, 
without the help of even a single vogon.

Even if you allow for the moment that measuring disk failures in terms of
disk size is a good idea, you need to also take into consideration that
the failure rate, when measured this way, will decrease inversely to disk
capacity. Hard disk manufacturers are very keenly interested in keeping
the average failure rate of their products low--or at least keeping it
from rising--on a year-by-year basis.

Thus, I would bet that if you found in 1999 that 5% of all new disks
failed within one year (for example), roughly the same percentage would
hold in 2009. If the failure rate kept rising, the company's expenses
would increase, the accountants would get all tense, and shareholders
would start pulling out. That's a pretty strong incentive for management
to keep the failure rate steady, even if it means beating down the
marketing department every now and again.

And honestly, when you start making RAID 5 arrays out of 6 or 7 disks,
you're already on shaky ground as far as redundancy is concerned. That has
always been the case.

Charles
-- 
http://bityard.net