VMware Cloud Community
skroesen
Contributor
Contributor

RAID Group with RAID 6

We just purchased a DAE of 1TB drives for our CX3-20. I am debating with myself about how to setup the RAID groups on these new drives considering the amount of storage and number of spindles. We will be using this storage for our file server as well as a repository for email via EMC Emailxtender (all running on ESX) so IO should not be extremely high.

I don't see much out there for best practices with RAID 6 with EMC. Looking for info on number of disks, i.e. 82 or go 132. From what I have read going with a higher number or disks in a RAID group with RAID 6 is OK because of the protection from failure during rebuild in the event you lose a second disk. What are the best practices with RAID 6? Wondering if I should just skip RAID 6 and go with (2) 6+1 RAID 5 sets.

0 Kudos
16 Replies
bobross
Hot Shot
Hot Shot

Well, if you have to use a static array like the CX, I'd stick to RAID-5, as you mentioned, 2 groups of 7 and a spare. The problem with large RAID sets is the time it takes to rebuild a failed disk; if you use a 13+2 RAID-6, for example, you must read 13TB of data to rebuild 1, and that takes forever. Well, not forever Smiley Happy but certainly many, many hours. I've seen a 1TB rebuild take 40 hours in my shop. After the rebuild finally finished, we pulled the bad disk, sent it back and our vendor said "there is nothing wrong with this disk." Great, thanks a lot. We wasted 40 hours for nothing?

Plus, there is another, little-known and rarely discussed problem with large RAID sets. It's data integrity. The larger the RAID set, the more data must be read during rebuild, and thus the higher probability of running into an unrecoverrable read error on disk. If you are using cheap, deep SATA, this is especially problematic; chances are 50-50 that if you read that 13 TB, you will have at least one unrecoverable read error and the rebuild will fail.

I am told that SCSI has fixed this with the DIF field, but the CX unfortunately does not use it.

This leads me to a question back to you; my shop is also considering two choices, namely: 1) buy more disk for our old array (probably the same age as your CX3, i.e. 3 years) or 2) buy a new small array. Choice 2 is not cheap, but choice 1 is also not cheap because our array vendor is charging us out the wazoo for maintenance/service for its 4th year.

How did you determine what to do, outside of just the math? Many thanks in advance.

skroesen
Contributor
Contributor

Thanks for your reply, helpful information. I am leaning toward the 6+1 with a HS. What you mention is a problem when getting these huge SATA disks...you have to be real careful not to dig your own hole and jump in when setting up this storage!

Our CX is about 2 years old now...maybe 2.5. We get a pretty substancial discount on maintenance since we are a government account, but have no idea what they are going to hit us for after the first three years run out. We looked at purchasing another AX4 rather than upgrading the current box and could not justify the costs this time around keeping in mind the performance of the CX compared to the AX we were looking at. In actuallity, our CX is relatively under utilized, but we like to have that extra performance on the FC RAID sets. We do have an AX4 in one of our remote office that is almost a year old, and looking to purchase another that will be used as a mirror view target with the CX as the source. We have been happy with that unit, and the pricing is pretty darn attractive with SAS storage when comparing the CX.

0 Kudos
williambishop
Expert
Expert

I'd have to agree with you, I distrust raid 5, and ever since 6 became available, I use it on every array that supports it. If recovery gets too long, then you're on the right track, go for a lower number of disks in the volume set, but it's worth the extra protection, even if you reduce usable space to do it.

--"Non Temetis Messor."
0 Kudos
jose_maria_gonz
Virtuoso
Virtuoso

Hi Skroesen,

I have attached below EMC Baseline rebuild rates for RAID 5.

Note that for a RAID 6 is about a 10% higher than on a RAID 5.

I hope it helps.

If you find this or any other answer useful please consider awarding points by marking the answer helpful or correct.

-


El Blog de Virtualizacion en Español

http://josemariagonzalez.es

-


0 Kudos
bobross
Hot Shot
Hot Shot

Helps? Sorry, not in the real world. This chart gives best-case laboratory behavior. No I/O load? Sure, right, my shop doesn't do any real I/O, ever.

Even under sterile lab conditions, it would still take over 13 hours to rebuild 1 TB from 8+1. Plus, Murphy (and probability) says that the failure will be in the RG that is the most busy over time. This is why I plan on 40-hour rebuilds.

0 Kudos
jose_maria_gonz
Virtuoso
Virtuoso

Hi Bobross

During a RAID re-build you can check I/O load for your CX3-20 with Navisphire Analyzer, if you bought the software Smiley Wink

Rgds,

J-

If you find this or any other answer useful please consider awarding points by marking the answer helpful or correct.

-


El Blog de Virtualizacion en Español

http://josemariagonzalez.es

-


0 Kudos
bobross
Hot Shot
Hot Shot

Of course. But I can check my CX today w/o Navi to see how slowly it runs Smiley Sad

One look at my switch port stats tells the tale.

0 Kudos
RParker
Immortal
Immortal

A) that's why SAN has standby drives, the RAID will rebuild itself and hopefully notify you to replace the failed drive at some point.

B) who cares how long it takes, as long as it finishes

C) you can adjust the speed at which an array get's rebuilt

D) with newer drives the time it takes is relative, rebuilding a 3 disk array takes less time than a 15 disk array, but it's certainly not a intolerable amount of time with the NEW SAS technology.

E) the raid rebuild requires no downtime so virtual no impact on the VM's performance (based upon point C)

F) RAID 6 outweighs the benefit of RAID 5 with an additional disk failure (2 drives to fail instead of just 1)

So RAID 6 is the tuture, and it's not any slower than RAID 5, and for the longer term it's better.

Plus, there is another, little-known and rarely discussed problem with large RAID sets. It's data integrity. The larger the RAID set, the more data must be read during rebuild, and thus the higher probability of running into an unrecoverrable read error on disk. If you are using cheap, deep SATA, this is especially problematic; chances are 50-50 that if you read that 13 TB, you will have at least one unrecoverable read error and the rebuild will fail.

Not true. ANY Good RAID or SAN will retry the rebuild, and won't progress to the next copy until the last is verified. So that is a weak argument. Drives are reliable these days, and even moreso than past drives. I remember drive failing pretty much left and right years ago, but now, it's a seldom occurrence, and IF a drive fails, it's usually because the drives are older.... Law of averages catches up.. MTBF

0 Kudos
RParker
Immortal
Immortal

Even under sterile lab conditions, it would still take over 13 hours to rebuild 1 TB from 8+1

Precisely the reason why NETAPP uses RAID 4, only 1 parity drive in the array (or 2 with double parity). and you are talking disaster recovery.. IF something happens. I prefer to remain optimistic (and proactive) to keep this sort of thing from happening in the FIRST place, firmware updates, scheduled reboots, software updates, keeping good backups, etc...

Your sounds like a armageddon situation and you put everything on the SAN and just leave it there, and cross your fingers hoping nothing happens. I prefer to keep this active and not wait until it happens. If it happens, ok.. But we are prepared.

So it takes 13 hours.. that's only an issue.. IF you sit there and watch it.. The RAID is still functioning during this time, you aren't waiting on a rebuild to continue using the array.. so 13 hours or 13 days.. if the array is still up, why is the time an issue?

0 Kudos
bobross
Hot Shot
Hot Shot

RParker said:

Not true. ANY Good RAID or SAN will retry the rebuild

Absolutely true. DIF exists for a reason (many reasons, actually, and this is one of them). The rebuild can be restarted, sure, but the situation does not change. If you have enough data, you will hit an unrecoverable read error at some point, statistically speaking. Restarting the rebuild does nothing but postpone the inevitable.

Who cares how long it takes? I do, and many others do as well. Sure, the array is up, but you now have a Hobson's choice; finish the rebuild quickly, by jacking up the rebuild priority and thus penalizing your production apps, or finish the rebuild very slowly, and save your production apps. There is no free lunch in rebuild.

RAID 6? Please. If you have the money for RAID 6, good for you; might as well use RAID 10 and get it over with, that way you can fail half the disks (minus one) and still live.

RAID 6 is not the future...it is a variant of the past...self-healing is the future (and is actually here, according to one vendor who has called on us)

0 Kudos
STPatrick
Contributor
Contributor

Long rebuild times are an important matter. Keep in mind that a rebuild stresses the disks to their limits. If you have a large raid 5 array with for example 13 or more disks the rebuild can take at least 12 hours up to some days (with large and slow SATA disks). And if only one drive can't stand this stress during the rebuild process (and SATA disk are good candidates for that issue) the hole array is history and you can say hello to your disaster recovery plan Smiley Happy Because of this scenario i prefer the raid 6 mode, at least with SATA Disk. The i/o of an raid5 is not much better than raid6 and the difference in comparsion with other Raid Levels like 10 or equal is minimal. So if raid 5 matches your requirements at i/o raid 6 does it too. I never had a read failure on a disk during the rebuild process not in a san array and not in an other array elsewhere... So i cant say something about that issue.

0 Kudos
RParker
Immortal
Immortal

Who cares how long it takes? I do, and many others do as well. Sure, the array is up, but you now have a Hobson's choice; finish the rebuild quickly, by jacking up the rebuild priority and thus penalizing your production apps, or finish the rebuild very slowly, and save your production apps. There is no free lunch in rebuild.

OK, let's go at this a different angle. I already pointed out that it's how you configure your RAID to build fast, lose performance, build slow maintain performance but rebuilds take longer. So if the performance is key, then fine keep it slow.

Raid fails, disk 1 goes bad. It rebuilds. Hour 1: ARRAY is still running... Hour 13: ARRAY is STILL running. So my question is WHERE is the impact? If this happens on a weekend and you are sleeping you won't even know it's completely transparent... so AGAIN I ask WHY does it matter how long it takes? You come in on Monday, see where the ARRAY rebuilt itself and your spare drive is now failed.

You still haven't made a good argument WHY you watch and are concerned with the rebuild rate? I understand not affecting performance, but it set the rebuild to slow, and your ARRAY NEVER fails, then during the time of the rebuild, what difference does it make?!?!?

NONE!

That's my point, what have you lost. Nothing. What does it do to your VM's? NOTHING. Where do you intervene during this time? You DON'T. So why should the speed make a difference? It's not making you wait for something to get done, you aren't waiting for the ARRAY to come back up (it's been running the whole time). So tell me, if it takes 100 hours, SO WHAT?!?!? The ARRAY isn't impacted, it's got 1 bad drive, big whoopee.. if the VM's are running, no data loss, no servers are crashed, nothing is wrong, everyone is happy AND performance doesn't suffer, what is the impact?

Answer: NONE! 1 hour or 100 hours, the ARRAY still goes on, never get's interrupted. This is only a problem IF the ARRAY becomes corrupted.

That's my point. You are making time an issue, where there is not an issue, a RAID is designed to rebuild low priority, and yes you CAN increase this priority which will (or may) adversely affect the RAID performance) but YOUR time IS NOT affected. Your VM's are NOT affected, a drive / array rebuilds, chances are with properly setup SANS and standby disks you won't even know UNTIL you get a notification that a drive failed.

OK, now what? NOTHING! ALL the RAIDS are still running, they didn't break because of a SINGLE disk.

And let me get this straight you would RATHER take a chance with ONE drive than to keep the sanity of RAID 6 with 2 drives (1 more drive buffer fail) and you complain about the time it takes to rebuild an ARRAY?

Umm.. that's rather obtuse thinking, you aren't making any sense.. so you sacriffice a little more space.. But you WANT to keep your space, AND have fast rebuilds, and ALL I am saying is you can't have both, if you give yourself an extra Parity drive on an ARRAY then when a drive fails you still have another drive fail and keep the ARRAY going. But you are contradicting yourself... because you say you are worried about one drive?

If you have the right hardware, you shouldn't have to worry at all.. and for my peace of mind on important Servers and RAID ARRAYS RAID 6 IS the future no matter what you say.. It's the best of ALL worlds, so if you have any confusion about RAID 6 and the future of ARRAY's you are reading the wrong tech forums it's the wave of the future with RAIDS. More and more people are using RAID 6 why? Simple more integrity and minimal SPACE impact.. and you want to make the leap ALL the way to RAID 10.. interesting.. since they aren't even in the same ball park. At least RAID 6 is somewhat closer to RAID 5.

Why do you think they came up with it, because someone has too much time on their hands? I don't think so.. it fits a need. Bottom line.

0 Kudos
RParker
Immortal
Immortal

Keep in mind that a rebuild stresses the disks to their limits

Not really it's reading from the entire ARRAY, but individual disk contribution is low, plus it's READing the content... and sure it still has to maintain the ARRAY all the while. But if you can look inside your RAID controller and see individual drive performance, I can assure you it's not as high as you think. That little light blinking isn't a speedometer, it just means the drive is being accessed, whether it's 1 k/s or 160 MB/s.. you can't simply assume the drive light means the disk is pegged.

And if only one drive can't stand this stress during the rebuild process (and SATA disk are good candidates for that issue) the hole array is history and you can say hello to your disaster recovery plan

When has this happened? First of all tell me when was the last time you had a disk failure that caused an ARRAY to fail? I am not talking about cheap off the shelf, Circuit City discount drives, MOST SAN vendors hand pick drives and only use high quality hardware. so we aren't talking about just an array of some OLD sata drives sitting in your desktop computer, this is high quality stuff. So it can handle the stress.. and the RAID controllers are NOT going to bombard the drive (which is why the rebuilds take a little while) to lessen the impact. IF your array fails, and IF there is a problem with the drive, and IF you can't rebuild the disk. . . . in the mean time your ARRAY is still up and running is it not? YES . . . . .

The i/o of an raid5 is not much better than raid6 and the difference in comparsion with other Raid Levels like 10

Which makes RAID 6 even more attractive . . . . . . .

So if raid 5 matches your requirements at i/o raid 6 does it too. I never had a read failure on a disk during the rebuild process not in a san array and not in an other array elsewhere... So i cant say something about that issue.

Exactly, so this whole failed drive scenario is a moot point.. If it's not happened yet.. it's not a likely scenario either.. Which PRECISELY why I continue to harp about using HCL equipment and using high grade components and machines that are for this purpose. I don't know of any person in any data center that I have talked to or read that their RAID completely fell apart because of a single disk, and they couldn't rebuild the drive....

4 hour a response times and 24/7 tech support is designed to keep the drives in tact.

0 Kudos
nabsltd
Enthusiast
Enthusiast

RAID-4 has the same rebuild characteristics as RAID-5...read all the non-failed drives and write the parity to the hotspare.

A poor RAID-5 implementation would read all the non-failed drives and then write back to all the drives, but then a poor RAID-4 implementation would do something similar.

The only difference between RAID-4 and RAID-5 is dedicated parity vs. striped parity.

0 Kudos
bobross
Hot Shot
Hot Shot

If you want the see the future...here it is...US Patent (applied for) 11/095,322 (Lubbers et al). Grid-based RAID. It's a very interesting paper. It makes for ultra-fast rebuilds and better protection than the way RAID-6 is currently implemented.

And if you don't believe corruption is real, read this:

I see I have touched a nerve with this topic. So I'll just stick with my advice to the original poster - I'd use RAID-5 for that stated purpose. If the I/O is low, RAID-5 should be fine.

0 Kudos
STPatrick
Contributor
Contributor

How much stress the disks during the rebuild have depends on the raid level, doesn't it? Raid 5, because of the layout stresses a single disk not so much like a disk in a raid 1 or raid 10. So thats an argument.

I know the meaning of the LED of a disk, for sure it's not indicating whats really happend with it Smiley Happy

All together i had exactly 3 disk failures during the rebuild... The last is about 1,5 years ago and it was a raid 5 array in a MSA.

I have really bad expierience with SATA's in a FSC Fibercat SP,... 2 Failed disks within 1 Month and the Fibercat was a half year old.

But for my personal flavour i couldn't sleep well if i would slow down the rebuild rate for a raid 5 array at a minimun for not impacting my enviroment, my VMs or what else and the rebuild takes 1 Week :-).

For sure in virtualized and consolidated enviroments, vmware and SANs you have to slow down the rebuild rate in standard because of the higher basic i/o load, if not the impact on performance would be massive.

So raid 6 is fine if the requirement on i/o matches. Because of the double parity you can set the rebuild rate to minimun for a minimum impact on performance without the dangerous that a second disk failure chrashes your array. You can say what you want but save is save Smiley Happy So in this case it really doesn't matter if the rebuild takes 1 Week or so...

And in relation with the opening post the use of raid 6 would decrease the risk and the solution would benefit from the larger available capacities and the lower price of SATAs, so i would prefer RAID 6

An for sure for SANs a 24/7 support with a fast reaction time and fast delivery of spare parts is a must.

Keep in mind that a rebuild stresses the disks to their limitsNot really it's reading from the entire ARRAY, but individual disk contribution is low, plus it's READing the content... and sure it still has to maintain the ARRAY all the while. But if you can look inside your RAID controller and see individual drive performance, I can assure you it's not as high as you think. That little light blinking isn't a speedometer, it just means the drive is being accessed, whether it's 1 k/s or 160 MB/s.. you can't simply assume the drive light means the disk is pegged.

And if only one drive can't stand this stress during the rebuild process (and SATA disk are good candidates for that issue) the hole array is history and you can say hello to your disaster recovery planWhen has this happened? First of all tell me when was the last time you had a disk failure that caused an ARRAY to fail? I am not talking about cheap off the shelf, Circuit City discount drives, MOST SAN vendors hand pick drives and only use high quality hardware. so we aren't talking about just an array of some OLD sata drives sitting in your desktop computer, this is high quality stuff. So it can handle the stress.. and the RAID controllers are NOT going to bombard the drive (which is why the rebuilds take a little while) to lessen the impact. IF your array fails, and IF there is a problem with the drive, and IF you can't rebuild the disk. . . . in the mean time your ARRAY is still up and running is it not? YES . . . . .

The i/o of an raid5 is not much better than raid6 and the difference in comparsion with other Raid Levels like 10 Which makes RAID 6 even more attractive . . . . . . .

So if raid 5 matches your requirements at i/o raid 6 does it too. I never had a read failure on a disk during the rebuild process not in a san array and not in an other array elsewhere... So i cant say something about that issue.Exactly, so this whole failed drive scenario is a moot point.. If it's not happened yet.. it's not a likely scenario either.. Which PRECISELY why I continue to harp about using HCL equipment and using high grade components and machines that are for this purpose. I don't know of any person in any data center that I have talked to or read that their RAID completely fell apart because of a single disk, and they couldn't rebuild the drive....

4 hour a response times and 24/7 tech support is designed to keep the drives in tact.

0 Kudos