Solved: How safe are SANs?

Osm3um · ‎07-22-2007

I just read a post regarding a SAN corrupting some VMs. It appears to have been at least partitally caused by a firmware issue.

I am curious, however:

How often this sort fo thing has happened?

On a SAN without a firmware problem is this rare?

etc.

I know it is a wide open question but this sort of story scares me big time.

FYI I am running my VMs on an EMC AX150i with two ESX boxes both raid 5 and raid 1 volumes. The AX150i had dual processors and dual power supplies.

Thanks,

Bob

Rumple · ‎07-23-2007

SAN's are no more prone to issues then running a RAID 5 volume off a single controller card.

A SAN typically is designed to be completely redundant if you pay the money for those options. As such they are less likely to experience a catestrophic failure then running direct attached SCSI.

I've lost direct attached RAID controller cards where it started writing corrupt data to the array and I've also lost 2 drives at exactly the same time in a RAID 5 configuration.

I have 3 SAN's in my current environment (an old EMC FC4700, CX500 and a cx700). I have yet to lose a LUN or even experienced any downtime in my environment even though the SAN's have drives that are replaced in them at least once every 3 months (SAN fails a drive long before it actually would fail in a normal environment) and I've replaced the SP's (Service processors) on the all 3 SAN's multiple times, also without downtime.

Overall I put my career on the line fore a SAN long before I'd trust it to a direct attached storage system.

View solution in original post

Rumple · ‎07-23-2007

SAN's are no more prone to issues then running a RAID 5 volume off a single controller card.

A SAN typically is designed to be completely redundant if you pay the money for those options. As such they are less likely to experience a catestrophic failure then running direct attached SCSI.

I've lost direct attached RAID controller cards where it started writing corrupt data to the array and I've also lost 2 drives at exactly the same time in a RAID 5 configuration.

I have 3 SAN's in my current environment (an old EMC FC4700, CX500 and a cx700). I have yet to lose a LUN or even experienced any downtime in my environment even though the SAN's have drives that are replaced in them at least once every 3 months (SAN fails a drive long before it actually would fail in a normal environment) and I've replaced the SP's (Service processors) on the all 3 SAN's multiple times, also without downtime.

Overall I put my career on the line fore a SAN long before I'd trust it to a direct attached storage system.

Dave_Mishchenko · ‎07-23-2007

Hi Bob, if you ask 10 people you'll probably get 10 different answers, but in general SANs are sufficiently safe such that many of the users here will be using them and have no qualms about recommending their use. The company I'm at has been using IBM SANs (DS / FAStT series) for 7 years and we've never had a firmware problem. We did have one "SAN burp" that knocked all the servers offline once, but other than that my experience has been very good. Firmware upgrades have been approached cautiously of course as a problem would have serious consequence, but I have no hesitation about using them.

A few months ago I have a VM go corrupt on an ESX 2.5 host with local storage after a power failure, so the problems aren't limited to SANs. And while I don't know the details of the thread that you are reading, no disk technology precludes some sort of backups of your VMs.

espsgroup · ‎07-23-2007

So, you were probably just reading my issue. I'm about to update it as I've made some progress, but I thought I would chime in.

I've been a heavy SAN admin in my last 4 positions and I can safely say that if a SAN is implemented correctly in your environment, they are very safe.

The issues start arising when you have one of the following.

1. Old firmware or software revisions that aren't kept up to date with the rest of your equipment.

2. Non-interoperable vendor equipment, which is becoming less of an issue.

3. A configuration that isn't 100% tested and at least blessed by someone who knows the environment well.

4. User error (which is surprisingly easy when you are looking at hundreds of LUNs, device paths, partition sizes and volumes. You start to lose track.

SANs are not trivial to setup, especially in a redundant fashion and there are many components. All pieces of the stack must be configured EXACTLY to work with all of the other components:

Partitions, Multipathing Software/Configuration, OS, HBA Driver, HBA, Cabling to Server, Switch Port Configuration, LUN Masking, Switch Fabric Zoning, Cabling to SAN unit, SAN Port Configuration, SAN configuration

There are lots of places for things to not quite line up right and it really takes a diligent SAN admin to work all of the kinks out. You have to really know your entire technology environment and this is not so easy sometimes when it is heterogenous (Windows, Linux, AIX, VMware.

The bottom line is test the crap out of your configuration 100%. Make sure everything works.

In my case I haven't been quite as thorough as I should have been with failover tests. I'm a one man System Administrator and Desktop support team for 50 users and a complete Linux, Windows and VMware environment with a 10TB SAN. I'm also the Network admin and Phone guy.

I'm stretched a little thin and things slip through the cracks. 😛

I'm learning though.

I would say SANs definitely have there place if you have the resources to implement them correctly. That doesn't mean expensive, it just means you have to know what you are doing.

Thanks,

Jeff

Osm3um · ‎07-23-2007

I figured as much.

It is interesting, however, as virtualization moves forward (I have had our servers virtualized for at least 3 years already) that "all of our eggs are in one basket (SAN)".

The other thing I have noticed is a lot of servers are tied closer together. That is, if I lose one box I may as well lose them all as the SQL is linked to the IIS which is linked to Sharepoint which has ties to a file share. So why not put them all in one basket?

Thanks for your reply....I am now going to doublr check my backup processes!

Bob

Osm3um · ‎07-23-2007

"So, you were probably just reading my issue. I'm about to update it as I've made some progress, but I thought I would chime in"

Yep, I was referencing your original post. This question has been bothering me for a while as I centralize my VMs more and more. Matter of fact I spent this weekend rewiring our AC/UPSs for more redundancy.

Interestingly enough I am a one person show as well: 60 users, 14 servers (VMs and phyiscal), ESX VI3 with one ISCSI SAN, help desk, two sites, Exchange, firewall, etc.

We only have a single SAN.

Bob

RParker · ‎07-23-2007

"The other thing I have noticed is a lot of servers are tied closer together. That is, if I lose one box I may as well lose them all as the SQL is linked to the IIS which is linked to Sharepoint which has ties to a file share. So why not put them all in one basket"

Not true. You are assuming that a SAN operates the same way as a disk drive in a host. It's VERY different.

A SAN is a big box with lots of drives. EMC (Netapp) for instance has an algorhthym that makes the drives random selected, not in sequence, like a normal RAID. You also have WAY more spindles in a SAN than a host 6 drives typcal vs 24 drives on a SAN.

So what happens when one of thoese drives fail in a SAN? For one thing, it has double parity (2 drives have the parity info, not just 1). So its redundant on disk, its redundant on Parity, and since its a huge array of disks, you would have multiple drives on stand by for this purpose.

So really what you are dealing with is a refrigerator of many types of items, and you don't have a stand alone cooler. A host is an island, and the SAN is the planet. You have virtually a much larger resource to pool from, than a single machine.

The SAN is also managed differently, because its a self sustaning device for the SOLE purpose of disk management. It ONLY monitors disks, and nothing else.

They hand pick the drives to be the best of the best, and like Rumple said, they have a 99.99% up time, and they fail the drives LONG before there would be a problem elsewhere.

So to say that putting your "eggs in one basket" is on the SAN, isn't really even CLOSE to the truth. They are ONLY common in that they live in the same cartridge, but the drives are in in way shape or form related to each other.

There are many shelves, each shelf is 14 drives. If you have 5 shelves, and you create a LARGE group (called an aggregrate) of 24 disks (for ultimate performance) the drives are pulled from ALL 5 shelves, not just one. And its done with the best possible redundancy, and the drives are constantly monitored for health. You will \*NEVER* notice there is a problem, unless you lose power to the SAN. Otherwise, it just sits and does its job.

We have a SAN (104TB) that we split among production and development.

We have had it for 5 years. Our SAN has been down 4 times.

2 power failures, 1 move (planned power outage) and 1 upgrade. The 2 power failures were the result of the building problems, and it caused our power to glitch.

Notice, in 5 years the SAN has never ever ever gone down for any other reason, we have had maybe a dozen drives fail, one recently, but it was immediately replaced with a spare, and the only reason I knew it was even replaced, was because I happened to check the log and noticed that the drive failed, none of my LUNS or Volumes were affected.

Also SANs are way way faster than any drive array in any computer. Ours get hammered 24/7, so we can't afford to lose any drives or data, and not to mention snapshot ability (which is essentially a moment in time that each volume is backed up 3 times a day). There are numerous reasons you want to go with a SAN, my favorite is sharing. ALL ESX anywhere, can have access to the same resource via Fibre. I love it. Much easier to manage.

A SAN isn't an egg in a basket, it's a managed universe of high availability resource. To belittle it by saying its a big hard drive, would be like saying an Indy Car is a faster version of a Kia. They are far different, even though they have SOME similarity.

RParker · ‎07-23-2007

Other than my other post, I only have 1 thing to add.

SAN's are mad expensive. Ours cost roughly a million dollars, after upgrades, space add, and maintenance.. maybe closer to $1.5 million, no that is not a mis print.

SANs are expensive for the GOOD ones, that's why most people don't have them, but if a company can afford it, it's probably one of the best technological investments a company will EVER make.

RParker · ‎07-23-2007

I find this funny, true, but funny.

We can say the same about Microsoft Windows as well.

The issues start arising when you have one of the following.

1. Old firmware or software revisions that aren't kept up to date with the

rest of your equipment.

\- People don't keep up with maintenance.

2. Non-interoperable vendor equipment, which is becoming less of an issue.

\- People don't read the HCL (hardware compatiblity list)

3. A configuration that isn't 100% tested and at least blessed by someone who knows the environment well.

\- BETA! People don't bother to test first to see if everything works right, or is configured right.

4. User error (which is surprisingly easy when you are looking at hundreds of LUNs, device paths, partition sizes and volumes. You start to lose track

\- This could be the blame for 80% of the world problems, but don't get me started. . . . . USER ERROR is the #1 reason for many things. If we can get past that, we would ALL be better off.

Osm3um · ‎07-23-2007

I see what you mean, but in our situation we have one SAN....redundant power, hot spare drive, raid 5/raid 1, redundant processors, redundant ISCSI switches etc. So we are redundant but not nearly to the same extent those with more SANs.

Thanks for the explanation I found it quite enjoyable to read. I, however, can only dream of the stuff you deal with!

I did notice the speed difference between local RAID and SAN and was SHOCKED by the increase. and we have a slower unit, EMC AX150i.

Bob