VMware Cloud Community
MPritchard
Contributor
Contributor

Replacing failing hard disk in perc6/i raid on a Dell Power Edge 2950

Hi folks, my first post. please be kind LOL

I have a dell pe2950 server that is running esx 3 with a perc 6/i controller connect to 5 hard drives, however one of the hard drives is showing

that the drive is iminent failure and has not yet failed, we have a replacement hard drive from dell however can somebody tell me. Given the hard drive has

yet to fail and leave the raid degraded if the drive was to be hot swapped, what effect would this have on the raid and the vmware system running the production servers (5 in total) would it be best practice to

shut down the virtual servers before commencing the hotswap. Has anybody experienced this and what was your outcome.

Thanks for any help in advance

0 Kudos
36 Replies
Jackobli
Virtuoso
Virtuoso

Assuming, that this is RAID5 without hot spare...

My thoughts:

  • your backups are up to date and you tested the restore capabilities

  • you know exactly which drive has failed or will fail soon.

  • you know exactly where this drive is located on the server

  • you double checked the position of the soon to fail drive

If I had an option I would swap this guests out to another host and then replace the disk.

If there is no such option, stopping the guests would lower a bit the duration of the RAID5 rebuild time. It depends also on the rebuild rate, that is set on your perc6.

Rebuilding a RAID volume does always take stress on the disks. The chance, that another disk is failing during this period is higher than usual.

But most problems occur due to human error (eg replacing the wrong disk).

0 Kudos
EnsignA
Hot Shot
Hot Shot

Since the drive has not actually failed yet, if you pull it while the server is running you will lose your array. If you have Server Administrator loaded on the server, which I presume you might since you got a predictive failure, you would open that up for that server and take the drive offline first. Alternatively, you can move your VM's off, then shutdown the server. Replace the drive in question, and power it back on. When it comes up you might be prompted to take the disk or controller information about the array, and you would select the controller.

nick_couchman
Immortal
Immortal

This is not necessarily true, depending on the RAID level. I don't think the OP specified the RAID level, but in a RAID5, 6, or 10, you can pull a drive out of the array without losing the array. Once the drive is pulled, the array will go into degraded mode, at which time you can put the replacement drive in and the RAID will rebuild. If you're running a RAID0 or JBOD, pulling the drive is going to result in loss of data on that drive (and possibly the entire array) one way or another, so you'll have to have those backups ready.

0 Kudos
EnsignA
Hot Shot
Hot Shot

I beg to differ. If a Dell array controller says you have a predictive failure and has not failed the drive you can lose the array if you pull a drive. The proper way to do it is offline the drive, or, shutdown the server, replace the drive, and power it up.

0 Kudos
nick_couchman
Immortal
Immortal

Okay...my experience is different - I've pulled a couple of drives out of arrays that had "predictive failures" and not had any issues. If the docs say otherwise, though, best to follow Dell's suggestions. My bad...

0 Kudos
kjb007
Immortal
Immortal

This is news to me as well. What would be the difference from when you receive a predictive failure warning, and then the drive dying, or in this case, being pulled out of the system? Are you saying that if you have predictive failure, and then the drive just dies, that you can lose your array?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
EnsignA
Hot Shot
Hot Shot

I am simply stating Dell's best practice as told to me by their technical support. I have been working with Dell servers for quite a few years, and they mention this to me every time I call in to get a drive sent for a replacement of a predictive failure. I assume if they feel it is that important to tell me that they have had instances of bad things happening. I put in this information as a way to BEST deal with this particular problem. Your mileage may vary....better safe than sorry...don't take any wooden nickels...etc.

0 Kudos
nick_couchman
Immortal
Immortal

Fair enough - sounds suspiciuos to me, like Dell trying to cover themselves in case something bad happens while you're replacing a drive. Personally, I think it's unreasonable to require a shutdown of a system to replace a drive, even during a predictive failure situation, but I can see it being a "best practice" to shut down the server if possible - this insures that no writes are taking place at the time the drive is removed.

0 Kudos
s1xth
VMware Employee
VMware Employee

This is a very interesting thread. I have a bunch of 2950's in raid 5 on esxi. I was wondering myself what the proper procedure would be for replacing a 'predictive failure' drive. I havent (thankfully) had to find out yet what the process would be, but from my past experience working with raid arrays is that the machine can stay up, pull the drive and put a new one in and the array should stay up and rebuild, granted that you pull the correct drive out.

With ESXi I am assuming this is the same process. I have worked on HP DL 380 G5's and G4's and 360 G5's with the smaller 2.5in sas drives that had a very high failure rate (for whatever reason) and I just swapped the drive out (granted these servers were all Windows boxes not Esx boxes).

Any one care to comment on the proper way to swap a drive with a predictive failure on it?

I am also assuming that the poster is referring the to ESXi predictive failure alarms in the health status page.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
EnsignA
Hot Shot
Hot Shot

Not to belabor(sp?) the point, but the best way is to offline the drive using Server Administrator, which can be done while the server is on-line. If that is not installed, then you either pull the drive and hope for the best or shutdown and replace. If I had the capability of moving off VM's via vMotion and didn't have SA installed, I would shutdown and replace to be safe. Just saying...

0 Kudos
s1xth
VMware Employee
VMware Employee

Yeah I agree with that process, I think the question here is with ESXi since we CANT install Server Administrator and we only have the health status page on the host, what would process be for swapping a drive.

I guess my process would be, 1. I know the drive that is failing, predictive or already faulted 2. backup guests, or if you have vmotion move them, in my case I dont. 3. Shut the host down and remove bad drive, put new drive in and power up and cross your fingers that the raid rebuilds itself and the server comes up. Should work....SHOULD.

The more I think about it I probably would not do a hotswap on the server, just to be safe about it. If its a HA server with HA guests than maybe I would, only after getting a solid backup.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
Jackobli
Virtuoso
Virtuoso

In rare cases, pulling a running drive could perhaps cause damages to the backplane.

A failed drive should be spun down, so the risk is low.

That is why cheap enclosure/backplane manufacturers are writing, that only spun down drives (cold drives) may be removed.

0 Kudos
s1xth
VMware Employee
VMware Employee

Cheap backplanes could definitly have issues pulling a running drive, but these PERC6 aka LSI Logic BP's have been pretty good for me (now the HP BP's have been horrilbe for me, believe they are LSI's but some models use something else cant remmeber off the top of my head, or HP just has horrible QC which is why I am running Dell boxes instead)....

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
nick_couchman
Immortal
Immortal

Yeah, I've had good experience with the Dell backplanes, too - they seem to be pretty good quality and I haven't had any issues with damaging one, damaging a drive, etc.

0 Kudos
nick_couchman
Immortal
Immortal

Agreed...

0 Kudos
RParker
Immortal
Immortal

and has not failed the drive you can lose the array if you pull a drive

Incorrect. RAID 101. Hot swap enterprise RAID controllers allow you to remove a drive from a RAID WHILE the drives are still running. A Perc 6 RAID controller allows you to have a RAID 6 which means 2 parity drives instead of just one. So you actually pull 2 drives at once. And NOTHING will happen to the RAID. I have done this in a Dell many times, no issues, and that's exactly what its designed to do. When you pull out the drive, and plug the other one it will automatically rebuild the drive over a few hours. But the RAID does NOT stop functioning.

0 Kudos
RParker
Immortal
Immortal

am simply stating Dell's best practice as told to me by their technical support. I have been working with Dell servers for quite a few years

Well I don't know who you are talking to at Dell, but it's just the opposite. They have no problem supporting hot swap drives during production, as that is exactly what this feature is for, NO down time. I have been doing this for Dell for at least 13 years, and not once did Dell ever say this wasn't a good idea, in fact that's their first question "Do you have physical access to the server, so we can pull the drive".

0 Kudos
RParker
Immortal
Immortal

but the best way is to offline the drive using Server Administrator

wow you have had some really unique experiences with Dell, because I haven't don't this either. Server Admin is a tool to monitor the server, and you CAN offline it, but it's only necessary to diagnose issues, and it's not a pre-requisite to replacing a drive. I reconfigure the RAID using this tool, but that's about it.

0 Kudos
RParker
Immortal
Immortal

In rare cases, pulling a running drive could perhaps cause damages to the backplane. A failed drive should be spun down, so the risk is low.

I have to disagree with this theory also. Journaling file systems, and parity file systems and battery backed cache prevent this . You can pull a drive (doesn't matter spinning or not) during write operations and it will simply shift the data to a drive that receives parity. It's been designed this way for years. In fact only recent years have there been tools to allow to offline the disk first, because early RAID controllers didn't have this option, it was a separate purchase to buy the tools that did do this. And the hardware has a latch that signals the chassis the drive is about to be remove, it's part of the close and lock. You can't yank the drive without performing an unlatch first. So it's the same as right click your CD rom drive and click "eject" or simply hitting eject button, there is NO difference to the hardware. It's only the perception that somehow it's better to power down the drive first, its not technically true. There are safeguards built into the RAID controllers to prevent this. That's the difference between enteprise level hardware and some cheap RAID off the shelf product, and why RAID controllers cost 1000 bucks. They are built with ALL of this in mind.

0 Kudos