VMware Cloud Community
Eudmin
Enthusiast
Enthusiast

Corrupt VMFS after EqualLogic firmware update

Last week I updated my EqualLogic firmware from 5.0.2 to 5.0.5 during a planned outage period.  Our electricians were doing breaker maintenance for a day and wanted everything power off.  I suspended my virtual machines instead of shutting them down because I figured it would be faster to start the guests in the VI up this way.  I then shut down my ESX hosts, updated the EqualLogic firmware, restarted the array to see that it had worked, and then shut them down.  I'm running vSphere 4.0, ESX 4.0.0 U2, by the way.  I have 2 NICs on each server talking to my SAN with multipath IO enabled in ESX by adding the nics to the iSCSI adapter (esxcli swiscsi nic add -n vmk1 -d vmhba32).  It worked fine until the firmware update.

After the firmware update I restarted the arrays, then started up my ESX hosts and hit the power button on the suspended VMs.  All seemed to come up fine.  It worked for around 2 days, but when I came in this morning everything in the VI, all of my Windows VM's had bluescreened.  All Linux VMs had kernel panics.  Rebooting VMs give NTFS errors.  The vmkwarning and vmkernel log files are giving errors about corrupted VMFS file systems and lock problems.  I have an open ticket to VMware and was on the phone with them most of the day, but the result is that the VMFS file systems are a total loss.

I'm resigned to rebuilding and restoring what I can, but what could have happened?  They don't know.  Is anyone else running this firmware successfully?

Reply
0 Kudos
13 Replies
idle-jam
Immortal
Immortal

mine looks good. i would suggest getting a call with dell as well ..

Reply
0 Kudos
Eudmin
Enthusiast
Enthusiast

I should have said.  We were all conferenced in together.  Dell got diags from the two PS5000e boxes and said it didn't look like an EqualLogic problem.  VMware asked me to get them in on the case because they didn't see 5.0.5 specifically in their hardware compatibility, but Dell said that they haven't seen lots of people with problems.

I don't have evidence, but it really feels to me like the firmware update did something to the way that multiple ESX servers use MPIO to write to the shared VMFS volumes.

Or maybe the Dell PowerConnect 5448 switches that I use to connect to the SAN went haywire somewhere in the last couple of days.  Sheesh.  Leaves me with lots of work.

Reply
0 Kudos
Eudmin
Enthusiast
Enthusiast

Now I created fresh volumes, installed new copies of ESXi 4.1U1, configured them and connected them to the shared volumes.  When I started writing to them I immediately started getting metadata corruptions.  It's gotta be the firmware?

Reply
0 Kudos
eeg3
Commander
Commander

Sorry to hear about your issue, but thanks for sharing. I'll be sure to wait a bit longer to update my firmware now.

Hopefully VMware or Dell can help you track down the cause.

Blog: http://blog.eeg3.net
Reply
0 Kudos
AndreTheGiant
Immortal
Immortal

I've not yet upgraded (have bad memories about the 5.0.0 "recommended" upgrade) so I cannot exclude firmware issues.

I suggest to call again Dell, because this seems a storage issue.

If you are using EMM try to disable it and go back to round robin or MRU.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
Eudmin
Enthusiast
Enthusiast

After lots of time on the phone with Dell and VMware tech support we think we've figured out the issue.  It wasn't a VMware problem or a firmware problem per se. It was triggered by my outage period, likely, but what happened was that a single disk had failed in the EqualLogic array without the controller knowing about it.  For some reason this one disk was corrupted and giving out bad reads and writes without triggering a RAID failure and being removed from the array.

We had to configure volumes bound to each member of the storage group and them hammer them with writes and verifying back what we read to see what disk was being read when the read didn't match the write.  Once I pulled that disk I stopped seeing the corruption progress.  I didn't get back the data, but now any new volumes I created didn't get corrupted immediately after formatting them.

This is a very insidious failure mode and not one that I had even considered.  I'd always assumed that if a disk failed it would be removed from the RAID set and if I had a spare the RAID would rebuild and all would be fine.  With a failure like this you get no disk failure notification.  Just lots and lots of corruption until eventually everything you were storing in that nice RAID 6 is toast, while the RAID controller hums happily along.  It totally bypasses the protection that you think RAID is giving you from data loss.

I'll be doing disaster recovery differently after this experience, and I'd advise people not to rely on their RAID level to protect them from disk failures.  It's entirely possible to get massive data loss by losing just one disk even if you're running, for example, RAID6 and think you need to lose 3 before you start losing data.  I've just seen it happen.

Reply
0 Kudos
Eudmin
Enthusiast
Enthusiast

posted in the comments. Just marking it as answered for completeness

Reply
0 Kudos
idle-jam
Immortal
Immortal

thanks for sharing your findings. very helpful!

Reply
0 Kudos
lawson23
Enthusiast
Enthusiast

Eudmin,

First sorry to hear about your situation!

I do have a question though being a EQ (PS4000e) owner myself and currently on 5.0.4 which I see this issue was fixed in 5.0.5.  We are moving to 5.0.7 on Sunday.  My question was did the drive itself show amber on the unit and the OS just did not pickup on the failure?  Or was there absolutly no warning at all regarding this failed drive?

Reply
0 Kudos
Eudmin
Enthusiast
Enthusiast

I saw the problem in 5.0.5 when I upgraded from an earlier one, maybe 5.0.2 now that I think about it.  I updated to 5.0.7 as soon as it was available after being urged to do so by the EQL tech I had dealt with for the problem as well as our Technical Account Manager at Dell.  It's been stable since that upgrade.  Actually it almost immediately failed a drive which I had to replace, but I'd rather that happen than the opposite.

To answer your question, the drive never showed did amber.  After a restart of the arrays I had a failure event where it said that a drive had failed, but it didn't say which drive.  Then when I looked at the drive status all were green in the GUI and all of the lights were green.  So some part of the OS knew that something was wrong, but it wouldn't take the drive out of the array.  Eventually that failure event cleared on its own and then it looked healthy from the GUI even though the data was still getting corrupted.

We identified which drive it was by doing a test with an executable that Dell sent called "mtio" which writes and reads 10GB of data at a time to a couple of volumes and then there is a command on the super secret tech service console on the EQL which I don't remember, something like lba_analyze, which analyzes the output of the mtio test to say which disk it was that had the data where the read didn't match the write.

Reply
0 Kudos
lawson23
Enthusiast
Enthusiast

Ok thanks for the information.

I have applied the 5.0.7 but waiting till Sunday to do the reboots on the controllers.

Reply
0 Kudos
dwilliam62
Enthusiast
Enthusiast

Hello,

All Dell/EQL customers should be running 6.0.6-H2 or greater in order to prevent corruption in a specific region of a VMFS Datastore, known as the Storage Heartbeat.

Additionally, configuring your ESX servers to the Dell/EQL best practices document is strongly suggested.

A copy of that document is available here:

EqualLogic Best Practices ESX

Don

Reply
0 Kudos
admin
Immortal
Immortal

Reply
0 Kudos