VMware Communities > VMTN > VMware Infrastructure™ > VI: ESX 3.5 > Discussions

This Question is Not Answered

1 "correct" answer available (10 pts) 2 "helpful" answers available (6 pts)
6 Replies Last post: Apr 14, 2008 9:47 AM by TylerApache
Reply

Big Problems...Whole system down

Apr 14, 2008 7:09 AM

Click to view TylerApache's profile Novice TylerApache 18 posts since
Mar 20, 2006
Two Dell PE2950, 32GB RAM, ESX 3.0.2

EMC NS40 SAN presenting iSCSI

Last Friday we experienced a power failure that caused our SAN to reboot. When I began checking machines that morning things seemed fine but shortly after I began to notice some oddities. Suddenly one of my VMs wouldn't reboot. Another went BSOD. Before I knew it, all of my VMs were corrupt. Most would not make it to a lgoin propmt. Others would BSOD at login. Still others wouldn't boot at all.

I spent the next 26 hrs on the phone with EMC and VMware trying to figure out what was wrong. VMware says everything is OK. EMC says the storage is fine. Ugh!! As a last ditch effort to save the VMs, we cloned one of the BSOD VMs to local storage and it booted perfectly. Ah HA!! I blew away one of the LUNS on the NS40, recreated it and re-presented it to the hosts. When I cloned machines over to it they booted fine. We only lost two machines due to drastic Windows repair efforts. Happy ending right...not so fast.

The next morning after letting the machines settle over night, I came in to begin the process of restoring order. It didn't take long to realize that the machines were beginning to exhibit the same corruption as before. The only machines that have come back uncorrupted and have stayed that way are the ones that I moved to the local storage. Now EMC is saying that they are seeing reservation errors on the SAN. Apparently this happens when two hosts attempt to access the LUN at the same time. I'm waiting on a return call from VMware at the moment. Help!!

Reply Re: Big Problems...Whole system down Apr 14, 2008 7:50 AM
Click to view wondab's profile Hot Shot wondab 92 posts since
Nov 30, 2007

What was the fist thing you did after the EMC rebooted? Did you unpresent and represent the LUNs and rescan?

There is a good thread explaining reservation conflicts due to the simultaneous altering of metadata.

http://communities.vmware.com/thread/80561


Reply Re: Big Problems...Whole system down Apr 14, 2008 7:59 AM
Click to view IB_IT's profile Expert IB_IT 426 posts since
May 31, 2007
We had a similar issue with a power outage. Our hosts were brought up and online before the SAN had a chance to power on, although we did not experience the same thing you are...seems your issue is a bit more drastic. In our case all the VM's were inaccessible or orphaned. It sounds like your hosts are still a bit confused as to who is accessing what VM...I wonder if you could try to "remove from inventory" everything from your cluster and then add back into inventory...your hosts included. If you have an open case with VMWare, see what they think about that.
Reply Re: Big Problems...Whole system down Apr 14, 2008 8:00 AM
in response to: wondab
Click to view TylerApache's profile Novice TylerApache 18 posts since
Mar 20, 2006
I did not unpresent/represent and rescan.
Reply Re: Big Problems...Whole system down Apr 14, 2008 8:05 AM
in response to: IB_IT
Click to view TylerApache's profile Novice TylerApache 18 posts since
Mar 20, 2006
In the queue at the moment...will ask when I get them online.
Reply Re: Big Problems...Whole system down Apr 14, 2008 8:08 AM
in response to: TylerApache
Click to view IB_IT's profile Expert IB_IT 426 posts since
May 31, 2007
Having delt with VMWare so frequently on different issues, often times they suggest to me blowing the machines away from Virtual Center and reregistering. Having reread your original problem, though, I would tend to agree with wondab that you may need to unpresent/re-present LUNS.
Reply Re: Big Problems...Whole system down Apr 14, 2008 9:47 AM
in response to: IB_IT
Click to view TylerApache's profile Novice TylerApache 18 posts since
Mar 20, 2006
Looks like we may have configuration update issues. When we went in to CHAP access to change the config, we got an "unable to update settings" message. The engineer is puzzled.
Actions