Two Dell PE2950, 32GB RAM, ESX 3.0.2
EMC NS40 SAN presenting iSCSI
Last Friday we experienced a power failure that caused our SAN to reboot. When I began checking machines that morning things seemed fine but shortly after I began to notice some oddities. Suddenly one of my VMs wouldn't reboot. Another went BSOD. Before I knew it, all of my VMs were corrupt. Most would not make it to a lgoin propmt. Others would BSOD at login. Still others wouldn't boot at all.
I spent the next 26 hrs on the phone with EMC and VMware trying to figure out what was wrong. VMware says everything is OK. EMC says the storage is fine. Ugh!! As a last ditch effort to save the VMs, we cloned one of the BSOD VMs to local storage and it booted perfectly. Ah HA!! I blew away one of the LUNS on the NS40, recreated it and re-presented it to the hosts. When I cloned machines over to it they booted fine. We only lost two machines due to drastic Windows repair efforts. Happy ending right...not so fast.
The next morning after letting the machines settle over night, I came in to begin the process of restoring order. It didn't take long to realize that the machines were beginning to exhibit the same corruption as before. The only machines that have come back uncorrupted and have stayed that way are the ones that I moved to the local storage. Now EMC is saying that they are seeing reservation errors on the SAN. Apparently this happens when two hosts attempt to access the LUN at the same time. I'm waiting on a return call from VMware at the moment. Help!!
EMC NS40 SAN presenting iSCSI
Last Friday we experienced a power failure that caused our SAN to reboot. When I began checking machines that morning things seemed fine but shortly after I began to notice some oddities. Suddenly one of my VMs wouldn't reboot. Another went BSOD. Before I knew it, all of my VMs were corrupt. Most would not make it to a lgoin propmt. Others would BSOD at login. Still others wouldn't boot at all.
I spent the next 26 hrs on the phone with EMC and VMware trying to figure out what was wrong. VMware says everything is OK. EMC says the storage is fine. Ugh!! As a last ditch effort to save the VMs, we cloned one of the BSOD VMs to local storage and it booted perfectly. Ah HA!! I blew away one of the LUNS on the NS40, recreated it and re-presented it to the hosts. When I cloned machines over to it they booted fine. We only lost two machines due to drastic Windows repair efforts. Happy ending right...not so fast.
The next morning after letting the machines settle over night, I came in to begin the process of restoring order. It didn't take long to realize that the machines were beginning to exhibit the same corruption as before. The only machines that have come back uncorrupted and have stayed that way are the ones that I moved to the local storage. Now EMC is saying that they are seeing reservation errors on the SAN. Apparently this happens when two hosts attempt to access the LUN at the same time. I'm waiting on a return call from VMware at the moment. Help!!