slickshoes
Enthusiast
Enthusiast

All VM's on ESX box blue screened, windows now corrupted, VM's won't start

Here is my setup. I have a Dell PowerEdge 2950 III with a QuadCore 3 Ghz processor and 24 GB of RAM. I am running ESX 3.5 U3. The Server is connected via iSCSI to an Enhance Tech Ultrastor RS8 IP-4. All hardware is on VMware's HCL.

I discovered I could not reach some of the Windows VM's on the host. I could ping but not RDP or access the files. I could not log in to vCenter as the VI Client was timing out. I also could not connect directly to the host. I went into the CLI and tried to reboot it but it hung. I took the first screenshot below. I went onsite and hard rebooted the server and rebooted the SAN from its GUI. When it came back up I tried to fire up my VM's but got various Windows corruption issues. It wanted to run chkdsk, said registry's were corrupt, DLL's were missing etc.. etc.... In the end, I lost all 9 Windows VM's.

When trying to create new VM's on the SAN windows setup won't even run. After the text portion of the setup, when it reboots to go into the GUI part, I get the message below.

I have a hunch what the issue is, but want to hear what the community thinks.

0 Kudos
9 Replies
marcelo_soares
Champion
Champion

If all your VMs bluescreened, it's most probably a HW issue - I would point my finger, without analyzing too much, to the storage access.

Take a look at your /var/log/vmkernel looking for StorageMonitor errors, or paste here the section on the time of failure - this can help us to pinpoit your problem better.

Marcelo Soares

VMWare Certified Professional 310

Technical Support Engineer

Linux Server Senior Administrator

Marcelo Soares
0 Kudos
bulletprooffool
Champion
Champion

If all the VMs on the Dell server bluescreened, you probably have a faulty server.

I'd suggest trying to start these another Server. Also, I'd run the diagnostics on the Dell.

My guess would be a dead processor, or memory on the Dell box, as the VMs would not even get to the BlueScreen if your ESX could not hit the storage . . and connection issues would kill the VM, rather than bluescreen it.

One day I will virtualise myself . . .
0 Kudos
slickshoes
Enthusiast
Enthusiast

Marcelo, unfortunately I rebuilt the ESX server since this issue. I worked with VMware support and they never asked for this and actually suggested I rebuild it to see if it works after the rebuild.

Bulletprooffool, I worked with Dell support yesterday and gave them a full DSET report. They said the server is fine. I'm running some hardware/firmware updates today but otherwise the servers seems ok. Remember, ESX stayed up, its the VM's that crashed. I think if the server had a hardware issue VMware would PSOD

Thanks for the ideas!

0 Kudos
azn2kew
Champion
Champion

Well, that's why VMware HA is in place when you have multiple ESX hosts in a cluster and with N+1 designs these will protect you from hardware failures, but if all the VMs are blue screen then I'm sure its connections issue with the datastore. Are you using iSCSI HBA or Software iSCSI? Have you verified that other hosts and virtual machines running too or everything in a single basket in terms of single ESX host and single iSCSI target? Do you have multiple iSCSI targets and dua NICs for redundancies? Have you do basic testing verfiication if storage is up, networking is good or your hardware components are rock solid? If this is a production server, I wouldn't depend on the same hardware but replacement and then stress test it 48 hours with memory test to see any leakage, firmware, bios, backplane, NIC drivers updates to the latest adn test it out on demo VMs. Once certified its all good bring it back to production cluster. I'm having good results with all Dell PE 2950, 6950 and R905 no problems!

Why VMware Support doesn't ask you some basic diagnostic but to reload your ESX? So what's the outcome with reloaded ESX, all the VMs are running now?

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

VMware vExpert 2009

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
0 Kudos
marcelo_soares
Champion
Champion

This is also what I think. That's why I pointed to storage. The fact that your VMs did not powered on after the problem also shows that something went wrong with disk access.

Luckily you could be back from that situation. Hope all the info that we provided here could help.

Marcelo Soares

VMWare Certified Professional 310

Technical Support Engineer

Linux Server Senior Administrator

Marcelo Soares
0 Kudos
slickshoes
Enthusiast
Enthusiast

After rebuilding ESX i have the same result. It is a small setup, with a single server and single SAN. I do have multiple NICs and multiple targets set up. All networking is fine.

I did run 3 tests where I tried to create a new Windows VM from the same ISO

1)

New VM on local

storage. No problem

2)

New VM on NEW LUN created

from unused space. Newly present to server. No Problem

3)

New VM on previously

used LUN. Windows set up ran through copying the files, rebooted to the

GUI part of the install and see below

0 Kudos
jandie
Enthusiast
Enthusiast

Anything unusual from the /var/log/vmkernel or /var/log/vmkwarning? What does the log on the storage end say? Maybe that LUN is corrupted. I had a similar error (not the same) way back and it ended up to be the LUN was corrupted and was missing data that was written to it. However, the installation of the guest OS also took a long time (way longer than usual). Another problem was caused by the connection to the storage itself - our problem was that the connection to the storage kept getting interrupted/dropped.

jandie

0 Kudos
Chris0937
Contributor
Contributor

Hi,

If you face this the OS corrupted, you might want to take a lookthe following link.

http://www.virtualizationteam.com/virtualization-vmware/vmware-vi3-virtualization-vmware/vmware-esx-...

It will teach you how to recover the vmfs partition table.

Also, you might want to call Enhance Tech for the latest firmware with RS8-IP4.

0 Kudos
Chris0937
Contributor
Contributor

Hi

This is Chris from Enhance-tech,

Please email to us: tech@enhance-tech.com

or call us 1(562)7773488#2 ,

Thank you,

Chris

VCP4 Engineer | Enhance Technology, Inc.

Website: http://www.enhance-tech.com

Email: chris@enhance-tech.com

Skype: support.enhance

World Headquarter: USA

Branch: Taiwan | Germany | China

0 Kudos