Here is my setup. I have a Dell PowerEdge 2950 III with a QuadCore 3 Ghz processor and 24 GB of RAM. I am running ESX 3.5 U3. The Server is connected via iSCSI to an Enhance Tech Ultrastor RS8 IP-4. All hardware is on VMware's HCL.
I discovered I could not reach some of the Windows VM's on the host. I could ping but not RDP or access the files. I could not log in to vCenter as the VI Client was timing out. I also could not connect directly to the host. I went into the CLI and tried to reboot it but it hung. I took the first screenshot below. I went onsite and hard rebooted the server and rebooted the SAN from its GUI. When it came back up I tried to fire up my VM's but got various Windows corruption issues. It wanted to run chkdsk, said registry's were corrupt, DLL's were missing etc.. etc.... In the end, I lost all 9 Windows VM's.
When trying to create new VM's on the SAN windows setup won't even run. After the text portion of the setup, when it reboots to go into the GUI part, I get the message below.
I have a hunch what the issue is, but want to hear what the community thinks.
If all your VMs bluescreened, it's most probably a HW issue - I would point my finger, without analyzing too much, to the storage access.
Take a look at your /var/log/vmkernel looking for StorageMonitor errors, or paste here the section on the time of failure - this can help us to pinpoit your problem better.
Marcelo Soares
VMWare Certified Professional 310
Technical Support Engineer
Linux Server Senior Administrator
If all the VMs on the Dell server bluescreened, you probably have a faulty server.
I'd suggest trying to start these another Server. Also, I'd run the diagnostics on the Dell.
My guess would be a dead processor, or memory on the Dell box, as the VMs would not even get to the BlueScreen if your ESX could not hit the storage . . and connection issues would kill the VM, rather than bluescreen it.
Marcelo, unfortunately I rebuilt the ESX server since this issue. I worked with VMware support and they never asked for this and actually suggested I rebuild it to see if it works after the rebuild.
Bulletprooffool, I worked with Dell support yesterday and gave them a full DSET report. They said the server is fine. I'm running some hardware/firmware updates today but otherwise the servers seems ok. Remember, ESX stayed up, its the VM's that crashed. I think if the server had a hardware issue VMware would PSOD
Thanks for the ideas!
Well, that's why VMware HA is in place when you have multiple ESX hosts in a cluster and with N+1 designs these will protect you from hardware failures, but if all the VMs are blue screen then I'm sure its connections issue with the datastore. Are you using iSCSI HBA or Software iSCSI? Have you verified that other hosts and virtual machines running too or everything in a single basket in terms of single ESX host and single iSCSI target? Do you have multiple iSCSI targets and dua NICs for redundancies? Have you do basic testing verfiication if storage is up, networking is good or your hardware components are rock solid? If this is a production server, I wouldn't depend on the same hardware but replacement and then stress test it 48 hours with memory test to see any leakage, firmware, bios, backplane, NIC drivers updates to the latest adn test it out on demo VMs. Once certified its all good bring it back to production cluster. I'm having good results with all Dell PE 2950, 6950 and R905 no problems!
Why VMware Support doesn't ask you some basic diagnostic but to reload your ESX? So what's the outcome with reloaded ESX, all the VMs are running now?
If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
Regards,
Stefan Nguyen
VMware vExpert 2009
iGeek Systems Inc.
VMware, Citrix, Microsoft Consultant
This is also what I think. That's why I pointed to storage. The fact that your VMs did not powered on after the problem also shows that something went wrong with disk access.
Luckily you could be back from that situation. Hope all the info that we provided here could help.
Marcelo Soares
VMWare Certified Professional 310
Technical Support Engineer
Linux Server Senior Administrator
After rebuilding ESX i have the same result. It is a small setup, with a single server and single SAN. I do have multiple NICs and multiple targets set up. All networking is fine.
I did run 3 tests where I tried to create a new Windows VM from the same ISO
1)
New VM on local
storage. No problem
2)
New VM on NEW LUN created
from unused space. Newly present to server. No Problem
3)
New VM on previously
used LUN. Windows set up ran through copying the files, rebooted to the
GUI part of the install and see below
Anything unusual from the /var/log/vmkernel or /var/log/vmkwarning? What does the log on the storage end say? Maybe that LUN is corrupted. I had a similar error (not the same) way back and it ended up to be the LUN was corrupted and was missing data that was written to it. However, the installation of the guest OS also took a long time (way longer than usual). Another problem was caused by the connection to the storage itself - our problem was that the connection to the storage kept getting interrupted/dropped.
jandie
Hi,
If you face this the OS corrupted, you might want to take a lookthe following link.
It will teach you how to recover the vmfs partition table.
Also, you might want to call Enhance Tech for the latest firmware with RS8-IP4.
Hi
This is Chris from Enhance-tech,
Please email to us: tech@enhance-tech.com
or call us 1(562)7773488#2 ,
Thank you,
Chris
VCP4 Engineer | Enhance Technology, Inc.
Website: http://www.enhance-tech.com
Email: chris@enhance-tech.com
Skype: support.enhance
World Headquarter: USA
Branch: Taiwan | Germany | China