VMware Cloud Community
phsteele
Contributor
Contributor

ESXi hang and troubleshooting

I'm new to the VMware forums so I'm not sure if this is the appropriate place to post this sort of question.

We've been using ESXi for setting up test servers for a couple of years, and have just set up our first production server. The approach we took isn't the way we would have liked to deploy a production ESXi server, but unfortunately our budget was limited. We purchased Dell R600 with dual 6-core cpus, 32 GB of RAM, and 900 GB of disk space using SAS disks in a RAID 5 config (direct attached, not a SAN). The server boots the free version of ESXi 4.1 (Build 348481) from a 1GB SD card. We've created 6 Windows 2008 R2 VM's of various sizes.

The server has been working under light load for the last few months. This morning all 6 VMs froze. vSphere client could still talk to the server and the server console behaved normally. We tried to do a VM reset on a couple of the VMs but the action did not work. We did manage to issue a the VM power off for another couple of VMs and that appeared to work but they would not restart. In the end we had to reboot the server (from vSphere Client) and everything came back up.

We are somewhat concerned that the server would crash in this fashion, taking down all the VMs. It's not clear if it was a hardware problem or some sort of ESXi software issue. There were no obvious errors. Is there any way to troublesheet this sort of problem if it happens again?

Thanks!

Reply
0 Kudos
8 Replies
DSTAVERT
Immortal
Immortal

Sorry to hear your problem.

I don't know anything specifically and it becomes dificult with out evidence. ESXi writes it's logs to a RAM disk unless setup to do otherwise. A reboot wipes out the logs. I would change the log destination to a datastore or use the vilogger from the vMA appliance. Use the vSphere client to change the log destination Configuration -> Software -> Advanced and look for syslog. If you do have asyslog server you can send the logs to that.

-- David -- VMware Communities Moderator
vmroyale
Immortal
Immortal

Hello.

How many drives make up that 900GB of disk space?  Do you have the default Wednesday scheduled defrag task turned off in Windows 2008?  Given that this happened on a Wednesday, I might expect this to be involved.

Good Luck!

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
phsteele
Contributor
Contributor

Thanks for the tip on where to put the event logs. They're now going to the local datastore. I'm still learning ESXi and have to admit I have lots more to learn. Every bit helps.

Thanks!

Reply
0 Kudos
phsteele
Contributor
Contributor

I had read that defrag was still recommended even in a VM setting. I'll turn it off although the defrag actually occurs at 1:00am and the crash occured just past 9:30am.

The Datastore consists of four 300GB disks set up as RAID5. Would more disks be a better option. There's only room for 6 disks in the R600, but I'm certainly willing to add 2 more if it makes sense.

Thanks!

Reply
0 Kudos
DSTAVERT
Immortal
Immortal

Have a look through some of the webcasts. http://communities.vmware.com/docs/DOC-14673 and have a look at http://blogs.vmware.com/esxi/2011/04/become-a-true-esxi-expert-with-the-new-free-vmware-elearning-co...

-- David -- VMware Communities Moderator
Reply
0 Kudos
vmroyale
Immortal
Immortal

Regardless of the value/need of defrag, if ALL 6 of those VMs run it at the same time against 4 disks on the backend then the result is likely not going to be a good one.  You might try modifying the defrag schedule to minimize the possiblity of them all running simultaneously, if you still want defrag to run though.

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
Reply
0 Kudos
phsteele
Contributor
Contributor

For the moment I've turned off defrag. Two of the systems are SharePoint servers and two more are Apache servers, so there's not going to be a lot of disk activity. I suspect the defrag wasn't the cause of the problem though...

Reply
0 Kudos
phsteele
Contributor
Contributor

Thanks. Resources like these is definitely something we need to go over, especially after looking at the ESXi event logs. There's lots of information and not entirely clear what's normal and what might be something that needs investigating. Some recurring errors looks suspicious though:

error 'App'] Failed to read header on stream TCP(local=127.0.0.1:53468, peer=127.0.0.1:0): N7Vmacore15SystemExceptionE(Connection reset by peer)

info 'Vmomi'] Throw vmodl.fault.RequestCanceled

Lots of others. We'll have to do some research...

Reply
0 Kudos