rjf7r
Enthusiast
Enthusiast

how to diagnose a failure: hypervisor ran fine for months, then VMs slowed to a crawl, then hypervisor became non-responsive

We have a very small installation running ESXi-5.1.0-799733-standard with a vSphere Essentials license on a single physical machine.  We run 10 very different servers that support our software development needs.

We haven't had any problems with the hypervisor, perhaps for over a year.  However, this morning first we noticed some of the VMs stopped responding, and about a half-hour later the hypervisor itself (accessed through vSphere Client) became unresponsive, including the hardware console itself.

We power-cycled the hypervisor hardware and everything came up fine with no error conditions or messages.  All the VMs are working fine now.

What happened?  How do I find out what happened?  How would I copy the system logs to another system where I can study them?

The "Events" tab in the vSphere Client displays only a limited span of the most recent events.  Is there a way to enlarge this log?  Is there a file behind this display that retains more history?

Thanks for any advice,

Bob

Tags (1)
0 Kudos
27 Replies
greco827
Expert
Expert

Check the vmkernel.log of the host.  /var/log

Feel free to share some of the log entries if you want.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
unsichtbare
Expert
Expert

Try pressing [ALT]+[F12] on the console of the host.

If there is a bunch of messages in reverse text (black text on a white background), then you are likely looking at a storage issue. Get some of the messages and it would be possible to diagnose the issue more precicely.

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/
rjf7r
Enthusiast
Enthusiast

I've attached vmkernel.log, perhaps that indicates something.

I just increased the number of buffers in the buffer cache, since it appears that there are sometimes no buffers available.

[ALT]+[F12] on the console does list a bunch of messages in reverse text--I have no idea what they mean or how to show them to you.

0 Kudos
greco827
Expert
Expert

You need to identify the datastore associated with this naaID ... naa.5000c5004f6cf510 ... and make sure it is properly mounted.  Check under Configuration --> Storage Adapters --> Devices.  If it is there but greyed out, you shoudl check with your storage admin to identify the LUN and make sure it is A) Still being made available to the WWN's of that particular host.  If it is not, but should be, have him add it back to the proper storage group (or whatever, depending on your array type).  If it has been destroyed on the array, or you simply don't need it, you need to get rid of it so that the host stops looking for it and trying to connect to it.  This is what is eating up your buffer cache more than likely.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
greco827
Expert
Expert

rjf7r, any update on what you found with the datastore with the naaID naa.5000c5004f6cf510 that was constantly flagging in the vmkernel logs?

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos
rjf7r
Enthusiast
Enthusiast

naaID naa.5000c5004f6cf510 is our local SATA boot disk, which is used essentially for two things:  booting the hypervisor, and hypervisor swap space.


It is more than half free.


Bob

0 Kudos
greco827
Expert
Expert

Interesting.  Since it is a local disk I would suspect a possible driver issue.  What kind of driver and driver version are you using?

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos
rjf7r
Enthusiast
Enthusiast

just the drivers that came with ESXi 5.1.  How would I determine the driver identity and version?

0 Kudos
greco827
Expert
Expert

The driver in use would vary with the HBA being used.  Some use Emulex, some use QLogic, etc, and within those families are multiple driver types.

Check this article to help determine what you have, and then we can troubleshoot more effectively. 


VMware KB: Determining Network/Storage firmware and driver version in ESXi/ESX 4.x, ESXi 5.x and ESX...

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos
rjf7r
Enthusiast
Enthusiast

Our storage drivers appear to be (according to "esxcli storage core adapter list") to be identified as

8086:1d02 15d9:0637 and 8086:1d6b 15d9:0637.

Using VMware Compatibility Guide: I/O Device Search  I don't get an exact match.

It appears that our VIB dates back to system installation, 3013-04-23.

Not sure if this means anything.

Bob

0 Kudos
greco827
Expert
Expert

What is the output of the commands " esxcfg-scsidevs -a" and "vmkload_mod -s HBADriver |grep Version"

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos
unsichtbare
Expert
Expert

Did you install ESXi with a vendor-supplied ISO (from Dell or HP) or with an ISO from VMware?

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/
0 Kudos
rjf7r
Enthusiast
Enthusiast

We used the VMware iso, since SuperMicro does not have a vendor-specific ISO.

(Since this was two years ago, is there any way to check from the running system?  I know that we did download the HP flavor since we were doing some testing on a separate piece of HP hardware before assembling our current system.)

0 Kudos
rjf7r
Enthusiast
Enthusiast

The output of esxcfg-scsidevs -a is attached (sorry that it's a photo of the console).


vmkload_mod -s HBADriver |grep Version of the three drives yields:


ahci version 3.0-13vmw Build 799733

rste version 2.0.2.0088

iscsi_vmk verions built on Aug 1, 2012

0 Kudos
unsichtbare
Expert
Expert

You can see the image you used in the ESXi Status page of the vSphere C# Client

+The Invisible Admin+ If you find me useful, follow my blog: http://johnborhek.com/
0 Kudos
rjf7r
Enthusiast
Enthusiast

So I guess it's:

ESXi-5.1.0-799733-standard

Does that mean it was installed from a VMware-distributed iso?

0 Kudos
greco827
Expert
Expert

OK, so that is a general availability release.  Have you since done any updates?  What is the current build version of your ESX server?

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos
rjf7r
Enthusiast
Enthusiast

How do I find the build version?  I assume that this is something different from the "Image Profile".

We've done Update 3, "VMware-VMvisor-Installer-5.1.0.update03-2323236.x86_64".

Bob

0 Kudos
greco827
Expert
Expert

In the VI Client it's at the top of the page if you are on a host.  Via CLI, you can find out by running the command vmware -v.

BuildVersion.jpg

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
0 Kudos