My ESX server 3.0. has some issue 2 days before.
my users are not able to connect to some of my virtual instances. when i check the Virtual center one of my esx server in the HA and DRS cluster status is showing notresponding.and all the virtual instances are showing the status as disconnected.but all the virtual machines are in online. iam unable to do annything, finally i have restarted my ESX server.then everything went fine.my suspect is there may be issue with the local storage.
how can i get the vmkcore dump files, where is it exact location i dont find any mount point for this partation.here i have pasted some of my log files please look on to this and give a solution why this occured. when i check with some persons they are telling this may be due to I/O time out, iam not clear with this point..
logs :
I have been looking at the log files of ESX06 and why it failed last night. In one of the vmkernel logs, I found that the host starting having problems around 15:29 yesterday. The problems say that they are storage related, here is the log entry for the beginning of the errors
Sep 20 15:29:40 gbswiesx06 vmkernel: 64:21:29:02.131 cpu0:1032)SCSI: 3753: AsyncIO timeout (5000); aborting cmd w/ sn 9464635, handle 6de48/0x3d2027a0
Basically the vmkernel is timing out talking to the disks. Then we get line after line of the following error, until the server is rebooted at approx 19:08.
Sep 20 15:37:52 gbswiesx06 vmkernel: 64:21:37:14.644 cpu6:1399)<4>lpfc0:0754:FPe:SCSI timeout Data: x3f07bce4 x98 x14 x6ba
Can I please ask that somebody on the storage team takes a look at the SAN / Switch & Error logs and report back with findings.
I will also look into why gbsvgpald01 blue screened, although given the above I am lead to believe it was the effect of not being able to see its vmware config files (vmdk, vmx etc)
Also, the following line
: 4052: Reclaimed timed out heartbeat
Leads to the fact that the host was responding through the heartbeat function (If a heartbeat fails, then HA kicks in) so HA knew that the host was online, but not about the issues it was having with accessing Storage (By Design) so that's why the VM's weren't migrated.
I can confirm HA does work because when the host was rebooted, I watched the VM's get transferred onto other hosts.
Have you tried to open an SR by VMWare?
Such problems can be caused by many reasons.
Hi,
after a failure you should always first run the "vm-support" script on the command line or export the diagnostic data from the VC. Then all logs and dumps are collected in one file that'll be requested by VMware Support.
Before restarting the complete host, you should try to restart the management-services - that won't affect any running VM's:
service mgmt-vmware restart*
service vmware-vpxa restart*
The vmware-vpxa communicates directly with VC. The mgmt-vmware service communicates with VI-Client AND the service vmware-vpxa.
In general you'll have your servers manageable again.
Michael
that won't affect any running VM's
FYI: With ESX 3.0.1 this is ONLY true when you have the according patch installed!