I have 3 nodes in my cluster. 1 of them suddenly shows as "not responding."
When I restart the vmware-vpxa service, it comes back online briefly and goes right back offline. The vpxa.log is very busy. Only holding about 5 mins of data before overwriting (not sure if this is normal) but I don't see any blaringly obvious errors in it.
My /etc/hosts, /etc/sysconfig/network, and /etc/vmware/esx.conf files look correct. However, when I run./ft_gethostbyname out of the /opt/vmware/aam/bin directory, it returns ft_gethostbyname(INIT) FAILED!
I'm sure this is part of the problem but don't know how to fix it. I'm stuck at the moment because 6 of my production servers are still running on this ESX node....despite the errors.
Any help greatly appreciated.
The vpxa agent suffered some kind of error and backtraced. This more than likely would have caused it to disconnect from VC.
Does this time match with the disconnection? Check for the string "*Backtrace*" in all the vpxa logs on all servers. Does it's logged time match the time the servers disconnect?
From the vpxa log:
Increment master gen. no (67): VpxaAlarm::CheckAlarmStatus
Unhandled exception: Not initialized
Some errors in the VC logs around the same time:
UpdateRuntimeData: Invalid managed entity status 93353016
UpdateRuntimeData: Invalid managed entity status 93346240
.......
-- FINISH task-internal-2 -- -- ScheduledTaskManager
VpxLro::LroMain took 773676173 ms
Also a DB error (not sure if its related):
2008-08-28 13:57:42.725 'App' 5796 error VpxdMoHost::CollectRemote database error while flushing stats data: "ODBC error: () - " is returned when executing SQL statement "{ call load_stats_proc(?, ?, ?, ?, ?, ?) }"
That ScheduledTaskManager task took 773676173 ms! That can't be right.
Both the backtrace on the ESX host and the errors on the VC seem to indicate a problem checking alarm status.
First check the rest of the vpxa logs to see if the backtrace corresponds to the disconnect times.
Also, attach/paste the hostd log from the same time. They're stored in /var/log/vmware/ and start with hostd.
You may want to consider opening a Support Request with VMware support.
Message was edited by: appk
vmkping doesn't miss a beat on either server. ?:|
I checked the vpxa files but they do not go back any further than the one I posted. They've been overwritten.
Hostd logs seem to have a problem.
At 13:45:10 my junior tech connected into the VirtualCenter via web interface and hit the RESET button on the MAIL02 virtual server. Then the log loops. It was running on ESX2. (full log attached as esx2.hostd.log). The server eventually appeared in my VCenter as (not responding).
Hw info file: /etc/vmware/hostd/hwInfo.xml
Task Created : haTask-80-vim.VirtualMachine.reset-23032
Event 92 : Mail02 on esx2.netsource-one.local in ha-datacenter is reset
State Transition (VM_STATE_ON -> VM_STATE_RESETTING)
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : open successful (17) size = 18269395968, hd = 0. Type 3
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : closed.
Disconnect check in progress: /vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02.vmx
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : open successful (17) size = 18269395968, hd = 0. Type 3
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail02/Mail02-flat.vmdk" : closed.
Event 93 : Mail02 on esx2.netsource-one.local in ha-datacenter is powered on
State Transition (VM_STATE_RESETTING -> VM_STATE_ON)
Task Completed : haTask-80-vim.VirtualMachine.reset-23032
Received a duplicate transition from foundry: 1
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23036
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23036
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23040
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23040
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23042
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23042
Ticket issued for mks connections to user: vpxuser
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23044
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-23044
etc....
Then, in an attempt to recreate the problem, I did the same thing to MAIL03 virtual server which was hosted on ESX1. Same problem occurred: (full log attached esx1.hostd.log)
Task Created : haTask-64-vim.VirtualMachine.reset-59360
Event 192 : Mail03 on esx1.netsource-one.local in ha-datacenter is reset
State Transition (VM_STATE_ON -> VM_STATE_RESETTING)
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : open successful (17) size = 13860645888, hd = 0. Type 3
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : closed.
Disconnect check in progress: /vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03.vmx
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : open successful (17) size = 13860645888, hd = 0. Type 3
DISKLIB-VMFS : "/vmfs/volumes/482aeccf-346c6aa7-db28-001d092b6b6f/Mail03/Mail03-flat.vmdk" : closed.
Event 193 : Mail03 on esx1.netsource-one.local in ha-datacenter is powered on
State Transition (VM_STATE_RESETTING -> VM_STATE_ON)
Task Completed : haTask-64-vim.VirtualMachine.reset-59360
Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59363
Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59363
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59364
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59364
Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59365
Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59365
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59366
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59366
Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59367
Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59367
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59368
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59368
Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59369
Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59369
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59370
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59370
Task Created : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59371
Task Completed : haTask-ha-root-pool-vim.ResourcePool.updateConfig-59371
Task Created : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59372
State Transition (VM_STATE_ON -> VM_STATE_RECONFIGURING)
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Task Completed : haTask-pool0-vim.ResourcePool.updateChildResourceConfiguration-59372
.....looping from there.
To bring them both back, I had to reset the vpxa service on the host which brought it online in VC long enough to initiate a maintenance mode command, which vmotioned the live VM's off to another node. Then I rebooted the host and it came back online.
Ok, time for an update from me...
after getting our support team to analyse the VC logs etc, I noticed that the person who built the host had forgotten to reboot it after changing the SC memory to 800MB (so it was still running at 200MB).
Corrected this and havent' had an instance of the 'not responding' issue in the last 5 hours.
Aside from this, the support guy did notice the logs were overflowing and wrapping very quickly because of 1 particular VM with lots of entries like the following:
State Transition (VM_STATE_RECONFIGURING -> VM_STATE_ON)
Similar to yours perhaps?
Jon
Thanks for the input, John. I definitely have rebooted since changing the SC to 800mb. In fact, prior to my reload of ESX, I had not even altered the SC memory and was still having similar problems. I suspect I'm having a problem with my Virtual Center....but not sure at this point. Might also be reloaded soon if we can't find anything else.
The hostd logs is full of RECONFIGURING events which is not normal.
(VM_STATE_ON -> VM_STATE_RECONFIGURING)
(VM_STATE_RECONFIGURING -> VM_STATE_ON)
Does the problem go away when you disable DRS and remove all resource pools?
This appears to be a known issue. You should open a support request with vmware support.
Message was edited by: appk
The problem DOES go away with the DRS removed.
My VMWare environment is NFR (and non-production but used for training/testing). I don't believe I can open a case with VMWare.
What appears to be happening is that all these events (which shouldn't be happening) are taking up all the resources for the vpxa agent.
The vpxa agent therefore cannot heartbeat the VC server so it shows as "not responding" in VC.
I believe VMware are working on a patch so keep an eye out for new patches/releases. In the meantime you may want to keep DRS disabled.
Hmm, that's a big leap from a random problem to it being a known issue with a potential patch on the way. Disabling DRS isn't an option for me and I'm looking to go into PROD with U2 week after next....am I doing the right thing I wonder?
Jon.
Well since this is a training machines, can you quickly reload everything in 2 hours instead of go through this long troubleshooting procedures. All the possibilities has been applied but has not fix. Why can't we test out with new rebuilds?
1. Download your latest packages and builds for both VC 2.5 u2 and ESX 3.5 U2
2. Backup your virtual center database.
3. Create a new virtual machine and load VC server components on there and point to the database location/credentials.
4. Migrate all VMs to other ESX hosts and reinstall fresh ESX 3.5 U2 and use Update Manager to patch it.
5. Configure all TCP/IP networking and LUN pieces.
6. Migrate back all the VMs to ESX 3.5 U2 and do the same to the rest of ESX hosts remaining.
7. Create a new cluster and enable HA, DRS and VMotion and drag and drop the ESX hosts to it.
8. Check to see if you have any problems.
If you want to continue troubleshooting:
1. Make sure static DNS entries are in place.
2. Make sure your root partition not filled up.
3. Check your database make sure it doesn't filled up.
4. Restart vpxa and hostd process
5. Completed remove VPX agent and readd your host.
6. Disconnect and remove all your ESX hosts and create brand new cluster.
7. Add each ESX to new cluster one by one.
8. Verify you have use the latest patches for your ESX host.
9. Check all the logs for any specific issues and tackle from there again.
10. Time to call VMware Support since this is mind boggling. I'm surprise if you reload ESX 3.5 U2 from scratch didn't work there are definitely having issues with your end. Check your networking configuration.
If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
Regards,
Stefan Nguyen
iGeek Systems Inc.
VMware, Citrix, Microsoft Consultant
AppK
I have to agree with John. Don't get me wrong....I GREATLY appreciate your help! But being that this is a known issue, can you offer up an article that states this so I can track it closely and know when a fix is ready? I'm concerned I will come across this with any new installs for my clients.
Azn2kew
All ESX nodes were already rebuilt with no change. I'm already rebuilding the VC currently and will post the results shortly.
I'm not aware of the full details of the issue or how its triggered. i.e. it could be something specific with the config thats causing the issue.
If delete the cluster then recreate it does the problem reoccur?
If not, if you remove then reinstall VC, wiping the database, does it reoccur?
I think it's a good idea to do the full reinstall of VC and ESX with a fresh database to see if it reoccurs.
I don't think there is an article on this, or at least none that I have found.
The reason why I think its a known issue is that a colleague of mine has come across this same sort of behavior and is investigating it with vmware support.
Naturally its possible that I'm incorrect and that there is something specific about your cluster or hosts which is causing this problem.
Within hostd.log 1 the esx host process is trying to send the updated VM resource pool information to the calling process which was a web session outside of the managed VC process and it ofcourse dies with a pipe broken message since the web session does not have the VC behind it. So I am not sure this is a practical thing to do to find the fault, however this behavior is a bug to be corrected. I don't think it is the cause but more a interuption in the normal process flow.
What does strike me as significant is the ODBC call failure. This is important as the esx host needs send the collected vmdb stats back to the VC DB at regular intervals.
I would have a closer look at the VC DB health and performance.
Well....
I have completed the reinstallation and reconfiguration of VCenter and it seems to have done the trick. I can now shutdown/restart a virtual machine from the console AND from the web interface.
Time will tell if it holds true.
Thanks to everyone for the input on this problem. I'll update this thread if a problem resurfaces.