I lost two hosts in my ESXi 5.0 enterprise cluster a few days ago, and found the solution to a related issue in this thread here: https://communities.vmware.com/thread/464597
However, now another host in that same cluster has the same problem - making two blades and one physical server, all in the same cluster, all on ESXi 5.0 running the standard VMWare build that have gone down within a matter of a few days.
vmkernel.log on each of the offending hosts is full of this:
---
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 12, /vmfs/devices/char/vob/VM type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 13, /vmfs/devices/char/vob/External type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 14, /vmfs/devices/char/vob/iScsi type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 15, /vmfs/devices/char/vob/Migrate type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 16, /vmfs/devices/char/vob/PageReti type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 17, /vmfs/devices/char/vob/Visorfs type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 18, /vmfs/devices/char/vob/Hardware type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 19, /vmfs/devices/char/vob/Vfat type CHAR: Busy
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 3232: Unimplemented operation on 0x4100233874b0/SOCKET_UNIX_SERVER
2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 20, /var/run/vmware/vobd-user-ctx.s type SOCKET_UNIX_SERVER: Not implemented
Once that happens, the host cannot be reconnected to the cluster until I make the vpxa.cfg edits in the thread I linked above. Obviously this not an acceptable solution, because the host is down until I am around to make the edits, restart the services and reconnect the host.
Why is this happening, and how can I further troubleshoot it?
Marco's questions from another thread:
- Have you changed anything on the cluster?
- Added HA/DRS, created more VMs, etc?
- What is your VM growth tax per month?
- Have you changed log/statistics settings for vCenter?
- Can you check if you don't have a lot of snapshots on the environment? (on SSH, do a "find /vmfs/volumes/ -name *delta*")
Answers:
- No, nothing has changed on the cluster.
- No changes to HA/DRS, and no new VMs in the last month.
- Not sure off the top of my head, but my cluster is running about 50% of it's total potential.
- I have not changed logs/stats (or anything else)
- There are actually no snapshots at all on the environment.
Look.. this seems like you are crossing some limit on the ESX server side. If you are having this now, means something changed, and you need to think about what may be the cause. It is very difficult to try to find out without more accurate data or even access to the problematic hosts. I would suggest checking the config max doc http://www.vmware.com/pdf/vsphere5/r50/vsphere-50-configuration-maximums.pdf and check mainly for:
- Max number of LUNs and paths presented
- Max number of VMs per host/cluster
Besides this, only a live look could help, or checking the log bundle from one of the affected hosts. Problem seems really a consequence of something bigger.
