VMware Cloud Community
Seventh77
Enthusiast
Enthusiast

Multiple host disconnects with "failed to crossdup fd xxx" errors in vmkernel.log

I lost two hosts in my ESXi 5.0 enterprise cluster a few days ago, and found the solution to a related issue in this thread here: https://communities.vmware.com/thread/464597

However, now another host in that same cluster has the same problem - making two blades and one physical server, all in the same cluster, all on ESXi 5.0 running the standard VMWare build that have gone down within a matter of a few days.

vmkernel.log on each of the offending hosts is full of this:

---

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 12, /vmfs/devices/char/vob/VM type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 13, /vmfs/devices/char/vob/External type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 14, /vmfs/devices/char/vob/iScsi type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 15, /vmfs/devices/char/vob/Migrate type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 16, /vmfs/devices/char/vob/PageReti type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 17, /vmfs/devices/char/vob/Visorfs type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 18, /vmfs/devices/char/vob/Hardware type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 19, /vmfs/devices/char/vob/Vfat type CHAR: Busy

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 3232: Unimplemented operation on 0x4100233874b0/SOCKET_UNIX_SERVER

2013-12-02T15:05:29.561Z cpu14:3200)WARNING: UserObj: 675: Failed to crossdup fd 20, /var/run/vmware/vobd-user-ctx.s type SOCKET_UNIX_SERVER: Not implemented

Once that happens, the host cannot be reconnected to the cluster until I make the vpxa.cfg edits in the thread I linked above. Obviously this not an acceptable solution, because the host is down until I am around to make the edits, restart the services and reconnect the host.

Why is this happening, and how can I further troubleshoot it?

Reply
0 Kudos
3 Replies
Seventh77
Enthusiast
Enthusiast

Marco's questions from another thread:

- Have you changed anything on the cluster?

- Added HA/DRS, created more VMs, etc?

- What is your VM growth tax per month?

- Have you changed log/statistics settings for vCenter?

- Can you check if you don't have a lot of snapshots on the environment? (on SSH, do a "find /vmfs/volumes/ -name *delta*")

Answers:

- No, nothing has changed on the cluster.

- No changes to HA/DRS, and no new VMs in the last month.

- Not sure off the top of my head, but my cluster is running about 50% of it's total potential.

- I have not changed logs/stats (or anything else)

- There are actually no snapshots at all on the environment.

Reply
0 Kudos
marcelo_soares
Champion
Champion

Look.. this seems like you are crossing some limit on the ESX server side. If you are having this now, means something changed, and you need to think about what may be the cause. It is very difficult to try to find out without more accurate data or even access to the problematic hosts. I would suggest checking the config max doc http://www.vmware.com/pdf/vsphere5/r50/vsphere-50-configuration-maximums.pdf and check mainly for:

- Max number of LUNs and paths presented

- Max number of VMs per host/cluster

Besides this, only a  live look could help, or checking the log bundle from one of the affected hosts. Problem seems really a consequence of something bigger.

Marcelo Soares
Reply
0 Kudos
aakalan
Enthusiast
Enthusiast

Reply
0 Kudos