Solved: NFS APD persists even after update applied

fabio_brizzolla · ‎07-10-2014

Good morning guys,

We've upgraded our ESXi Hypervisor from 5.1 to 5.5 U1 last weekend. I read about the NFS APD bug and downloaded the referred update to apply yesterday nigth.

~ # esxcli software vib install -d "/vmfs/volumes/4f27d555-e55efb08-0da4-d4ae52723fbc/ESXi550-201407001.zip"

Installation Result

Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Reboot Required: true

VIBs Installed: VMware_bootbank_esx-base_5.5.0-1.28.1892794, VMware_bootbank_lsi-mr3_0.255.03.01-2vmw.550.1.16.1746018, VMware_bootbank_lsi-msgpt3_00.255.03.03-1vmw.550.1.15.1623387, VMware_bootbank_misc-drivers_5.5.0-1.28.1892794, VMware_bootbank_mtip32xx-native_3.3.4-1vmw.550.1.15.1623387, VMware_bootbank_net-e1000e_1.1.2-4vmw.550.1.15.1623387, VMware_bootbank_net-igb_5.0.5.1.1-1vmw.550.1.15.1623387, VMware_bootbank_net-tg3_3.123c.v55.5-1vmw.550.1.28.1892794, VMware_bootbank_rste_2.0.2.0088-4vmw.550.1.15.1623387, VMware_bootbank_sata-ahci_3.0-18vmw.550.1.15.1623387, VMware_bootbank_scsi-megaraid-sas_5.34-9vmw.550.1.28.1892794, VMware_bootbank_scsi-mpt2sas_14.00.00.00-3vmw.550.1.15.1623387, VMware_locker_tools-light_5.5.0-1.28.1892794

VIBs Removed: VMware_bootbank_esx-base_5.5.0-0.0.1331820, VMware_bootbank_lsi-mr3_0.255.03.01-1vmw.550.0.0.1331820, VMware_bootbank_lsi-msgpt3_00.255.03.03-1vmw.550.0.0.1331820, VMware_bootbank_misc-drivers_5.5.0-0.0.1331820, VMware_bootbank_mtip32xx-native_3.3.4-1vmw.550.0.0.1331820, VMware_bootbank_net-e1000e_1.1.2-4vmw.550.0.0.1331820, VMware_bootbank_net-igb_2.1.11.1-4vmw.550.0.0.1331820, VMware_bootbank_net-tg3_3.123c.v55.5-1vmw.550.0.0.1331820, VMware_bootbank_rste_2.0.2.0088-4vmw.550.0.0.1331820, VMware_bootbank_sata-ahci_3.0-17vmw.550.0.0.1331820, VMware_bootbank_scsi-megaraid-sas_5.34-9vmw.550.0.0.1331820, VMware_bootbank_scsi-mpt2sas_14.00.00.00-3vmw.550.0.0.1331820, VMware_locker_tools-light_5.5.0-0.0.1331820

After restarting the host and remounting our NFS (used for backup VMs using ghettoVCB.sh), the APD issues are persisting as I checked on the /var/log/vobd.log this morning:

2014-07-10T11:00:58.757Z: [APDCorrelator] 46073957136us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-10T11:00:58.757Z: [APDCorrelator] 46073957584us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-10T11:02:45.898Z: No correlator for vob.vmfs.nfs.server.disconnect

2014-07-10T11:02:45.898Z: [vmfsCorrelator] 46181098984us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.100.83 /mnt/HD/HD_a2/VMBACKUP a643d5cd-6c9ea269-0000-000000000000 D-Link

2014-07-10T11:03:18.758Z: [APDCorrelator] 46213958621us: [vob.storage.apd.timeout] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

2014-07-10T11:03:18.758Z: [APDCorrelator] 46213959037us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

2014-07-10T11:07:40.806Z: [APDCorrelator] 46476006085us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-10T11:07:40.806Z: No correlator for vob.vmfs.nfs.server.restored

2014-07-10T11:07:40.806Z: [APDCorrelator] 46476006585us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-10T11:07:40.806Z: [vmfsCorrelator] 46476006474us: [esx.problem.vmfs.nfs.server.restored] 192.168.100.83 /mnt/HD/HD_a2/VMBACKUP a643d5cd-6c9ea269-0000-000000000000 D-Link

So, anyone have a clue about this?

Thanks in advance.

JPM300 · ‎07-29-2014

Hey fabio_brizzolla,

Well the first thing I found was that the D-Link DNS-320 is not on VMwares hardware compatability list, however some of the other versions are, so unless I missed something I couldn't find it, which means there could be unexpected behavior out of the setup.

With that said, your other setup's that are the same, they are all running fine on 5.0 update x, its just the 5.5 installation that you are having problems with?

View solution in original post

fabio_brizzolla · ‎07-12-2014

Guys, please....

At this moment I'm writing this message I'm looking to my SSH and this just happened on the /var/log/vobd.log

2014-07-12T11:49:22.325Z: [APDCorrelator] 221777525452us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T11:49:22.325Z: [APDCorrelator] 221777525882us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T11:49:28.827Z: [APDCorrelator] 221784027308us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T11:49:28.827Z: [APDCorrelator] 221784027603us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

One little thing about that... the entry on vobd.log happened at the same time of this...

2014-07-12 11:36:07 -- info: Initiate backup for VM26_REP

2014-07-12 11:36:07 -- info: Creating Snapshot "ghettoVCB-snapshot-2014-07-12" for VM26_REP

Destination disk format: VMFS thin-provisioned

Cloning disk '/vmfs/volumes/datastore1 (1)/VM26_REP/VM26_REP.vmdk'...

Clone: 99% done.

2014-07-12 11:49:05 -- info: Removing snapshot from VM26_REP ...

rm: can't remove '/vmfs/volumes/a643d5cd-6c9ea269/VM26_REP/VM26_REP-2014-07-03_14-46-29/VM26_REP-flat.vmdk': No such file or directory

2014-07-12 11:49:25 -- info: Slept 1 seconds to work around NFS I/O error

2014-07-12 11:49:25 -- info: Backup Duration: 13.30 Minutes

2014-07-12 11:49:25 -- info: Successfully completed backup for VM26_REP!

Edit #1: The APD is happening randomly and not specifically on the snapshot operations. I just evidenced the APD in the middle of a cloning operation.

2014-07-12 12:06:15 -- info: Initiate backup for VM11_Automatix

2014-07-12 12:06:15 -- info: Creating Snapshot "ghettoVCB-snapshot-2014-07-12" for VM11_Automatix

Destination disk format: VMFS thin-provisioned

Cloning disk '/vmfs/volumes/datastore1 (1)/VM11_Automatix/VM11_Automatix.vmdk'...

Clone: 23% done.

2014-07-12T12:29:34.032Z: [APDCorrelator] 224189232456us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:29:34.032Z: [APDCorrelator] 224189232867us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:29:40.856Z: [APDCorrelator] 224196056810us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:29:40.856Z: [APDCorrelator] 224196057189us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:31:22.163Z: [APDCorrelator] 224297363767us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:31:22.163Z: [APDCorrelator] 224297364228us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:31:28.853Z: [APDCorrelator] 224304053611us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:31:28.853Z: [APDCorrelator] 224304053991us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:32:10.225Z: [APDCorrelator] 224345426115us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:32:10.226Z: [APDCorrelator] 224345426551us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:32:16.869Z: [APDCorrelator] 224352070095us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:32:16.870Z: [APDCorrelator] 224352070491us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:34:34.393Z: [APDCorrelator] 224489593639us: [vob.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:34:34.393Z: [APDCorrelator] 224489594068us: [esx.problem.storage.apd.start] Device or filesystem with identifier [a643d5cd-6c9ea269] has entered the All Paths Down state.

2014-07-12T12:34:40.861Z: [APDCorrelator] 224496061281us: [vob.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

2014-07-12T12:34:40.861Z: [APDCorrelator] 224496061654us: [esx.clear.storage.apd.exit] Device or filesystem with identifier [a643d5cd-6c9ea269] has exited the All Paths Down state.

According to the KB2077360 this should be fixed on esx-base:5.5.0-1.18.1881737. I'm running esx-base:5.5.0-1.28.1892794!!!!

Please, any help we'll appreciate.

Regards.

fabio_brizzolla · ‎07-16-2014

For god sake... no one?!

SG1234 · ‎07-17-2014

are you sure it is an ESXi related OS thing? is it possible that one of the NICs in NFS portgroup is acting up ? try disconnecting one nic at a time from the vSwitch and then see it the error repeats

~Sai Garimella

fabio_brizzolla · ‎07-29-2014

SG1234 wrote:

are you sure it is an ESXi related OS thing? is it possible that one of the NICs in NFS portgroup is acting up ? try disconnecting one nic at a time from the vSwitch and then see it the error repeats

~Sai Garimella

Hi Sai Garimella,

I'm pretty sure this is a ESXi related thing because I have this identical infrastrucure replicated in other site running ESXi 5.0.0-623860 with the same little NAS (D-Link DNS-320L) for VM backups and there's not trace of APDs in the vobd.log.

At the moment we've upgraded to ESXi 5.5.0 (and patched it to fix the APD bug) we got APDs here... and not there. Same backup script. Same NAS equipment. Same servers. Same cabling infrastructure. Same everything.

The APDs are intermittent, causing the backup to take A LOT MORE time than expected (it jumped from 12 hours to almost 72 hours).

I'm OK with the fact that VMware will not give a shit for that issue... sadly.

JPM300 · ‎07-29-2014

Hey fabio_brizzolla,

Well the first thing I found was that the D-Link DNS-320 is not on VMwares hardware compatability list, however some of the other versions are, so unless I missed something I couldn't find it, which means there could be unexpected behavior out of the setup.

With that said, your other setup's that are the same, they are all running fine on 5.0 update x, its just the 5.5 installation that you are having problems with?

fabio_brizzolla · ‎07-29-2014

JPM300 wrote:

Hey fabio_brizzolla,

Well the first thing I found was that the D-Link DNS-320 is not on VMwares hardware compatability list, however some of the other versions are, so unless I missed something I couldn't find it, which means there could be unexpected behavior out of the setup.

With that said, your other setup's that are the same, they are all running fine on 5.0 update x, its just the 5.5 installation that you are having problems with?

Yes, they're running fine on 5.0. We'll consider demote these D-Links DNS-320L if nothing can be done (maybe a advanced/hidden configuration to avoid the APDs). But that'll suck.

Thanks for your insigth.

JPM300 · ‎07-29-2014

Np, I'm not saying it won't work or can't im just sayign since its not on the HCL it could be problimatic or have unexpected behavior or unexplainable behavior. With that said you can continue troubleshooting it, seeing as you have it working 5.0, but if you eventually hit a wall and call Vmware support it will just be best case effort.

How do you have the NFS setup on that host? Is there 2 physical nics setup on the vSwitch/PortGroup/Network or just 1?

The first thing I would do is simplify the setup, create a new vSwitch, with 1 Physical nic in it with the NFS port group, hook only that 1 connect to the DLINK NAS and try to make it fail, see if you still get the APD error messages. If you do try different nics, cables, NAS if you have a spare, ect. Then if you are still getting the APD I would call it a day and chock it up to incompatability. Remeber with NFS vmware is still using its native storage connections and maybe the NAS is trying to do something the 5.5 vmkernel doesn't like or expect to see.

I hope this has been helpfull

fabio_brizzolla · ‎07-29-2014

How do you have the NFS setup on that host? Is there 2 physical nics setup on the vSwitch/PortGroup/Network or just 1

It's a very simples setup: our PowerEdge servers are directly connected to the same PowerConnect 6200 switches where the DNS-320L is. Both pNICs are bound to vSwitch0 as Active/Active.

The first thing I would do is simplify the setup, create a new vSwitch, with 1 Physical nic in it with the NFS port group, hook only that 1 connect to the DLINK NAS and try to make it fail, see if you still get the APD error messages. If you do try different nics, cables, NAS if you have a spare, ect. Then if you are still getting the APD I would call it a day and chock it up to incompatability. Remeber with NFS vmware is still using its native storage connections and maybe the NAS is trying to do something the 5.5 vmkernel doesn't like or expect to see. I hope this has been helpfull

We already tried connect the NAS directly to the server pNIC in a separate vSwitch... same thing. Thanks anyway JP 😉

JPM300 · ‎07-31-2014

Figured I would mention this as I just came across it the other day. It may apply to your case

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=207639...