VMware Cloud Community
michalhoppe
Contributor
Contributor

Stateless caching to USB problems with ESXi 5.5U1 on Cisco UCS

Hi everyone,

I am setting up a vSphere 5.5 environment in the lab that is to use stateless caching to USB.  I am under the gun to get this into production yesterday.

The hardware used is a Cisco UCS B200-M3 blade with B200M3.2.2.2.0.04282014643 firmware and Cisco provided 4GB USB disk.  There are 4 fiber channel LUNs presented, but these are meant to be data only.

I have been able to validate that the USB disk works by manually installing ESXi to it, and booting from it.

The short version:

We have the vCenter/Auto Deploy/DNS/DHCP/TFTP infrastructure validated and working.  Auto Deploy rules are working.  Applying the Host Profile with "Enable stateless caching to a USB disk on the host" fails with an error that the cache does not meet specification and that the host needs to be rebooted.  Rebooting once or many times does not resolve the error.  The USB disk is blank; the host will not boot from it.  I tried switching the Host Profile setting to "Enable stateless caching on the host" with the first argument being "usb", selected to overwrite any existing VMFS volumes, and selected to ignore any SSD devices.  Same thing happens.

The long version:

I have spent a fair bit of time troubleshooting this, and believe that I found the root cause:  the USB is being claimed for passthrough when it should be left alone for ESXi to mount it and use it.

Here's the story:

/var/log # lsusb

Bus 02 Device 04: ID 0624:0402 Avocent Corp.

Bus 02 Device 03: ID 13fe:3100 Kingston Technology Company Inc. 2/4 GB stick

Bus 02 Device 02: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub

Bus 02 Device 01: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 01 Device 02: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub

Bus 01 Device 01: ID 1d6b:0002 Linux Foundation 2.0 root hub

* note: the USB stick is a Cisco part number made by UNIGEN, not Kingston

/var/log # dmesg | grep 13fe

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: New USB device found, idVendor=13fe, idProduct=3100

2014-09-05T18:07:26.950Z cpu6:33693)<6>usb 2-1.3: Vendor: 0x13fe, Product: 0x3100, Revision: 0x0100

/var/log # dmesg | grep 2-1.3

2014-09-05T18:07:23.216Z cpu8:33604)<6>usb 2-1.3: new high speed USB device number 3 using ehci_hcd

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: New USB device found, idVendor=13fe, idProduct=3100

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=3

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: Product: PSE4000S3

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: Manufacturer: UNIGEN

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: SerialNumber: 40E11B0086921B5A

2014-09-05T18:07:23.336Z cpu8:33604)<6>usb 2-1.3: usbfs: registered usb0203

2014-09-05T18:07:26.950Z cpu6:33693)<6>usb 2-1.3: Vendor: 0x13fe, Product: 0x3100, Revision: 0x0100

2014-09-05T18:07:26.950Z cpu6:33693)<6>usb 2-1.3: Interface Subclass: 0x06, Protocol: 0x50

2014-09-05T18:07:27.237Z cpu6:33693)<6>usb-storage 2-1.3:1.0: interface is claimed by usb-storage

2014-09-05T18:07:27.237Z cpu6:33693)<6>usb 2-1.3: device is not available for passthrough

2014-09-05T18:08:42.445Z cpu18:33546)<6>usb-storage 2-1.3:1.0: unclaiming vmhba32

2014-09-05T18:08:42.445Z cpu18:33546)<6>usb 2-1.3: device is available for passthrough

It appears that the USB device is being unclaimed and made available for passthrough to VMs.  This is not the desired behaviour.

To find our what my USB device is called, I ran:


esxcli storage core device list | less  (output shortened for clarity)

mpx.vmhba32:C0:T0:L0

  Display Name: Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)

  Vendor: UNIGEN

  Model: PSE4000S3

Based on my understanding, there is a way to prevent making a device available for passthrough by marking it perenially reserved.  This can be done in the host profile, which I've done:

pernially_reserved.png

Also, I turned off the USB arbitrator service and tried to reapply the Host Profile to no avail:

/etc/init.d/usbarbitrator stop

So I kept on digging for the reason why ESXi is not writing the cache to USB.

Looking at syslog.log, here's what I found.  (results redacted and shortened)

  1. 2014-09-05T17:47:46Z 2014-09-05 17: 47:46,938 Host Profiles[40280]: INFO: Now caching to disk...^@ <-- this is good!

2014-09-05T17:47:55Z HostProfileManager: [2014-09-05 17:47:55,380 root     INFO] Scanning mpx.vmhba32:C0:T0:L0 for any installs ...^@

2014-09-05T17:47:55Z HostProfileManager: [2014-09-05 17:47:55,715 root     INFO] gpt

487 255 63 7831552

1 2048 7829503 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

^@

2014-09-05T17:47:55Z HostProfileManager: [2014-09-05 17:47:55,942 root     INFO] ^@

  1. 2014-09-05T17:48:01Z HostProfileManager: [2014-09-05 17:48:01,945 root     INFO]   Found nothing on mpx.vmhba32:C0:T0:L0.^@ <-- this is good!

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,279 root     INFO] gpt

487 255 63 7831552

1 2048 7829503 EBD0A0A2B9E5443387C068B6B72699C7 linuxNative 0

^@

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,280 root     INFO] Fresh install.  Using GPT^@

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,280 root     INFO]   Using the standard, minimum partition layout.^@

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,280 root     INFO] Checking USB device...^@

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,807 root     INFO] gpt

0 0 0 0

1 64 8191 C12A7328F81F11D2BA4B00A0C93EC93B 128

5 8224 520191 EBD0A0A2B9E5443387C068B6B72699C7 0

6 520224 1032191 EBD0A0A2B9E5443387C068B6B72699C7 0

7 1032224 1257471 9D27538040AD11DBBF97000C2911D1B8 0

8 1257504 1843199 EBD0A0A2B9E5443387C068B6B72699C7 0

^@ <-- this is good!

2014-09-05T17:48:02Z HostProfileManager: [2014-09-05 17:48:02,807 root     INFO] Preparing Visor volumes on disk /vmfs/devices/disks/mpx.vmhba32:C0:T0:L0...^@ <-- this is good!

2014-09-05T17:48:03Z HostProfileManager: [2014-09-05 17:48:03,112 root     INFO] stderr: create fs deviceName:'/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0:8', fsShortName:'vfat', fsName:'(null)'

deviceFullPath:/dev/disks/mpx.vmhba32:C0:T0:L0:8 deviceFile:mpx.vmhba32:C0:T0:L0:8

Checking if remote hosts are using this device as a valid file system. This may take a few seconds...

Creating vfat file system on "mpx.vmhba32:C0:T0:L0:8" with blockSize 1048576 and volume label "none".

Filesystem was created but mount failed on device "mpx.vmhba32:C0:T0:L0:8".: Not found. ^@  <-- ERROR!

Going back to try and find out what happened to my USB storage, I found:

esxcli storage core device list | less

   mpx.vmhba32:C0:T0:L0

   Display Name: Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)

   Has Settable Display Name: false

   Size: 0

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path:

   Vendor: UNIGEN

   Model: PSE4000S3

   Revision: PMAP

   SCSI Level: 2

   Is Pseudo: false

   Status: dead timeout

   Is RDM Capable: false

   Is Local: true

   Is Removable: true

   Is SSD: false

   Is Offline: false

   Is Perennially Reserved: false

   Queue Full Sample Size: 0

   Queue Full Threshold: 0

   Thin Provisioning Status: unknown

   Attached Filters:

   VAAI Status: unsupported

   Other UIDs: vml.0000000000766d68626133323a303a30

   Is Local SAS Device: false

   Is Boot USB Device: false

   No of outstanding IOs with competing worlds: 32

The USB disk is in "dead timeout" and the "perenially reserved" setting from the Host Profile had no effect.

However, I was able to prove that the USB device worked fine, at least for a while:

esxcli storage core device stats get | less

   mpx.vmhba32:C0:T0:L0

   Device: mpx.vmhba32:C0:T0:L0

   Successful Commands: 471

   Blocks Read: 7455

   Blocks Written: 0

   Read Operations: 309

   Write Operations: 0

   Reserve Operations: 0

   Reservation Conflicts: 0

   Failed Commands: 85

   Failed Blocks Read: 0

   Failed Blocks Written: 0

   Failed Read Operations: 0

   Failed Write Operations: 0

   Failed Reserve Operations: 0

Looking at the VM kernel log, see this:

vmkernel.log | less

2014-09-05T19:03:50.242Z cpu9:39820 opID=9efa2c3c)World: 14296: VC opID hostd-8b72 maps to vmkernel opID 9efa2c3c

2014-09-05T19:03:55.446Z cpu9:39820 opID=252dfa49)World: 14296: VC opID 58F84D94-00000680-7b-4e maps to vmkernel opID 252dfa49

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)ScsiPath: 5151: DeletePath : adapter=vmhba32, channel=0, target=0, lun=0

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)ScsiDevice: 3612: Can't unregister device mpx.vmhba32:C0:T0:L0 because it is in use.  OpenCount:1 InternalOpenCount:0 RefCount:2 FilterCount:0

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)ScsiDevice: 3623: Device mpx.vmhba32:C0:T0:L0 was in use by worldId 0

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)WARNING: NMP: nmpUnclaimPath:1502: NMP device "mpx.vmhba32:C0:T0:L0" quiesce state change failed: Busy

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)WARNING: ScsiPath: 3708: Path vmhba32:C0:T0:L0 is being removed

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)WARNING: ScsiPath: 3914: Failed to issue command 0x0 (cmdSN 0x0) on path vmhba32:C0:T0:L0: No connection

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)ScsiPath: 4874: Path vmhba32:C0:T0:L0 could not be unclaimed from plugin, status Busy. Continue path unclaiming

2014-09-05T19:03:55.475Z cpu9:33047 opID=252dfa49)WARNING: ScsiScan: 1758: Could not delete path vmhba32:C0:T0:L0


I confirmed that mpx.vmhba32:C0:T0:L0 is truly unavailable by trying to read from it.

/dev/disks # ls -l ./mpx*

-rw-------    1 root     root     4009754624 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0

-rw-------    1 root     root       4161536 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0:1

-rw-------    1 root     root     262127616 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0:5

-rw-------    1 root     root     262127616 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0:6

-rw-------    1 root     root     115326976 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0:7

-rw-------    1 root     root     299876352 Sep  5 19:42 ./mpx.vmhba32:C0:T0:L0:8

/dev/disks # cat  ./mpx.vmhba32\:C0\:T0\:L0\:1

cat: read error: Input/output error

(yes, I know I would get a bunch of garbage, but no error)

So it appears that the USB stick is recognized, partially configured (partitions are written), but fails at some point before ESXi is able to mount it to write the cache to it.

I was hoping to reset the USB bus by disabling and re-enabling the ESXi kernel USB and USB-storage modules, but that didn't seem to work - it was a bit of a long shot.

esxcli system module set --enabled=false  --module=usb

esxcli system module set --enabled=true  --module=usb


esxcli system module set --enabled=false  --module=usb-storage

esxcli system module set --enabled=true  --module=usb-storage

Has anyone else seen this behaviour??  Any help would be greatly appreciated.

Thanks,

Michal

6 Replies
michalhoppe
Contributor
Contributor

After much troubleshooting, it turned out that I needed to change the USB BIOS policy on this service profile, "Legacy USB Support" to disabled.

ChadEastwood
Contributor
Contributor

Michal,

Your troubleshooting steps and analysis saved me a bunch of time, thank you.  Did you find this value to be the only thing about your service profile that needed to change?

0 Kudos
JKOB
Contributor
Contributor

Hi Michalhoppe,

what did you actually change in the configuration.

I've got the same problem with my UCS environment using vSphere 6.

I'm not able to solve it.

How is your boot and bios policy

Thanks in advance

Jean

0 Kudos
916California
Contributor
Contributor

Michal - Thank you for sharing the thorough troubleshooting. I'm currently on ESXi 6.x u3 latest patch and have Legacy USB Disabled. Not sure if you can advise anything else, active ticket with VMWare is pending, still no solution.

0 Kudos
stein201110141
Contributor
Contributor

Experiencing similar problems with UCS B200M3 blades with SD cards, you gave us a lot of ideas to try, excellent initial description! For our part we can read from the partitions, no IO errors, but having the same error message about created partiontion not mounted. We also have M4 and M5 blades that are not behaving like this and for some yet undetermined reason, M3s works some times ...

0 Kudos
Assyrian
Contributor
Contributor

stein201110141​ Did you have an OS previously installed on the SD cards? If so you will need to use partedutil command from the ESXi host. The one time to perform this would be BEFORE Remediating the host, Once the host boots it's initial time, you will then run the following commands (this will vary depending on SD card size

16GB= dd if=/dev/zero of="/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0" bs=512 count=34 seek=29360094 conv=notrunc

32GB= dd if=/dev/zero of="/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0" bs=512 count=34 seek=62325725 conv=notrunc

After you run the command then try remediating the host.

The root cause in my case was the UCS B200 M3 blade has the FX3 RAID controller for the SD Cards and does not perform a full format of the partition tables on the disk, so the PartedUtil command performs this. Here is the VMWare KB VMware Knowledge Base

If you have any questions don't hesitate to message me directly or if you want to chat on the phone. Hope this helps.

0 Kudos