VMware Cloud Community
Najtsob
Enthusiast
Enthusiast
Jump to solution

VMkernel log & SCSI Sense Codes

Hi,

Should be there errors and scsi sense codes if your VM data gets corrupted, or is it possible that content of the block itself is corrupted (e.g. bitflip) in such way that on scsi level looks OK, but the data itself is not valid ?

It is happening from time to time that random VM just dies or starts reporting various FS errors, like would you have bad hdd on physical server. This happens only with VMs on FC array while those on iscsi one don't have a problem. Array and san fabric is supposedly OK, so I am looking on server and vmware side if there is anything that would indicate what is going on.

best regards

1 Solution

Accepted Solutions
daphnissov
Immortal
Immortal
Jump to solution

I did some digging, and a couple of things you have to fix first.

  1. The firmware on that array (it looks like a P6300) is old and not supported on that version of ESXi. This alone can cause issues, even when just changing protocols. You must get to at least 11200000 as shown in the HCL.

pastedImage_2.png

  2. As also posted, your version of the qlnativefc driver is *extremely* old. I've had bad experiences with this driver and instability issues that have arisen from outdated and buggy versions. The latest is qlnativefc 1.1.77.1-1 and if you check the release notes that shows a change log of fixes between your version and the target version, it is a huge list.

You may also want to boot the latest SPP that has firmware for G7s to see if it has a later version of the HBA firmware.

But these top two things you must fix before trying to troubleshoot further. It's always dangerous to make assumptions about what *should* work when you deviate from the supported compatibility list, so true up first then continue. It's likely that with these couple of fixes your problem may disappear. You'd be surprised how many times I've found that to be the case over the years.

View solution in original post

26 Replies
daphnissov
Immortal
Immortal
Jump to solution

What you're describing sounds like there are either bits being dropped in transmission due to receiver issues, or data corruption. Your warning that the "san fabric is supposedly OK" is not very reassuring, so you may want to look at drops on the transceiver side. Usually data corruption over FC doesn't issue SCSI sense codes, because those would have to be sourced from the array side, and the array, since it's block, has no knowledge of what constitutes those data, only that it's data. Also check firmware on your director and the array itself as sometimes bugs can lead to similar symptoms.

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

If bits are being dropped this should be detected/logged on server as CRC errors ?

Problem is that array is managed by others and the say that it is OK and that's it. I'm looking on server side to either found error on our end or something to prove them that array is not OK after all. 

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

It depends. It sounds like you're in for a long haul of troubleshooting, so you first need to identify which VMs had a problem and when, then start looking into host logs to see what it captured around that time. Also look at vmware.log for the respective VM to see what it captured. This will help you to start narrowing things down.

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

I have a list of VMs that gor corrupted.

the VMkernel log on all host doesn't contain any errors related to storage. Is there any other log that I should look at ?

I will go and check vmware.log for troubled VMs.

What do you think about this KB (VMware Knowledge Base ) ?

I am only concerned about the following statement: Verbose logging consumes logging partition space at an accelerated rate and may exhaust the partition of its free space.

So how much logs will be generated per hour/day ?

Br

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

So how much logs will be generated per hour/day ?

There's no way to know that. I haven't personally followed that procedure, so you may want to consult with your I/O vendor to see what the risk is and for how long it can be safely enabled.

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

For example this VM is booted by scheduled task to do some work and then shut down. In worked previous day, but next day it failed to start.

2017-11-12T00:25:01.404Z| vmx| W110: ObtainHardwareID unexpected failure: 22.

2017-11-12T00:25:01.405Z| vmx| W110: Hostinfo_MachineID ObtainHardwareID failure (Invalid argument); providing default.

2017-11-12T00:25:01.405Z| vmx| I120: [0x12FA000-0x2256F7C): /bin/vmx

2017-11-12T00:25:01.425Z| vmx| I120: changing directory to /vmfs/volumes/59e4e0e6-095a5b8e-aa9f-e839350ea7be/[name]/.

2017-11-12T00:25:01.425Z| vmx| I120: Config file: /vmfs/volumes/59e4e0e6-095a5b8e-aa9f-e839350ea7be/[name]/[name].vmx

2017-11-12T00:25:01.425Z| vmx| I120: Vix: [158981706 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0

2017-11-12T00:25:01.425Z| vmx| I120: Vix: [158981706 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=2, newAppState=1878, success=1 additionalError=0

2017-11-12T00:25:01.601Z| vmx| W110: PowerOn

2017-11-12T00:25:02.526Z| vcpu-1| I120: CPU reset: soft (mode 2)

2017-11-12T00:25:02.912Z| vcpu-1| I120: CPU reset: soft (mode 2)

2017-11-12T00:25:07.047Z| vcpu-1| I120: CPU reset: soft (mode 2)

Some weird time jump int the past ? (log timestamps)

2017-11-12T00:25:18.555Z| vmx| I120: TOOLS installed legacy version 9354, available legacy version 9354

2017-11-12T00:25:18.555Z| vmx| I120: TOOLS manifest update status is 3

2017-11-12T00:25:18.555Z| vmx| I120: TOOLS can be autoupgraded.

2017-11-12T00:25:18.555Z| vmx| I120: TOOLS Setting autoupgrade-checked TRUE.

2017-11-12T00:25:18.555Z| vmx| I120: RPT: Disabled. Skipped.

2017-11-12T00:26:02.047Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox-dnd timed out.

2017-11-12T23:55:01.155Z| vmx| I120: Tools: sending 'OS_Halt' (state = 1) state change request

2017-11-12T23:55:01.164Z| vmx| I120: Vix: [158981706 vmxCommands.c:529]: VMAutomation_InitiatePowerOff. Tried to soft halt. Success = 1

2017-11-12T23:55:01.319Z| vcpu-1| I120: TOOLS state change 1 returned status 1

2017-11-12T23:55:01.646Z| vcpu-1| I120: TOOLS autoupgrade protocol version 0

2017-11-12T23:55:01.659Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)

2017-11-12T23:55:15.800Z| vcpu-0| I120: APIC THERMLVT write: 0x10000

2017-11-12T23:55:15.800Z| vcpu-1| I120: APIC THERMLVT write: 0x10000

2017-11-12T23:55:15.800Z| vcpu-0| I120: PIIX4: PM Soft Off.  Good-bye.

2017-11-12T23:55:15.800Z| vcpu-0| I120: Chipset: Issuing power-off request...

2017-11-12T23:55:15.800Z| vcpu-0| A115: ConfigDB: Setting softPowerOff = "TRUE"

2017-11-12T23:55:15.836Z| vmx| I120: Stopping VCPU threads...

2017-11-12T23:55:15.836Z| vcpu-0| I120: VMMon_WaitForExit: vcpu-0: worldID=158981709

2017-11-12T23:55:15.836Z| vcpu-1| I120: VMMon_WaitForExit: vcpu-1: worldID=158981732

2017-11-12T23:55:15.837Z| svga| I120: SVGA thread is exiting

2017-11-12T23:55:16.137Z| vmx| I120:

2017-11-12T23:55:16.138Z| vmx| I120+ OvhdMem: Final (Power Off) Overheads

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

Please provide more information about this VM (hardware version, OS, etc.) and your environment. What version of vSphere (vCenter and ESXi)? What hardware (make and model)? What storage?

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

HW version 8, Ubuntu linux 64 bit, Vmware tools not open vm tools are used.

Hosts are Esxi 5.5U2,

Servers are Proliant G7 and storage is also something from HP. The servers and the storage are directly connected without FC switches in between.

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

What version of Ubuntu?

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

14.04.5 LTS

It's nothing special, just file server.

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

Based on your responses, a few things I would point out and suggest:

  1. Ubuntu 14.04 is not supported on 5.5. The first version to claim support is 6.0U2, for both 32-bit and 64-bit systems. This is significant because there known issues with odd OS occurrences when you stray from the HCL.
  2. At 5.5U2, there are lots of patches to be applied. If you're going to stay on the 5.5 release, you need to be patched up.
  3. Look to update your HBA and NIC firmware on those G7s. I've observed some wonky behavior in the past on some hardware combinations.
  4. Bring your ESXi drivers for I/O devices as up-to-date as HP's recipes allow. Similar types of oddities happen when they're multiple versions old.
Najtsob
Enthusiast
Enthusiast
Jump to solution

I agree with you, but issues only occurs with VMs on FC storage.

Even a VM that has failed in the past will run smoothly when moved to iSCSI storage and the start failing again when moved back to FC.

It's either something wrong with FC storage (array, HBAs, drivers, cables, ...) or something super weird is happening and the VM acts different on different storage. 

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

I see. What is this HP storage that you're using and the microcode in use? What are your FC HBAs in use? Can you provide their firmware versions and driver version in use? Refer to this KB for help determining those.

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

~ # esxcfg-scsidevs -a

vmhba2  qlnativefc        link-up   fc.50014380186aa649:50014380186aa648    (0:6:0.0) QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

vmhba3  qlnativefc        link-up   fc.50014380186aa64b:50014380186aa64a    (0:6:0.1) QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA

~ # vmkload_mod -s qlnativefc | grep Version

Version: 1.1.29.0-1OEM.550.0.0.1331820

~ # vmkchdev -l |grep vmhba2

0000:06:00.0 1077:2532 103c:3263 vmkernel vmhba2

~ # vmkchdev -l |grep vmhba3

0000:06:00.1 1077:2532 103c:3263 vmkernel vmhba3

~ # /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a

Listing all system keys:

Key Value Instance:  QLNATIVEFC/qlogic

Listing keys:

Name:   0

Type:   string

value:

QLogic PCI to Fibre Channel Host Adapter for HPAJ764A:

        FC Firmware version 7.03.00 (90d5), Driver version 1.1.29.0

I have checked this on VMware compatibility guide site.

The driver version isn't listed there but I guess that is because it's oem version so I need to check with HP for compatibility. 
https://www.vmware.com/resources/compatibility/search.php?deviceCategory=io&productid=12687&deviceCa...

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

And what about your storage array and its microcode version?

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

I don't have access, so I need wait to get the data 😕

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

Microcode should be :

HSV340

2048

CR2094lesp-10100000

0 Kudos
daphnissov
Immortal
Immortal
Jump to solution

The HSV340 is a FC controller, not the type of array. There are several models that used that controller. Can you find out exactly what array you have? Is it a P6300 EVA? P6000 series?

0 Kudos
Najtsob
Enthusiast
Enthusiast
Jump to solution

It should be P6000.

0 Kudos