Hi,
Should be there errors and scsi sense codes if your VM data gets corrupted, or is it possible that content of the block itself is corrupted (e.g. bitflip) in such way that on scsi level looks OK, but the data itself is not valid ?
It is happening from time to time that random VM just dies or starts reporting various FS errors, like would you have bad hdd on physical server. This happens only with VMs on FC array while those on iscsi one don't have a problem. Array and san fabric is supposedly OK, so I am looking on server and vmware side if there is anything that would indicate what is going on.
best regards
I did some digging, and a couple of things you have to fix first.
2. As also posted, your version of the qlnativefc driver is *extremely* old. I've had bad experiences with this driver and instability issues that have arisen from outdated and buggy versions. The latest is qlnativefc 1.1.77.1-1 and if you check the release notes that shows a change log of fixes between your version and the target version, it is a huge list.
You may also want to boot the latest SPP that has firmware for G7s to see if it has a later version of the HBA firmware.
But these top two things you must fix before trying to troubleshoot further. It's always dangerous to make assumptions about what *should* work when you deviate from the supported compatibility list, so true up first then continue. It's likely that with these couple of fixes your problem may disappear. You'd be surprised how many times I've found that to be the case over the years.
What you're describing sounds like there are either bits being dropped in transmission due to receiver issues, or data corruption. Your warning that the "san fabric is supposedly OK" is not very reassuring, so you may want to look at drops on the transceiver side. Usually data corruption over FC doesn't issue SCSI sense codes, because those would have to be sourced from the array side, and the array, since it's block, has no knowledge of what constitutes those data, only that it's data. Also check firmware on your director and the array itself as sometimes bugs can lead to similar symptoms.
If bits are being dropped this should be detected/logged on server as CRC errors ?
Problem is that array is managed by others and the say that it is OK and that's it. I'm looking on server side to either found error on our end or something to prove them that array is not OK after all.
It depends. It sounds like you're in for a long haul of troubleshooting, so you first need to identify which VMs had a problem and when, then start looking into host logs to see what it captured around that time. Also look at vmware.log for the respective VM to see what it captured. This will help you to start narrowing things down.
I have a list of VMs that gor corrupted.
the VMkernel log on all host doesn't contain any errors related to storage. Is there any other log that I should look at ?
I will go and check vmware.log for troubled VMs.
What do you think about this KB (VMware Knowledge Base ) ?
I am only concerned about the following statement: Verbose logging consumes logging partition space at an accelerated rate and may exhaust the partition of its free space.
So how much logs will be generated per hour/day ?
Br
So how much logs will be generated per hour/day ?
There's no way to know that. I haven't personally followed that procedure, so you may want to consult with your I/O vendor to see what the risk is and for how long it can be safely enabled.
For example this VM is booted by scheduled task to do some work and then shut down. In worked previous day, but next day it failed to start.
2017-11-12T00:25:01.404Z| vmx| W110: ObtainHardwareID unexpected failure: 22.
2017-11-12T00:25:01.405Z| vmx| W110: Hostinfo_MachineID ObtainHardwareID failure (Invalid argument); providing default.
2017-11-12T00:25:01.405Z| vmx| I120: [0x12FA000-0x2256F7C): /bin/vmx
2017-11-12T00:25:01.425Z| vmx| I120: changing directory to /vmfs/volumes/59e4e0e6-095a5b8e-aa9f-e839350ea7be/[name]/.
2017-11-12T00:25:01.425Z| vmx| I120: Config file: /vmfs/volumes/59e4e0e6-095a5b8e-aa9f-e839350ea7be/[name]/[name].vmx
2017-11-12T00:25:01.425Z| vmx| I120: Vix: [158981706 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=1, newAppState=1873, success=1 additionalError=0
2017-11-12T00:25:01.425Z| vmx| I120: Vix: [158981706 mainDispatch.c:3964]: VMAutomation_ReportPowerOpFinished: statevar=2, newAppState=1878, success=1 additionalError=0
2017-11-12T00:25:01.601Z| vmx| W110: PowerOn
2017-11-12T00:25:02.526Z| vcpu-1| I120: CPU reset: soft (mode 2)
2017-11-12T00:25:02.912Z| vcpu-1| I120: CPU reset: soft (mode 2)
2017-11-12T00:25:07.047Z| vcpu-1| I120: CPU reset: soft (mode 2)
Some weird time jump int the past ? (log timestamps)
2017-11-12T00:25:18.555Z| vmx| I120: TOOLS installed legacy version 9354, available legacy version 9354
2017-11-12T00:25:18.555Z| vmx| I120: TOOLS manifest update status is 3
2017-11-12T00:25:18.555Z| vmx| I120: TOOLS can be autoupgraded.
2017-11-12T00:25:18.555Z| vmx| I120: TOOLS Setting autoupgrade-checked TRUE.
2017-11-12T00:25:18.555Z| vmx| I120: RPT: Disabled. Skipped.
2017-11-12T00:26:02.047Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox-dnd timed out.
2017-11-12T23:55:01.155Z| vmx| I120: Tools: sending 'OS_Halt' (state = 1) state change request
2017-11-12T23:55:01.164Z| vmx| I120: Vix: [158981706 vmxCommands.c:529]: VMAutomation_InitiatePowerOff. Tried to soft halt. Success = 1
2017-11-12T23:55:01.319Z| vcpu-1| I120: TOOLS state change 1 returned status 1
2017-11-12T23:55:01.646Z| vcpu-1| I120: TOOLS autoupgrade protocol version 0
2017-11-12T23:55:01.659Z| vcpu-1| I120: GuestRpc: Reinitializing Channel 0(toolbox)
2017-11-12T23:55:15.800Z| vcpu-0| I120: APIC THERMLVT write: 0x10000
2017-11-12T23:55:15.800Z| vcpu-1| I120: APIC THERMLVT write: 0x10000
2017-11-12T23:55:15.800Z| vcpu-0| I120: PIIX4: PM Soft Off. Good-bye.
2017-11-12T23:55:15.800Z| vcpu-0| I120: Chipset: Issuing power-off request...
2017-11-12T23:55:15.800Z| vcpu-0| A115: ConfigDB: Setting softPowerOff = "TRUE"
2017-11-12T23:55:15.836Z| vmx| I120: Stopping VCPU threads...
2017-11-12T23:55:15.836Z| vcpu-0| I120: VMMon_WaitForExit: vcpu-0: worldID=158981709
2017-11-12T23:55:15.836Z| vcpu-1| I120: VMMon_WaitForExit: vcpu-1: worldID=158981732
2017-11-12T23:55:15.837Z| svga| I120: SVGA thread is exiting
2017-11-12T23:55:16.137Z| vmx| I120:
2017-11-12T23:55:16.138Z| vmx| I120+ OvhdMem: Final (Power Off) Overheads
Please provide more information about this VM (hardware version, OS, etc.) and your environment. What version of vSphere (vCenter and ESXi)? What hardware (make and model)? What storage?
HW version 8, Ubuntu linux 64 bit, Vmware tools not open vm tools are used.
Hosts are Esxi 5.5U2,
Servers are Proliant G7 and storage is also something from HP. The servers and the storage are directly connected without FC switches in between.
What version of Ubuntu?
14.04.5 LTS
It's nothing special, just file server.
Based on your responses, a few things I would point out and suggest:
I agree with you, but issues only occurs with VMs on FC storage.
Even a VM that has failed in the past will run smoothly when moved to iSCSI storage and the start failing again when moved back to FC.
It's either something wrong with FC storage (array, HBAs, drivers, cables, ...) or something super weird is happening and the VM acts different on different storage.
I see. What is this HP storage that you're using and the microcode in use? What are your FC HBAs in use? Can you provide their firmware versions and driver version in use? Refer to this KB for help determining those.
~ # esxcfg-scsidevs -a
vmhba2 qlnativefc link-up fc.50014380186aa649:50014380186aa648 (0:6:0.0) QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA
vmhba3 qlnativefc link-up fc.50014380186aa64b:50014380186aa64a (0:6:0.1) QLogic Corp ISP2532-based 8Gb Fibre Channel to PCI Express HBA
~ # vmkload_mod -s qlnativefc | grep Version
Version: 1.1.29.0-1OEM.550.0.0.1331820
~ # vmkchdev -l |grep vmhba2
0000:06:00.0 1077:2532 103c:3263 vmkernel vmhba2
~ # vmkchdev -l |grep vmhba3
0000:06:00.1 1077:2532 103c:3263 vmkernel vmhba3
~ # /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a
Listing all system keys:
Key Value Instance: QLNATIVEFC/qlogic
Listing keys:
Name: 0
Type: string
value:
QLogic PCI to Fibre Channel Host Adapter for HPAJ764A:
FC Firmware version 7.03.00 (90d5), Driver version 1.1.29.0
I have checked this on VMware compatibility guide site.
The driver version isn't listed there but I guess that is because it's oem version so I need to check with HP for compatibility.
https://www.vmware.com/resources/compatibility/search.php?deviceCategory=io&productid=12687&deviceCa...
And what about your storage array and its microcode version?
I don't have access, so I need wait to get the data 😕
Microcode should be :
HSV340 | 2048 | CR2094lesp-10100000 |
The HSV340 is a FC controller, not the type of array. There are several models that used that controller. Can you find out exactly what array you have? Is it a P6300 EVA? P6000 series?
It should be P6000.