Hi,
We have an ESXi (3 ESXi5.1 servers) setup with Openfiler as the shared storage.
For the last couple of days, the DAVG value of the servers appear to be too high (more than 400) and performance degraded. The Openfiler machine is setup on a Lenovo thinkcenter machine with 4 TB SATA HDD with a single volume. Around 15 VMs are running in this server.
When we checked the vmkernel logs found the below:
" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-07-08T05:41:59.783Z cpu0:141643)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x85, CmdSN 0xcc6 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-08T05:41:59.783Z cpu0:4459)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x4d, CmdSN 0xcc7 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2014-07-08T05:41:59.783Z cpu0:4459)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x1a, CmdSN 0xcc8 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
2014-07-08T05:48:37.008Z cpu2:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 541 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors
2014-07-08T05:48:40.476Z cpu3:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 210 msecs). # R: 1, # W: 1 bytesXfer: 0 sectors
2014-07-08T05:48:41.384Z cpu0:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 304 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors
2014-07-08T05:48:50.677Z cpu3:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 312 msecs). # R: 1, # W: 1 bytesXfer: 0 sectors
2014-07-08T06:11:59.802Z cpu2:4098)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124007b0c40, 5102) to dev "naa.6d4ae5209ba76b001810002c23fa5645" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE
2014-07-08T06:11:59.802Z cpu2:4098)ScsiDeviceIO: 2331: Cmd(0x4124007b0c40) 0x85, CmdSN 0x1494 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
We suspect two reasons behind this performance degradation : Network I/O or Storage I/O.
We tried to ping storage from host and vice versa but the response was below 1ms. Attached the result of disk I/O calculated from a Windows VM hosted in Openfiler.
Regards,
Nithin
Nothing wrong with openfiler by itself, but as every storage system it needs an appropriate physical disk subsystem below.
I'm afraid not really. You can look at which VMs consume the most IOPS in your environment in (r)esxtop or with PowerCLI and check what they're doing.
Here's a PowerCLI sample snippet that reports realtime (last 20 minutes) values:
Get-VM | Sort |
Select @{N="Name"; E={$_.Name}},
@{N="AvgWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},
@{N="PeakWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}},
@{N="AvgReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},
@{N="PeakReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}} |
Format-Table -autosize
Name AvgWriteIOPS PeakWriteIOPS AvgReadIOPS PeakReadIOPS
---- ------------ ------------- ----------- ------------
VM1 14,5 48 0,1 21
VM2 18,6 51 2,1 289
VM3 14,6 81 0 4
You could enable Storage IO control on the datastore which should help with distributing IOPS a little more fairly between competing VMs, but it probably won't help much at this level and doesn't solve the fundamental problem.
For the last couple of days, the DAVG value of the servers appear to be too high (more than 400) and performance degraded. The Openfiler machine is setup on a Lenovo thinkcenter machine with 4 TB SATA HDD with a single volume. Around 15 VMs are running in this server.
A single consumer-grade 4TB SATA 7200rpm drive (or RAID1) to run 15 VMs on? Not surprising you're running into storage performance problems. A disk like this is only able to handle about 100 random IOPS, which is most likely not nearly enough to satisfy the performance needs of your 15 VMs. Sounds like you don't have storage controller caching on this box either, which makes things worse.
We tried to ping storage from host and vice versa but the response was below 1ms. Attached the result of disk I/O calculated from a Windows VM hosted in Openfiler.
You are merely testing network latency here, which says absolutely nothing about the contention of the physical disks. Based on what you told us I'm 98% sure the problem lies with insufficient performance provided by the disk subsystem and not with the network infrastructure.
This is only a dev environment, that is why went on with the openfiler. Is there any other workaround for this other than replacing the storage ?
Nothing wrong with openfiler by itself, but as every storage system it needs an appropriate physical disk subsystem below.
I'm afraid not really. You can look at which VMs consume the most IOPS in your environment in (r)esxtop or with PowerCLI and check what they're doing.
Here's a PowerCLI sample snippet that reports realtime (last 20 minutes) values:
Get-VM | Sort |
Select @{N="Name"; E={$_.Name}},
@{N="AvgWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},
@{N="PeakWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}},
@{N="AvgReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},
@{N="PeakReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}} |
Format-Table -autosize
Name AvgWriteIOPS PeakWriteIOPS AvgReadIOPS PeakReadIOPS
---- ------------ ------------- ----------- ------------
VM1 14,5 48 0,1 21
VM2 18,6 51 2,1 289
VM3 14,6 81 0 4
You could enable Storage IO control on the datastore which should help with distributing IOPS a little more fairly between competing VMs, but it probably won't help much at this level and doesn't solve the fundamental problem.
Thanks. The script is helpful.