VMware Cloud Community
pingnithin
Enthusiast
Enthusiast
Jump to solution

High DAVG in ESXi

Hi,

We have an ESXi (3 ESXi5.1 servers) setup with Openfiler as the shared storage.

For the last couple of days, the DAVG value of the servers appear to be too high (more than 400) and performance degraded. The Openfiler machine is setup on a Lenovo thinkcenter machine with 4 TB SATA HDD with a single volume. Around 15 VMs are running in this server.

When we checked the vmkernel logs found the below:

" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2014-07-08T05:41:59.783Z cpu0:141643)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x85, CmdSN 0xcc6 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2014-07-08T05:41:59.783Z cpu0:4459)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x4d, CmdSN 0xcc7 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2014-07-08T05:41:59.783Z cpu0:4459)ScsiDeviceIO: 2331: Cmd(0x4124007b2340) 0x1a, CmdSN 0xcc8 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2014-07-08T05:48:37.008Z cpu2:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 541 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors

2014-07-08T05:48:40.476Z cpu3:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 210 msecs). # R: 1, # W: 1 bytesXfer: 0 sectors

2014-07-08T05:48:41.384Z cpu0:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 304 msecs). # R: 1, # W: 1 bytesXfer: 5 sectors

2014-07-08T05:48:50.677Z cpu3:627340)FS3Misc: 1465: Long VMFS rsv time on 'Openfiler-DB1' (held for 312 msecs). # R: 1, # W: 1 bytesXfer: 0 sectors

2014-07-08T06:11:59.802Z cpu2:4098)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x85 (0x4124007b0c40, 5102) to dev "naa.6d4ae5209ba76b001810002c23fa5645" on path "vmhba1:C2:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2014-07-08T06:11:59.802Z cpu2:4098)ScsiDeviceIO: 2331: Cmd(0x4124007b0c40) 0x85, CmdSN 0x1494 from world 5102 to dev "naa.6d4ae5209ba76b001810002c23fa5645" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

We suspect two reasons behind this performance degradation : Network I/O or Storage I/O.

We tried to ping storage from host and vice versa but the response was below 1ms. Attached the result of disk I/O calculated from a Windows VM hosted in Openfiler.

Regards,

Nithin

Nithin Radhakrishnan www.systemadminguide.in
Reply
0 Kudos
1 Solution

Accepted Solutions
MKguy
Virtuoso
Virtuoso
Jump to solution

Nothing wrong with openfiler by itself, but as every storage system it needs an appropriate physical disk subsystem below.

I'm afraid not really. You can look at which VMs consume the most IOPS in your environment in (r)esxtop or with PowerCLI and check what they're doing.

Here's a PowerCLI sample snippet that reports realtime (last 20 minutes) values:

Get-VM | Sort |

Select @{N="Name"; E={$_.Name}},

@{N="AvgWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},

@{N="PeakWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}},

@{N="AvgReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},

@{N="PeakReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}} |

Format-Table -autosize

Name     AvgWriteIOPS PeakWriteIOPS AvgReadIOPS PeakReadIOPS
----            ------------ ------------- ----------- ------------
VM1                14,5            48        0,1          21
VM2                18,6            51        2,1          289
VM3                14,6            81          0            4

You could enable Storage IO control on the datastore which should help with distributing IOPS a little more fairly between competing VMs, but it probably won't help much at this level and doesn't solve the fundamental problem.

-- http://alpacapowered.wordpress.com

View solution in original post

Reply
0 Kudos
4 Replies
MKguy
Virtuoso
Virtuoso
Jump to solution

For the last couple of days, the DAVG value of the servers appear to be too high (more than 400) and performance degraded. The Openfiler machine is setup on a Lenovo thinkcenter machine with 4 TB SATA HDD with a single volume. Around 15 VMs are running in this server.

A single consumer-grade 4TB SATA 7200rpm drive (or RAID1) to run 15 VMs on? Not surprising you're running into storage performance problems. A disk like this is only able to handle about 100 random IOPS, which is most likely not nearly enough to satisfy the performance needs of your 15 VMs. Sounds like you don't have storage controller caching on this box either, which makes things worse.

We tried to ping storage from host and vice versa but the response was below 1ms. Attached the result of disk I/O calculated from a Windows VM hosted in Openfiler.

You are merely testing network latency here, which says absolutely nothing about the contention of the physical disks. Based on what you told us I'm 98% sure the problem lies with insufficient performance provided by the disk subsystem and not with the network infrastructure.

-- http://alpacapowered.wordpress.com
pingnithin
Enthusiast
Enthusiast
Jump to solution

This is only a dev environment, that is why went on with the openfiler. Is there any other workaround for this other than replacing the storage ?

Nithin Radhakrishnan www.systemadminguide.in
Reply
0 Kudos
MKguy
Virtuoso
Virtuoso
Jump to solution

Nothing wrong with openfiler by itself, but as every storage system it needs an appropriate physical disk subsystem below.

I'm afraid not really. You can look at which VMs consume the most IOPS in your environment in (r)esxtop or with PowerCLI and check what they're doing.

Here's a PowerCLI sample snippet that reports realtime (last 20 minutes) values:

Get-VM | Sort |

Select @{N="Name"; E={$_.Name}},

@{N="AvgWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},

@{N="PeakWriteIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberWriteAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}},

@{N="AvgReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -average).Average, 1)}},

@{N="PeakReadIOPS"; E={[math]::round((Get-Stat $_ -stat "datastore.numberReadAveraged.average" -RealTime | Select -Expand Value | measure -max).maximum, 1)}} |

Format-Table -autosize

Name     AvgWriteIOPS PeakWriteIOPS AvgReadIOPS PeakReadIOPS
----            ------------ ------------- ----------- ------------
VM1                14,5            48        0,1          21
VM2                18,6            51        2,1          289
VM3                14,6            81          0            4

You could enable Storage IO control on the datastore which should help with distributing IOPS a little more fairly between competing VMs, but it probably won't help much at this level and doesn't solve the fundamental problem.

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
pingnithin
Enthusiast
Enthusiast
Jump to solution

Thanks. The script is helpful.

Nithin Radhakrishnan www.systemadminguide.in
Reply
0 Kudos