In most cases such issues are related to the storage rather than the network.
To see whether the network throughput is as expected you should use a tool which transfers data without storage access, i.e. something like iperf, or NetIO.
I have tested the vms, moving them to local datastores, on each the ESX hosts, and there by bypassing the iscsi infrastructure. When I did that, I did not see any noticeable change.
I have tested iperf, and appears to be getting about 950mbps. This number goes up when they are connected on the same on the same host, ususally somewhere around 4.5 to 5 gbps.
As André has pointed out, this is usually the cause of poor underlying storage performance, and your test results would seem to confirm this. Your local storage has different performance characteristics than this shared storage. So you should be looking at the shared storage to determine the cause. I will say that what looks immediately suspect to me are the 7.2K SATA drives which are horrible for virtual machines as they cannot do random writes well at all.
What would be the best way to test this? Wouldn't shutting down 3/4 of VM environment have help with determining this? Also wouldn't moving VMs to different storage devices show different results? As of thus far, no matter the change I still get the ruffly the same results. Note: to supplement, the 7.2k drives, I have SSDs setup as cache.
Thanks for your reply.
What type/model of storage array do you use?
Does it use enterprise SSDs, and NL-SAS disks, or consumer SSD, and SATA disks?
SSDs are certainly fast, but it really depends on the storage controller, and or how the storage software uses them.
I'm pretty sure that it can take a wide variety of drives, including sas, sata, and ssd. With that said, I wouldn't call it an enterprise array. Currently I have 10x 2TB WD Gold drives, running in RAID 6. I have 2 SSDs that are setup as Read cache -- one for each
I'm running 21 VMs,
4x - Windows Server
8x - Windows 7
6x - Linux Ubuntu
3x Virtual Appliances
With that said, do you feel that I'm running at expected performance? If so how can I tell? My Array metrics seem quite low for usage (metrics are for one day).
CPU = 20% with a few small spikes up to %60
Network 1 = Management
2 = Highest Received 30 MB/s / Highest Sent 23 MB/s -- Multipath
3 = Highest Received 34 MB/s / Highest Sent 24 MB/s -- Multipath
4 = HA
Disk = Averages less that 10% but spikes occasionally 59% (highest)
iSCSI IOPS = Average 100, Highest spike was small, but hit 1769
Queue Depth = Average 0, highest spike was 5.
Although RAID-6 with mechanical hard drives is terrible for performance (2x write penalty for double parity calculations), with 10 of them (SSD cache makes no difference because it's a read cache only) I'd still expect to be getting way more than that. Provide some more details about your Synology setup. You're using iSCSI? What type here? What version of DSM? How are your hosts connected to this storage? What's the networking topology in place here? I'd also grab I/O Analyzer and deploy it to run as a testbed.
I really can't tell you what you can expect, but IMO you should at least have a better read performance with the SSDs as read-cache, unless you have lots of cache misses.
Anyway, did you configure your environment according to the vendor's recommendation (see Knowledge Base | Synology Inc)?
Especially the point where it comes to disabling DelayedAck may be important.
For how to disable this on a production system see e.g. https://kb.vmware.com/kb/1002598
The vendor's recommendation is almost identical to what I have. The only difference being that mine also is employing multipath. I haven't tried the DelayedAck, and sounds like it might be worth shoot. I'll have to schedule an outage to test the disabling the delayedAck.
We are using file based iscsi -- as this is what was recommended by synology at the time of installation . DSM version 6.2.1-23824 update 2. Everything in connected via a single gigabit switch (soon upgrading to 2). The switch is a 48 port hp 1920. Currently there are multiple vLANS setup with 2 vlans dedicated for our iscsi connections (vlan2, vlan3).
We have 2 hosts that are identically configured. Each has 4 physical adapter (2 of which dedicated for iscsi), the other 2 are for the guest connections. The iscsi ports are setup so that no tagging is done on the host, but rather at the switch. Each host is setup to us multipath down the 2 physical connections.
On the array side, there are 4 NICS. 2 of which have been dedicated to iSCSI, one for management, and the other for HA heart beat.
I will get back to you later with my I/O tests. However, below I have attached the results of Crystal Benchmark with was run on 4 of our servers today.
seq q32t1 read seq q32t1 write 4k q32t1 read 4k q32t1 read seq read seq write 4k read 4k write Server1 195 96 108 17 107 63 8 3 Server2 195 140 110 22 107 62 8 3 Server3 180 88 100 11 103 53 7 3 Server4 65 37 7 3 58 34 6 2 Server1 194 103 107 18 107 67 7 3 Server2 192 100 104 3 107 96 8 2 Server3 189 116 103 8 103 55 7 2 Server4 60 45 4 3 57 34 5 2 Average 158.75 90.625 80.375 10.625 93.625 58 7 2.5
Note there is something going on with server 4. I will delve into fixing that after I have resolved the issue at had.
Again, thanks for all your help.
Here are the results of my i/o tests.
Test Performed Workload Spec IOPS Read IOPS Write IOPS MBPS Read MBPS Write MBPS Max Write IOPS 10min 0.5k_0%Read_0%Random 10623.29 0 10623.29 5.19 0 5.19 Max Write Throughput 10min 0.5k_0%Read_0%Random 13220.3 0 13220.3 6.46 0 6.46 Max Throughput 10min 512k_100%Read_0%Random 354.26 354.26 0 177.1 177.13 0 Max IOPS 10min 0.5k_100%Read_0%Random 39135.7 39135.7 0 19.11 19.11 0
One thing I noticed was that it seems odd was the MBPS. It sees very low-- lower than some of the cifc tests that I had performed.
Wow, those write throughput statistics are horrible. No wonder you're noticing such bad performance. There is such a huge disparity in read vs write numbers because the read figures are being boosted by your SSD cache fronting your disk group. Here are some of the things I'd check and test:
- I'd want to actually *see* how you have your virtual switches set up with regard to iSCSI and their associated vmkernel ports.
- Your iSCSI portal on Synology is connected over L2 (from the vmkernel adapters), correct?
- Do some network tests and look at response latencies from vmkernel to the iSCSI portal service. Check all adapters/uplinks. What does that look like?
- Check network stats for these vmkernels/uplinks. Are there dropped packets?
- The file-based LUN on Synology has, in my experience (I own 2 units in my lab) provided the worst performance at the cost of the most flexibility on the VMware side. This is just a trade-off you have to determine for yourself if it's worth. But provision a block LUN (multiple LUNs on RAID) and run some tests with I/O Analyzer to compare. Also compare to a NFS v3 export backed by the same Synology. What do the numbers look like when stacked against each other?
- What does your CPU and Memory utilization on the NAS look like? File-based iSCSI LUN takes the most system resources.
- Do you have any active snapshots on this LUN within Synology?
FYI, I'm leaving the country on vacation tomorrow and won't return for more than 2 weeks. I won't be able to respond during that time. Good luck.
2. I'm not quite sure what you mean by portal. However, our iscsi traffic is dedicated on 2 of the synology ports. They are turn connected a switch which manages all the tagging and untagging of iscsi traffic. Likewise we have 2 on each host that are dedicated for iscsi traffic. They are setup for mpio, and have no vlan configure on them. All management of the NAS comes on a different port, that is purely dedicated for this function. Port 4 is dedicated for HA.
3. vmkping shows .135 to .134ms. Jumbo frames are also working with the connection. I'm not sure what other testing should All adapters appear to be up and running with zero dropped packets.
4. We have seen zero drops in packets on all interfaces.
5. I talked to synology, and they said block level was removed removed from dsm 6.2. As far as NFS goes, the reason I didn't want to use nfs, was because I was because I wouldn't be able to leverage both NICs on my ESX machines. Our belief was that doing LACP was only allowed in you purchased vcenter enterprise plus, or higher.
6. CPU averages around under 20 percent. Memory is at 29% CPU spikes are see as high 50%.
7. We do have snapshots, but none that are currently running.
- What version of ESXi?
- What type of hardware?
- Is this not the ESXi software iSCSI initiator you have here?
- When you say "vmkping shows .135 to .134ms" you do mean less than one millisecond, correct, and not one hundred thirty-five milliseconds?
- "I talked to synology, and they said block level was removed removed from dsm 6.2" <== I didn't know this. I'm still on 6.1 myself.
- As far as using NFS, yes, I understand that, but as it stands right now with your current performance numbers (on writes) you're nowhere near saturating a 1 GbE uplink. I would still recommend you try it on a single host as an experiment to compare the results. You don't have to delete your iSCSI configuration as long as the NFS export is on one of the same networks you have. By bypassing the iSCSI stack and running performance tests you can eliminate a complex variable in the equation.
- "We do have snapshots, but none that are currently running." <==What does "currently running" mean here? What I meant was does this iSCSI LUN on the Synology side have an open or active snapshot against it?