We recently moved from our previous product which did Hot Add backups to a newer product called Rubrik which does NBD backups (no negative comments on this please ). We have dedicated 10GbE to our Cisco Nexus core and backend storage is a Pure Storage all flash over 8Gbps FC. We are seeing decent speeds for NBD, although nowhere near previous. However, it still is not hitting even gigabit speeds. On our Cisco UCS, which has 4 blades sharing dual dedicated 10GbE uplinks for management network, its even worse. Its like running 100Mbps!!!
I tested with an HP server (Proliant DL585 Gen 7) with only gigabit links and it was only around 1MB/sec. The UCS is around the same. When I upgraded the HP server to 10GbE, it went to about 125 MB/sec. I have confirmed that it is configured correctly at storage and switching. Any thoughts on this? Why does ESXi seem to be limiting the bandwdith per vm to roughly 10% of bandwidth on the management NIC? vMotion is getting around 300MB/sec. I feel that should be faster as well.
ESXi 6.0 50505093
vCenter 6.0 Build 4541947
I am running only one vm on a particular blade and doing a backup. The host says it has 10Gbps, and the uplinks are 10Gbps all the way to the Rubrik device. I even pinned the blade to its own dedicated set of uplinks. Still only getting 1gbps speed max. I'm still wondering if the problem is in UCS itself (older blade) and ESXi detects this somehow.
I have never seen anything that would throttle based on a NBD traffic vs other traffic. vSphere does have the ability to set network reservations and limits when using vCenter and a VDS, but default everything is wide open.
There are many things that will cause this behavior that are due to configuration. For example if any of the points are using the E1000 vNIC the speed is limited to 1Gbs for that driver. Switch to the vmxnet3. Also there are RSS and ring buffer configurations that will impact network throughput as well as TCP offload settings and others. Then there is the storage of where the NBD. Are the reads and writes continuous or not will make a difference. Also the performance of the array to handle the whole load it is responsible for, not just the backup job. Even with separate arrays each need to be fast enough. Then there is the TCP/IP overhead that will reduce the actual data written on the array.
Based on what you are saying, I suspect an E1000 vNIC somewhere in the datapath chain, but when talking about going faster than 1Gbs there is a lot more tuning that needs to be done from the guest OS all the way through to the end storage array. Good place to start is reading the performance guides the guest OS, ESXi, Rubrik backup and the Storage array. It is rare that the default configurations would be best when trying to achieve over 1Gbs network speeds.
There are some network test out there that will use a ramdisk to isolate the storage from being the bottleneck. No these are not very accurate for backup speeds cause the kind of reads and writes are critical and very, but atleast you can confirm the network path can perform over 1Gbs to start.