Hi,
for several days, we are trying to find a solution to improve performance when moving VM images from one (free)ESXi to another.
Our situation is as follows:
We have 4 different ESXi servers running, all hardware should be supported according to VMWares compatibility guide. Connected via gigabit network, all VMs stopped, nothing else running on the network. All VMs are stored locally on SATA disks.
Now we wanted to move a bunch of VM images from one ESXi to the others - but regardless of what technique we used, we achieved a max. speed of about 10 MByte/s (which would take several days to transfer everything).
We tried the following:
- Enable SSH and use SCP to copy the files directly between the servers (using several different encryption algorithms)
- Use Veeam FastSCP to copy the files directly
- Login via SSH on both sides and used a combination of tar&netcat to push raw files over the network
- Install ProFTPD and copy files via FTP
For all these methods, the write speed did never exceed ~10MByte/s (often it was even slower). So we thought the networking might have an issue, therefore we tried some local transfers.
But even a local "dd" to transfer data from 1 local disk to the other was just as slow as the network transfers... We tried this on all the different ESXi hardware we have, with RAID enabled, disabled, etc.
After some Googling, we found out that this should be an issue with the VMFS which doesn't really support the usage of "standard" filesystem tools. Therefore we tried to find a way to use ESX-internal tools to copy/clone the VM images.
We installed an NFS server "between" 2 ESXi servers, i.e. both have the NFS share accessible. Copying from ESXi to NFS is incredibly fast (via vSphere client, running at full gigabit speed) - but copying from NFS to ESXi is slow again (~10MByte/s as before). The final test we did was use the vmkfstools commandline utility to clone from the NFS share to the local VMFS disk. This seems to be a little bit faster (hard to measure due to missing statistics, seems to be about ~20MByte/s) but far from being a fast solution.
What is the expected way to move a VM image to a ESXi server? Is there a way to improve the transfer speed?
Yes, it does. As I said, this problem (poor local disk i/o performance observed in ESXi, but not in VM) can be eliminated only by "mature" raid-controller with big on-board cache, usually 512MB - 2GB (or network storage of course, but now we are talking about local disks).
It does not matter if you have local disk attached to chipset-controller or raid-controller without cache, performance is equally bad. Actually, it can be even worse on raid-controller (without cache) if you are using raid5/6 where parity calculations must be done...
Try to use ISCSI instead of local disk or NFS, It's safer and faster than local disk and also it will be accessible for all ESXi hosts.
Also you can use Veeam Backup & Replication instead of the tools for copying and migrating virtual machines.
Welcome to the Community,
do your hosts have a disk/RAID controller with write cache enabled. Usually you need to have either battery or flash backed cache in order to turn on write caching. ESXi itself doesn't do any write caching and fully relies on the hardware.
André
@Davoud: IIRC the backup mechanism of Veeam is not available for the free ESXi version (due to missing API support).
Switching to iSCSI or NFS might be a solution but would increase the overall complexity of the system (e.g. the need for one or more additional NAS/SAN). Nevertheless, as the proprietary VMFS is only readable from ESX, we intend to switch to NFS to simplify backups.
@a.p.: We could already eliminate the RAID as the source of the problem. We tried the scenarios above on systems with and without a RAID controller, the behavior was always the same - and the problem exists only when working with .vmdk on the ESXi host. Within any VM (regardless of OS type), the disk performance is good!
As a.p. says: do you have RAID-controller with r/w-cache? That makes a big difference! ESXi does not do any disk-caching on its own, and relies fully on controller. That means local directly attached disk are basically not cached at all (apart from quite small in-drive cache, a few MB) and you can not do a lot with it. If disks are attached to raid-controller, activate controller-cache (both for reading and writing). And to make things even more complicated, some controllers do not allow caching of write-operations if you have no battery-backup unit attached...
"...the problem exists only when working with .vmdk on the ESXi host. Within any VM (regardless of OS type), the disk performance is good..."
That's what I'm talking about: ESXi does NOT do disk-caching, while every modern OS (sure even the one you have on VM) does disk-caching. That's why disk-performance in VM is good, and disk-performance in ESXi is poor (despite of using the very same disk as VM). And it is one more sign you probably either have no controller-cache at all, or it is not active...
See above, we tried this on hosts with and without RAID controllers - and the disk performance within any VM is very good! ...
Sorry, should put update to my post under yours, so once more for clarity:
ESXI does not do disk caching.
OS (even that on VM) does disk caching.
So disk-performance in VM can be good, and yet disk performance of ESXi (using the very same local disk as VM) can be poor.
Does this have also implications on systems w/o RAID?
Yes, it does. As I said, this problem (poor local disk i/o performance observed in ESXi, but not in VM) can be eliminated only by "mature" raid-controller with big on-board cache, usually 512MB - 2GB (or network storage of course, but now we are talking about local disks).
It does not matter if you have local disk attached to chipset-controller or raid-controller without cache, performance is equally bad. Actually, it can be even worse on raid-controller (without cache) if you are using raid5/6 where parity calculations must be done...
Ok thanks, that explains a lot. Maybe we can evaluate this using a different RAID controller ... but its very likely that we will use the NFS datastore as it has other pros compared to the local disk solution (especially to avoid VMFS...).
/* So disk-performance in VM can be good, and yet disk performance of ESXi (using the very same local disk as VM) can be poor. */
Now, the question is HOW poor?
I'm copying 200 Gb via vSphere Client interface between two datastores within one chassis (from 2*SSD RAID1 to 2*HDD RAID1 attached to Adaptec 4505 without battery, so no caching enabled), and it takes me some 10 hours. Do you think it's an OK POOR or really POOR POOR? (Knowledge base says the slow copying was inherent to 5.0 and 5.1, but we've got 5.5 and trying to be, like, proud
of it.)
Do we need to be worried about this kind of performance or must we live with it?
Thanks!
