Fast Virtualized Hadoop and Spark on All-Flash Disks - Best Practices for Optimizing Virtualized Big Data Applications on vSphere 6.5

Version 2

    Note: This paper has been updated for vSphere 6.5. The version for 6.0 is attached, below, as well.


    Best practices are described for optimizing Big Data applications running on VMware vSphere. Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark. The Hewlett Packard Enterprise ProLiant DL380 Gen9 servers used in the test featured fast Intel processors with a large number of cores, large memory (512 GiB), and all-flash disks. Test results are shown from two MapReduce and three Spark applications running on three different configurations of vSphere (with 1, 2, and 4 VMs per host) as well as directly on the hardware. Among the virtualized clusters, the fastest configuration was 4 VMs per host due to NUMA locality and best disk utilization. The 4 VMs per host platform was faster than bare metal for all tests with the exception of a large (10 TB) TeraSort test where the the bare metal advantage of larger memory overcame the disadvantage of NUMA misses.