VMware Cloud Community
Phel
Enthusiast
Enthusiast

ESXi 7.0 - VMDK/Datastore throughput/tps issues

Coming fresh off of this forum post: https://forums.plex.tv/t/docker-container-keeps-loosing-connection-access-db-constantly-busy/590693/...

I am wanting to see what potential issues there are going from 6.7u3 to 7.0, with vmdk's on Ubuntu 20.04. When checking and running disk latency commands on an NVME Samsung 970 Pro, and when creating new vmdks vs existing vmdks (all created within esxi 7.0) there seems to be a drastic bandwidth difference. '

All of the above done via the guest VM

For instance:

@apollo:~$ sudo fdisk /dev/sdd

Command (m for help): n

Select (default p):

Using default response p.

Partition number (1-4, default 1):

First sector (2048-134217727, default 2048):

Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-134217727, default 134217727):

Created a new partition 1 of type 'Linux' and of size 64 GiB.

Command (m for help): w

The partition table has been altered.

Calling ioctl() to re-read partition table.

Syncing disks.

andrew@apollo:~$ sudo mkfs.ext4 /dev/sdd

mke2fs 1.45.5 (07-Jan-2020)

Found a dos partition table in /dev/sdd

Proceed anyway? (y,N) y

Creating filesystem with 16777216 4k blocks and 4194304 inodes

Filesystem UUID: 500830b1-a52c-4aba-ad7f-c4b012d7278d

Superblock backups stored on blocks:

        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,

        4096000, 7962624, 11239424

Allocating group tables: done

Writing inode tables: done

Creating journal (131072 blocks): done

Writing superblocks and filesystem accounting information: done

andrew@apollo:~$ sudo dd if=/dev/sdd of=/dev/null bs=1M

65536+0 records in

65536+0 records out

68719476736 bytes (69 GB, 64 GiB) copied, 19.0623 s, 3.6 GB/s

andrew@apollo:~$ sudo dd if=/dev/sda of=/dev/null bs=1M

24576+0 records in

24576+0 records out

25769803776 bytes (26 GB, 24 GiB) copied, 29.965 s, 860 MB/s

ed, 29.965 s, 860 MB/s

We can see there is a huge difference when doing the running drive compared to an empty disk. Even when running it against a semi empty disk (<14GB of data). In addition it took 3 minutes to write a 12 GB test file from `/dev/urandom` to the /mnt/plex (sdb1):

@apollo:/mnt/plex$ sudo dd if=/dev/urandom of=/mnt/plex/testfile.fake bs=1M count=105MB

[sudo] password for :

11117+0 records in

11116+0 records out

11655970816 bytes (12 GB, 11 GiB) copied, 183.181 s, 63.6 MB/s

That seems like a ridiculously low transaction speed, and when running hdparm -Tt, on two different vmdk located on the same NVME ssd:

@apollo:~$ sudo hdparm -Tt /dev/sdb

/dev/sdb:

Timing cached reads:   19262 MB in  2.00 seconds = 9651.45 MB/sec

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Timing buffered disk reads: 2706 MB in  3.00 seconds = 901.96 MB/sec

andrew@apollo:~$ sudo hdparm -Tt /dev/sdb

/dev/sdb:

Timing cached reads:   20236 MB in  2.00 seconds = 10141.30 MB/sec

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Timing buffered disk reads: 2738 MB in  3.00 seconds = 912.12 MB/sec

andrew@apollo:~$ sudo hdparm -Tt /dev/sdc1

/dev/sdc1:

Timing cached reads:   19366 MB in  2.00 seconds = 9703.55 MB/sec

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 00 00 00 00 20 00 00 c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Timing buffered disk reads: 9146 MB in  3.00 seconds = 3047.93 MB/sec

andrew@apollo:~$ sudo hdparm -Tt /dev/sda1

All of this is on Ubuntu 20.04, so I am wondering if there is some issue with esxi 7.0, on throughput, vmdk creations, tps, etc.

If there is a better way to debug this at the esxi level and what I can do.

To confirm, this is all been done on:

t10.NVMe____Samsung_SSD_970_PRO_1TB_________________EA43B39156382500, partition 1

Version: "AMD Ryzen 7 1700 Eight-Core Processor"

64 GB DDR4 ECC memory

X470 Motherboard

The guest VM's settings are:

pastedImage_16.png

I was going to try and downgrade back to esxi 6.7u3, but it seems like the upgrade to 7.0 wipes out the bootbank's 6.7 config:

[root@esxi2:~] tail -2 /*bootbank/boot.cfg

==> /altbootbank/boot.cfg <==

build=7.0.0-1.0.15843807

updated=5

==> /bootbank/boot.cfg <==

build=7.0.0-1.0.15843807

updated=4

So my question is at this point, should I load esxi 6.7u3 back onto a USB stick and boot from there for the time being? I was wanting to test k8s/kubernetes on vsphere/esxi, but, I haven't been able to figure out how to do it on a single esxi host. But would rather not endure slowdown, throughput, or tps issues on vmdks randomly like above.

0 Kudos
2 Replies
continuum
Immortal
Immortal

Just a suggestion .....

A complaint about performance problems can be easily waved away when you are using snapshots on unsupported hardware.

So at least get the snapshots out of the equation. Off course snapshots perform not as good as regular vmdks.


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

0 Kudos
Phel
Enthusiast
Enthusiast

I can easily remove the snapshot, but the snapshot was made as a precaution after most of the commands that were run above (to be specific, it was made when installing plex specifically on the system, rather than in a docker container). There was no difference from docker container > snapshot > on system in terms of performance, so I am not saying that the snapshot wouldn't cause some performance issue, but that doesn't seem to be the cause in my mind.

0 Kudos