VMware Performance Community
nimo1983
Contributor
Contributor

VMMark 3.1.1 test failing when running with 4 or more tiles (works fine with 3 tiles)

Hello Team,

Running VMMark 3.1.1 against a 3 host setup.

  • Num of hosts: 3
  • ESXi version: VMware ESXi, 6.7.0, 18828794
  • Processor Type: Intel(R) Xeon(R) Gold 6240Y CPU @ 2.60GHz
  • Logical Processors: 72 (per host)

Prime client running with a single NIC. The client VMs are running on a seperate cluster and VMMark VMs running on another cluster.

Tile 0 was tested fine and then cloned to 4 more. When running the test with 4 tiles, it fails with the below error:

20211209-16:07:35 Info Could not complete Setup for the following 6 Wklds: DS3DB Tile3 failed setup DS3DB Tile0 failed setup AuctionLB Tile3 failed setup : Review Client3_restore.txt for details AuctionLB Tile1 failed setup : Review Client1_restore.txt for details AuctionLB Tile0 failed setup : Review Client0_restore.txt for details AuctionLB Tile2 failed setup : Review Client2_restore.txt for details

Connecting to ElasticLB vm shows the following error:

Couldn't start mongod on ElasticLB : about to fork child process, waiting until server is ready for connections.
forked process: 5346
ERROR: child process failed, exited with error number 100

Note: The test runs fine with 3 tiles.

Attachments: Results folder including the VMMark properties files.

Looking forward for expert inputs to get over this and complete the testing.

 

 

Reply
0 Kudos
12 Replies
RebeccaG
Expert
Expert

Hi Nimo,

Thank you so much for uploading your results files; with it I was able to find the problem.


Three main things you will need to fix:

  1. The hosts files for some of the VMmark VMs contain the IP addresses of the wrong tile VMs. E.g. ElasticLB1 should have hosts file entries for tile1. These were also incorrect for all tile 3 VMs.

The VMs’ hosts files are automatically populated by the VMmark harness when creating additional tiles. It indicates maybe you tried to clone the tiles manually or altered the hosts files.

To fix this, delete tiles 1, 2 and 3. Recreate them following “Creating Additional VMmark Tiles (Tiles 1 through n)” in the VMmark User’s Guide.

 

  1. You are using static IPs on the 10.61.187 subnet. It’s recommended that VMmark VMs use a private network and the prime client is set up with two virtual NICs, one on the private network and one on the public network for administrative access.

Most users have trouble getting enough static IPs allocated for all the VMmark VMs as 19 Ips are required per tile.

If you decide to keep operating on the 10.61.187 subnet, when you recreate tiles 1, 2 and 3, in VMmark3.properties, set your ProvisioningIPstaticStart = 10.61.187.126. Tile 1 will be provisioned in the correct range as the VMmark harness accounts for tile 0.

 

3. There is also an IP address conflict between the Deploy VM (Deploy/DeployVMinfo = DeployVM0:10.61.187.190) and DS3DB2 which has the same IP. You need to set DeployVMinfo to a static IP that is not in the range you specified for ProvisioningIPstaticStart. E.g. use 10.61.187.125. This issue would be resolved if you configure VMmark using an internal network instead of an external one.

Thank you and please post here if you have any additional questions.

Reply
0 Kudos
nimo1983
Contributor
Contributor

Thanks Rebecca.

I had started deployment using private IPs and Prime Client is now assigned with 2 IPs.

Shall keep you posted about the progress once tile0 is progressed.

In the meantime, I have one more q - How can I run the job with workload Ops only and no infraOps.

Regards,

NiMo

 

Reply
0 Kudos
nimo1983
Contributor
Contributor

Hello Rebecca and all,

We have re-created the env with private IPs. A turbo run succeeded just fine, but fails for 3 hour run.

20211214-15:47:48 Info Guest Info failures for the following machines: 4 machines : DS3DB0 DS3DB3 DS3DB1 DS3DB2 Error Messages DS3DB0= :DS3DB3= :DS3DB1= :DS3DB2= :

Primeclient to DS3DB* connectivity is all working well. If I trigger it with 3 tiles, it will run further, but then failed with another error - 20211214-13:31:43 Info Service Requested Stop: ExceptionFound: Unable to Update DeployVM0 to use new CPUS:CPUS: 8
20211214-13:31:43 Info VMmark3: Service Requested Stop: ExceptionFound: Unableto Update DeployVM0 to use new CPUS:CPUS: 8 : 3630.41(secs)

Note: I have attached the logs for Guest info failure run.

Please provide inputs as this is holding a project rollout.

Regards,

NiMo

Reply
0 Kudos
RebeccaG
Expert
Expert

To run without infrastructure operations, see in the VMmark User's Guide "Running a Subset of the Workloads (a Partial Tile). You'll uncomment the variable "InfrastructureList" from VMmark3.properties and remove all infrastructure workloads. However, this will result in a benchmark run that does not generate a score (Score_4_Tile_Test.txt) as it is not a compliant configuration. 

For the guestinfo failure, it looks like the guestinfo collection was unable to complete before timing out. (these files are in your results folder in guestinfofiles/DS3DBN-Guestinfo.txt)

In your VMmark3.properties file, try changing MiscInfoTimeOut = 600 or even larger and rerunning.

Please upload the run with the error "Unable to Update DeployVM0..." when you get a chance.

Thank you, Rebecca

Reply
0 Kudos
nimo1983
Contributor
Contributor

Thanks Rebecca.

I shall try with Misc timeout value.

Here is the results folder for latest run and failure for deployVM0. It fails with CPU or RAM. This time the error was - Unable to HotAddMemoryDeployVM0 to sue new memory settings. (result folder attached).

Looking forward to hearing from you soon so as to get a valid complete 3 hour run.

Regards,

NiMo

 

Reply
0 Kudos
RebeccaG
Expert
Expert

I apologize for the delay in response. In looking at the three tile result, the Deploy error occurred 1 hour into the 3 hour run.

At the beginning of the run, a deploy operation completed correctly on all of your hosts. The second time Deploy tried to run on esxi04-ehc-ehcdc.com, the error ocurred "ERROR  DeployVM0 Iteration 3 Unable to Create VM : Msg An error occurred while communicating with the remote host." (as reported in Deploy-t0.csv)

The real issue is that the Deploy VM was unable to be created, and it looks like the hot add error occurred because the VM wasn't created.

Try rerunning 3 tiles and see if the error recurs. It might not recur. If it does, you need to know whether the error occurs only on host esxi04, and when during the run it occurs. You might also want to check network connection between the vCenter Server and host esxi04, as it returned "an error occurred while communicating with the remote host". 

Also, did the increased MiscInfoTimeOut help?

Reply
0 Kudos
nimo1983
Contributor
Contributor

Thanks Rebecca and happy New year.

The MiscInfoTimeOut did help, however in subsequent attempts it failed. So I tried running it directly without STAX client

It gets stuck at collecting info for ds3.

/root/ds3/:

---------------------------------------------------------------------------------------------------------------
/ds3/:

==============================================================================

I tried running the same from prime client directly (ssh ds3db0 perl "/root/VMmark3/tools/VMmark3-GuestInfo.pl /root/DS3DB0-GuestInfo.txt") than through STAX UI, it succeeds afte4 16 mins, but with STAX it fails even after 30 mins.

Anything specific we shd look at.

Reply
0 Kudos
jamesz08
VMware Employee
VMware Employee

The DS3DB0-Guestinfo.txt file you attached is not complete.  Is that the one you manually ran?  This also is cut off at the /ds3 directory.  Possibly it timed out?

Can you check that there is free space on the DS3DB VM? If not, can you post the result of 'df -h' command from the DS3DB cli?

If there is freespace can you manually run the command from the DS3DB cli:

find /ds3/ -name '*' | xargs md5sum 2>/dev/null > ds3-md5sum.txt

Post the results here.

 

 

Reply
0 Kudos
nimo1983
Contributor
Contributor

Thanks James.

The output I attached is for the failed run for DS3DB0 and it gets stuck at ds3 everytime.

Here is the df output.

[root@DS3DB0 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 14G 5.4G 8.5G 39% /
devtmpfs 24G 0 24G 0% /dev
tmpfs 24G 0 24G 0% /dev/shm
tmpfs 24G 8.8M 24G 1% /run
tmpfs 24G 0 24G 0% /sys/fs/cgroup
/dev/sdc1 99G 89G 4.5G 96% /ds3
/dev/sdb1 246G 196G 38G 84% /var/lib/mysql
/dev/sda1 497M 349M 149M 71% /boot
tmpfs 4.8G 0 4.8G 0% /run/user/0
[root@DS3DB0 ~]#

it runs fine when it is 1 tile, but when we add 2 or 3 tiles to the run, it throws a timeout. And always happen with DS3DB VMs when tiles number are increased.

Attaching results for 2 recent runs with 1 tile (guestinfo working) and 2 tiles (guestinfo failing for DS3DB0)

Any pointers before hitting the weekend.

Regards,

Niyaz

Reply
0 Kudos
jamesz08
VMware Employee
VMware Employee

The passing 1-tile run it takes about 10 minutes to complete the guestinfo for ds3db0

The failing run the ds3db1 finished after 29 minutes. 

Either your storage is too slow to support more than 1 tile or one or more of your hosts are having some issues accessing the storage.  Check vmkernel.log on your hosts for possible storage errors.

 

Reply
0 Kudos
nimo1983
Contributor
Contributor

Thanks James. Had issues with storage, which is sorted. Have spread the workload VMs now.

The run is still failing. Attached is the log. Anything obvious in this one.

Service Requested Stop: ExceptionFound: SVMotion TargetDatastore Exception : Review Log and Configuration : :Target and Source Datastores are the same

Reply
0 Kudos
jamesz08
VMware Employee
VMware Employee

I am seeing errors in the vmotion, svmotion, xvmotion and deploy workloads.

The deploy has a specific error message:

ERROR DeployVM0 Iteration 0 Unable to Create VM : Msg An error occurred while communicating with the remote host.

The other workloads just have a general error.  I suspect there is an intermittent networking error with 1 or more of your hosts on the vmkernel nics.  As a first troubleshooting step I would verify the networking details on the vmkernel configurations.  As this appears to be intermittent make sure there aren't any IP conflicts with other systems on the network.

Tags (1)
Reply
0 Kudos