nimo1983
Contributor
Contributor

Failure with 4 tiles

Hello experts,

Happy New Year.

Starting a new thread with a new setup.

VMMark setup as per the guidance.

Setup:

4 hosts 

 
  • Hypervisor: VMware ESXi, 6.7.0, 18828794
  • Processor Type: Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz
  • Logical Processors: 56

Clients running on seperate cluster and Primeclient running with 2 IPs. Connectivity to all VMs in place. Triggered a 4 tile run and it says completed (although it ran for more than 3h 44 mins just for VMmark run tiles stage) and eventually threw the below message.

Note: Also noticed the CPU utiization was lower than expected in this run. As per the number of tiles, it should have been 188, but it was less than 90GHz all time.

Could not Run the following 12 Wklds: ['DS3WebA Tile3 f
ailed to start/complete process', 'DS3WebA Tile2 failed
to start/complete process', 'DS3WebA Tile1 failed to s
tart/complete process', 'DS3WebC Tile1 failed to start/
complete process', 'DS3WebB Tile1 failed to start/compl
ete process', 'DS3WebC Tile0 failed to start/complete p
rocess', 'DS3WebB Tile0 failed to start/complete proces
s', 'DS3WebC Tile2 failed to start/complete process', '
DS3WebC Tile3 failed to start/complete process', 'DS3We
bB Tile3 failed to start/complete process', 'DS3WebB Ti
le2 failed to start/complete process', 'deploy Failed R
un Phase']

The results folder is attached herewith (the attachment is split due to the size restriction).

Looking for inputs.

Regards,

Niyaz

0 Kudos
5 Replies
jamesz08
VMware Employee
VMware Employee

There are several issues I see.

Major issue:

sf-esx02, sf-esx03, sf-esx04 vmnic0 MTU is the default of 1500 while sf-esx01 and the 2 client hosts have the MTU set to 9000.  This is likely causing network issues and I suspect is the cause of many of the failures.

Please correct this so they all have identical MTU.

 

Performance issue:

sf-esxi01 and sf-esxi04 seem to have a bad DIMM in them.  This shouldn't cause any functional problems as they are not being used (lower total memory for these 2 systems) but may impact performance.

 

Minor config problem:

Standby VMs have timesync disabled.  This will not cause a problem, but as a best practice all VMs should have it enabled.  This is a requirement if you plan to publish any results.

 

0 Kudos
nimo1983
Contributor
Contributor

Thanks James.

MTU is set on all hosts and time is synced for standby* VMs.

Ran again with 3 hosts and 4 tiles this time and still it fails with the same error.

Could not Run the following 8 Wklds: ['DS3WebA Tile0 fa
iled to start/complete process', 'DS3WebA Tile3 failed
to start/complete process', 'DS3WebC Tile0 failed to st
art/complete process', 'DS3WebB Tile0 failed to start/c
omplete process', 'DS3WebC Tile1 failed to start/comple
te process', 'DS3WebB Tile3 failed to start/complete pr
ocess', 'DS3WebC Tile3 failed to start/complete process
', 'deploy Failed Run Phase']

Checking the wrf files, it has too many failures.

Error during browse for reviews returned by web server: The request timed out
Thread 15: Error in Login for User newuser48069702, failure 1, retrying
Error during browse for reviews returned by web server: The request timed out
Thread 18: Error in Login for User newuser166974341, failure 1, retrying
Error during browse for reviews returned by web server: The request timed out
Thread 21: Error in Login for User user23566746, failure 1, retrying
Error during browse for reviews returned by web server: The request timed out
Thread 17: Error in Login for User user133715143, failure 1, retrying
Error during browse for reviews returned by web server: The request timed out
Thread 13: Error in Login for User user152089543, failure 1, retrying
Error during browse for reviews returned by web server: The request timed out
Thread 4: Error in Login for User newuser185687334, failure 1, retrying
Thread 0: exiting
Thread 11: error in parsing response from browse reviews request
Thread 11: Error in Login for User user51878161, failure 2, retrying

Unhandled Exception:
System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.
at System.Net.WebConnection.BeginRead (System.Net.HttpWebRequest request, System.Byte[] buffer, System.Int32 offset, System.Int32 size, System.AsyncCallback cb, System.Object state) [0x0002e] in <59be416de143456b88b9988284f43350>:0
at System.Net.WebConnectionStream.BeginRead (System.Byte[] buffer, System.Int32 offset, System.Int32 size, System.AsyncCallback cb, System.Object state) [0x001b2] in <59be416de143456b88b9988284f43350>:0
at System.Net.WebConnectionStream.Read (System.Byte[] buffer, System.Int32 offset, System.Int32 size) [0x00007] in <59be416de143456b88b9988284f43350>:0
at System.IO.StreamReader.ReadBuffer (System.Char[] userBuffer, System.Int32 userOffset, System.Int32 desiredChars, System.Boolean& readToUserBuffer) [0x0003c] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.IO.StreamReader.Read (System.Char[] buffer, System.Int32 index, System.Int32 count) [0x0009d] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at ds2xdriver.ds2Interface.ds2browsereview (System.String browse_review_type_in, System.String get_review_category_in, System.String get_review_actor_in, System.String get_review_title_in, System.Int32 batch_size_in, System.Int32 customerid_out, System.Int32& rows_returned, System.Int32[]& prod_id_out, System.String[]& title_out, System.String[]& actor_out, System.Int32[]& review_id_out, System.String[]& review_date_out, System.Int32[]& review_stars_out, System.Int32[]& review_customerid_out, System.String[]& review_summary_out, System.String[]& review_text_out, System.Int32[]& review_helpfulness_sum_out, System.Double& rt) [0x00139] in <d329ccf9df0d44c493533f20c2907579>:0
at ds2xdriver.User.Emulate () [0x00abb] in <d329ccf9df0d44c493533f20c2907579>:0
at System.Threading.ThreadHelper.ThreadStart_Context (System.Object state) [0x00017] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x0008d] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x00000] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state) [0x00031] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ThreadHelper.ThreadStart () [0x0000b] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
[ERROR] FATAL UNHANDLED EXCEPTION: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.NetworkStream'.
at System.Net.WebConnection.BeginRead (System.Net.HttpWebRequest request, System.Byte[] buffer, System.Int32 offset, System.Int32 size, System.AsyncCallback cb, System.Object state) [0x0002e] in <59be416de143456b88b9988284f43350>:0
at System.Net.WebConnectionStream.BeginRead (System.Byte[] buffer, System.Int32 offset, System.Int32 size, System.AsyncCallback cb, System.Object state) [0x001b2] in <59be416de143456b88b9988284f43350>:0
at System.Net.WebConnectionStream.Read (System.Byte[] buffer, System.Int32 offset, System.Int32 size) [0x00007] in <59be416de143456b88b9988284f43350>:0
at System.IO.StreamReader.ReadBuffer (System.Char[] userBuffer, System.Int32 userOffset, System.Int32 desiredChars, System.Boolean& readToUserBuffer) [0x0003c] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.IO.StreamReader.Read (System.Char[] buffer, System.Int32 index, System.Int32 count) [0x0009d] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at ds2xdriver.ds2Interface.ds2browsereview (System.String browse_review_type_in, System.String get_review_category_in, System.String get_review_actor_in, System.String get_review_title_in, System.Int32 batch_size_in, System.Int32 customerid_out, System.Int32& rows_returned, System.Int32[]& prod_id_out, System.String[]& title_out, System.String[]& actor_out, System.Int32[]& review_id_out, System.String[]& review_date_out, System.Int32[]& review_stars_out, System.Int32[]& review_customerid_out, System.String[]& review_summary_out, System.String[]& review_text_out, System.Int32[]& review_helpfulness_sum_out, System.Double& rt) [0x00139] in <d329ccf9df0d44c493533f20c2907579>:0
at ds2xdriver.User.Emulate () [0x00abb] in <d329ccf9df0d44c493533f20c2907579>:0
at System.Threading.ThreadHelper.ThreadStart_Context (System.Object state) [0x00017] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x0008d] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state, System.Boolean preserveSyncCtx) [0x00000] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, System.Object state) [0x00031] in <dca3b561b8ad4f9fb10141d81b39ff45>:0
at System.Threading.ThreadHelper.ThreadStart () [0x0000b] in <dca3b561b8ad4f9fb10141d81b39ff45>:0

DVDstore_End_Iteration_Number 0

Any tip on how to troubleshoot this issue further and get a successful run? Or could this be a problem with the environment? Your help is greatly appreciated.

Regards,

Niyaz

0 Kudos
jamesz08
VMware Employee
VMware Employee

Can you run 1 tile?  Possibly your storage is being overwhelmed.  

0 Kudos
nimo1983
Contributor
Contributor

Thanks James.

Ran it with 1 tile and it did work.

Redeploying the tiles 1 & 2 now and rerunning. hit similar issue here with MiscInfotimeout too (I have updated another thread for another setup similar issue for MiscInfotimeout).

It gets stuck at collecting info for ds3.

/root/ds3/:

---------------------------------------------------------------------------------------------------------------
/ds3/:

==============================================================================

In both the cases, I have used database backup and restore script. is there any issue with restoring DB from snapshot.

Regards,

Niyaz

 

0 Kudos
jamesz08
VMware Employee
VMware Employee

snapshots should work, however they are not a compliant configuration when it comes to publishing results.  

I suggest collecting esxtop during your runs and looking at the storage latency.  It seems that your storage may be too slow to accommodate multiple tiles. 

0 Kudos