JimPhreak
Contributor
Contributor

vsanTraceReader process?

Can anyone explain what this process is and why it is consuming my so much CPU?  It's running like this on two of the three nodes on my vSAN cluster which have no VM's running on the vSAN datastore at this moment.

vsanTraceReader.jpg

0 Kudos
8 Replies
zdickinson
Expert
Expert

This is from the Essential vSAN Guide:  Link

Maybe you can run the command mentioned at the end of the paragraph to view the logs.  Thank  you, Zach.

0 Kudos
JimPhreak
Contributor
Contributor

Maybe I'm entering the command wrong but this is what I'm getting when I try to enter the command:

[root@localhost:/vmfs/volumes/b9e102de-807c1230/esxi01/vsantraces] zcat vsanlog.gz | /usr/lib/vmware/vsan/bin/vsanTraceReader.py > vsanlog.txt

zcat: vsanlog.gz: No such file or directory

No wallclock entry detected. There might be a mismatch between the reader version and the tracefile.

0 Kudos
zdickinson
Expert
Expert

zcat vsanlog.gz | /usr/lib/vmware/vsan/bin/vsanTraceReader.py > vsanlog.txt

I think you have the file name incorrect.  When I do an ls of /vsantraces, I get something like:

vsantraces--2016-05-01T03h04m32s385.gz

vsantraces--2016-05-01T09h52m55s956.gz

vsantraces--2016-05-01T17h58m30s717.gz

vsantraces--2016-05-02T01h33m18s481.gz

vsantraces--2016-05-02T10h11m50s871.gz

vsantraces--2016-05-02T19h33m12s753.gz

vsantraces--2016-05-03T05h14m30s680.gz

vsantraces--2016-05-03T14h10m49s522.gz

vsantraces.index

vsantracesUrgent--2016-03-14T22h30m35s502.gz

vsantracesUrgent--2016-03-28T15h34m12s728.gz

vsantracesUrgent--2016-04-12T12h11m17s552.gz

vsantracesUrgent--2016-04-28T19h09m36s639.gz

vsantracesUrgent.index

So my command was:  zcat vsantraces--2016-05-03T14h10m49s522.gz | /usr/lib/vmware/vsan/bin/vsanTraceReader.py > test.txt

It took awhile and the file is ~366 MB.  Here is a sample:

2016-05-03T18:15:57.762856 [197240107] [cpu26] [] DOMTraceVsanServerDispatchStats:2190: {'index': 2, 'avgQdepth': 1, 'maxExecutionTimeUs': 44, 'maxLatencyUs': 68, 'numActivations': 313, 'avgLatencyUs': 14, 'numOpInstantiations': 108, 'avgExecutionTimeUs': 16, 'role': 'DOM_ROLE_OWNER', 'maxQdepth': 12}

2016-05-03T18:15:57.762856 [197240108] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'write', 'index': 2, 'maxExecutionTimeUs': 44, 'maxLatencyUs': 68, 'numActivations': 207, 'avgLatencyUs': 28, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 11}

2016-05-03T18:15:57.762857 [197240109] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'read', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 24, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762857 [197240110] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'gwe', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 0, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762857 [197240111] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'other obj op', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 63, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762857 [197240112] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'cmmds', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 0, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762858 [197240113] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'resync', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 0, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762858 [197240114] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'other base op', 'index': 2, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 19, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762914 [197240115] [cpu26] [] DOMTraceVsanServerDispatchStats:2190: {'index': 1, 'avgQdepth': 1, 'maxExecutionTimeUs': 26, 'maxLatencyUs': 23, 'numActivations': 241, 'avgLatencyUs': 4, 'numOpInstantiations': 72, 'avgExecutionTimeUs': 5, 'role': 'DOM_ROLE_OWNER', 'maxQdepth': 6}

2016-05-03T18:15:57.762914 [197240116] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'write', 'index': 1, 'maxExecutionTimeUs': 26, 'maxLatencyUs': 23, 'numActivations': 114, 'avgLatencyUs': 7, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 4}

2016-05-03T18:15:57.762914 [197240117] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'read', 'index': 1, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 18, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

2016-05-03T18:15:57.762914 [197240118] [cpu26] [] DOMTraceDomServerDispatchStats:2147: {'statsType': 'gwe', 'index': 1, 'maxExecutionTimeUs': 0, 'maxLatencyUs': 0, 'numActivations': 0, 'avgLatencyUs': 0, 'role': 'DOM_ROLE_OWNER', 'avgExecutionTimeUs': 0}

Thank you, Zach.

0 Kudos
JimPhreak
Contributor
Contributor

Got it.  Attached is the latest gz file.

0 Kudos
zdickinson
Expert
Expert

Nothing jumps out to me, other than it seems to be logging every read/write.  That might be normal.  It doesn't look like you can set a logging level for vSAN traces.  There are these to commands:

vsan trace get

vsan trace set --reset

I don't think they help though, it only sets where they are stored, how often to roll them, etc...

I think you're at the end of my help.  Thank you, Zach.

0 Kudos
JimPhreak
Contributor
Contributor

Is there anywhere else you could recommend I pose this question to?  I only have a VMUG Advantage license so I don't get support with that.

0 Kudos
zdickinson
Expert
Expert

Sorry, not really.  Hope Duncan or Cormac comment Smiley Happy

0 Kudos
IVOXY
Enthusiast
Enthusiast

Believe it or not, running "esxcli vsan trace set --reset true" fixed my issue. I too was seeing most of my CORE UTIL % eaten up on my 3 NUC lab hosts, and running that command and then re-checking esxtop disappeared vsanTraceReader from the top CPU consumer in esxtop. My consumed CPU in the H5 client for my hosts went from 61-76% average to ~16% for each of my hosts.

0 Kudos