VMware Performance Community
Notti
Contributor
Contributor

workload hangs at "mailserverclient"

Hi all,

I am running a workload only with mailserver, standby, database and fileserver within two tiles.

During several Bechmarking tries I always experienced the following problem:

The Benchmark always "hangs" / runs forever at the job element "mailserverclient" for both Tiles.

The output file mailserverclient_0.stdout is empty on both clients.

mailserver_functions.xml says for mailserverclient:

<sequence>

<call function="'debugSTAXUtilLogAndMsg'">

'Info: Request: run mailserver %s' % (cmd)

</call>

<script>

pnMailserverclient = "Tile %u: mailserverclient" % tilenumber

</script>

<process name='pnMailserverclient'>

<location>'%s ' % gCLIENTS[tilenumber]</location>

<command mode="'shell'">cmd</command>

<parms>'-f %s%svmmark.sim -r -x' % (ClientWorkdir, gFsep)</parms>

<workdir>ClientWorkdir</workdir>

<stdout>'C:\mailserverclient_%u.stdout' % tilenumber</stdout>

<stderr mode="'stdout'"/>

<returnstdout/>

</process>

<call-with-map function="'CheckProcess'">

<call-map-arg name="'processId'">"%s" % pnMailserverclient</call-map-arg>

</call-with-map>

<if expr="RC != 0">

<script> runErrors += 1</script>

</if>

</sequence>

What does this part of the script do here? Can I run this stuff manually?

EDIT: I was able to start the LoadSim -f %s vmmark.sim -r -x manually from both clients with domain admin rights.

Could it be a authorization problem with client/mailserver? LoadSim only runs under domain admin rights / Exchange 2003 Admin...

Is there any idea how to grant these rights automatically or add new users on all relevant clients/VMs?*

BUT it still does not work with the above described soulution....Can you please give some adivse regarding further investigation or more information for a detailed research.

I do not see more options within the Testenvironment to identify the issue.

Thank you very much for the help and best regards from Germany!!

Notti :smileyshocked:

0 Kudos
29 Replies
psmith2006
VMware Employee
VMware Employee

If you've carefully reviewed all the steps for setting up the client and the mailserver the VMmark Benchmarking Guide, then try following the steps in the troubleshooting section - see "Testing a Single Failing Client-Workload Pair" , and if needed "Recovering Data After Certain Harness Hangs" . You can zip up the Results_<datestamp> directory and mail to vmmark-info for assistence.

0 Kudos
Notti
Contributor
Contributor

Hello,

Thanks for your reply. I did set up the Mailservers as described in the VMMark Benchmarking Guide.

Testing a single Failing Client-Workload created the same results as running a full Tile Benchmark. It always hangs at "mailserverclient" and "mailserverstats".

It seems to me that the script is not running correctly at some point for mailserverstats and mailserverclient but I am not sure about it.

Recovering Data after certain Harness Hangs creates the attached Results-files. Can you please have a look at it. This project is critical for our future ESX 3.0 Environment.

Thanks and best regards - Notti

0 Kudos
psmith2006
VMware Employee
VMware Employee

There's no info in the STAX User log could you restart you STAF on you clients

and restart the STAXMonitor on your prime client then:

Select New Submit

Select Log Options tab

Enable all the logging options.

Then try this again and send the results.

So far it looks like there is some kind of permissions problem.

0 Kudos
Notti
Contributor
Contributor

I am currently running the test again.

Via RDP I see LoadSim opening on the prime client and there is a LoadSim Error: "Could not access the Active Directory. LoadSim failed to start.". visible on the other client (client1) too....

Are there any requirements regarding authorization of administrator accounts on both domains? Do I have to run the STAXMonitor as an Domain Administrator via RDP from the prime client?

In my oppinion this still does not guarantee that the Benchmarkingtest is running right in the other domain / tile, correct? Do the domains have to trust each other? Smiley Happy

Additionally I experienced time drifts of system clocks on some VMs and the client systems. Is there any recommendation from VMWare to handle this issue?

Should I use the DCs for time sync, the ESX-Server (over VMWare-Tools, which does not seem to work) or an external NTP-Server?

Sorry for the confusion... I am currently struggling in this project.

I am very glad that you are here to help me! Thank you very much and best regards - Notti

0 Kudos
quikah
Contributor
Contributor

Sounds like your loadsim clients are having network issues. Is your test system on an isolated network? Have you set the DNS of your clients to the mailserver? You shouldn't have to do any trusts to get it working.

On you client system run the following command (reference http://technet2.microsoft.com/windowsserver/en/library/B6879C0B-CFF7-438D-A7F3-0715456DCEFB1033.mspx 😞

nslookup

set q=srv

ldap.tcp.dc._msdcs.<domain>

Replace <domain> with the name of your domain. You should get some information about the DC of your test system. If you don't then there is something wrong with the DNS.

I have found that time sync via tools has always worked for me. The clients sync their time via the AD. If there a particular set of VMs that always show the drift?

0 Kudos
psmith2006
VMware Employee
VMware Employee

As mentioned previously you want to have the guests sync to the host clock which can be done thru the VMware tools or

by adding tools.syncTime = true to the .vmx files. See documentation for details.

Also be sure to update any windows clients or VM with the Microsoft patch: KB933360

http://www.vmware.com/pdf/vmware_timekeeping.pdf

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1318&slice...

0 Kudos
Notti
Contributor
Contributor

Yes I did install the Test environment on an isolated network and I did set up DNS of the clients to the mailserver.

I tried the DNS-Test and after setting up a reverse lookup zone it worked fine for both domains.

At first nslookup says "DNS request timed out. Timeout was 2 seconds" BUT after that it does display information about the domain.

I still experience the same problem with LoadSim, which says that it could not access the Active Directory... It must be some kind of permission problem... Smiley Sad

The timesync problem seems to be okay for now. I changed some regedit values on the client systems and then it worked fine.

0 Kudos
jamesz08
VMware Employee
VMware Employee

Are you logged into the client system as the Domain Administrator or the local one? Try logging in as the Domain Administrator and see if you can run loadsim then. Were you able to initialize the mailstore on both clients before running the test?

Try removing the client system from the domain and rejoining it.

0 Kudos
Notti
Contributor
Contributor

I was finally able to run the test with a full STAX User log...

To understand it easier: I had to accept the Error Message of LoadSim on both clients... After that VMMark was able to finish the Testrun.

Please find all created files attached to this reply! More help would be great!!! Smiley Happy

0 Kudos
Notti
Contributor
Contributor

Hi jamesz08,

I am logged in on both clients with the local account. I am able to run LoadSim manually without any problems (logged on with the domain account or within a runas cmd-box) and initialized the mailstores on both clients before the test.

All steps seem to work manually, but if I run it with the script / STAXMon it runs forever and nothing happens...

Additonally if I run STAXMon as a domain admin on the prime client STAXMon is not even able to stop or start services on the mailserver VMs..

Using local accounts seem to look good/right but there is still a permission problem with LoadSim or some kind of....

0 Kudos
jamesz08
VMware Employee
VMware Employee

You appear to have a couple of different problems, which are only partly related.

1. Make sure that the directory permissions of you vclient directories allow the local administrator read/write access. This may be the cause of your loadsim problems if the local Administrator doesn't have write access. This can happen if you setup these directories when you were logged in as the Domain Admin.

2. You have CLEANUPFLAG=0 in your VMMARK.config, please make sure that is set to 1 or commented out when starting a new test. You should only set that to 0 if you are attempting to retrieve results from a failed run by setting RESCUE=1 flag. That may or may not help with your loadsim problems.

3. There are 2 components to STAF. The agent, STAFProc which is run on each client and STAXmon which is run on the Prime Client only. STAXmon sends jobs to the STAF agents to complete. Your log file indicates that STAXmon was unable to connect to the STAF agent on client1 (Timed out connecting to endpoint). This can happen if there is either a network problem between the 2 systems (either cannot reach the system or cannot resolve the name used client1) or STAFproc is not running on the system. From client0 command line issue the following command:

staf tcp://client1 ping ping

You should get the response PONG.

If you get an error message make sure that client0 can resolve the name client1 into the correct IP address. Have you created a HOSTS file listing each system and IP? Each system should be able to resolve all the others by name (just to keep everything simple, each tile really only needs entries for its client and the prime client). Also make sure that the STAF agent is actually running.

Are you starting STAF using the "Scheduled Tasks" feature on the clients? Make sure that you have unchecked the box which terminates the process after 72 hours (Properties > Settings).

0 Kudos
Notti
Contributor
Contributor

Thanks for your reply,

As you expected correctly the vclient directories did not had full read/write access for the local administrators.

Additionally I have changed the CLEANUPFLAG to 1 and started a new workload only with the mailserver VMs.

I can run a full workload with standby, FileServer and Database so there should be no problem with STAF / STAX.

I've checked it on every client and the scheduled task is configured correctly. My HOST-Files are also correct, because everybody can reach everyone via IP or hostname. Smiley Happy

Sadly the above described workload seems to run fine but the Score_N_Tile_test.out is missing in the Results_xxx folder.... any idea why?

Trying to extract an HTML-formatted Table of Results and Ratios the prime client cries "Could not resolve start-time. Time on clients not synchronized." maybe because of the missing .out file?

All VMs are syncing against the ESX-Server and the physical clients are syncing via AD to their mailservers.

0 Kudos
jamesz08
VMware Employee
VMware Employee

The scoring script will only work when all the workloads have completed successfully. You cannot generate a score from a partial tile.

0 Kudos
Notti
Contributor
Contributor

LoadSim gives back the same Error as described before. For me it looks like that the script is trying to run LoadSim with the local user which is running STAXMon too.

This User does not have the rights to access the AD on both systems and there is no Useraccount with rights in both Domains within the test scenario.

Which user did you use to run STAXMon? Running LoadSim requires a Domain Admin thats for sure...

0 Kudos
jamesz08
VMware Employee
VMware Employee

Check the membership of the Administrators group for your local system. It should be a member of "Domain Admins" for the mailserver domain. That usually gets added when you join the domain. Each client should be setup this way.

I run all my tests using the local administrator without problems.

0 Kudos
Notti
Contributor
Contributor

Good morning everybody,

LoadSim works now under the following circumstances:

I am logged on the clients with their appropriate Domain Admin accounts. On the prime client the STAXMon is started via RunAs... with the local Useraccount of client0.

So running this test seem to work now. Thanks to everybody helping me out of these problems.

The timesync problem seems to be okay for now. Everything is running smooothly... :smileycool:

Additionally I would like to generate a report only with MailServer, Database, Standby and FileServer.

As jamesz08 said before, it is not possible to generate a report without JavaServer and WebServer?

The Scope of this project does not consider these VMs running on a Tile for our Test because of financial/practical reasons.

I am fully aware of the situation that running a report would not generate a full Benchmarkingtest recommended by VMWare, allthough I would like to do the report for this project.

Do you have any Idea what I need to do to generate these reports without Java- and WebServer? Any help would be appreciated... Smiley Happy

0 Kudos
psmith2006
VMware Employee
VMware Employee

Glad to hear you were able to get your mailserver working. Do note that any results that are generated without all six workloads will not be comparable to published VMmark results and can't be represented as VMmark results. The VMmark harness supports running partial tiles for debugging purposes. The source for the script that post-processes the results files for scoring is of course included in the harness.

0 Kudos
Notti
Contributor
Contributor

Yeah it works well now. Thanks to you guys... Smiley Happy I am fully aware of the fact that the results will not be comparable to other servers or Benchmarks of VMMark.

Allthough I would like to extract a workload score out of Mailserver, Database Server, Fileserver and Standby Server.

So do I understand it right that the script for post-processing result files are able to handle only a partial tile? Score_N_Tile_test.out is not created after an Benchmarkingtest.

If it is only possible to create result files with a full tile, do you think that there is an option to customize the script that it is able to use only a partial tile?

while running tilescore2html.sh the result is always: Could not resolve start-time. Time on clients not synchronized. How do I get this stuff up and running? Is the missing .out-file the reason for failure of the script?

My clients and VMs seem to be synced within an appropriate timeframe....

Thanks again!!!

0 Kudos
psmith2006
VMware Employee
VMware Employee

The script is failing because its expecting data for all the 5 workloads. Anything is possible - its just a matter of software.

0 Kudos