I can't seem to run the fileserver workload with any other workload without it failing on startup for a single tile. I'm running one tile and all other workloads work fine for any duration of execution. However, when you add fileserver into the mix, it fails on startup leaving the tbench_srv processes running on the client. Any help would be greatly appreciated...
any errors come back from the Stax Monitor stuff?
Just the following:
20070812-20:36:41 Info Info: Process FileServer on fileserver0 = shell /home/f
ileserver/dbench/src/dbench -c /home/fileserver/dbench/
src/client_plain.txt -p 1066 -l 1000 45 client0
20070812-20:36:44 Info Process: Tile 0: FileServer failed to start/complete. R
eturned: RC = 1, STAFResult = None
However, I can run it independently with no issues.
Might need to wait for the experts. I've not seen that error, not in troubleshooting info either. i'd maybe go and check out the fundamentals, networking and host files between your client and VMs, and double check your vmmark.config and staf.cfg file...
Since you can run the fileserver as a standalone workload via the harness, I'll have to assume that the network configurations and harness setup are probably OK. I would try to add workloads in addition to the fileserver until I find one that causes the error. Try adding them in the following order (one at a time): Standby, java, mail, database, and finally web. (You could likely combine the standby and java ) Try a short, ~15 minute, run after adding in each workload. I have a few other questions:
\- What type of system are you running on? Is it maxed out on CPU with a full tile?
\- Are you running on a private network to the client? If not, SPECweb generates a fair bit of traffic and might be choking the network before fileserver starts.
\- Have you tried tuning the workload delay parameters to start fileserver up first?
I'll try the suggestions. Actually, I have been doing much of that but in "larger quantities". I know the webserver and fileserver "don't like each other" and I know that everything excluding webserver doesn't make fileserver happy either. However, I'll try and be more methodical in the testing as you suggested and report back.
As for your questions:
-I'm running all the Tile 1 workload VMs on a Dell 2950 and nothing else so resources should not be an issue. The client is all alone on a 2950 as well.
-The network between the two is private gigabit.
-I have not tried tuning the workload so fileserver get's a "headstart" but will do so.
Thanks
Makes me think something is duplicated in the staf.cfg - perhaps Machinenickname is incorrect?
Two more questions:
\- How much memory in on the ESX box?
\- Are you running the client natively or in a VM?
Yes I would be interested to know the resourcing specification on the host. Ironically, I got the same error today on webserver0. I know the whole tile is configured 100% perfect, as it's been used successfully to VMMARK new hosts.
Out of interest I ran a single tile on an old DL380G4 with not enough RAM, and I get this error. So I think it's a response time / timeout thing, as that VM is so sluggish it takes 10 minutes for the logon box to appear!
Checked all the STAF.cfg(s) and all looks well. No duplications for MACHINENICKNAME.
32GBs on both workload ESX host and client box. And the "alone" comment from my previous post refers to running the client on bare-metal with no other services offered or running on the box. You're not allowed to run clients inside a VM for a compliant run right?
I'd be surprised if there are any physical resource constraints causing this issue. But I do tend to get surprised more often than I'd like...;-)
Problem solved. It was an order of execution issue. Once I moved the fileserver delay parameter to allow it to start first and reordered the other workloads appropriately, everything works fine!!! Now the question is, does changing the order of execution "invalidate" the results. I would not think so but I'd like to be sure. Thanks for the suggestions.
Make sure that you have set the registry parameters on the client
as indicated on page 55 of the benchmarking guide. This will make
sure that the workload processes can get the port numbers they need.
Changing the /DELAYTIME values is OK.
What can cause the RC=1 error? I've tried the workaround as suggested, for example the following causes the error to occur on the fileserver:
MailServer/DELAYTIME="250"
JavaServer/DELAYTIME="540"
Standby/DELAYTIME="170"
WebServer/DELAYTIME="5"
Database/DELAYTIME="360"
FileServer/DELAYTIME="180"
and the following causes it on the webserver
MailServer/DELAYTIME="250"
JavaServer/DELAYTIME="540"
Standby/DELAYTIME="170"
WebServer/DELAYTIME="180"
Database/DELAYTIME="360"
FileServer/DELAYTIME="5"
In my case I feel it's just the host is too slow (2 year old single core with slightly overcomitted RAM, non compliant test I know!) because the exact tiles work fine on our over specced 585's.