Re: Java Performance on VMware ESX

haroldr · ‎05-26-2009

As a performance engineer at VMware, I have done a lot of testing with Java applications on VMware ESX. I have uniformly found the performance and scaling of Java applications to be excellent, with no special tuning required. As a start at demonstrating this, I recently published the results of some SPECjvm2008 experiments in VROOM!, the VMware Performance Engineering teams' blog. While SPECjvm2008 is only focused on core Java performance, and can't be used to demonstrate multi-vm scaling, the results do show that there is nothing inherent in Java itself that would lead to poor performance when virtualized.

On a few occasions, I have worked with customers who were experiencing performance issues with their Java deployments on ESX. In all of these cases, the root-cause turned out to be a configuration issue in their environment, and not really related to Java itself. In most cases, the issues were related to memory overcommitment. One of my colleagues has written an excellent document,[ Java in Virtual Machines on VMware ESX: Best Practices|http://www.vmware.com/resources/techresources/1087], which provides guidance on avoiding these and other common issues when deploying Java application on ESX.

I would like to use this thread for a discussion of questions or issues related to Java performance on VMware ESX 3.5 or vSphere 4.0. Post your comments, questions, or experiences, and I'll do my best to respond.

Hal

tcutts · ‎06-17-2009

I'm glad you've been having success... I hope some of it will rub off here. I've got a couple of different VMs on ESX 3.5 which are trying to run Java applications of one sort or another, and we've been seeing serious performance issues; the application just seems to stop.

Guest OS: Debian GNU/Linux 5.0 (Linux kernel 2.6.29)

Guest RAM: 4G or 8G, depending on the application

Physical server: HP BL680c, 4x quad-core Xeon, 64GB RAM.

Storage: HP EVA8100

Application 1: Lucene web search engine. The indexer process just seems to grind to a halt. The machine is not doing any significant I/O, or using any CPU. Running strace on java indexer process reveals a lot of activity, almost all of which is the futex() system call, and the occasional sendto():

futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245763, 884153}, NULL) = 0 gettimeofday({1245245763, 884203}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245763, 884239046}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49963954}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245763, 935999}, NULL) = 0 gettimeofday({1245245763, 936041}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245763, 936076493}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964507}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245763, 987932}, NULL) = 0 gettimeofday({1245245763, 987973}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245763, 988008086}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964914}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 39930}, NULL) = 0 gettimeofday({1245245764, 39971}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 40006447}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964553}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 90476}, NULL) = 0 gettimeofday({1245245764, 90522}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 90557966}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964034}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 140834}, NULL) = 0 gettimeofday({1245245764, 140876}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 140911970}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964030}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 191503}, NULL) = 0 gettimeofday({1245245764, 191543}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 191579147}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49963853}) = -1 ETIMEDOUT (Connection timed out) futex(0x41695e58, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 242319}, NULL) = 0 gettimeofday({1245245764, 242360}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 242395507}) = 0 futex(0x7f63fdb30114, FUTEX_WAIT_PRIVATE, 1, {0, 49964493} <unfinished ...> <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out) futex(0x41c40be8, FUTEX_WAKE_PRIVATE, 1) = 0 clock_gettime(CLOCK_MONOTONIC, {88335, 123965121}) = 0 futex(0x7f63fd9e6e64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fd9e6e60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 clock_gettime(CLOCK_MONOTONIC, {88335, 124311258}) = 0 clock_gettime(CLOCK_MONOTONIC, {88335, 124348135}) = 0 gettimeofday({1245245764, 256967}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 257003626}) = 0 futex(0x7f63fc95c304, FUTEX_WAIT_PRIVATE, 1, {0, 599963374} <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fc9c5518, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fc7c8f54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fc7c8f50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fd9e6e64, FUTEX_WAIT_PRIVATE, 7219, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fd9e4e08, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fd3a08e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fd3a08e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fc7c8f54, FUTEX_WAIT_PRIVATE, 7225, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fde9e6a8, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fcf091e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fcf091e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fd3a08e4, FUTEX_WAIT_PRIVATE, 7225, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fcf089e8, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 257517}, NULL) = 0 futex(0x7f63fc9c7a34, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fc9c7a30, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fdc10294, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fdc10290, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...> <... restart_syscall resumed> ) = 0 futex(0x7f63fc9c8a98, FUTEX_WAKE_PRIVATE, 1) = 0 gettimeofday({1245245764, 257689}, NULL) = 0 gettimeofday({1245245764, 257725}, NULL) = 0 clock_gettime(CLOCK_REALTIME, {1245245764, 257756521}) = 0 futex(0x7f63fc9c7a34, FUTEX_WAIT_PRIVATE, 9, {0, 599968479} <unfinished ...> <... futex resumed> ) = 1 futex(0x7f63fcf091e4, FUTEX_WAIT_PRIVATE, 7379, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fdc0fbd8, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fde63c64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fde63c60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fdc10294, FUTEX_WAIT_PRIVATE, 7383, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fc7c8168, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fd2fab44, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fd2fab40, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fde63c64, FUTEX_WAIT_PRIVATE, 7289, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fcf3e728, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fcee6ea4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fcee6ea0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fd2fab44, FUTEX_WAIT_PRIVATE, 7283, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fc960888, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7f63fdbd83e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7f63fdbd83e0, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7f63fcee6ea4, FUTEX_WAIT_PRIVATE, 7297, NULL <unfinished ...> <... futex resumed> ) = 0 futex(0x7f63fd8de328, FUTEX_WAKE_PRIVATE, 1) = 0 sendto(8, "\0\361\0&\20\376\200\0\0\0\0\0\0\2PV\377\376\236a\263\0\0\233-\1\2\243\254\355\0\5s"..., 1055, 0, {sa_family=AF_INET6, sin6_port=htons(59659), inet_pton(AF_INET6, "fe80::250:56ff:fe9e:61b3", &sin6_addr), sin6_flowinfo=32768, sin6_scope_id=if_nametoindex("lo")}, 28 <unfinished ...> <... recvfrom resumed> "\0\361\0\"\20\376\200\0\0\0\0\0\0\2PV\377\376\236a\263\0\0\351\v\1\2\243\254\355\0\5s"..., 65535, 0, {sa_family=AF_INET6, sin6_port=htons(59659), inet_pton(AF_INET6, "fe80::250:56ff:fe9e:61b3", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=if_nametoindex("eth0")}, ) = 773

Any ideas why the application doesn't seem to be doing anything useful>

Application 2: A tomcat server

A tomcat server talking to a remote mysql database has extremely poor responsiveness. Running strace() on that process reveals the same behaviour as above. The machine doesn't appear to be doing anything at all, and yet the java process doesn't respond to the user, or do anything other than repeatedly call futex().

Any ideas? I'm at my wits' end here, and the users are getting very grumpy...

Regards,

Tim

haroldr · ‎06-24-2009

I notice that you are using a guest OS (Debian GNU/Linux 5.0) that is not supported on ESX 3.5, but is supported on ESX 4.0. I don't know what issues this OS might have on ESX 3.5, but the guest OS install guide (http://www.vmware.com/pdf/GuestOS_guide.pdf) lists a couple of known issues with this OS regarding timekeeping behavior, and points to a couple of knowledgebase articles with workarounds. I don't know whether these are valid for ESX 3.5, but since your problem seems to be related to timekeeping, I would suggest giving them a try. Alternatively, you could try running the application on a different OS to see whether the behaviour changes..

If the workarounds or a different OS do not help, the following information might help me to better understand what might be happening.

Which JVM are you using?
What are the startup parameters (e.g. heap size and any tuning) that you are using for your applications?
How many VMs are running on this ESX host?
What is the total amount of memory assigned to powered-on VM?
Have you run the same applications in a non-virtual environment?

Let me know how things work out,

Hal

tcutts · ‎06-25-2009

Thanks for your response, Harold.

tcutts · ‎07-06-2009

I notice that you are using a guest OS (Debian GNU/Linux 5.0) that is not supported on ESX 3.5, but is supported on ESX 4.0. I don't know what issues this OS might have on ESX 3.5, but the guest OS install guide (http://www.vmware.com/pdf/GuestOS_guide.pdf) lists a couple of known issues with this OS regarding timekeeping behavior, and points to a couple of knowledgebase articles with workarounds. I don't know whether these are valid for ESX 3.5, but since your problem seems to be related to timekeeping, I would suggest giving them a try.

We have already followed those.

Alternatively, you could try running the application on a different OS to see whether the behaviour changes..

We have tried the application under Ubuntu 8.04 LTS, a fully supported guest OS. The symptoms are the same.

If the workarounds or a different OS do not help, the following information might help me to better understand what might be happening.
Which JVM are you using?

Problems have seen with several versions of Sun's JVM, including 1.6.0_14, which is the one currently being used.

What are the startup parameters (e.g. heap size and any tuning) that you are using for your applications?

No heap or tuning parameters; the users just seem to be accepting defaults:

/software/MIG/sun/jdk1.6.0_14//bin/java -Djava.util.logging.config.file=/software/MIG/apache-tomcat-6.0.20-dev/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.endorsed.dirs=/software/MIG/apache-tomcat-6.0.20-dev/endorsed -classpath :/software/MIG/apache-tomcat-6.0.20-dev/bin/bootstrap.jar -Dcatalina.base=/software/MIG/apache-tomcat-6.0.20-dev -Dcatalina.home=/software/MIG/apache-tomcat-6.0.20-dev -Djava.io.tmpdir=/software/MIG/apache-tomcat-6.0.20-dev/temp org.apache.catalina.startup.Bootstrap start

How many VMs are running on this ESX host?

Currently 57, but usually more like about 30, and the problem persists then. The machine is not physically heavily loaded - CPU utilisation is about 25%, RAM about 56%

What is the total amount of memory assigned to powered-on VM?

4GB

Have you run the same applications in a non-virtual environment?

Yes, using the identical software stack. The application runs fine in the non-virtual environment

Now that I know we have the same problem in the support Ubuntu guest as in Debian, should I file this as a formal support request?

tcutts · ‎07-07-2009

Another tomcat application has just been reported to me with the same issues. The application just isn't responding sensibly to web requests at all, but when I log into the VM and run top, I see the tomcat server using about 30% CPU, and when I strace it, it's just in some sort of busy loop to do with futexes, again:

>[pid 2927] futex(0x7f3f500cadc4, FUTEX_WAIT_PRIVATE, 1, {0, 49932168}) = -1 ETIMEDOUT (Connection timed out)

>[pid 2927] futex(0x7f3f573aab28, FUTEX_WAKE_PRIVATE, 1) = 0

>[pid 2927] clock_gettime(CLOCK_MONOTONIC, {94449, 630230700}) = 0

>[pid 2927] gettimeofday({1246979223, 755143}, NULL) = 0

>[pid 2927] clock_gettime(CLOCK_MONOTONIC, {94449, 630337417}) = 0

>[pid 2927] clock_gettime(CLOCK_MONOTONIC, {94449, 630381277}) = 0

>[pid 2927] gettimeofday({1246979223, 755288}, NULL) = 0

>[pid 2927] clock_gettime(CLOCK_REALTIME, {1246979223, 755336136}) = 0

repeated endlessly. No other systems calls indicative of any I/O to the outside world. Again, when the same software stack is run on a physical server, the tomcat server returns results in under a second.

As with the other problem applications, this problem occurs only after the java application has been running for an unpredictable length of time; initially it all works fine.

I think I'm going to formally raise a support ticket.

Regards,

Tim

tommyodom · ‎07-19-2009

For what it is worth, I too have been seeing these problems trying to run Java on Ubuntu 8.04 under ESXi 3.5. I'm not sure about your configuration, but if I set the VM to single CPU then the problems go away but it seems to be very reproducable with 2 or 4 CPUs configured on the VM. I have seen the issue during the startup of Glassfish and while running Maven. I decided to investigate the issue further this weekend and while running strace on the process I did see a loop identical to the one you posted. I am submitting a support request for this issue so hopefully we can get to the bottom of it because I would really like to run my application servers with more than 1 CPU allocated.

tommyodom · ‎07-22-2009

Well I hope you have better luck with support than i did. All support would tell me is that they can't guarantee every application and that I should run it in single CPU mode since multiple CPUs don't work for my java processes. They didn't seem particularly interested in the fact that the same application on the same stack works fine on a physical computer with two cores but not on a virtual machine with 2 vCPUs.

The only thing I can think is that something with the CPU timing running under a VM is different enough than the CPU timing on a non-VM which is exposing some bug in either the linux kernel, glibc, or java's VM.

haroldr · ‎07-22-2009

I am going to try to reproduce your problems with Ubuntu myself. If I cannot, I will contact you in a private message to your community account so that I can get some additional information. In either case, I will try to find someone internally to look into this further.

In the meantime, to rule out memory-overcommit issues, which can cause problems with Java applications, can you try re-running your 2 vCPU case, but set a memory reservation for the VM equal to its full memory size. Java in a VM is particularly sensitive to having its heap ballooned or swapped.

Thanks,

Hal

tommyodom · ‎07-22-2009

Hi Hal,

Thank you for your response, i appreciate you taking the time to look into this even though it may not directly be a VMWare problem.

I modified the VM to set the memory reservation to the full OS memory size but I still ran into the problem. But, I did some further investigation and wanted to share what I've found so far.

While the Java process was taking 100% of the CPU, I attached GDB to the process and I ran a "thread apply all bt" and I noticed that all of the threads were in a pthread_cond_wait kernel vsyscall with the exception of two threads which had the following stack traces:

#0 0xb78c097e in PSPromotionManager::copy_to_survivor_space () from /usr/lib/jvm/java-1.5.0-sun-1.5.0.18/jre/lib/i386/server/libjvm.so

#1 0xb768e23c in instanceKlass::oop_copy_contents () from /usr/lib/jvm/java-1.5.0-sun-1.5.0.18/jre/lib/i386/server/libjvm.so

#2 0xb78c0721 in PSPromotionManager::drain_stacks () from /usr/lib/jvm/java-1.5.0-sun-1.5.0.18/jre/lib/i386/server/libjvm.so

#3 0xb78c2a4e in StealTask::do_it () from /usr/lib/jvm/java-1.5.0-sun-1.5.0.18/jre/lib/i386/server/libjvm.so

#4 0xb76521bf in GCTaskThread::run () from /usr/lib/jvm/java-1.5.0-sun-1.5.0.18/jre/lib/i386/server/libjvm.so

So, I did a little bit of reading on the GC options and discovered that on a single CPU system the parallel GC is not used which is probably why my single CPU system didn't have any problems. In my experimentation I also found that setting the -XX:ParallelGCThreads=1 had no apparent effect until I also set -XX:UseParallelOldGC which tells the JVM to use parallel GC for both young and old generations.

When using either -XX:UseSerialGC or the combination of -XX:ParallelGCThreads=1 and -XX:UseParallelOldGC, I no longer hit the 100% cpu usage problem in my initial testing. I am going to run with the -XX:UseSerialGC and multiple vCPUs to see if everything appears stable. I also am going to leave the memory reservation set since it sounds like that's a good idea for Java.

tommyodom · ‎07-22-2009

Well I tried those settings on my Glassfish installation but still no luck, I guess whatever the underlying problem is it's more than just the GC that hits it. It does still seem like most of the threads are stuck in pthread waits but with Glassfish it is a bit more difficult to tell what's what because of how many threads it starts up and how many are typically idle waiting on network connections.

jjmurray · ‎07-22-2009

Hi,

One possible cause to rule out here is the circumstance shown in which describes a Futex problem with another flavor f Linux on physical systems.

This may be contributing to the issue - let's rule it out or in.

Is the application doing a large amount of GC in any case - i.e more than once a second. That may account for the high number of calls to gettimeofday() which is an expensive call.

tommyodom · ‎07-22-2009

Unfortunately that bug doesn't really describe what the fixes were other than new versions of X11 & imsettings neither of which are installed on this system. It does pose an interesting question though as to whether or not some other application in ubuntu is interfering with Java's futex calls even though java is using the private mutexes. I'll see if I can reduce the number of processes a bit to see if that makes any difference.

tcutts · ‎08-04-2009

For what it is worth, I too have been seeing these problems trying to run Java on Ubuntu 8.04 under ESXi 3.5. I'm not sure about your configuration, but if I set the VM to single CPU then the problems go away but it seems to be very reproducable with 2 or 4 CPUs configured on the VM. I have seen the issue during the startup of Glassfish and while running Maven. I decided to investigate the issue further this weekend and while running strace on the process I did see a loop identical to the one you posted. I am submitting a support request for this issue so hopefully we can get to the bottom of it because I would really like to run my application servers with more than 1 CPU allocated.

We see this problem on single CPU virtual machines; I basically don't bother with multiple CPU virtual machines. I can imagine, if it's a virtual SMP box, that timing issues are likely to be worse than on a single CPU. I strongly suspect this to be a timing issue, but haven't yet got much evidence to back that up. It just smells that way!

Tim

tcutts · ‎08-04-2009

I am going to try to reproduce your problems with Ubuntu myself. If I cannot, I will contact you in a private message to your community account so that I can get some additional information. In either case, I will try to find someone internally to look into this further.
In the meantime, to rule out memory-overcommit issues, which can cause problems with Java applications, can you try re-running your 2 vCPU case, but set a memory reservation for the VM equal to its full memory size. Java in a VM is particularly sensitive to having its heap ballooned or swapped.

In our single CPU case, there is no memory ballooning, and no swapping going on. The machine appears to be completely unloaded, but the application just doesn't respond properly.

Tim

tcutts · ‎08-04-2009

Hi,
One possible cause to rule out here is the circumstance shown in which describes a Futex problem with another flavor f Linux on physical systems.

This may be contributing to the issue - let's rule it out or in.

That seems to be a gtk/x11 issue, which is not relevant in my case - these are tomcat JSP servers.

Is the application doing a large amount of GC in any case - i.e more than once a second. That may account for the high number of calls to gettimeofday() which is an expensive call.

is there any way I can find that out from a running tomcat process? Please bear in mind that I'm not a java expert, but a mere sysadmin. I suspect jstat is what I want to use, but the documentation is less than clear, and I don't really know how to interpret the output.

Tim

tommyodom · ‎08-04-2009

There are a few java command line options you can use to turn on garbage collection information but I don't know whether or not they would have performance impacts in your situation. If you are running the Sun JVM, the three flags that may be of use to you are:

-verbose:gc

prints information at every collection

-XX:+PrintGCDetails prints additional information about the collections

-XX:+PrintGCTimeStamps

will additionally print a time stamp at the start of each collection

Any of those three flags should output information to tell you how frequently garbage collection is running.

tcutts · ‎08-07-2009

Hi,
One possible cause to rule out here is the circumstance shown in which describes a Futex problem with another flavor f Linux on physical systems.
This may be contributing to the issue - let's rule it out or in.
Is the application doing a large amount of GC in any case - i.e more than once a second. That may account for the high number of calls to gettimeofday() which is an expensive call.

I've checked, and the application is not doing a great deal of GC. Even less, when I used -Xmx and -Xms to increase the memory it's using. The problem persisted.

Most of the application performs fairly well; the parts that don't are those which interact with the back end database (on a physical server). When we performed packet capture of the communication between the virtualised app and the Oracle server, and compared it with the same thing on the non-virtualised app, we did notice that the virtualised app used three times as many packets as the non-virtualised app. Total data size was about the same, so it looks like smaller packets. So I'm starting to wonder whether this might possibly be something to do with the virtualised network and packet sizes? But if that were the case, why are we only seeing this with Java applications? What controls this packet size? There seem to be several layers, any of which could be at fault: Tomcat -> JDBC -> Oracle client libraries -> Linux kernel -> ESX and probably other layers I haven't thought of. Certainly we haven't seen performance issues with other Java applications where the database is internal to the VM, such as our Confluence server. Only with tomcat servers contacting external Oracle instances, and with a web indexing application which uses java to spider the site's pages. Perhaps it's a general problem with network connections which are initiated by the JVM in the virtualised environment? I will do some more investigation, since if that's the case, I can come up with a simpler test case which will make it easier for you to reproduce.

SlaytanicLemmy · ‎08-19-2009

Did you find any more information on the possible JVM-Network issue that you were investigating? We have many tomcat-based applications running in JVMs and recently, there has been a reported slowdown. They do not push the CPU at all, but they do use back-end JDBC Oracle/MS-SQL DBs. If this is an issue, I want to be able to escalate internally and within VMware.

We do not have the luxury of being able to move the apps to a physical environment. We are using JVM jdk1.5.0_14-b03 (64-bit) on RHEL 4.8 (64-bit) 2vCPU.

Any input would greatly enhance additional troubleshooting that we would perform to validate.

otarroux · ‎08-24-2009

Hi have a very similar issue with an in-house developped java application, running on a RHEL 5.3 guest on VmWare 3.5 U4 on a 16 core host.

The application is consuming a lot of system time, and is slow compared to a similar physical box.

Typically : CPU USR around 15 %, CPU SYS around 50%.

There is no WaitIO, no swapping, no ballooning, lot of available RAM.

Java spent most of the time in the system with "futex" and "poll" calls :

% time seconds usecs/call calls errors syscall -

-

49.99 163.781588 26907 6087 1240 futex 41.07 134.545105 17489 7693 poll 7.00 22.928262 532 43077 recvfrom 1.36 4.469311 372443 12 accept 0.48 1.566032 53 29513 sendto

We have checked the appropriate time settings are applied :

- notsc divider=10 in the kernel parameters

- ntp time sync instead of vmware tools time sync

We have tested with 1 vCPU and 4 vCPU : % CPU sys is better in 4vCPU, but still much higher that usual.

Since the physical box behaves much better, I would not suspect the application to be badly written.

Then I don't know if the problem comes from java or from VwMare or from RHEL.

Would a formal ticket at Vmware help ?