We’ve got a strange issue going on and I was wondering if anyone had seen anything similar.
We recently added 6 IBM x3850 M2 with 4 X7460 Intel 6 core procs and 128GB memory. All the systems are configured the same. The HBAs and NICs in the systems are in the same slots. These are identical systems all ordered at the same time. Same bios, firmware, ect…
After putting these systems into production we started receiving calls of performance issues on some of the Application owners. While running tests it was determined that VMs running some of the new systems where running at about half the expected speed.
Upon further investigation it appears that 3 of the 6 servers exhibit this behavior in memory and CPU benchmarks under any load condition (from 1 to 60 VMs), while the other 3 servers perform as expected under the same load. We attempted to reload the systems, but the one’s exhibiting the issue continue to do so.
Anybody seen anything like this? We have active cases with Both VMware and IBM but neither has been able to help isolate the issue as of yet.
All Hosts are running ESX 3.5 Update 4 and have all patches that have been released as of last week.
All 6 Hosts mentioned are in a single Cluster and HA and DRS are enabled. All VMs and on SAN storage.
We have been able to work around the issue. The issue was resolved by disconnecting the power cables from the servers that were not performing, and letting them completely discharge. Then reconnecting the power and booted the systems back up. After this process the systems are now performing as expected.
As of yet we we have been unable to reproduce the issue. We attempted disconnecting one power supply while the system is running and also at cold boot, but all systems are now performing as expected under any power scenario.
We are continuing to work with the vendor, and will post a follow up if a root cause is ever determined.
We have exactly the same problem with 8 brand new IBM x3850 M2s. They are Machine/Model 7233-2RG fitted with 4 Xeon E7420 and 96GB RAM. We saw your post when searching for a solution and in desperation unplugged both PSUs of one of the x3850 M2s and left it overnight. We've come back in this morning, re-run the same benchmarks and found that the CPU is effectively performing twice as fast. We are completely confused as to how this has worked. Has anyone else seen anything like this and did you ever determine a root cause? We've logged a call with IBM and I'll post any results we get from that.
The root cause is the system is throttled due to power constraints. IBM has provided that much information, but has not got a true resolution as to why the system doesn't not go back into full performance when both power supplies are available ( it actually appears that the system goes into limp mode when the secondary power supply is reconnected ) . We still have an open ticket, and still looking for resolution. I have been in contact with a couple other people having this issue and some have been able to reproduce the issue able to reproduce via the following steps:
1. Disconnect both power supplies from the power grid
2. Leave it unpowered for 5 minutes
3. Reconnect both power supplies and power up the server
4. Login on the console and run the following command:
while true; do time dd if=/dev/zero of=/dev/shm/test count=100 bs=1M; done
5. Observe the results
6. Remove power cord from PS1
7. Observe the results (nothing should be different at this point)
8. Plug back the power cord on PS1
9. Observe the results (now the observed times should be doubled)
Last I heard from one of the others having the issue IBM has Acknowledged and been able to reproduce thisissue in house, and is working on a BIOS fix. I will continue to update this tread as more information comes out.
Thanks for the response. It appears our problem is identical and we can reproduce it using exactly the same procedure. I can confirm that the performance does not decrease until the second PSU is plugged back in. IBM have also confirmed to us that they have been able to reproduce it. I'll post more here as soon we hear anything more from IBM.
Do you have a case # for your IBM issue? I'm getting some wierd recomendations from IBM that the issue may be caused by the CPU thermal grease not being up to snuff, and once the grease is reapplyed teh issue sould go away. For your reference our case number is 20TZ6RL.
Fantastic, thanks for the update, I've been away so haven't looked at this for a while. We have a machine that we've removed from the cluster to test this so I'll try and get this tested and report back.
There's a similar issue faced by me in one of my deployments. Even after upgrading to the latest BMC Update given by IBM on its site for
X3850 M2 has not given us positive results.
The issue is specifically with I/O intensive Servers like Citrix Xe App 4.5 where it touches 100 % CPU after having 15-20 Users.