We are a cloud service provider with multiple clusters both 5.1 and 5.5
Every month we pay a lot for licensing and support to vmware (throught the VSPP program).
We have a mix of all kinds of OS's. And also we haven't got access to all of our vm's as we rent them to customers so they are the responsible for their vm.
On the first of January we had a PSOD on of our esx hosts. Vmware support concluded that it was due to the use of a E1000 nic.(the vm which showed in the PSOD doesnt even have a e1000, only vmxnet3).
Yesterday we have 3 PSOD's on different servers in a matter of hours. This resulted in the restart of 59 servers. Of those are multiple terminal servers so you can understand the mess...
Vmware again concluded it was due to the use of e1000 nic's.
But then it got really funny, they said there is a patch but they are just not releasing it. Cause it need to go through "extensive testing" so they dont release it to individual customers.
I tried to get more information (because we found out that are using e1000e in 2 vm's icm with win 8.1 since half december, one of those vm's was on all 4 psod's)
But apart from "yes we see it happening more often on windows 8.x and 2012" they wont tell me anything more.
Also they cant explain what exactly is happing and why it is happing. They say disabling RSS (e1000e only) solves the issue. So it is related to vm's with multiple cpu's then?
How come that such a huge issue (i have seen case back in 2012) isnt patch immediately.
Why dont they give a hotfix to customers. As im sitting on a ticking time bomb. I cant force my customer not to use e1000. I cant force freebsd to support vmxnet3.
I cant change appliances that i get from suppliers where i dont even have root access on.
today vmware is an amateur club in my eye's
Hans
I also have seen this issue and the despair on the net about the lack of solution from VMware, I myself rebuild whole farms because of this.
Seen people switching NIC's because of this issue even switching OS for this matter, that is absurd!
Hardforum and some other blogs/forums have some posts of this issue and there are some case numbers there that can possibly help.
Where the workaround suits some people I can imagine that when u use appliances you are f***ed...
When u use their product and pay for support and you have 59 servers rebooting because of a bug they allegedly have a patch for, I can totally relate to your anger.
This is really bad practice bij VMware.
I hope they will help you soon, I see word there is a patch coming Q1 2014.
In my experience of obtaining hot patches they do require extensive testing and are only certified for a specific environment, which is why they cant be made available to the general public. When VMware give you a hot patch you cant patch your environment from that point on without the hot patch being removed (as a later build # supersedes this). So at this point you need to decide between a stable environment for your customers (with the hot patch) or being on the latest general release build version.
On one occasion in the past I worked closely with VMware for ~3 months to get the hot patch for our environment made available in the general release patches so that we could continue "normal" operations. I think you need to push this up the support escalation path to agree on a solution and time line.
Cheers,
Jon
Does log support what they say, consider changing to vmxnet3 or paravirtualized nics and see. Keep VMware people posted and save the ticket number to address them in future when same issue comes.
Hi HdeJongh,
I was bitten by this as well. However, I was able to quickly remediate my environments so I only saw one PSoD on this. It sounds like you don't have the luxury of changing your customer's vNIC types, so here's a crazy idea that might just work...
You might consider containerizing the problem VMs by running them on nested ESXi hosts. That way when the GOS triggers the PSoD, it affects the virtual ESXi instance, not the underlying bare metal hypervisor. This is purely speculation and YMMV. Keep us posted either way... and keep your head up. I'm confident that VMware is working diligently on this effort.
yes the idea about creating a bad cluster is what we are going to do. But still there are some very important vm's with freebsd which we cant even change...
Same issue here, I had about 6 PSOD in a very short time! I had a long dispute with VMware Support, they told us also to switch to vmxnet3. But that is not a workaround with more than thousand of VMs controlled by the customers! But they wont understand this and the aren't really helpful finding a real workaround or solution.
Find the "bad" VMs and isolate them was also a approach we tried, but we aren't successful with this, because every time a other VM triggered the issue and sometimes we could not determine the corresponding VM.
Our Workaround was a downgrade to 5.1 1157734, even if VM Support didn't recommend this step. Since then, we had no PSOD any more...
It seems that even the patches released today didn't include a fix for this issue.
I'm really disappointed by this company and their behaviour!
i have 2 different cases open, interesting to see what both engineers are doing. One of them has got the message i guess as i was contacted yesterday by somebody who is going to look into getting a patch...
the other one has proposed that she is going to every day if the patch is already released.
They said to us also go to 5.1 but only because the 5.1 patch will be released sooner.
5.1 is not an option: i would have to downgrade all vm's and all vmware tools...(same problem as with a vmnic change, i dont own the vm's)
Actually it seems like a fix for this is in fact included in the patches released today (only for 5.1 though):
VMware KB: VMware ESXi 5.1, Patch ESXi510-201401201-UG: Updates ESXi 5.1 esx-base vib
PR1042045: ESXi host experiences a purple diagnostic screen with errors for E1000PollRxRing and E1000DevRx when the rxRing buffer fills up and the max Rx ring is set to more than 2. The next Rx packet received that is handled by the second ring is NULL, causing a processing error. The purple diagnostic screen or backtrace contains entries similar to:
@BlueScreen: #PF Exception 14 in world 63406:vmast.63405 IP 0x41801cd9c266 addr 0x0
PTEs:0x8442d5027;0x383f35027;0x0;
Code start: 0x41801cc00000 VMK uptime: 1:08:27:56.829
0x41229eb9b590:[0x41801cd9c266]E1000PollRxRing@vmkernel#nover+0xdb9 stack: 0x410015264580
0x41229eb9b600:[0x41801cd9fc73]E1000DevRx@vmkernel#nover+0x18a stack: 0x41229eb9b630
Great information management letting your customers know that their significant issue is fixed, VMware.
Hi,
also in the new update for ESXi 5.1 update 2. The fix is included.
Frank
so its true, there is a fix, its just that vmware thinks that this problem isnt big enough so they want to wait for U1 which is scheduled for end of q1
Hi All,
ESXi Update 2 is already released today with the fix for the same.
ESXi 5.1 Update 2 Release Notes
Thanks,
Avinash
I think they have gone a little crazy. In a world where other virtualization products are quickly catching up with VMware, one of the best arguments you could make for VMware was stability. This issue has put a really big dent in that. What makes it worse is that nobody at VMware really seems to care - not support, sales, or the sales engineers. They just want to blame the old Intel driver and tell you to replace it, instead of taking ownership of the problem and providing a real solution. Perhaps they are afraid to admit their mistake. The funny thing is that E1000 is the default in vCenter for many operating systems, and if you are a legacy shop you likely have a lot of it from the 3.x days. I can't speak to whether or not it is a bad driver, but if it is then why did they roll it out to begin with? I cannot think of a single time in the past 5 years that it has caused me a problem. Telling their customers to replace tens, hundreds, or thousands of E1000s is not practical and they don't even offer a suggestion on how to do it. They will, however, offer to sell you PSO hours to help with an upgrade to the faulty product, knowing there is an issue. I was lucky enough to pick up on the PSOD issue through this community. Otherwise, I would have upgraded to 5.5 and a couple PSODs might have been enough to get management more interested in a competitive product. It is taking VMware way too long to get a patch out. I am not sure why they can't release it for 5.5 since they were able to include a fix in the update for 5.1. It feels like 6.0 might be out before I am able to implement 5.5 at the rate they are moving. While I am frustrated at VMware, I am grateful to all of you who have posted the issue on the community. It has saved guys like me from a lot of stress.
I completely agree with you, I think I have raised more than 150 VMware support calls so far in my career but never got solution to any problem. I respect this community more than VMware support. This community has helped me a lot. VMware support people just take logs and couldn't gave me anything I should really consider, they just try to manipulate the situation and blame storage, servers, network etc...and if you get connected to Bangalore support, it is very slow too.
..............................................................
really......
so i just downloaded the vcloud usage meter 3.2 which i MUST run from vmware for the vspp program.
THis is a vmware vapp.
Guess what...
e1000...
so tell me brilliant vmware, how do i change that nic?
Guess what NIC vShield Manger 5.5 uses? E1000! VMware engineers keep telling me I need to upgrade and get away from E1000, not sure why since they use it in their products.
Is this problem only showing in 5.1 and 5.5?
Did it show up after the previous round of patches (Oct 2013 timeframe)?
I am still at 5.0 (SSO avoidance
) and just finished patching a cluster with the October releases. All other clusters are older, and I see there was a patch release set on 1-23-2014.
I see the KB says 5.x, but it seems the posters on this thread only mention 5.1 and 5.5.
Thanks
i had it on a 5.5 cluster with the latest patches.
Just dug into details on last weeks patch release for 5.0 - It is (supposedly) fixed.
Thanks
I got into this post for changing the nics : https://communities.vmware.com/thread/426140?start=15&tstart=0
Here folks also complain about the PSDO's.
The update for version 5.1 only was posted, still no 5.5 ...
