VMware Cloud Community
buzurk
Contributor
Contributor

PSOD In VMware 5.5

Hi

one of our hosts is getting the attached PSOD at random, can last a day, can last a few days

host is dell r720, 225gb ram, no local storage,

esxi 5.5 running off an SD card

NAS is attached via iscsi over 10gb

i did follow this guide, but still same issue

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203282...

Host does run two server 2012 core r2 veeam proxies

we use veeam to replicate to this server also

4 Replies
Michelle50
Contributor
Contributor

Hi

Welcome to the communities.

could you please enable SSH and collect log file further analysis.

http://jackstromberg.com/2012/10/vmware-host-sensor-data-is-not-updated-the-query-service-is-not-ava...

If I have lost confidence in myself, I have the universe against me.
0 Kudos
ch1ta
Hot Shot
Hot Shot

Are using E1000E vNIC by any chance? I'm asking since similar issues have reported recently by Veeam in its forum weekly digests:

October, 15th:

POSSIBLE DATA CORRUPTION ISSUE, MUST READ! Sorry for starting this way, but this is a big deal, and I wanted to make sure you catch this when scanning through the weekend spam. Apparently, VMware issued a support KB article on this issue a few weeks ago, but it totally flew under the radar (I have not seen a single tweet or blog about this). In short, any data flowing through the VM network stack may get corrupted - including file copies, remote clients interactions with databases, any client-server or multi-tiered apps.

The scariest part is that the scope of this issue is very significant. In fact, we might as well be facing the biggest data corruption issue in the history of virtualization. The issue may occur on any Windows Server 2012 VM with the default (E1000E) vNIC adaptor running on ESXi 5.0 and 5.1, which makes it probably around 20% of all VMs in the world. The easiest workaround is to change the vNIC type to VMXNET3 or E1000 (you should be able to apply this change in bulk with a PowerCLI script), or disable TCP Segmentation Offload in the guest operating system. Keep in mind that changing vNIC type may result in change of DHCP address, because the OS will see that as the new network adapter, so this may affect some applications. As such, disabling TCP Segmentation Offload may sometimes be a better choice, however this increases VM CPU usage.

Specifically to backups, even if some of your backup infrastructure components are running in a Windows Server 2012 VM, you should be safe if you are using Veeam Backup & Replication 6.5 or later. This was the version when we added inline network traffic verification to work around some unrelated data corruption issues involving faulty network equipment that we have observed in support. I had a big story about this in a weekly digest over one year ago. However, unfortunately your actual production data may already be corrupted, and unless you still have backups going all the way back to your vSphere 5.x or Windows Server 2012 upgrade times, this might be one of those cases of unrecoverable data loss... and worst of all, without running a compare against a copy of data that is known to be "good", it is impossible to say which specific parts of data are corrupted...

As per VMware support KB, the investigation is still on-going, so I would not yet jump to a conclusion that this is a bug with VMware. For example, we did see one mysterious data corruption issue during weeks of automated stress testing of our Windows Server 2012 support. We call it "10 bad bits mystery" internally, and it was affecting network transfers on both physical and virtual hardware. Unfortunately, the issue was impossible to reproduce reliably, so our investigation with Microsoft went nowhere (and we already had the problem covered with our network traffic verification anyway). But, if anyone from VMware R&D or support are reading this, feel free to reach out to me to discuss the data corruption pattern, as well as factors facilitating the issue surfacing – as this could be the same issue.

November, 2nd.

The E1000E vNIC saga continues... besides confirmed data corruption issue involving Windows Server 2012 and vSphere 5.0/5.1, the stress testing of our upcoming B&R update code revealed additional critical issue in vSphere 5.5. We've probably seen more PSODs (Purple Screen Of Death) in the past few weeks than in all previous years collectively. After long troubleshooting, we've managed to isolate the issue to E1000E vNIC VM configurations. Apparently, those will reliably crash ESXi 5.5 hosts when under a heavy network I/O load (for example, backup or restore activities). We've opened a support case with VMware, however as the patch most likely will not ready before our vSphere 5.5 support release, we will be advising our customers against using E1000E vNIC adapters for now. The problem here is that it is the default adapter type, which is going to make it a problem for users who do not read manuals.

By the way, this PSOD issue does not seem specific to vSphere 5.5, as looks like at least one customer reported exactly the same error on vSphere 5.1, and according to the poster, VMware has confirmed the issue. In our QC lab, we have not seen crashes before upgrading to vSphere 5.5, but we saw a plenty of those on the same hardware after upgrading, so this probably means that the bug is much more likely to appear on vSphere 5.5. Anyway, this story is a perfect example why we like to take time testing the new platform before shipping support for it. And huge respect to our QC team for being able to find issues like that, issues that slip by a few times bigger VMware QC team... this would not be possible without all the load sims and automated testing tools they have developed in house in all these years to ensure B&R performs reliably under heavy load.

Cheers.

zXi_Gamer
Virtuoso
Virtuoso

The PSOD you have posted resembles the ongoing issue with e1000 driver.

E1000 virtual nic on Win 2003 R2 causing PSOD on ESXi 5.1 (PF Exception 14)

From VMware, it is a known issue and to be addressed soon.

Re: Pink screened vSphere 5.1 (twice) by installing Notepad++ on Server 2012 R2

0 Kudos