VMware Cloud Community
LucasAlbers
Expert
Expert

Handling memory errors for enterprise deployments

I ran across a study that mentioned memory errors on large scale google server deployments:

It got me thinking about how vmware vsphere corrects memory errors.

Redhat has a module that enables additional information on errors and on supported modules can do background memory scrubbing, assuming hardware support.

To be specific linux has mainline support for this:

EDAC (Error Detection and Correction) is a set of Linux kernel modules for handling hardware-related errors.

Its major focus has been ECC memory error handling, however it also detects and reports PCI bus parity errors.

Error Detection and Correction (EDAC) Support

Does vmware/esx do any additional memory checking?

+

The only option that I found for additional memory checking is:

Mem.TestOnAlloc = 1

"Check the new allocated page for Memory Errors"

What is the performance implications of this switch?

How effective this would be at correcting errors.

What methods do you use to mitigate memory errors on your large scale deployments?

I follow the practive of 72 hour memory burnin's before deployment, and enable the additional memory checking option in the bios.

Overview of additional memory hardware correction, chipkill for correcting more than a single bit error:

For example on dell servers and any entrprise vendor I assume, you can enable additional error correction:

Advanced ECC: this mode joins two controllers into a lockstep mode, thus creating a 128 bit data path....allows sdc over x4 and x8 path.

+

0 Kudos
0 Replies