VMware Cloud Community
gdmersh
Enthusiast
Enthusiast
Jump to solution

ipmi sensor number mapping with DIMM module

Hi

I am trying to detect correctable memory errors in the DIMM modules my servers. It has ESXi 6.5 running on it.

I ran following esxcli to detect the errors

-----------------------

esxcli hardware ipmi sel list | grep -B5 -A 3 -i -E "memory|correctable"

Record:390
   Record Id: 390
   When: 2019-02-28T01:08:16
   Event Type: 111 (Unknown)
   SEL Type: 2 (System Event)
   Message: Assert + Memory Correctable ECC
   Sensor Number: 83
   Raw:
   Formatted-Raw:
--
Record:393
   Record Id: 393
   When: 2019-04-25T06:29:14
   Event Type: 111 (Unknown)
   SEL Type: 2 (System Event)
   Message: Assert + Memory Correctable ECC
   Sensor Number: 83
   Raw:
   Formatted-Raw:

-------------------------

It shows 2 events that happened with sensor number: 83. How can I use this information to find out which memory module (actual slot  number) it happened in?

So basically how can I map the sensor number from the command output above with a DIMM slot information e.g DIMMA1 etc..

Thank you

Dee

Reply
0 Kudos
1 Solution

Accepted Solutions
e_espinel
Virtuoso
Virtuoso
Jump to solution

Hello.
A standard server has a hardware management interface that is generically known as IPMI. In different masks it is called IMM, BMC, XClarity, ILO and more.
The IPMI has a port assigned (labeled) and in standard form is configured to obtain an IP from a DHCP service, it can also be configured with a fixed IP, entering the UEFI (BIOS) of the Server.

If you have access to the IPMI of your server, there you can have more details of the reported memory event.

What make/model of server do you have?
If it is IBM or Lenovo Server you can get a lot of Hardware data online using the DSA tool.

Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit defined by the manufacturer it is recommended to plan the change.

 

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.

View solution in original post

Reply
0 Kudos
2 Replies
e_espinel
Virtuoso
Virtuoso
Jump to solution

Hello.
A standard server has a hardware management interface that is generically known as IPMI. In different masks it is called IMM, BMC, XClarity, ILO and more.
The IPMI has a port assigned (labeled) and in standard form is configured to obtain an IP from a DHCP service, it can also be configured with a fixed IP, entering the UEFI (BIOS) of the Server.

If you have access to the IPMI of your server, there you can have more details of the reported memory event.

What make/model of server do you have?
If it is IBM or Lenovo Server you can get a lot of Hardware data online using the DSA tool.

Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit defined by the manufacturer it is recommended to plan the change.

 

Enrique Espinel
Senior Technical Support on IBM, Lenovo, Veeam Backup and VMware vSphere.
VSP-SV, VTSP-SV, VTSP-HCI, VTSP
Please mark my comment as Correct Answer or assign Kudos if my answer was helpful to you, Thank you.
Пожалуйста, отметьте мой комментарий как Правильный ответ или поставьте Кудо, если мой ответ был вам полезен, Спасибо.
Reply
0 Kudos
gdmersh
Enthusiast
Enthusiast
Jump to solution

Hi e_espinel,

Thank you for the response.

I have Dell Power edge and Hp servers. 

| Re:  Memory Correctable ECC events are not considered serious errors, but a count is kept (PFA) that when exceeding the limit      |  defined  by the manufacturer it is recommended to plan the change.

        Yes, exactly that's what I am trying to monitor to see how many times the correctable error was reported. To do that I run the command 

esxcli hardware ipmi sel list 

Record:390
   Record Id: 390
   When: 2019-02-28T01:08:16
   Event Type: 111 (Unknown)
   SEL Type: 2 (System Event)
   Message: Assert + Memory Correctable ECC
   Sensor Number: 83
   Raw:
   Formatted-Raw:

There were more events like this....

This tells me that ECC correctable memory event happened on the given date and time. But I don't know which memory module it happened in. It only says Sensor Number: 83 . So is there any command or cli tool that can tell me which memory module this sensor number belongs to as I have multiple DIMM modules on my server.

Thank you so much 🙂

 

Reply
0 Kudos