V
V
vlarkanov2018-05-28 10:20:59
linux
vlarkanov, 2018-05-28 10:20:59

How do I know which memory modules are failing (what motherboard slots they are in)?

Hello!
In syslog, I saw a lot of memory error messages like this:


May 28 06:26:52 ru-tul-dc01-mon02 kernel: [317191.270724] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 28 06:26:52 ru-tul-dc01-mon02 kernel: [317191.270726] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: cc1d730000010092
May 28 06:26:52 ru-tul-dc01-mon02 kernel: [317191.271141] EDAC sbridge MC0: TSC 0
May 28 06:26:52 ru-tul-dc01- mon02 kernel: [317191.271142] EDAC sbridge MC0: ADDR 7f80bf80
May 28 06:26:52 ru-tul-dc01-mon02 kernel: [317191.271143] EDAC sbridge MC0: MISC 4078f886
May 28 06:26:52 ru-tul mon02 kernel: [317191.271145] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1527478012 SOCKET 0 APIC 0
May 28 06:26:52 ru-tul-dc01-mon02 kernel: [317191.271425] EDAC MC0: 30156 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x7f80b offset:0xf80 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:1)
May 28 06:26:44 ru-tul-dc01-mon02 kernel: [317183.270202] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
May 28 06:26:44 ru-tul-dc01-mon02 kernel: [317183.270203] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 9: 8c000050000800c1
May 28 06:26:44 en- tul-dc01-mon02 kernel: [317183.270204] EDAC sbridge MC0: TSC 0
May 28
06:26:44 tul-dc01-mon02 kernel: [317183.270206] EDAC sbridge MC0: MISC 90000000000208c
May 28 06:26:44 ru-tul-dc01-mon02 kernel: [317183.270207] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1527478004 SOCKET 0 APIC 0
May 28 06:26:44 ru-tul-dc01-mon02 kernel: [ 317183.270217] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x794717 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket: 0 ha:0 channel_mask:1 rank:0)

Accordingly, there is a problem with CPU0 channel:0 slot:0 and CPU0 channel:2 slot:0
modules. In total, 6*8Gb modules are installed in the system. On the marking board type A1, A2, etc.
How to understand which modules are unhealthy (i.e. in which board slots they are installed)?
UPD: I tried to look through dmidecode:

# dmidecode -t memory | the grep 'Locator: P'
Locator: the P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Locator: the P1-DIMMA2
Bank Locator: P0_Node0_Channel0_Dimm1
Locator: the P1-DIMMA3
Bank Locator: P0_Node0_Channel0_Dimm2
Locator: the P1-DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Locator: the P1-DIMMB2
Bank Locator: P0_Node0_Channel1_Dimm1
Locator: the P1-DIMMB3
Bank Locator: P0_Node0_Channel1_Dimm2
Locator: the P1-DIMMC1
Bank Locator: P0_Node0_Channel2_Dimm0
Locator: the P1-DIMMC2
Bank Locator: P0_Node0_Channel2_Dimm1
Locator: the P1-DIMMC3
Bank Locator: P0_Node0_Channel2_Dimm2
Locator: the P1-DIMMD1
Bank Locator: P0_Node0_Channel3_Dimm0
Locator: the P1-DIMMD2
Bank Locator: P0_Node0_Channel3_Dimm1
Locator: the P1-DIMMD3
Bank Locator: P0_Node0_Channel3_Dimm2
Locator: the P2-DIMME1
Bank Locator: P1_Node1_Channel0_Dimm0
Locator: the P2-DIMME2
Bank Locator: P1_Node1_Channel0_Dimm1
Locator: the P2-DIMME3
Bank Locator: P1_Node1_Channel0_Dimm2
Locator : P2-DIMMF1
Bank Locator: P1_Node1_Channel1_Dimm0
Locator: P2-DIMMF2
Bank Locator: P1_Node1_Channel1_Dimm1
Locator: P2-DIMMF3
Bank Locator: P1_Node1_Channel1_Dimm2
Locator: P2-DIMMG1
Bank Locator: P1_Node1_Channel2_Dimm
-DIMMGG2
Bank Locator: P1_Node1_Channel2_Dimm1
Locator: P2-DIMMG3
Bank Locator: P1_Node1_Channel2_Dimm2
Locator: P2-DIMMH1
Bank Locator: P1_Node1_Channel3_Dimm0
Locator: P2-DIMMH2
Bank Locator: P1_Node1_Channel3_Dihanmm1
Locator: P2-DIMMH
: P1

Am I right in thinking that P0_Node0_Channel0_Dimm0 corresponds to CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 channel:0 slot:0, and P0_Node0_Channel2_Dimm0 - CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0? Then it turns out that I need DIMMA1 and DIMMC1.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
M
Melkij, 2018-05-28
@melkij

Light up the specifications for the memory controller and motherboard layout.
If you believe the kernel's self-identification, the first modules on channels 1 and 3 of the first socket will fail.

A
Andrey2508, 2018-06-01
@Andrey2508

Pull out one memory module first, as PrAw wrote, run the test. Then pull out the second and conduct a test, and so on.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question