Fault Manager

From pressy's brainbackup
Jump to: navigation, search

Memory Died

I got a mail:

root@mprsx01:/root # mail
From noaccess@mprsx01.local Thu Jul 18 20:25:55 2013
Date: Thu, 18 Jul 2013 20:25:55 +0200 (CEST)
From: No Access User <noaccess@mprsx01.local>
Message-Id: <201307181825.r6IIPtLk001862@mprsx01.local>
Subject: Fault Management Event: mprsx01:GMCA-8000-YN
To: root@mprsx01.local
Content-Length: 809

SUNW-MSG-ID: GMCA-8000-YN, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Jul 18 20:25:55 CEST 2013
PLATFORM: Sun-Fire-X4440, CSN: 1012QADF009, HOSTNAME: mprsx01
SOURCE: eft, REV: 1.16
EVENT-ID: 3ce64096-de0f-eda4-aaac-b6d4e5d53eda
DESC: The number of correctable errors associated with this memory module has exceeded acceptable levels.
AUTO-RESPONSE: Pages of memory associated with this memory module may have been removed from service, up to a limit which has now been reached.
IMPACT: Total system memory capacity has been reduced (where supported).
REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/GMCA-8000-YN for the latest service procedures and policies regarding this diagnosis.


? d

so let's see:


root@mprsx01:/root # fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jul 18 20:25:55 3ce64096-de0f-eda4-aaac-b6d4e5d53eda  GMCA-8000-YN   Major

Problem Status    : solved
Diag Engine       : eft / 1.16
System
    Manufacturer  : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown

System Component
    Manufacturer  : Sun-Microsystems
    Name          : Sun-Fire-X4440
    Part_Number   : 000-0000-00
    Serial_Number : 1012QADF009
    Host_ID       : 008f8772

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.memory.generic-x86.dimm_ce
   Certainty   : 100%
   Affects     : /motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
   Status      : faulted but still in service

Description : The number of correctable errors associated with this memory
              module has exceeded acceptable levels.

Response    : Pages of memory associated with this memory module may have been
              removed from service, up to a limit which has now been reached.

Impact      : Total system memory capacity has been reduced (where supported).

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Please refer to the associated reference document at
              http://support.oracle.com/msg/GMCA-8000-YN for the latest service
              procedures and policies regarding this diagnosis.

root@mprsx01:/root # fmdump
TIME                 UUID                                 SUNW-MSG-ID EVENT
Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed
root@mprsx01:/root # fmdump -v -u 3ce64096-de0f-eda4-aaac-b6d4e5d53eda
TIME                 UUID                                 SUNW-MSG-ID EVENT
Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed
  100%  fault.memory.generic-x86.dimm_ce

        Problem in: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
           Affects: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
               FRU: -
          Location: CPU 0 D2
root@mprsx01:/root #

so... shutting down the server and replaced the DIMM on CPU0/D2...

root@mprsx01:/root # fmadm repair 3ce64096-de0f-eda4-aaac-b6d4e5d53eda
fmadm: recorded repair to 3ce64096-de0f-eda4-aaac-b6d4e5d53eda
root@mprsx01:/root # fmadm reset eft
fmadm: eft module has been reset
root@mprsx01:/root # fmadm faulty
root@mprsx01:/root #

btw: no, there was no error LED on the mainboard... no error in the ILOM... perhaps only if the DIMM dies completely