Fault Manager
From pressy's brainbackup
Memory Died
I got a mail:
root@mprsx01:/root # mail From noaccess@mprsx01.local Thu Jul 18 20:25:55 2013 Date: Thu, 18 Jul 2013 20:25:55 +0200 (CEST) From: No Access User <noaccess@mprsx01.local> Message-Id: <201307181825.r6IIPtLk001862@mprsx01.local> Subject: Fault Management Event: mprsx01:GMCA-8000-YN To: root@mprsx01.local Content-Length: 809 SUNW-MSG-ID: GMCA-8000-YN, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jul 18 20:25:55 CEST 2013 PLATFORM: Sun-Fire-X4440, CSN: 1012QADF009, HOSTNAME: mprsx01 SOURCE: eft, REV: 1.16 EVENT-ID: 3ce64096-de0f-eda4-aaac-b6d4e5d53eda DESC: The number of correctable errors associated with this memory module has exceeded acceptable levels. AUTO-RESPONSE: Pages of memory associated with this memory module may have been removed from service, up to a limit which has now been reached. IMPACT: Total system memory capacity has been reduced (where supported). REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/GMCA-8000-YN for the latest service procedures and policies regarding this diagnosis. ? d
so let's see:
root@mprsx01:/root # fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- Jul 18 20:25:55 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Major Problem Status : solved Diag Engine : eft / 1.16 System Manufacturer : unknown Name : unknown Part_Number : unknown Serial_Number : unknown System Component Manufacturer : Sun-Microsystems Name : Sun-Fire-X4440 Part_Number : 000-0000-00 Serial_Number : 1012QADF009 Host_ID : 008f8772 ---------------------------------------- Suspect 1 of 1 : Fault class : fault.memory.generic-x86.dimm_ce Certainty : 100% Affects : /motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2 Status : faulted but still in service Description : The number of correctable errors associated with this memory module has exceeded acceptable levels. Response : Pages of memory associated with this memory module may have been removed from service, up to a limit which has now been reached. Impact : Total system memory capacity has been reduced (where supported). Action : Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/GMCA-8000-YN for the latest service procedures and policies regarding this diagnosis. root@mprsx01:/root # fmdump TIME UUID SUNW-MSG-ID EVENT Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed root@mprsx01:/root # fmdump -v -u 3ce64096-de0f-eda4-aaac-b6d4e5d53eda TIME UUID SUNW-MSG-ID EVENT Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed 100% fault.memory.generic-x86.dimm_ce Problem in: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2 Affects: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2 FRU: - Location: CPU 0 D2 root@mprsx01:/root #
so... shutting down the server and replaced the DIMM on CPU0/D2...
root@mprsx01:/root # fmadm repair 3ce64096-de0f-eda4-aaac-b6d4e5d53eda fmadm: recorded repair to 3ce64096-de0f-eda4-aaac-b6d4e5d53eda root@mprsx01:/root # fmadm reset eft fmadm: eft module has been reset root@mprsx01:/root # fmadm faulty root@mprsx01:/root #
btw: no, there was no error LED on the mainboard... no error in the ILOM... perhaps only if the DIMM dies completely