Fault Manager
From pressy's brainbackup
Memory Died
I got a mail:
root@mprsx01:/root # mail From noaccess@mprsx01.local Thu Jul 18 20:25:55 2013 Date: Thu, 18 Jul 2013 20:25:55 +0200 (CEST) From: No Access User <noaccess@mprsx01.local> Message-Id: <201307181825.r6IIPtLk001862@mprsx01.local> Subject: Fault Management Event: mprsx01:GMCA-8000-YN To: root@mprsx01.local Content-Length: 809 SUNW-MSG-ID: GMCA-8000-YN, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Thu Jul 18 20:25:55 CEST 2013 PLATFORM: Sun-Fire-X4440, CSN: 1012QADF009, HOSTNAME: mprsx01 SOURCE: eft, REV: 1.16 EVENT-ID: 3ce64096-de0f-eda4-aaac-b6d4e5d53eda DESC: The number of correctable errors associated with this memory module has exceeded acceptable levels. AUTO-RESPONSE: Pages of memory associated with this memory module may have been removed from service, up to a limit which has now been reached. IMPACT: Total system memory capacity has been reduced (where supported). REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Please refer to the associated reference document at http://support.oracle.com/msg/GMCA-8000-YN for the latest service procedures and policies regarding this diagnosis. ? d
so let's see:
root@mprsx01:/root # fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jul 18 20:25:55 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Major
Problem Status : solved
Diag Engine : eft / 1.16
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
System Component
Manufacturer : Sun-Microsystems
Name : Sun-Fire-X4440
Part_Number : 000-0000-00
Serial_Number : 1012QADF009
Host_ID : 008f8772
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.memory.generic-x86.dimm_ce
Certainty : 100%
Affects : /motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
Status : faulted but still in service
Description : The number of correctable errors associated with this memory
module has exceeded acceptable levels.
Response : Pages of memory associated with this memory module may have been
removed from service, up to a limit which has now been reached.
Impact : Total system memory capacity has been reduced (where supported).
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Please refer to the associated reference document at
http://support.oracle.com/msg/GMCA-8000-YN for the latest service
procedures and policies regarding this diagnosis.
root@mprsx01:/root # fmdump
TIME UUID SUNW-MSG-ID EVENT
Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed
root@mprsx01:/root # fmdump -v -u 3ce64096-de0f-eda4-aaac-b6d4e5d53eda
TIME UUID SUNW-MSG-ID EVENT
Jul 18 20:25:55.2255 3ce64096-de0f-eda4-aaac-b6d4e5d53eda GMCA-8000-YN Diagnosed
100% fault.memory.generic-x86.dimm_ce
Problem in: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
Affects: hc://:chassis-mfg=Sun-Microsystems:chassis-name=Sun-Fire-X4440:chassis-part=unknown:chassis-serial=1012QADF009/motherboard=0/chip=0/memory-controller=0/dram-channel=0/chip-select=2
FRU: -
Location: CPU 0 D2
root@mprsx01:/root #
so... shutting down the server and replaced the DIMM on CPU0/D2...
root@mprsx01:/root # fmadm repair 3ce64096-de0f-eda4-aaac-b6d4e5d53eda fmadm: recorded repair to 3ce64096-de0f-eda4-aaac-b6d4e5d53eda root@mprsx01:/root # fmadm reset eft fmadm: eft module has been reset root@mprsx01:/root # fmadm faulty root@mprsx01:/root #
btw: no, there was no error LED on the mainboard... no error in the ILOM... perhaps only if the DIMM dies completely