How this Started

I run two Seagate ST3250620AS as my root file system with a Linux software RAID0 setup for my /home directory. These drives are from Seagate’s 7200.10 series which were the first drives to switch to perpendicular recording some years ago. This was a time when Seagate had a 5 year warranty for OEM drives and an immaculate reputation.

Starting on Friday, I heard my hard drive clicking. Some quick investigation by looking at logs revealed that sdb was dying to some degree:

Jun 11 14:21:38 core kernel: ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x810000 action 0xe frozen
Jun 11 14:21:38 core kernel: ata3.00: irq_stat 0x08400000, interface fatal error, PHY RDY changed
Jun 11 14:21:38 core kernel: ata3: SError: { PHYRdyChg LinkSeq }
Jun 11 14:21:38 core kernel: ata3.00: failed command: READ FPDMA QUEUED
Jun 11 14:21:38 core kernel: ata3.00: cmd 60/60:00:7d:8d:25/00:00:10:00:00/40 tag 0 ncq 49152 in
Jun 11 14:21:38 core kernel: res 40/00:00:7d:8d:25/00:00:10:00:00/40 Emask 0x10 (ATA bus error)
Jun 11 14:21:38 core kernel: ata3.00: status: { DRDY }
Jun 11 14:21:38 core kernel: ata3: hard resetting link
Jun 11 14:21:41 core kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 11 14:21:41 core kernel: ata3.00: configured for UDMA/133
Jun 11 14:21:41 core kernel: ata3: EH complete

I noticed the clicking when I took the side of my case off to look at something else, and figured maybe I bumped the cable. I touched the cable and it seemed happy. I wrote it off as a bad cable and replaced the cable later that day when I had a chance to power down the machine. I noticed that one of the contacts was recessed a bit more then the others, so I swapped it and looked at the others. Two others were bad, so I just threw them out and visually inspected the replacements.

Fast forward a few hours and it’s acting up again. This time I dig deeper with smartctl and run some tests, the first drive in the array passes without problems, but the other has some serious issues. I downloaded Seagate’s Seatools CD and booted off of that since my attempts at running the S.M.A.R.T. long test from Linux failed. Running it from the CD found 2 bad sectors (on top of 7 that were already remapped) and give me the option to repair them, and so far so good. See my smartctl data below. Also note this drive is almost 4 years old but reports a lifetime of only 4718… I think that’s an oops on Seagate’s part as this drive has been on 24/7 since then.

Full Smartctl Dump for Those Interested

$ sudo smartctl -a /dev/sdb

smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3250620AS
Serial Number:    5QE0DYWW
Firmware Version: 3.AAC
User Capacity:    250,059,350,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jun 13 10:00:31 2010 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:          ( 430) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      (  92) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   086   006    Pre-fail  Always       -       34962761
  3 Spin_Up_Time            0x0003   092   089   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       323
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       7
  7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       341869071
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4720
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1031
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       119
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   050   045    Old_age   Always       -       44 (Lifetime Min/Max 41/44)
194 Temperature_Celsius     0x0022   044   050   000    Old_age   Always       -       44 (0 14 0 0)
195 Hardware_ECC_Recovered  0x001a   077   053   000    Old_age   Always       -       14538
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 119 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 119 occurred at disk power-on lifetime: 4715 hours (196 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 55 1a 5e e0  Error: UNC at LBA = 0x005e1a55 = 6167125

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 78 13 5e e0 00      01:06:52.649  READ VERIFY SECTOR(S) EXT
  42 00 00 78 0b 5e e0 00      01:06:52.631  READ VERIFY SECTOR(S) EXT
  42 00 00 78 03 5e e0 00      01:06:52.618  READ VERIFY SECTOR(S) EXT
  42 00 00 78 fb 5d e0 00      01:06:52.600  READ VERIFY SECTOR(S) EXT
  42 00 00 78 f3 5d e0 00      01:06:52.587  READ VERIFY SECTOR(S) EXT

Error 118 occurred at disk power-on lifetime: 4715 hours (196 days + 11 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 01 77 bb 1b e0  Error: UNC at LBA = 0x001bbb77 = 1817463

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 00 b8 1b e0 00      00:57:43.347  READ VERIFY SECTOR(S) EXT
  42 00 00 00 b0 1b e0 00      00:57:43.334  READ VERIFY SECTOR(S) EXT
  42 00 00 00 a8 1b e0 00      00:57:43.317  READ VERIFY SECTOR(S) EXT
  42 00 00 00 a0 1b e0 00      00:57:43.304  READ VERIFY SECTOR(S) EXT
  42 00 00 00 98 1b e0 00      00:57:43.287  READ VERIFY SECTOR(S) EXT

Error 117 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 55 1a 5e ee  Error: UNC at LBA = 0x0e5e1a55 = 241048149

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 50 1a 5e ee 00      05:47:25.395  READ DMA
  27 00 00 00 00 00 e0 00      05:47:23.485  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      05:47:23.427  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:47:23.426  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      05:47:23.426  READ NATIVE MAX ADDRESS EXT

Error 116 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 55 1a 5e ee  Error: UNC at LBA = 0x0e5e1a55 = 241048149

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 50 1a 5e ee 00      05:47:19.397  READ DMA
  27 00 00 00 00 00 e0 00      05:47:23.485  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      05:47:23.427  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:47:23.426  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      05:47:23.426  READ NATIVE MAX ADDRESS EXT

Error 115 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 55 1a 5e ee  Error: UNC at LBA = 0x0e5e1a55 = 241048149

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 50 1a 5e ee 00      05:47:19.397  READ DMA
  27 00 00 00 00 00 e0 00      05:47:19.396  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      05:47:19.338  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      05:47:19.338  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      05:47:17.436  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4718         -
# 2  Short offline       Completed without error       00%      4716         -
# 3  Short offline       Completed: read failure       90%      4716         241048149
# 4  Short offline       Completed: read failure       90%      4714         241048149
# 5  Short offline       Completed: read failure       90%      4712         241048149
# 6  Short offline       Completed: read failure       90%      4712         241048149
# 7  Short offline       Completed: read failure       90%      4710         169589623
# 8  Extended offline    Completed: read failure       90%      4706         169589623
# 9  Extended offline    Completed without error       00%      4400         -
#10  Short offline       Completed without error       00%      4397         -

Comments