How this Started
I run two Seagate ST3250620AS as my root file system with a Linux software RAID0 setup for my /home
directory. These drives are from Seagate’s 7200.10 series which were the first drives to switch to perpendicular recording some years ago. This was a time when Seagate had a 5 year warranty for OEM drives and an immaculate reputation.
Starting on Friday, I heard my hard drive clicking. Some quick investigation by looking at logs revealed that sdb was dying to some degree:
Jun 11 14:21:38 core kernel: ata3.00: exception Emask 0x10 SAct 0x1 SErr 0x810000 action 0xe frozen
Jun 11 14:21:38 core kernel: ata3.00: irq_stat 0x08400000, interface fatal error, PHY RDY changed
Jun 11 14:21:38 core kernel: ata3: SError: { PHYRdyChg LinkSeq }
Jun 11 14:21:38 core kernel: ata3.00: failed command: READ FPDMA QUEUED
Jun 11 14:21:38 core kernel: ata3.00: cmd 60/60:00:7d:8d:25/00:00:10:00:00/40 tag 0 ncq 49152 in
Jun 11 14:21:38 core kernel: res 40/00:00:7d:8d:25/00:00:10:00:00/40 Emask 0x10 (ATA bus error)
Jun 11 14:21:38 core kernel: ata3.00: status: { DRDY }
Jun 11 14:21:38 core kernel: ata3: hard resetting link
Jun 11 14:21:41 core kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
Jun 11 14:21:41 core kernel: ata3.00: configured for UDMA/133
Jun 11 14:21:41 core kernel: ata3: EH complete
I noticed the clicking when I took the side of my case off to look at something else, and figured maybe I bumped the cable. I touched the cable and it seemed happy. I wrote it off as a bad cable and replaced the cable later that day when I had a chance to power down the machine. I noticed that one of the contacts was recessed a bit more then the others, so I swapped it and looked at the others. Two others were bad, so I just threw them out and visually inspected the replacements.
Fast forward a few hours and it’s acting up again. This time I dig deeper with smartctl
and run some tests, the first drive in the array passes without problems, but the other has some serious issues. I downloaded Seagate’s Seatools CD and booted off of that since my attempts at running the S.M.A.R.T. long test from Linux failed. Running it from the CD found 2 bad sectors (on top of 7 that were already remapped) and give me the option to repair them, and so far so good. See my smartctl
data below. Also note this drive is almost 4 years old but reports a lifetime of only 4718… I think that’s an oops on Seagate’s part as this drive has been on 24/7 since then.
Full Smartctl Dump for Those Interested
$ sudo smartctl -a /dev/sdb
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10 family
Device Model: ST3250620AS
Serial Number: 5QE0DYWW
Firmware Version: 3.AAC
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Sun Jun 13 10:00:31 2010 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 92) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 086 006 Pre-fail Always - 34962761
3 Spin_Up_Time 0x0003 092 089 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 323
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 7
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 341869071
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4720
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1031
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 119
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 056 050 045 Old_age Always - 44 (Lifetime Min/Max 41/44)
194 Temperature_Celsius 0x0022 044 050 000 Old_age Always - 44 (0 14 0 0)
195 Hardware_ECC_Recovered 0x001a 077 053 000 Old_age Always - 14538
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 119 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 119 occurred at disk power-on lifetime: 4715 hours (196 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 55 1a 5e e0 Error: UNC at LBA = 0x005e1a55 = 6167125
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 78 13 5e e0 00 01:06:52.649 READ VERIFY SECTOR(S) EXT
42 00 00 78 0b 5e e0 00 01:06:52.631 READ VERIFY SECTOR(S) EXT
42 00 00 78 03 5e e0 00 01:06:52.618 READ VERIFY SECTOR(S) EXT
42 00 00 78 fb 5d e0 00 01:06:52.600 READ VERIFY SECTOR(S) EXT
42 00 00 78 f3 5d e0 00 01:06:52.587 READ VERIFY SECTOR(S) EXT
Error 118 occurred at disk power-on lifetime: 4715 hours (196 days + 11 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 77 bb 1b e0 Error: UNC at LBA = 0x001bbb77 = 1817463
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
42 00 00 00 b8 1b e0 00 00:57:43.347 READ VERIFY SECTOR(S) EXT
42 00 00 00 b0 1b e0 00 00:57:43.334 READ VERIFY SECTOR(S) EXT
42 00 00 00 a8 1b e0 00 00:57:43.317 READ VERIFY SECTOR(S) EXT
42 00 00 00 a0 1b e0 00 00:57:43.304 READ VERIFY SECTOR(S) EXT
42 00 00 00 98 1b e0 00 00:57:43.287 READ VERIFY SECTOR(S) EXT
Error 117 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 55 1a 5e ee Error: UNC at LBA = 0x0e5e1a55 = 241048149
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 1a 5e ee 00 05:47:25.395 READ DMA
27 00 00 00 00 00 e0 00 05:47:23.485 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:47:23.427 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:47:23.426 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 05:47:23.426 READ NATIVE MAX ADDRESS EXT
Error 116 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 55 1a 5e ee Error: UNC at LBA = 0x0e5e1a55 = 241048149
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 1a 5e ee 00 05:47:19.397 READ DMA
27 00 00 00 00 00 e0 00 05:47:23.485 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:47:23.427 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:47:23.426 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 05:47:23.426 READ NATIVE MAX ADDRESS EXT
Error 115 occurred at disk power-on lifetime: 4711 hours (196 days + 7 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 55 1a 5e ee Error: UNC at LBA = 0x0e5e1a55 = 241048149
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 50 1a 5e ee 00 05:47:19.397 READ DMA
27 00 00 00 00 00 e0 00 05:47:19.396 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 00 05:47:19.338 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 00 05:47:19.338 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 05:47:17.436 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 4718 -
# 2 Short offline Completed without error 00% 4716 -
# 3 Short offline Completed: read failure 90% 4716 241048149
# 4 Short offline Completed: read failure 90% 4714 241048149
# 5 Short offline Completed: read failure 90% 4712 241048149
# 6 Short offline Completed: read failure 90% 4712 241048149
# 7 Short offline Completed: read failure 90% 4710 169589623
# 8 Extended offline Completed: read failure 90% 4706 169589623
# 9 Extended offline Completed without error 00% 4400 -
#10 Short offline Completed without error 00% 4397 -
Comments