(re)scanning the hard drive surface on linux |
Message boards : Number crunching : (re)scanning the hard drive surface on linux
Author | Message |
---|---|
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
I guess that there are some linux power-users browsing the forums from time to time, so I'll try to ask here :-) One of the server hard drives suddenly got a few blocks marked as damaged today: I'm not sure if it's really hdd surface problem, I've already seen errors like this once and they were caused by the faulty PSU. Since the server's PSU isn't very good, I suspect it may be happening again. Also, the syslog says that the drive went offline for a moment just before these errors popped out. Before replacing the drive I'd like to verify if it's really damaged - what's the best tool to rescan the surface (and the sectors marked as damaged)? Is there any tool for linux that I can trust, or should I just download and run the drive manufacturer tools ? I already backed up everything and scanned the drive with badblocks (read only scan), the drive seems to be in a good condition - no weird noises during scan/file operations, no spinup problems or anything suspicious - just these few blocks marked as bad. M4 Project homepage M4 Project wiki |
quel Send message Joined: 19 May 09 Posts: 34 Credit: 32,923,471 RAC: 0 |
Well, some bad sectors over the life of the drive are normal. In some cases the sector remapping is automatic and in other cases it isn't as you noted the drive went offline. If you already did a full bad blocks scan then there isn't anything new to learn from the manufacturer tools. Make sure you do a forced full fsck post badblock scan if you didn't already. If you haven't yet then install the package smartmontools and do a smartctl --all /dev/sda. If SMART doesn't give you some notice about imminent drive death or the reallocated sector count is in a near failure state, then you're probably fine. |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
SMART didn't show any errors until I ran the self tests. Both short and extended tests (smartctl -t short / smartctl -t long) stopped after few second with the same error: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 16934 429195333 # 2 Short offline Completed: read failure 90% 16934 429195333 I thought that it may be a 'soft' error caused by power failure, but looks like this time it's damaged surface. M4 Project homepage M4 Project wiki |
quel Send message Joined: 19 May 09 Posts: 34 Credit: 32,923,471 RAC: 0 |
Yes, if it fails even the short test then it's time to get a new drive. |
quel Send message Joined: 19 May 09 Posts: 34 Credit: 32,923,471 RAC: 0 |
Also, bigger drives seem to get a lot less testing at the factory. I wrote this up recently: http://insomnia.quelrod.net/docs/new_drive_testing.txt You'd be amazed at how many 1.0, 1.5, 2.0 TB drives from any vendor actually don't even pass that simple test new out of the retail box (not OEM.) There are quite a few very untested 750G drives out there. A full rw test on a 1.5TB drive can take a good 10 hours. (I'm a *nix admin by day and have a few hundred HDs currently spinning.) |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
Ravager:/tmp# smartctl -a /dev/sda smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.10 family Device Model: ST3320620AS Serial Number: [EDITED] Firmware Version: 3.AAC User Capacity: 320,072,933,376 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Nov 16 20:50:25 2009 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 115) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 111 086 006 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 091 090 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 608 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 079 051 030 Pre-fail Always - 90130145 9 Power_On_Hours 0x0032 081 081 000 Old_age Always - 16937 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 716 187 Reported_Uncorrect 0x0032 039 039 000 Old_age Always - 61 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 058 047 045 Old_age Always - 42 (Lifetime Min/Max 41/42) 194 Temperature_Celsius 0x0022 042 053 000 Old_age Always - 42 (0 16 0 0) 195 Hardware_ECC_Recovered 0x001a 101 060 000 Old_age Always - 1773457 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 62 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 62 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 45 00 95 e0 Error: UNC at LBA = 0x00950045 = 9764933 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 3f 00 95 e0 00 00:21:35.564 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:33.663 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:33.662 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:31.761 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:31.760 READ DMA EXT Error 61 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 45 00 95 e0 Error: UNC at LBA = 0x00950045 = 9764933 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 3f 00 95 e0 00 00:21:24.092 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:33.663 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:33.662 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:31.761 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:31.760 READ DMA EXT Error 60 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 45 00 95 e0 Error: UNC at LBA = 0x00950045 = 9764933 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 3f 00 95 e0 00 00:21:24.092 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:22.191 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:22.190 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:31.761 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:31.760 READ DMA EXT Error 59 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 45 00 95 e0 Error: UNC at LBA = 0x00950045 = 9764933 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 3f 00 95 e0 00 00:21:24.092 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:22.191 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:22.190 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:20.289 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:20.288 READ DMA EXT Error 58 occurred at disk power-on lifetime: 16930 hours (705 days + 10 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 45 00 95 e0 Error: UNC at LBA = 0x00950045 = 9764933 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 3f 00 95 e0 00 00:21:24.092 READ DMA EXT ec 00 00 45 00 95 a0 00 00:21:22.191 IDENTIFY DEVICE 25 00 08 3f 00 95 e0 00 00:21:22.190 READ DMA EXT 25 00 08 07 c1 97 e0 00 00:21:20.289 READ DMA EXT ca 00 20 e7 54 00 e0 00 00:21:20.288 WRITE DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Selective offline Completed: read failure 90% 16934 429195333 # 2 Selective offline Completed: read failure 90% 16934 429195333 # 3 Selective offline Completed: read failure 90% 16934 429195333 # 4 Short offline Completed: read failure 90% 16934 429195333 # 5 Short offline Completed: read failure 90% 16934 429195333 SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 9865000 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. At least this one worked for almost 2 years before failing. Quite interesting that SMART shows 700+ power cycle counts. AFAIR this drive was in the server since the day I bought it (maybe I tested it in another machine for 1-2 days max before installing), so I see no reason why the power cycle count should be so high (unless there was a hardware - maybe PSU related problem I didn't notice). I thought that it'll be 50-60 maximum. I hope that the guy who sold me the drive doesn't read this, because I think it's still on warranty, and he might not like the fact that it worked almost 24/7/365 in quite heavily loaded server |-) M4 Project homepage M4 Project wiki |
quel Send message Joined: 19 May 09 Posts: 34 Credit: 32,923,471 RAC: 0 |
Heh. Well, I've RMAed many Seagate drives that came with a 5 year warranty. They make the process quite easy. You just need the model and serial number to check the warranty status. No need for receipts or any other fuss. Well, read the MTBF ratings the manufacturers put on the drives and then ponder how to reconcile those numbers with reality ;) |
doublechaz Send message Joined: 5 Mar 09 Posts: 27 Credit: 1,517,764 RAC: 0 |
I was having trouble with SATA drives dropping out of my array for a while. It turned out to be the PSU. I knew it wasn't the drives as I had spares that I could put in and they always tested perfect out of that server. So, if you suspect the PSU I would say get a new one in there with plenty of headroom. Are you running some non-zero RAID level on the server in question? I hope. ;) |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
Nope, no RAID here of any type. Just multiple single HDDs with database tables spread between them, that way performance is better than with a cheap RAID (each large, frequently accessed table is on it's own physical drive, also each drive keeps a small number of less used tables); for realtime backup I use replication slave with just one, large single drive. M4 Project homepage M4 Project wiki |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
It took few hours longer than I expected to fix everything. The database had one file completely damaged; I replaced it with a copy from backup, but every time I started the db server, the table was marked as read only. I was quite surprised when I noticed that a simple `DROP TABLE` fixed the problem, so I could recreate the table structure (the data was not important). Everything is up and running, but there won`t be new work until tomorrow. M4 Project homepage M4 Project wiki |
Message boards :
Number crunching :
(re)scanning the hard drive surface on linux