C
C
CaptSmile2013-11-26 16:21:02
Hard disks
CaptSmile, 2013-11-26 16:21:02

What causes intermittent freezes in Ubuntu server 12.10?

There is a server (file dump) running ubuntu. It has a cron job set up for daily backup via rsync to an external usb hard drive.
Without any periodicity in the morning, I find the server tightly hung. Not available over the network, a certain log flaunts on the screen: I set up a smart check. in webmin System Information the message flaunts:
60iltirqgnlm.jpg

Drive temperatures sda: 33℃ (282 errors!), sdb: 28℃

command # smartctl -l error /dev/sda:

smartctl 5.43 2012-06-30 r3573 [i686-linux-3.5.0-43-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 282 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 282 occurred at disk power-on lifetime: 32122 hours (1338 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 e8 a0 74 34 e0  Error: ICRC, ABRT 232 sectors at LBA = 0x003474a0 = 3437728

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 e8 a0 74 34 e0 00  48d+00:38:50.875  READ DMA
  c8 00 88 40 cc 34 e0 00  48d+00:38:10.188  READ DMA
  c8 00 90 50 e2 34 e0 00  48d+00:38:10.188  READ DMA
  c8 00 b8 c0 eb 34 e0 00  48d+00:38:10.188  READ DMA
  c8 00 90 a0 9d 34 e0 00  48d+00:38:10.188  READ DMA

Error 281 occurred at disk power-on lifetime: 31972 hours (1332 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 60 8c 34 e3  Error: ICRC, ABRT at LBA = 0x03348c60 = 53775456

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 60 8c 34 e3 00  41d+18:35:49.250  READ DMA
  c8 00 00 60 8b 34 e3 00  41d+18:35:08.125  READ DMA
  c8 00 00 60 8a 34 e3 00  41d+18:35:08.125  READ DMA
  c8 00 00 60 89 34 e3 00  41d+18:35:08.125  READ DMA
  c8 00 00 60 88 34 e3 00  41d+18:35:08.125  READ DMA

Error 280 occurred at disk power-on lifetime: 30960 hours (1290 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 80 00 76 23 e0  Error: ICRC, ABRT 128 sectors at LBA = 0x00237600 = 2323968

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 80 00 76 23 e0 00   5d+06:11:29.500  READ DMA
  c8 00 10 68 8f d6 e0 00   5d+06:10:48.500  READ DMA
  c8 00 08 58 8f d6 e0 00   5d+06:10:48.500  READ DMA
  c8 00 88 88 0e 04 e1 00   5d+06:10:48.438  READ DMA
  c8 00 08 98 57 d3 e0 00   5d+06:10:48.438  READ DMA

Error 279 occurred at disk power-on lifetime: 30939 hours (1289 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 20 50 9d e3  Error: ICRC, ABRT at LBA = 0x039d5020 = 60641312

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 20 50 9d e3 00   4d+09:20:04.750  READ DMA
  c8 00 00 20 4f 9d e3 00   4d+09:20:04.750  READ DMA
  c8 00 00 20 4e 9d e3 00   4d+09:20:04.750  READ DMA
  c8 00 00 20 4d 9d e3 00   4d+09:20:04.750  READ DMA
  c8 00 00 20 4c 9d e3 00   4d+09:20:04.750  READ DMA

Error 278 occurred at disk power-on lifetime: 30938 hours (1289 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 51 00 00 8c c6 e4  Error: ICRC, ABRT at LBA = 0x04c68c00 = 80120832

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 00 8c c6 e4 00   4d+07:56:42.250  READ DMA
  c8 00 00 00 8b c6 e4 00   4d+07:56:42.250  READ DMA
  c8 00 00 00 8a c6 e4 00   4d+07:56:42.250  READ DMA
  c8 00 00 00 89 c6 e4 00   4d+07:56:42.250  READ DMA
  c8 00 00 00 88 c6 e4 00   4d+07:56:42.250  READ DMA


Actually the question is: will the hard drive die soon? Can this be prevented?

Answer the question

In order to leave comments, you need to log in

4 answer(s)
S
sonik_spb, 2013-11-26
@sonik_spb

No one will say exactly when the screw will fail. Change as quickly as possible if the data is important.

C
CaptSmile, 2013-11-26
@CaptSmile

spoiler
smartctl 5.43 2012-06-30 r3573 [i686-linux-3.5.0-43-generic] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint P80 SD
Device Model: SAMSUNG HD080HJ
Serial Number: S08EJ1GLB17924
Firmware Version: ZH100-47
User Capacity: 80.026.361.856 bytes [80,0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Tue Nov 26 15:29:33 2013 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 1825) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 253 253 025 Pre-fail Always - 4032
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1881
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 32260
10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 253 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1071
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 142 094 000 Old_age Always - 32
194 Temperature_Celsius 0x0022 142 094 000 Old_age Always - 32
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 42114621
196 Reallocated_Event_Count 0x0032 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 253 253 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age Always - 0
202 Data_Address_Mark_Errs 0x0032 253 253 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 282 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 282 occurred at disk power-on lifetime: 32122 hours (1338 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 e8 a0 74 34 e0 Error: ICRC, ABRT 232 sectors at LBA = 0x003474a0 = 3437728
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 e8 a0 74 34 e0 00 48d+00:38:50.875 READ DMA
c8 00 88 40 cc 34 e0 00 48d+00:38:10.188 READ DMA
c8 00 90 50 e2 34 e0 00 48d+00:38:10.188 READ DMA
c8 00 b8 c0 eb 34 e0 00 48d+00:38:10.188 READ DMA
c8 00 90 a0 9d 34 e0 00 48d+00:38:10.188 READ DMA
Error 281 occurred at disk power-on lifetime: 31972 hours (1332 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 60 8c 34 e3 Error: ICRC, ABRT at LBA = 0x03348c60 = 53775456
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 00 60 8c 34 e3 00 41d+18:35:49.250 READ DMA
c8 00 00 60 8b 34 e3 00 41d+18:35:08.125 READ DMA
c8 00 00 60 8a 34 e3 00 41d+18:35:08.125 READ DMA
c8 00 00 60 89 34 e3 00 41d+18:35:08.125 READ DMA
c8 00 00 60 88 34 e3 00 41d+18:35:08.125 READ DMA
Error 280 occurred at disk power-on lifetime: 30960 hours (1290 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 80 00 76 23 e0 Error: ICRC, ABRT 128 sectors at LBA = 0x00237600 = 2323968
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 80 00 76 23 e0 00 5d+06:11:29.500 READ DMA
c8 00 10 68 8f d6 e0 00 5d+06:10:48.500 READ DMA
c8 00 08 58 8f d6 e0 00 5d+06:10:48.500 READ DMA
c8 00 88 88 0e 04 e1 00 5d+06:10:48.438 READ DMA
c8 00 08 98 57 d3 e0 00 5d+06:10:48.438 READ DMA
Error 279 occurred at disk power-on lifetime: 30939 hours (1289 days + 3 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 20 50 9d e3 Error: ICRC, ABRT at LBA = 0x039d5020 = 60641312
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 00 20 50 9d e3 00 4d+09:20:04.750 READ DMA
c8 00 00 20 4f 9d e3 00 4d+09:20:04.750 READ DMA
c8 00 00 20 4e 9d e3 00 4d+09:20:04.750 READ DMA
c8 00 00 20 4d 9d e3 00 4d+09:20:04.750 READ DMA
c8 00 00 20 4c 9d e3 00 4d+09:20:04.750 READ DMA
Error 278 occurred at disk power-on lifetime: 30938 hours (1289 days + 2 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 00 8c c6 e4 Error: ICRC, ABRT at LBA = 0x04c68c00 = 80120832
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 00 00 8c c6 e4 00 4d+07:56:42.250 READ DMA
c8 00 00 00 8b c6 e4 00 4d+07:56:42.250 READ DMA
c8 00 00 00 8a c6 e4 00 4d+07:56:42.250 READ DMA
c8 00 00 00 89 c6 e4 00 4d+07:56:42.250 READ DMA
c8 00 00 00 88 c6 e4 00 4d+07:56:42.250 READ DMA
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 32158 -
# 2 Extended offline Completed without error 00% 32131 -
Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

C
CaptSmile, 2013-11-26
@CaptSmile

I don't see spoilers from the toaster.
Chrome Version 33.0.1712.4 dev-m

S
sonik_spb, 2013-11-26
@sonik_spb

Lay out the smart completely, there should be an answer to the question =)

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question