hard drive - Why do different manufacturers have different S.M.A.R.T value?

Sunday, September 23, 2018

hard drive - Why do different manufacturers have different S.M.A.R.T value?

First of all, I think everyone knows that hard drives fail a lot more than the manufacturers would like to admit. Google did a study that indicates that certain raw data attributes that the S.M.A.R.T status of hard drives reports can have a strong correlation with the future failure of the drive.

We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

Seagate seems like it is trying to obscure this information about their drives by claiming that only their software can accurately determine the accurate status of their drive and by the way their software will not tell you the raw data values for the S.M.A.R.T attributes. Western digital has made no such claim to my knowledge but their status reporting tool does not appear to report raw data values either.

I've been using HDtune and smartctl from smartmontools in order to gather the raw data values for each attribute. I've found that indeed... I am comparing apples to oranges when it comes to certain attributes. I've found for example that most Seagate drives will report that they have many millions of read errors while western digital 99% of the time shows 0 for read errors. I've also found that Seagate will report many millions of seek errors while Western Digital always seems to report 0.

Q: How do I normalize this data? Is Seagate producing millions of errors while Western digital is producing none? Wikipedia's article on S.M.A.R.T status says that manufacturers have different ways of reporting this data.

Here is my hypothesis:

I think I found a way to normalize (is that the right term?) the data.

Seagate drives have an additional attribute that Western Digital drives do not have (Hardware ECC Recovered). When you subtract the Read error count from the ECC Recovered count, you'll probably end up with 0. This seems to be equivalent to Western Digitals reported "Read Error" count. This means that Western Digital only reports read errors that it cannot correct while Seagate counts up all read errors and tells you how many of those it was able to fix.

I had a Seagate drive where the Read error count was less than the ECC Recovered count and I noticed that many of my files were becoming corrupt. This is how I came up with my hypothesis. The millions of seek errors that Seagate produces are still a mystery to me.

Please confirm or correct my hypothesis if you have additional information.

Here is the smart status of my western digital drive just so you can see what I'm talking about:

james@ubuntu:~$ sudo smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1001FALS-00E3A0
Serial Number:    WD-WCATR0258512
Firmware Version: 05.01D05
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Jun 10 19:52:28 2010 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   175   021    Pre-fail  Always       -       4033
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       270
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1468
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       262
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       223
194 Temperature_Celsius     0x0022   105   102   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Edit: Here is the Seagate drive that I was talking about that was causing data corruption. This data is from HDTune.

HD Tune: ST3250623A Health
ID                               Current  Worst    ThresholdData       Status
(01) Raw Read Error Rate         45       38       6        77882492   Ok
(03) Spin Up Time                99       98       0        0          Ok
(04) Start/Stop Count            100      100      20       640        Ok
(05) Reallocated Sector Count    100      100      36       0          Ok
(07) Seek Error Rate             85       60       30       359872048  Ok
(09) Power On Hours Count        94       94       0        6028       Ok
(0A) Spin Retry Count            100      100      97       0          Ok
(0C) Power Cycle Count           100      100      20       689        Ok
(C2) Temperature                 25       55       0        25         Ok
(C3) Hardware ECC Recovered      50       47       0        201555081  Ok
(C5) Current Pending Sector      100      100      0        0          Ok
(C6) Offline Uncorrectable       100      100      0        0          Ok
(C7) Ultra DMA CRC Error Count   200      199      0        1          Ok
(C8) Write Error Rate            100      253      0        0          Ok
(CA) TA Counter Increased        100      253      0        0          Ok
Power On Time         : 6028
Health Status         : Ok

The fact that the Hardware ECC Recovered is larger than the Raw Read Error Rate is counter intuitive in my opinion.

This is what I've found to be a "normal" seagate drive where the ECC Recovered matches the Raw Read Error Rate:

HD Tune: ST380011A Health
ID                               Current  Worst    ThresholdData       Status
(01) Raw Read Error Rate         62       46       6        79986164   Ok
(03) Spin Up Time                98       98       0        0          Ok
(04) Start/Stop Count            100      100      20       6          Ok
(05) Reallocated Sector Count    100      100      36       0          Ok
(07) Seek Error Rate             83       60       30       210309663  Ok
(09) Power On Hours Count        93       93       0        6516       Ok
(0A) Spin Retry Count            100      100      97       0          Ok
(0C) Power Cycle Count           99       99       20       1325       Ok
(C2) Temperature                 25       52       0        25         Ok
(C3) Hardware ECC Recovered      62       46       0        79986164   Ok
(C5) Current Pending Sector      100      100      0        0          Ok
(C6) Offline Uncorrectable       100      100      0        0          Ok
(C7) Ultra DMA CRC Error Count   200      188      0        18         Ok
(C8) Write Error Rate            100      253      0        0          Ok
(CA) TA Counter Increased        100      253      0        0          Ok
Power On Time         : 6516
Health Status         : Ok

EDIT:

I want to clarify that I know that Google generally considers S.M.A.R.T useless. I know that everyone should backup their data. I am however in the business of fixing other peoples computers. Most people do not have backups or have RAID. It is not cost effective for corporations to troubleshoot hard drives, so they just run them on a RAID until they die. I find it useful in my line of work to check the SMART status of the hard drive. It takes like 30 seconds. If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I'll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.

I'm just trying to fine tune this process.

Answer

It does appear that different manufacturers use SMART values for sometimes radically different things, as you can see here:

My hard disk(s) in ReadyNAS is reporting high SMART Raw Read Error Rate, Seek Error Rate, and Hardware ECC Recovered. What should I do?

Seagate uses these SMART fields for internal counts, so this is a known issue with Seagate disks. Look for abnormal counts in other fields, especially Reallocated Sector Ct and ATA Error Count.

So when it comes to your actual question ...

If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I'll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.

I'd say a good rule of thumb is, you can only expect SMART settings to be comparable within the same drive manufacturer, and maybe even the same drive model!

So when you're looking at diagnosing those SMART counts, keep that in mind... one manufacturer's "read error retry count" may mean something totally different than another manufacturer's. Sad but true. :(

Blog

Sunday, September 23, 2018

hard drive - Why do different manufacturers have different S.M.A.R.T value?

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?