Strange SMART data from SSD, why and when to change the disk?

Y

Yoh2017-03-20 01:06:54

linux

Yoh, 2017-03-20 01:06:54

Hello.
There are two computers in which two Samsung EVO 850 (500 GB) SSD drives are combined into a software RAID-1.
One was assembled at the beginning of 2015, everything is still working, the Wear_Leveling_Count parameter on both disks decreases synchronously, now 89 TB are written, the Wear_Leveling_Count values on both disks are 29.
The second one was assembled in the middle of 2016 and there is an oddity in the operation of one disk. Both disks have 50 TB each (I look based on the Total_LBAs_Written parameter), but on one disk the Wear_Leveling_Count parameter is 72% (quite normal), and on the other disk it is 41% (which is not normal).
I contacted Samsung support, they gave me a template answer, without even looking at the SMART data that I sent (after all, as I wrote earlier, based on the data of the same SMART, it can be seen that the same amount of information is written to the disks):

The difference in the values in the parameter depends on how the RAID array is built. For example, in RAID1, the Wear_Leveling_Count parameter can be reduced in cases where the drives need resynchronization: SSD 2 was inactive and RAID will have to rewrite data from SSD 1 to SSD 2, after it is activated in the system.

The firmware of all 4 drives are the same, the model numbers are the same.
RAID-1 is used to improve reliability (disks fail not only due to exhaustion of resources, but also unpredictable disk controller failure - personal experience), therefore, it is not necessary to write that this does not make sense in the case of SSDs.
Tell me, for what reason can there be such a discrepancy in the Wear_Leveling_Count parameters with the same recorded amount of information? How do you know when it's time to change a problem drive? When will Wear_Leveling_Count approach 0 or will it still look at the recording resource that the manufacturer claims (about 150 TB)? Maybe someone came across, disk models are popular.

[~]# smartctl -A /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.36.3.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen , Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 7637
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 17
177 Wear_Leveling_Count 0x0013 072 072 000 Pre-fail Always - 580
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 067 064 000 Old_age Always - 33
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 4
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 108741171263

[~]# smartctl -A /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-327.36.3.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen , Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 7637
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 17
177 Wear_Leveling_Count 0x0013 041 041 000 Pre-fail Always - 1249
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100,100,010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100,100,010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 070 062 000 Old_age Always - 30
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 3
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 109374845811

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Artem @Jump, 2017-03-20
@Yoh

Wear_Leveling_Count is not the most reliable parameter, everyone counts them differently, and often it either resets to zero or shows the weather in general.
Look at Total_LBAs_Written, judging by it, 50TB are written to both disks, the difference in recording is 300GB.
In principle, such a discrepancy can be caused by the fact that TRIM does not reach one of the disks.
Although if you have a server, then it would be better not to rely on trim, but to leave a decent over provisioning.
How do you know when it's time to change a problem drive?
Target by Total_LBAs_Written and track unhealthy moves by Reallocated_Sector, Used_Rsvd_Blk_Cnt_Tot, Erase_Fail_Count_Total

K

Konstantin Stepanov, 2017-03-20
@koronabora

1) Different quality of memory on disks. Cells on one disk beat faster and the disk considers that it is already worn out enough.
2) Firmware glitch and do not pay attention.