Why doesn't mdadm go into degraded when there is a bad block on one of the disks?

K

Kenny002021-05-26 16:27:38

linux

Kenny00, 2021-05-26 16:27:38

Actually, there is a RAID 5 of 6 disks built on mdadm.
Until a certain time, everything worked successfully, but when trying to pick up the files, the checksum of the copies was different.
By checking the surface of the disks, it was found that the disk / dev / sda fell down.

smartctl -a /dev/sda

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       950
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       116
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       73628
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       114
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       57
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       7888761
194 Temperature_Celsius     0x0022   113   094   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       47
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   155   080   000    Old_age   Offline      -       12120

Actually, it's already 47 Reallocated_Sector
And mdadm does not report anything by status, as if everything is fine.

cat /proc/mdstat

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 sda3[0] sdf3[5] sde3[4] sdd3[3] sdc3[2] sdb3[1]
      9743319040 blocks super 1.2 level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

md1 : active raid10 sda2[0] sdf2[5] sde2[4] sdd2[3] sdc2[2] sdb2[1]
      1566720 blocks super 1.2 512K chunks 2 near-copies [6/6] [UUUUUU]

md0 : active raid1 sda1[0] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1]
      4190208 blocks super 1.2 [6/6] [UUUUUU]

mdadm --detail /dev/md127

/dev/md127:
           Version : 1.2
     Creation Time : Mon Mar 16 21:27:21 2020
        Raid Level : raid5
        Array Size : 9743319040 (9291.95 GiB 9977.16 GB)
     Used Dev Size : 1948663808 (1858.39 GiB 1995.43 GB)
      Raid Devices : 6
     Total Devices : 6
       Persistence : Superblock is persistent

       Update Time : Wed May 26 09:57:02 2021
             State : clean
    Active Devices : 6
   Working Devices : 6
    Failed Devices : 0
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 64K

Consistency Policy : unknown

              Name : 33ea55f9:RAID-5-0  (local to host 33ea55f9)
              UUID : 04d214c4:ee331e6a:74ca0a04:5e846481
            Events : 148

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
       4       8       67        4      active sync   /dev/sde3
       5       8       83        5      active sync   /dev/sdf3

Hence 2 questions:
Why is the state then unknown in the Consistency Policy?
Why doesn't mdadm go into degraded when there is a bad block on one of the disks?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

Vladimir, 2021-05-26
@Kenny00

Smart disk and mdadm are different things, mdadm will eject the disk from the array when it runs into inconsistency, such errors are written to dmesg, you can also see the error counter in /sys/block/mdX/md/mismatch_cnt
To check the status of the array, you can run a check (just scan for errors)
echo check > /sys/block/mdX/md/sync_action to
fix errors, you can run
echo repair > /sys/block/mdX/md/sync_action

R

Rsa97, 2021-05-26
@Rsa97

You have 47 Current_Pending_Sector, not Reallocated_Sector_Ct.
This means that there was one unsuccessful attempt to read or write to these sectors. By itself, this condition is not considered an error. If there is another unsuccessful attempt, then the HDD will try to move the sector. The counter of relocation attempts (Reallocated_Event_Count) and, if the relocation is successful, the counter of relocated sectors (Reallocated_Sector_Ct) will increase.
If the retry is successful, the sector will be unmarked as suspicious.