How to make mdadm start rebuilding on disk hot-plugging?

S

Sinot2017-08-14 16:01:28

linux

Sinot, 2017-08-14 16:01:28

Greetings.
I have a Dell R720 server with a PERC H310 Mini RAID controller. Works as a non-RAID SAS controller.
All of this is running Debian 9 with software RAID (two RAID1 disks for system, four RAID5 plus one hot-spare for data).

RAID Configuration

/dev/md0:
        Version : 1.2
  Creation Time : Thu Aug 10 14:56:12 2017
     Raid Level : raid1
     Array Size : 585928704 (558.79 GiB 599.99 GB)
  Used Dev Size : 585928704 (558.79 GiB 599.99 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Aug 14 15:04:26 2017
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : vt-sstor0:0  (local to host vt-sstor0)
           UUID : 4f4cef6f:642f6e9d:89d6711d:c18078ff
         Events : 1269

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

/dev/md1:
        Version : 1.2
  Creation Time : Thu Aug 10 14:56:38 2017
     Raid Level : raid5
     Array Size : 1757786112 (1676.36 GiB 1799.97 GB)
  Used Dev Size : 585928704 (558.79 GiB 599.99 GB)
   Raid Devices : 4
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Aug 14 14:15:41 2017
          State : clean 
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : vt-sstor0:1  (local to host vt-sstor0)
           UUID : 865573b2:d9d8dacc:7d29767f:4e0cb9ac
         Events : 1740

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1
       3       8       81        3      active sync   /dev/sdf1

       4       8       97        -      spare   /dev/sdg1

For experiments, I pulled out a hot-spare disk (/dev/sdg) from RAID5 on a hot disk (/dev/sdd), hoping that it would be replaced by a hot-spare disk (/dev/sdg). But the controller simply marked this disk in the array as removed and that's it.
If, in the same configuration, some disk is marked as failed (mdadm /dev/md1 -f /dev/sde1), then rebuild will be launched using the hot-spare disk.
As I understand it, pulling a disk from a RAID array to a "hot" one is not a regular situation and a hot-spare disk should be used instead.

mdadm --detail /dev/md1

/dev/md1:
        Version : 1.2
  Creation Time : Thu Aug 10 14:56:38 2017
     Raid Level : raid5
     Array Size : 1757786112 (1676.36 GiB 1799.97 GB)
  Used Dev Size : 585928704 (558.79 GiB 599.99 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Aug 14 15:42:14 2017
          State : clean, degraded 
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

           Name : vt-sstor0:1  (local to host vt-sstor0)
           UUID : 865573b2:d9d8dacc:7d29767f:4e0cb9ac
         Events : 1741

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       -       0        0        2      removed
       3       8       81        3      active sync   /dev/sdf1

       4       8       97        -      spare   /dev/sdg1

dmesg

[20819.208432] sd 0:0:4:0: [sde] Synchronizing SCSI cache
[20819.208722] sd 0:0:4:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[20819.211163] md: super_written gets error=-5
[20819.211168] md/raid:md1: Disk failure on sde1, disabling device.
md/raid:md1: Operation continuing on 3 devices.
[20819.598026] RAID conf printout:
[20819.598029] --- level:5 rd:4 wd:3
[20819.598031] disk 0, o:1, dev:sdc1
[20819.598033] disk 1, o:1, dev:sdd1
[20819.598035] disk 2, o:0, dev:sde1
[20819.598036] disk 3, o:1, dev:sdf1
[20819.598038] RAID conf printout:
[20819.598039] --- level:5 rd:4 wd:3
[20819.598040] disk 0, o:1, dev:sdc1
[20819.598041] disk 1, o:1, dev:sdd1
[20819.598043] disk 3, o:1, dev:sdf1
[20819.598316] md: unbind
[20819.618026] md: export_rdev(sde1)

Actually a question: How mdadm to force so to work?
There is a suspicion that the PERC H310 Mini RAID controller is too smart, that it somehow informs the system that the disk has been removed. And what will happen if the electronics of the disk burn out, or something like that, will also be removed?
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

N

nucleon, 2018-03-02
@nucleon

the correct removal procedure is:
mdadm --manage /dev/md1 --fail /dev/sde1
mdadm --manage /dev/md1 --remove /dev/sde1
if that doesn't help, try removing the SPARE disk in the same way
and then re-add it, but already as a regular disk