A
A
ander2022-02-06 22:20:23
RAID
ander, 2022-02-06 22:20:23

Why does the disk constantly “fly out” of the raid array on the PERC H710P Mini controller?

There is a Dell R720 server with a built-in PERC H710P Mini raid controller. Recently created a raid1 of 2 Seagate Exos X18 16TB sata drives. After several hours of work, one of the disks went into failed. After rebooting the server, the disk became ready, the rebuild began. Then the story repeated itself a few days later, only two disks "flew out" already. After rebooting the server, the disks again switched to the ready state, and the rebuild began. The server is powered by a UPS.

Story under spoiler
Есть сервер Dell R720 со встроенным рейд-контроллером PERC H710P Mini (Embedded). На нем уже з года беспроблемно работает массив raid10 из 4 sas-дисков по 4Тб и полгода raid1 из 2 sata-дисков 8Тб. 2 месяца назад создал raid1 из 2 sata-дисков Seagate Exos X18 16TB.
После копирования где-то 1 Тб информации 1 диск из 16тб массива перешел в состояние failed. Вынул диск из сервера и сдал в сервис. Сервис ответил "диск рабочий" и вернул обратно. Поставил диск в сервер, сделал рейд1, после нескольких часов работы тот же диск "вылетел" в failed. Проблемный диск? Или контроллер? Поменял два диска из этого массива местами в сервере. После двух дней работы (1,5 Тб информации) оба диска ЖД этого массива перешли в состоянии failed. Перезагрузил сервер, оба диска перешли в состояние Ready, пошел процес ребилда виртуального диска. Сервер все это время работал от ИБП, перебоев электричества не было.
Первый раз "проблемный" диск был в корзине 1, после смены местами другой диск, который сейчас стал в корзине 1 в состоянии ребилда, парный диск в состоянии "not applicable", что заставляет задуматься: а не в контроллере ли дело?

The server is running Windows Server 2012 R2.
Controller firmware versions:
Firmware Version 21.0.2-0001
Driver Version 6.600.21.08

Controller logs:
2022-02-06T20:12:16-0600 PDR4
Disk 3 in Backplane 1 of Integrated RAID Controller 1 returned to a ready state.

2022-02-06T20:12:15-0600 PDR4
Disk 1 in Backplane 1 of Integrated RAID Controller 1 returned to a ready state.

2022-02-06T20:10:09-0600 PDR1017
Drive 3 in disk drive bay 1 is operating normally.

2022-02-06T20:10:07-0600 PDR1017
Drive 1 in disk drive bay 1 is operating normally.

2022-02-06T20:09:32-0600 SYS1003
System CPU Resetting.

2022-02-06T20:09:23-0600 SEL9901
OEM software event.

2022-02-06T20:09:22-0600 OSE0003
An OS graceful shut-down occurred.

2022-02-06T20:07:31-0600 VME0007
Virtual Console session created.

2022-02-06T20:07:31-0600 VME0001
Virtual Console session started.

2022-02-06T20:07:31-0600 USR0030
Successfully logged in using root, from ip and Virtual Console.

2022-02-06T20:05:35-0600 USR0030
Successfully logged in using root, from ip and GUI.

2022-02-06T19:57:21-0600 USR0030
Successfully logged in using root, from ip and GUI.

2022-02-06T16:12:24-0600 USR0032
The session for root from ip using GUI is logged off.

2022-02-06T15:38:50-0600 USR0030
Successfully logged in using root, from ip and GUI.

2022-02-05T23:39:48-0600 PDR1001
Fault detected on drive 3 in disk drive bay 1.

2022-02-05T23:39:44-0600 CTL38
The Patrol Read operation completed for Integrated RAID Controller 1.

2022-02-05T23:39:44-0600 VDR31
Controller cache is preserved for missing or offline Virtual Disk 2 on Integrated RAID Controller 1.

2022-02-05T23:39:44-0600 PDR60
Error occurred on Disk 3 in Backplane 1 of Integrated RAID Controller 1 : (Error 2).

2022-02-05T23:39:44-0600 VDR7
Virtual Disk 2 on Integrated RAID Controller 1 has failed.

2022-02-05T23:39:43-0600 PDR3
Disk 3 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly.

2022-02-05T03:00:01-0600 CTL37
A Patrol Read operation started for Integrated RAID Controller 1.

2022-02-04T23:53:04-0600 PDR1001
Fault detected on drive 1 in disk drive bay 1.

2022-02-04T23:52:58-0600 PDR60
Error occurred on Disk 1 in Backplane 1 of Integrated RAID Controller 1 : (Error 2).

2022-02-04T23:52:57-0600 VDR8
Virtual Disk 2 on Integrated RAID Controller 1 is degraded either because the physical disk drive in the drive group is removed or the physical disk drive added in a redundant virtual drive has failed.

2022-02-04T23:52:57-0600 PDR3
Disk 1 in Backplane 1 of Integrated RAID Controller 1 is not functioning correctly.


What to do? Throw away discs? Buy another controller?
Another boo server will arrive soon, I'll make a backup and try to update the raid controller, it's not a fact that it will help...

Answer the question

In order to leave comments, you need to log in

2 answer(s)
Z
Zettabyte, 2022-02-07
@Zettabyte

Firmware Version 21.0.2-0001

I would start from here. This firmware version is from May 2012.
This controller clearly passes the "standard boundary" of 2 TB, but still 10 years ago there was no talk of 16 TB drives.
Try updating to the latest version.
The second moment is more subtle. Sector size.
Your controller definitely supports 512 byte sector drives. But with 4K - still a question.
I have repeatedly come across discussions on this topic on near-storage resources, but I didn’t need to dive deep into the topic. Still, our specialty is dead iron.
EMNIP, disks with 512e reached 10 TB, but it would be nice to clarify this from the datasheets. So if you suddenly have the opportunity to test this idea with other disks - check it out (optimally, if they are, for example, 12 TB each).
Service answered "disk working"

You can check disks yourself using R.tester : https://rlab.ru/tools/rtester.html
and on the server - if something obscene comes out, then the information accompanying this moment can give food for thought.
The sector size can be viewed in the same place as the health of other disks, if necessary. R.tester, for example, can show SMART for SAS hard drives.

C
Cirick, 2022-02-07
@Cirick

You had the most with Seagate Exos drives, but only sas. Constantly flew out of the raid once a week. The problem was in the firmware of the disks themselves. After I updated the firmware, it's been 1.5 years since not one departure.
My friend faced the same problem, also helped update the firmware of the disks.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question