How to determine the cause of HDD failure?

L

lossyy2020-11-12 17:54:20

Server equipment

lossyy, 2020-11-12 17:54:20

Hello!
Server HP ML 350 G6
Installed 2 ssd and 3 ssd 600 Gb 3.5" 15,000 rpm.
Hp P410 raid with cache battery.

The day before yesterday it turned out that the disk in the second slot Failed
. During the rebuild, I got a delayed write error (I read that in some cases it will be cured if you turn off the caching of the disk itself and do not use the writeback function, but there was no rightback during the rebuild, just the cache is on the raid controller.)

Today we connected iLO look at the status of the disks, there may be some system errors, first disk 5 showed everything ok, and then showed Degraded
I got into the command line of the interrogation of the stacks, found out through hpacucli that the disk is Predictive Failure

. What are my thoughts:
1. Disks come to an end due to old age (although they were opened from my blister a year ago and installed in the server, but still their year of release 2013) or a factory defect (I don’t really think about marriage, since the 2nd disk was in the mirror raid of the proxmox system, and the 5th disk for backups, 40 GB of Wirth machine data is copied every day.)

2. An employee of the company moved the server (moved the back to position it towards himself) to a hot one when iLO was connected and when photographing the insides of the server at my request (I am more inclined towards this option, but I can be wrong, against the argument is that the hard drives are in the laptop work and we move them, but after all, the speeds there are not 15,000 but 5400, and when the bad sectors move, they then climb out). A conversation was held and agreement was made that the hot tower should no longer be moved.

3. Hardware failure, since the HP p410 array is old, and the EF0600FARNA disks are sharpened for G8 servers, but the sleds are for G6, and somehow the server worked for a year without problems. I read the info that the EF0600FARNA drives have several Firmware firmware, one of the firmware fixes the failure due to the redistribution of bad blocks. I'll try, when we fix the problem with the screws by replacing them with new ones, flash them through the HBA on another server.

Who thinks, for what reason the disks began to crumble, the time has come? Somehow they break unevenly.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

L

lossyy, 2020-11-13
@lossyy

Thanks to Vlad, my friend who revealed that it was overheating. According to the statistics of drives, they heated up to a maximum of 56 degrees, and this is already critical for drives. Their location is as follows:
1 - HDD 600
2 - HDD 600
3 - SSD
4 - SSD
5 - HDD 600
6 - Empty
So the 2nd failed and the 5th is on its way. They are closest to the SSD oven.
We will organize an additional inflow for the server. Thank you all for your attention to the problem.