Xen 4.1, fs crash with virtual machines causes crashes when accessing the root. how to explain?

M

Mikhail Konyukhov2013-01-27 21:43:53

linux

Mikhail Konyukhov, 2013-01-27 21:43:53

The situation is following: There is an office server with xen (debian). There was a threefold power failure. After the first failure, raid5 (on which virtual machines) fell apart, but it didn’t completely fall apart - it remained in readwrite mode and began to slowly resynchronize. After the second failure, fs itself broke on this raid (ext3, data=ordered). After the third power failure (during a raid resync), fs stopped mounting.
In general, we stabilized the power supply, resynchronized raid5, ran fsck on this fs and on the root (just in case). There were no errors at the root, in this fs everything that was fixed. Restarted.
After the restart, strange things began to happen: If you do not start virtual machines, the system works like clockwork. And after starting the virtual machines, the system glorifies errors on the disk (in this fs with virtual machines). So after 3-4 errors in this fs, errors start to pour in when accessing the root file system.
For example, we do:
# apt-cache search linux-image
- crashes (inside the kernel) and apt-cache terminates
And so with almost most of the binaries: apt-*, xm, dd, aptitude ... It's
strange that if it reboots and does not start virtual machines Everything is running like clockwork again.
In system 2 raid5 (on sda1+sdb1+sdc1 and sda2+sdb2+sdc2 partitions on 3 SATA drives)
System: Linux lynx 3.2.0-4-amd64 #1 SMP Mon Jul 23 02:45:17 UTC 2012 x86_64 GNU/ linux
What could be the catch? What can cause errors when working with another fs and how can this be fixed?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

M

Mikhail Konyukhov, 2017-03-26
@piromanlynx

I found my own old question and will write an answer. The only reason for this behavior was that due to previous power outages, problems began with an electrical failure in one of the HDDs. This took out the brain of the SATA controller and it behaved inappropriately. We replaced the screw and now we don’t use those controllers, we take new ones from Adaptec, they don’t have such problems.

I

Ilya Klimov, 2013-01-27
@xanf

I've been catching similar issues due to memory issues. Is it possible to run memtest on the server?
What is the error message (most interesting is dmesg)?

J

J_o_k_e_R, 2013-01-28
@J_o_k_e_R

I join the request for a bug report.
Do I understand correctly that the entire system is located on two raids? Root on one, virtuals on the other? Perhaps power outages killed that part of the disk (s) on which the virtuals are located so that when you try to access it, the hard disk controller goes crazy and starts to give errors in general, incl. and on the normal part of the disk(s).
Do you already have a spare hard drive after such an accident? If not, it's very rash. Urgently purchase and replace "combat" disks one at a time and scan each of them with the same badblocks.