H
H
homecreate2015-04-09 07:22:09
linux
homecreate, 2015-04-09 07:22:09

How to make a QEMU/KVM VM cluster safe for VMs?

Hello everyone
Let's say we have a cluster on which we plan to run virtual washers using qemu-kvm. Naturally, some kind of common storage is required, where the images will be located. If the host is down (or dead), this will be detected by the cluster software and all virtual machines from this host will be restarted. And the question is this: after all, if the host is dead, then the data did not have time to correctly add to the images, right? And when we restart the virtual machine, we will get a FS with a failure, right? Even more so, if STONITH is correctly implemented, then in the event, for example, only the network interface fails (the cleaner disconnected the cable) and while the link in the SAN remains operational, the node will be brutally shot in the head.
How to avoid such a situation?
Thanks in advance

Answer the question

In order to leave comments, you need to log in

3 answer(s)
D
dyasny, 2015-07-04
@homecreate

each situation must be considered in particular.
1. if the host is working, but the control network has fallen and cannot be reached. in this case, there will be stonith, which for a virtual machine will be no different from a full-fledged reset of iron, or there will be nothing until the admin restores the network himself (this already depends on the settings). a typical failover cluster, in principle, reduces all failures to fallen hardware and restarting services on another host, and that virtual machine is bad that is not able to survive the reset without serious losses.
2. if the host crashed and the virtual machine was restarted on another - in principle, it suffered no more than if it ran on the same fallen hardware, plus an automatic restart. In short, solid profit, HA is still not FT
3. if the storage has fallen - the place is over, the fabric has failed - it does not matter from the side of the host or storage or switches. any problem that is given when trying to write or read a virtual disk error (EIO, ENOSPACE if in kernel terms). qemu-kvm in this case immediately sends the VM to a pause so as not to generate IO and additional failures. Thus, in flight IOs are frozen and not lost. We fix the storage, bring the VM out of the pause, and it continues as if nothing had happened.
By the way, #3 is the main reason for using nfs hard mount for virtual machines, so that problems with disk access are immediately displayed by the hypervisor and do not go to the buffer.

A
Armenian Radio, 2015-04-09
@gbg

The "host died" situation for a virtual machine server is an unlikely event (about the same as a forced shutdown of a regular server), so don't let the cleaner into the server room and there will be no problems.

P
Puma Thailand, 2015-04-09
@opium

What else cluster by?
In general, in the cloud, it is implied that restarting the instance is not a problem, well, the fs was beaten, they did a check and went, they didn’t go, they launched a new one and rolled up a script or backup from the old one.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question