How in Linux to monitor the continuity of CPU allocation by the virtualizer from inside the VM?

P

pi3142014-10-22 18:07:05

linux

pi314, 2014-10-22 18:07:05

The situation is as follows: Java software runs under Debian, which runs in a virtual machine (presumably under XEN). The software logic is sensitive to timeouts of the order of one or two seconds (hardware control via sockets, a large number of connections with watchdog). On the stand, everything is in a bunch. In production, for one client, the system periodically gets cancer: many (but not all!) Connections fall for no apparent reason. Dropping / restoring connections, in principle, is handled correctly by the software, but I really want to get to the bottom of the cause of the phenomenon, because for the normal operation of the entire system, it must be completely excluded or at least reduced to a controlled minimum.
For this, everything possible has been monitored for a long time, from pings between components, to the load on switches, PoE continuity, etc. etc. According to the results, the network, as a cause, can already, in principle, be excluded, and suspicion falls on the virtualizer. This is the only component to which there is no sane access (operated by the client - no one is allowed under any circumstances).
The working hypothesis boils down to the fact that the virtualizer ~~does not provide meat to the tigers~~temporarily stops allocating CPU to our virtual machine (bursting other VMs?), which leads to watchdog timeouts, and our software, "waking up", begins to restore connections that, in fact, have not fallen. The hypothesis, of course, is very bold, but this is the only way that so far the stand has been able to reproduce the situation. Of course, requests to local admins end with the answer: "No, we don't know - everything is fine with us."
Hence, in fact, the question : has anyone met with tools that can be used in Linux to log failures in providing the system with a CPU virtualizer from within the system itself. Of course, you can write it yourself ... However, if someone has met with such things, I would be extremely grateful for advice or, at least, kicks in the right direction.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

P

pi314, 2014-12-05
@pi314

Problem solved, hypothesis confirmed. Directly on the question: the easiest way to monitor is with:
Top gives the same value (the rightmost in the %CPU line, "0.0 st", which means steal time), but it's more tedious to extract it from there.
Many thanks to everyone for the advice and suggestions!

V

Vladimir, 2014-10-22
@rostel

Possibly through timer control
wiki.xen.org/wiki/Xen_power_management#HPET_as_bro...
but it's disabled by default.
Nothing can be measured without a standard.

P

Puma Thailand, 2014-10-23
@opium

Just tell the client that the problem is in his virtualization and he and his specialists solve it on his side, this is completely normal, then what you are trying to do is not effective nonsense, each problem has its own tool.
The most common problem is, for example, backup, when backup is done on virtualization, there is such a strong lag on all virtual machines, if we say for sites that are loaded only during the day, and backup at night anyway, then for your system it’s like death, the second common problem is a gluttonous neighbor, neighboring the virtual machine has consumed all the resources and your system has sucked timeouts.

E

Ergil Osin, 2014-10-23
@Ernillew

> presumably under XEN
To check under what exactly use the lscpu command.
If Xen, then you will see the line
Hypervisor vendor: Xen
If KVM, then
Hypervisor vendor: KVM
Now, if VirtualBox (it happens in production, I saw it myself), then you will not see about the hypervisor, alas. Having determined who does not report meat there, it will be easier already.