Abnormal jbd2 activity when running out of space on ext4 partition?

I

igortiunov2016-01-21 23:22:50

linux

igortiunov, 2016-01-21 23:22:50

Friends, good afternoon.
The essence of the problem:
When running on RHEL 6, a process that generates many small records in text files (log files) runs out of space on the partition. At this moment, iops are fired on the machine on the disk that ran out of space, in iotop in the top jbd2 process.
Problem Description:
Many virtual machines (VMWare ESXi) are running apache-tomcat used for development. At some point, developers turn on debug logs on this Apache to search for their bugs and, either with sadness and hopelessness, or in joy from the bug found, they leave such a server to live their lives.
At some point, the ext4-formatted virtual machine partition runs out of space (clogged with apache text logs). Everything would be fine, but at this very moment, the Write Rate (KBps) counter on the virtual disk of the machine shoots into the shelf (~ 20 MBps, during normal operation ~ 640 KBps). In the guest OS, iotop in the top shows the jbd2 process and the IO counter value is 100%.
Questioner:
I want to understand the mechanism behind this behavior. What happens to the log in such a scenario - why is the load on the disk on which the space has run out not reduced, but increased tenfold?
Guest OS RHEL 6.x (3 <= x <=6)

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

igortiunov, 2016-01-31
@igortiunov

In the process of studying the documentation, it turned out that when the ENOSPC(No space left on device) state is reached, the ext4 file system driver changes its behavior as follows:
1. Disables delayed block allocation ( www.pointsoftware.ch/en/4-ext4-vs-ext3 -filesystem-...
This is most likely the reason for the increase in write activity (in the current pattern of work - many small writes)
2. Forces a log commit to the file system in the hope that some blocks will be
freed.This is most likely the reason for jbd2 activity in iotop.

I

Ilya, 2016-01-28
@mirspo

A loop (dieadlock) is most likely: a critical process that cannot be stopped tries to persist, but there is no space and enters a loop - perhaps it generates something else, for example, flush. It is necessary to be picked in the ext4 code for the exact answer. You can try to find the culprit
mount -t debugfs none /sys/kernel/debug
echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_journal_start/enable
echo 1 >/sys/kernel/debug/tracing/events/jbd2/ jbd2_run_stats/enable
cat /sys/kernel/debug/tracing/trace_pipe