Performance sag mariadb, linux. Why is the Load Average growing to 30-40?

Y

Yuriyant2019-07-30 07:32:51

linux

Yuriyant, 2019-07-30 07:32:51

Hello.
There is a server with an 8-core CPU, RAM 32GB, 2 SSD RAID 1 (mdadm, lvm), Centos 7.
Database: 10.3.16-MariaDB. The project on Bitrix is running on the server.
The average load on the database is from 1 to 4 thousand requests per second. With such a load, the server worked for ~ 4 months. and the load average of the server was about 0.7-2. mysqld load ranged from 50% to 200%.
But something went wrong , performance began to degrade dramatically.
Now the base daemon loads the system by top by 400-600%, and the load average reaches 40. And such a load average can be both at 400% and at 600% of the load given by mysqld.
According to observations, there are moments and performance does not return for a long time. With equal database metrics for count. requests, the load may return to normal for some time up to load average: 1-2.
Restarting the server or mysqld has no effect.
Spent a few days looking for a relationship between database and operating system metrics in order to find a relationship in the deviations, but so far could not.
Theoretically, it may not even be a database, hardware, or something else.
Initially, I thought that something might be wrong with the disks, but trim is enabled, according to atop, the disk system is loaded by 3-7%.
SSD model

Model Family:     Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model:     INTEL SSDSC2KB240G8

Monitoring data for disk latency

Read: sda  	min: 0.13 ms,	max: 30.04 ms, avg: 0.66 ms
Read: sdb 	min: 0.01 ms,	max: 27.37 ms, avg: 0.90 ms
Write: sda	min: 0.09 ms,	max: 8.33 ms,	avg: 1.34 ms
Write: sdb	min: 0.09 ms,	max: 10.48 ms,  avg: 1.41 ms

Base settings

[mysqld]
key-buffer-size = 32M
long_query_time=1
performance_schema=on
read_rnd_buffer_size=128K
sort_buffer_size=1M
read_buffer_size=128K
join_buffer_size=512K
max_allowed_packet=64M
sql_mode=
transaction-isolation=READ-COMMITTED
innodb_io_capacity = 3000
innodb_io_capacity_max = 6000
innodb_read_io_threads = 64
innodb_write_io_threads = 64
innodb_thread_concurrency=10
innodb_lru_scan_depth=100
open_files_limit = 65535
table_open_cache=5000
query_cache_min_res_unit=3000
thread_cache_size=500
innodb_flush_method=O_DIRECT
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size=12M
query_cache_type=1
query_cache_size=128M
query_cache_limit=1M
query_cache_min_res_unit=3500
tmp_table_size=1024M
max_heap_table_size=1024M
innodb_buffer_pool_size=15G
innodb_log_file_size=1G
innodb_log_files_in_group=3
innodb_buffer_pool_instances=15

System limits ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127750
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

SHOW ENGINE INNODB STATUS

=====================================
2019-07-30 06:57:50 0x7f9e40f63700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 1 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 1621 srv_active, 0 srv_shutdown, 1 srv_idle
srv_master_thread log flush and writes: 1622
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 320574
--Thread 140317647853312 has waited at btr0sea.cc line 1476 for 0.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x55fd09116700 created in file btr0sea.cc line 243
a writer (thread id 140317647853312) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffff
Last time write locked in file btr0sea.cc line 1557
--Thread 140317668058880 has waited at btr0sea.cc line 934 for 0.00 seconds the semaphore:
S-lock on RW-latch at 0x55fd09116700 created in file btr0sea.cc line 243
a writer (thread id 140317647853312) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffff
Last time write locked in file btr0sea.cc line 1557
--Thread 140317670516480 has waited at btr0sea.cc line 934 for 0.00 seconds the semaphore:
S-lock on RW-latch at 0x55fd09116700 created in file btr0sea.cc line 243
a writer (thread id 140317647853312) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffff
Last time write locked in file btr0sea.cc line 1557
--Thread 140317639866112 has waited at btr0sea.cc line 934 for 0.00 seconds the semaphore:
S-lock on RW-latch at 0x55fd09116700 created in file btr0sea.cc line 243
a writer (thread id 140317647853312) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 1, lock_word: ffffffff
Last time write locked in file btr0sea.cc line 1557
OS WAIT ARRAY INFO: signal count 10122942
RW-shared spins 0, rounds 40759553, OS waits 237147
RW-excl spins 0, rounds 994262, OS waits 71533
RW-sx spins 220, rounds 1828, OS waits 40
Spin rounds per wait: 40759553.00 RW-shared, 994262.00 RW-excl, 8.31 RW-sx
------------
TRANSACTIONS
------------

.......
ROW OPERATIONS
--------------
10 queries inside InnoDB, 1 queries in queue
10 read views open inside InnoDB
Process ID=9541, Main thread ID=140317842663168, state: sleeping
Number of rows inserted 17731, updated 22665, deleted 12861, read 5275053684
2.00 inserts/s, 19.98 updates/s, 0.00 deletes/s, 4063602.40 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

mysqltuner. He recommends disabling the cache, but that doesn't help. Bitrix CMS has a lot of subrequests in requests that are not particularly affected by the application and works faster with the cache.

-------- Recommendations ---------------------------------------------------------------------------
General recommendations:
    Control warning line(s) into /var/log/mariadb/mysql_error.log file
    Control error line(s) into /var/log/mariadb/mysql_error.log file
    MySQL was started within the last 24 hours - recommendations may be inaccurate
    Increasing the query_cache size over 128M may reduce performance
    Adjust your join queries to always utilize indexes
    Consider installing Sys schema from https://github.com/mysql/mysql-sys
Variables to adjust:
    query_cache_size (=0)
    query_cache_type (=0)
    query_cache_size (> 128M) [see warning above]
    join_buffer_size (> 512.0K, or always use indexes with JOINs)

System metrics for 12 hours.

Statistics on mariadb cache usage for 12 hours. I tried and disable it - the result was getting a little worse. The cache on this project still gives a performance boost.

Statistics by query types

Innidb metrics

According to the log of slow queries, of course, I did an analysis, one of the long queries is about 80% of slow queries, but disabling it does not affect the performance and load on the server.
And this is what my atop looks like for the last few days.

Unfortunately, advanced monitoring was configured after the performance sag and there is no data on the server at its best time.
But there is a schedule for the CPU.

Update 30.07.2019 16:00
Transferred the database to another server, the situation has not changed much.
Problems with iron are canceled, apparently requests affect the load, but I haven’t figured out why at different points in time in different ways.
Most likely, requests began to affect innodb metrics:
RW Locks S OS Waits, RW Locks X OS Waits, RW Locks S Spin Rounds, RW Locks X Spin Rounds.
I would be grateful for any help or advice in finding the reasons for this performance failure.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

Y

Yuriyant, 2019-08-01
@yuriyant

Thanks everyone for the help and advice!
Two manipulations were made at once to help solve the problem. Which of them solved the problem, I can’t say for sure.
Bitrix has a table with catalog discount coupons, we had 300 thousand coupons in it. Since it is not possible to change the request due to its systemic nature and updating Bitrix will overwhelm everyone, tests were carried out with the data. Removing 200 thousand old coupons from the table made it possible to speed up a slow query from ~1.3 sec. up to 0.01 sec. There are 60 thousand coupons left in the table.
We will finalize the automatic coupon cleaning system and write an appeal to Bitrix with a request to optimize the request. Although the slow request, according to the monitoring data, worked out ~ 3.7 times per second, but this did not prevent it from choking the processor.
The second event is the transition from mariadb (10.1, 10.3 - both versions were tested for several days) to percona 8. The assumption fell on a database problem with our data set and queries, there was a thought, but suddenly I know in percon and eliminated the error.
I still believe that the key factor here was Vitaly Karasik pointed out to a slow request.
Apparently, on our data, this request caused locks and high CPU load due to an increase in the values of the RW Locks S OS Waits, RW Locks X OS Waits, RW Locks S Spin Rounds, RW Locks X Spin Rounds metrics. There were more coupons and at some point in time their number. became critical for the request.
Floating LA, independent of the number of queries to the database, turned out to be the influence of calling cron jobs on uploading catalogs to the Yandex Market. I will bind the display of the tasks of Biriks agents (task crons) on the charts in order to more clearly build connections between the load and the application in the future.
UPDATE
08/09/2019 The database was returned to mariadb 10.3. It turned out one more request that increases the load on the CPU. We have about 1 million sessions in the database. So Bitrix in the Sale\Fuser::getIdByUserId($ID) method for getting one record does not set a limit and also sorts the entire selection. If FALSE is passed, then in our case it sorted about 1 million records + added 1 more record there each time. An appeal to Bitrix has been sent, they have accepted it and, apparently, an update will be released soon.
Who needs a solution now - change the method right in the Sale\Fuser::getIdByUserId core in this way.

$res = FuserTable::getList(array(
                       'filter' => array(
                               'USER_ID' => $userId
                       ),
                       'select' => array(
                               'ID'
                       ),
                       'limit' => 1, // Добавить лимит
                       // 'order' => array('ID' => "DESC") // Убрать эту сортировку
               ));

M

Mike, 2019-07-30
@skvarovski

a lot of interesting things, but I did not see the type of table engines.
maybe somewhere there is a MyISAM table that is blocked, and in the first line key-buffer-size = 32M is written incorrectly, it just affects the speed of MyISAM tables, with this configuration you can set 1G
key_buffer_size = 1G
myisam_sort_buffer_size = 128M
myisam_max_sort_file_size = 10G
myisam_max_extra_sort_file_size = 10G