Answer the question
In order to leave comments, you need to log in
How to estimate the critical load on the disk subsystem?
I work with a service that constantly writes and reads a disk. This is a system built on the basis of Elasticsearch, it works in test mode, while there were no loads close to combat ones.
There was a question of monitoring of loading on a disk subsystem.
It is easy to remove the load, it is difficult to interpret the results. I have little experience with disks, raids, and storage, so I don't understand a little which way to go.
RAID 10 built, PERC 6/i Integrated controller, SEAGATE, Centos7, xfs drives
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3600057SS
Revision: ES66
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5007ece257f
Serial number: 6SL9L37D
Device type: disk
Transport protocol: SAS
Local Time is: Thu Sep 3 15:39:39 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
Average access time per disk 3.4 ms read, 3.9 ms write
Average latency 2 ms
Take data for the last 10 minutes, for example.
summary: 19 io/s, read 640 sectors (0kB/s), write 284120 sectors (236kB/s) in 600 seconds
Performance Data: tps=19io/s; read=546b/s; write=242449b/s;
The information is taken from here: /sys/block//stat
It turns out that since I have a raid, one iops per read is the real one iops. On the record, it turns out that one iops per record turns into 2 iops in fact.
When to start panicking that the load on the disk is growing? In fact, if you look at the graphs, more than 40 iops has never happened at all. How many sectors per second is normal, and how many are already bad?
I would not want to catch on in fact, when the disks shut up and everything hangs and waits for the I / O to complete.
What values of iowait should scare?
I check the disks for errors in smart, is this enough to understand when the disk starts to fail and it's time to change it?
I would like to understand the monitoring and tuning of the disk subsystem, I will be grateful for any links, explanations and literature.
Answer the question
In order to leave comments, you need to log in
With iowait, not everything is so simple, here is a good explanation of how it works.
If you have monitoring, I would focus on the values of await / svctm from iostat. Look at what random read time is stated by disk manufacturers (usually 3-5ms) and consider these indicators acceptable.
Regarding the amount of data to be read, in general it is impossible to say what amount is normal, especially with a mixed load. Here you should probably pay attention to utilization, but you should also be careful
In Linux, the most obvious disk utilization indicator is flight_time. If you take measurements every second, then the difference between the start and end value will show how many seconds during the second the disk was busy (the value is usually from zero to 1).
It is located in /sys/block/sdX/device/stat (the meaning of all these numbers is in the kernel source Documentation).
At the household level - if there are few block devices, then just atop (and let it settle for 11-12 seconds) - and disk utilization will be shown there.
If there are a lot of block devices and they do not fit into the atop output, then I wrote a simple top on block devices separately for myself https://github.com/amarao/blktop
If you need to collect these metrics automatically, then usually the corresponding applications (for example, munin or ganglia) have modules that collect this information.
The best explanation I've seen on this topic is How to properly measure disk performance by Amarao. Invite him , he can answer your questions if he wants :)
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question