How to periodically monitor memory errors (EDAC,ECC) in linux (is there a comprehensive solution for monitoring server health)?

G

gremlintv22019-04-29 23:38:17

linux

gremlintv2, 2019-04-29 23:38:17

Hello, I
came across another monitoring task:
It is necessary to catch memory errors with a script (well, or a service) once per hour and report them to the alert channel (mail, messengers, etc. etc.)
What are the solutions for this?
I found this article, but for some reason this script swears at the lack of an integer value. (Maybe it should be so.)
In general, I am looking for a comprehensive solution for monitoring the hardware component of the server to be sent to prometheus and monitoring (partial alerting) through grafana, but while almost everything is self-written:

temperature(node_exporter).
HDD/SSD(smartmon-tools + script)
NVME (nvme-cli + script)
RAM (on the queue)))

Thanks
UPD: found a script for edac check and system sensors did not check
another one using mcelog

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

V

VoidVolker, 2019-04-30
@gremlintv2

Zabbix

Z

zersh, 2019-05-09
@zersh

Mcelog is great, send a report via cron.
The second option: if the server has ipmi / bmc
Collect information and health of the server. For example via ipmitool or snmp.