G
G
gremlintv22019-04-29 23:38:17
linux
gremlintv2, 2019-04-29 23:38:17

How to periodically monitor memory errors (EDAC,ECC) in linux (is there a comprehensive solution for monitoring server health)?

Hello, I
came across another monitoring task:
It is necessary to catch memory errors with a script (well, or a service) once per hour and report them to the alert channel (mail, messengers, etc. etc.)
What are the solutions for this?
I found this article, but for some reason this script swears at the lack of an integer value. (Maybe it should be so.)
In general, I am looking for a comprehensive solution for monitoring the hardware component of the server to be sent to prometheus and monitoring (partial alerting) through grafana, but while almost everything is self-written:

  • temperature(node_exporter).
  • HDD/SSD(smartmon-tools + script)
  • NVME (nvme-cli + script)
  • RAM (on the queue)))

Thanks
UPD: found a script for edac check and system sensors did not check
another one using mcelog

Answer the question

In order to leave comments, you need to log in

2 answer(s)
V
VoidVolker, 2019-04-30
@gremlintv2

Zabbix

Z
zersh, 2019-05-09
@zersh

Mcelog is great, send a report via cron.
The second option: if the server has ipmi / bmc
Collect information and health of the server. For example via ipmitool or snmp.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question