And how do you monitor a large number of servers?

M

Maxim2013-08-10 23:40:06

Helpdesk

Maxim, 2013-08-10 23:40:06

Good night everyone,
For a long time I have been tormented by the question of how specialists from Habr monitor a large number of servers (> 50), please share your experience.
We use the monitoring system WhatsUp and PRTG, self-written scripts, SNMP, etc. all this is undoubtedly convenient, but ... many other problem areas associated with hardware problems are missed, for example, a failed power supply, or a problem with one of the screws on the HW Raid, this can certainly be tied to monitoring by scripts, etc. but this is too clumsy (in principle, this is how it works now), since different OS, different pieces of iron.
How would you centralize all this?

Reply

Answer the question

In order to leave comments, you need to log in

9 answer(s)

T

tocha4, 2013-08-10
@tocha4

I settled on zabbix. A fairly user-friendly interface, many built-in triggers, the ability to create your own, you can bind to almost any hardware. Well, for free, which with such functionality captivates.

R

rozhik, 2013-08-11
@rozhik

zabbix. It perfectly monitors http / s, snmp, the execution time of queries to the database (or anything else, for which there is enough fantasy to write a script).
Cool records. Good alarms and reports.

S

stavinsky, 2013-08-11
@stavinsky

I advise you to look at the nagios assembly called CheckMK. The Germans are doing it, almost everything has already been rewritten for themselves. Excellent clear interface. Easy to set up. It has its own passive agent with pre-installed checks for many services. Optimized for highload. (all possible nosql, cache, etc. are already bundled and work. )

E

ergil, 2013-08-11
@ergil

I will support the two previous speakers. Zabbix.

T

track, 2013-08-11
@track

Doesn't the power supply crash send a catchable prtg event?

J

joneleth, 2013-08-11
@joneleth

Zabbix, but you still have to write scripts for specific things, the bonus is that everything will be in one interface.

V

Vladimir, 2013-08-11
@merdoc

the dude .
True, we mainly monitor routers and switches.

A

Acidmind, 2013-08-13
@Acidmind

We use Nagios. Scripts are easy to write. I write in ruby for our needs. Monitoring MS SQL, DB2, backups... So far I haven't found a task that could not be solved.

D

Dmitry, 2013-08-16
@CyberFlow

I use ganglia to monitor quantitative metrics. For event monitoring of Shinken or Icinga, plus add to this a centralized collection of logs in logstash + elasticsearch. It looks complicated, but for systems of over 50 machines with requests for monitoring hardware, network nodes, etc. There is no easy way, in my opinion.