Is there a strategy to deal with RabbitMQ state reset?

S

Sergey Tikhonov2018-12-19 11:22:48

RabbitMQ

Sergey Tikhonov, 2018-12-19 11:22:48

Hello!
In different teams, when using RabbitMQ in software development, we adhered to the rule "RabbitMQ is reliable as a database". This allowed us to set tasks for Celery without worrying too much about RabbitMQ losing something or otherwise violating delivery guarantees. And in general (subject to competent Celery settings), this approach justifies itself, but there are nuances.

Our admins ran into the account of a distributed cluster of rabbits three times, as a result they got crashes in the style of "it's easier to roll a rabbit over again than to understand its Erlang dumps"
With the spread of Amazon/Kubernetes/Docker/whatever, the stateful rabbit suddenly became very prone to administrative errors like "oh, did it store data or something? and it moved"
A couple of times we encountered situations where, for unknown reasons, some messages were lost. It’s hard to say whether the code is to blame or containerization, but the fact is that something needed didn’t fly somewhere.

And the question is: is there any experience in combating such behavior? Interested in a general approach to providing delivery guarantees when using RabbitMQ in an untrusted environment.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey, 2018-12-19
@tumbler

1) RabbitMQ developers are critical of all data losses on the side of RabbitMQ itself, so if at some point you are sure that it is he who is losing data, then feel free to create a bug (with details on how to repeat).
2) If admins do not know how to set up a stateful application in a container environment or a lot of manual operations, then this is more of an administrative task to learn and, for example, use templates / charts / etc. to prevent surprises. But also RabbitMQ in the container needs to be configured so as not to receive degradation and dumps.
3) From the side of RabbitMQ itself, there is queue mirroring for duplicating data, which will allow you to suddenly lose nodes (but recovery can cost high CPU consumption).
4) I also recommend logging and identifying each message sent and received in order to evaluate the problem. For greater reliability, you can also enable logging at the RabbitMQ level (if resources allow). At our last job, we had our own plugin for RabbitMQ, which received a copy of all received and sent messages, raked out the necessary meta-information from them and sent them to Graylog.
5) And of course, you need to send and receive with confirmation, but I think you are already doing this without me.

S

Sergey Tikhonov, 2018-12-24
@tumbler

And finally, an article from Megafon that appeared on time: https://habr.com/post/434016/