N
N
NoN2013-11-14 10:10:33
linux
NoN, 2013-11-14 10:10:33

Unexpected restart of EJabberd - how is this possible?

Hello Linux, Erlang and/or EJabberd experts.

I ran into a very strange (and even unbelievable) situation when using EJabberd. The inexplicability of what happened was very interesting, but his own investigation did not lead to anything. Hope the community can help.

Initial data : VDS under Xen, on which Ubuntu 12.04 has been running for half a year, and standard packages: isolated (without S2S) EJabberd and RabbitMQ. During all this time, both applications rebooted only as planned during updates or reboots of the entire server, otherwise they worked stably. Jabber-bot is locally connected to ejabberd.

Recently, while reviewing the logs, I found that the bot was losing connection with the server for 40 seconds. I was surprised that it was not the localhost network that was falling off - I looked more. And then the interesting began.

Based on the logs and the process table, it turned out that the process with Erlang VM (beam) was rebooted. The creation time of the beam process and its internal uptime (ejabberdctl stats uptimeseconds) of course corresponded to the time the bot was turned off.

Questions : why did the vaunted bulletproof elrang fall (?). And, more interestingly, who restarted it?

The catch is that in the standard 12.04 package it is launched not through upstart (which can monitor and respawn), but through the old init.d (SysV), which is a simple set of scripts and does not control running processes, "died so died" .

Ejabberd is really not monitored by anyone. "Killed by nine" (kill -9) does not rise.

Only ejabberd rebooted. Neither RabbitMQ, nor epmd, nor any other daemons were "affected".

This happened during the execution of cron.daily (standard set). Did it cause you to fall and/or get up? Possibly, but I don't know what the connection is. Nothing like this happened in the previous six months.

Collected all the logs in one timeline, arranged chronologically.

syslog.log: Cron starts running:

Oct 25 16:18:01 myhostname CRON[18678]: (root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))

ejabberd.log: EJD starts disconnecting all clients for no apparent reason:

=INFO REPORT==== 2013-10-25 16:22:57 ===
I(<0.16451.0>:ejabberd_c2s:1502) : ({socket_state,tls,{tlssock,#Port<0.8176>,#Port<0.8178>},<0.16450.0>}) Close session for [email protected]/3977817831382700501260091

=INFO REPORT==== 2013-10-25 16:22:57 ===
I(<0.14970.0>:ejabberd_c2s:1502) : ({socket_state,tls,{tlssock,#Port<0.7808>,#Port<0.7810>},<0.14969.0>}) Close session for [email protected]/11507889201382475822937194

=INFO REPORT==== 2013-10-25 16:22:58 ===
I(<0.14965.0>:ejabberd_c2s:1502) : ({socket_state,gen_tcp,#Port<0.7805>,<0.14964.0>}) Close session for [email protected]/30615165031382475819899328

ejabberd.log: Some clients try to return:

=INFO REPORT==== 2013-10-25 16:23:00 ===
I(<0.900.0>:ejabberd_listener:281) : (#Port<0.8261>) Accepted connection {{127,0,0,1},33716} -> {{127,0,0,1},5222}

=INFO REPORT==== 2013-10-25 16:23:00 ===
I(<0.16531.0>:ejabberd_c2s:785) : ({socket_state,gen_tcp,#Port<0.8261>,<0.16530.0>}) Accepted authentication for bot by []

=INFO REPORT==== 2013-10-25 16:23:01 ===
I(<0.16531.0>:ejabberd_c2s:938) : ({socket_state,gen_tcp,#Port<0.8261>,<0.16530.0>}) Opened session for [email protected]/90172789113827037819693

ejabberd.log: But the server immediately shuts down:

=INFO REPORT==== 2013-10-25 16:23:02 ===
I(<0.41.0>:ejabberd_app:89) : ejabberd 2.1.10 is stopped in the node [email protected]

erlang.log, ejabberd.log: Server reads config, rises:

=PROGRESS REPORT==== 25-Oct-2013::16:23:29 ===
          supervisor: {local,ejabberd_sup}
             started: [{pid,<0.247.0>},
                       {name,ejabberd_hooks},
                       {mfargs,{ejabberd_hooks,start_link,[]}},
                       {restart_type,permanent},
                       {shutdown,brutal_kill},
                       {child_type,worker}]
=WARNING REPORT==== 2013-10-25 16:23:29 ===
W(<0.41.0>:ejabberd_config:260) : The option 'modules' is disallowed, and will not be accepted

=INFO REPORT==== 2013-10-25 16:23:31 ===
I(<0.41.0>:ejabberd_app:72) : ejabberd 2.1.10 is started in the node [email protected]

ejabberd.log: Clients returned:

=INFO REPORT==== 2013-10-25 16:23:32 ===
I(<0.892.0>:ejabberd_listener:281) : (#Port<0.3285>) Accepted connection {{127,0,0,1},33730} -> {{127,0,0,1},5222}

=INFO REPORT==== 2013-10-25 16:23:32 ===
I(<0.895.0>:ejabberd_c2s:785) : ({socket_state,gen_tcp,#Port<0.3285>,<0.894.0>}) Accepted authentication for bot by []

=INFO REPORT==== 2013-10-25 16:23:32 ===
I(<0.895.0>:ejabberd_c2s:938) : ({socket_state,gen_tcp,#Port<0.3285>,<0.894.0>}) Opened session for [email protected]/21300376551382703812415798

syslog.log: However, cron is still running.

Oct 25 16:23:38 myhostname rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="454" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

Hypotheses, arguments and questions:

1. It was just a very rare Heisenbug. But who then restarted the demon?

2. Cron restarted the process for some purpose. But this has never happened before. Yes, and too clumsy for a network server.

3. Ejabberd decided to restart for some reason. Critical error, scheduled reboot (I looked in the documentation and code about this - I did not find it). It seems that because the reboot was soft, the clients first disconnected, then the process ended. But why did the VM reboot itself then? Usually EJD restarts inside the VM (ejabberdctl restart).

4. Erlang VM decided to restart for some reason. If this is a critical error, then it is unlikely that EJD would have managed to smoothly disconnect all clients. If a scheduled reboot is ... we are talking about Erlang, and not about Win95.

5. It's so Xen is naughty. I did not work with him, I am not aware of his capabilities and features. But it's strange if it shoots and restarts the processes.

6. I overlooked the EJD respawn monitor. Why doesn't it rise after kill -9 then? Well, anything can happen, open my eyes.

The demon restarted by evil spirits is now working, you can look at something to test hypotheses.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
I
Ivan Osipov, 2013-11-16
@Osipov

It looks like the Erlang/OTP supervisor worked. See: http://www.erlang.org/doc/design_principles/sup_princ.html (obvious I think).
The question is why did it work. I can assume that the memory is over or corrupted.
Crash dumps not saved? (they are usually written to the directory where the logs are kept).
In general, of course, there is little information: erlang.log should be looked at in full + ejabberd.log a few minutes before the client shutdown began.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question