Node.js - what is the reason and how to deal with "Request Interrupted" and "Idle connection" errors if they occur in a simple request?

K

keht2015-02-05 14:22:25

MongoDB

keht, 2015-02-05 14:22:25

There is a REST backend for a mobile game built on node.js and mongodb (express, mongoose). The application is deployed on Heroku (2 dyno (1x)), the database is on Mongolab (M1 Cluster (1.7 GB). Used to process ratings, gifts, saves, user profiles, etc.
The client communicates with the server via POST requests (inside client uses libcurl), passing and receiving data in the form of JSON. The nature of the load on the server is 50/50 read / write, no caching. There are an average of 2 requests to the database per backend request.
After the project was launched, the load was 4500rpm at its peak, response time 30-40ms (data obtained in NewRelic) As the database grew, the response time began to increase, and rpm - to fall.At the moment we have ~1000rpm at the peak and 100ms.The time in the database has increased from 3-5ms to 15-20ms .
And everything would be fine, but from the moment of start to the current time, there are more and more errors in the metric. At first there were none at all, then they began to slip 1-2. And now:
i.imgur.com/8n6YbpO.png
"H18 - Request Interrupted" - https://devcenter.heroku.com/articles/error-codes#...
"H15 - Idle connection" https://devcenter. heroku.com/articles/error-codes#
... despite the decrease in load, the number of errors is growing.
In the logs (Logentries) there is only H15, they look like this:
heroku router - - at=error code=H15 desc="Idle connection" method=POST path="/v1/userString/set" host=xxxxxxxxxx.com request_id=c339d810-c020-4df1-81d0-e9016a25c8eb fwd="213.44. 130.41/i16-les01-ntr-213-44-130-41.sfr.lns.abo.bbox.fr" dyno=web.2 connect=0ms service=55001ms status=503 bytes=0
H18 for some reason not in the logs generally.
service=55001ms is a strange time. Inside the client, a timeout of 30s is set for all requests. Setting a timeout on the server (set 30 seconds for tests) does not affect anything. I tried to write to the log using the 30s timer from the beginning of the request, but not a single such case was included in the log - all requests completed faster.
I checked Slow Queries in Mongolab and didn’t see anything wrong there - in two days there were only ~160 requests with an average response of 264ms,
99.9% of H15 in the logs are on method=POST path="/v1/userString/set" (rpm of this request is ~120 out of 1000).
Nothing interesting inside. First middleware: express.compress, express.json, express.urlencoded, then my code: pastebin.com/5xqs2y3q The exact same request handling is used in a dozen other places in the project and does not cause any problems. The only difference is that there is a rather large amount of input and output data here: 3-5kb versus 0.1-0.5kb in other requests.
How it works:
Client sends JSON with player id and data (saved):
{ "UserId" : "f1234567890", "Key" : "progress", "Value" : " ---- [ SOME USER DATA HERE - 3-5kb of text ] ---- ", "Version" : 123, "AccessToken" : "
The server pulls the corresponding entry from the database, checks it (the decision is made by the "Version" field, a banal comparison of the int). If everything is in order, then "Value" is written to the database and the response is sent to the client:
{"result":{"code":0,"message":{"status":"updated"}}}
And if not, then " Value" from the database is sent back to the client to resolve the conflict.
{"result":{"code":0,"message":{"status":"need merge","Value":" ---- [ SOME DATA HERE - 3-5kb of text ] ---- }}}
My questions are:
1. Why is the response time of such a simple request 200-400ms (when the db responds in 3-5ms)
2. Why do H15 errors occur?There is no option in my code,
3. How to find out who is to blame for H18 - the client or the server, if for some reason this error is not in the logs. By the way, why?
4. Why does the number of errors continue to increase when the rpm decreases?
5. What, in fact, is the problem and how can I fix it.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

K

keht, 2017-04-07
@keht

I myself will answer my own questions after a couple of years :)
1 - The problem is in the parsing of the request on the side of the node. The request contains a JSON body, which is automatically parsed by the node, and this is suddenly insanely expensive even with a small amount of data (~10kb).
2, 3 - The problem is exclusively on the client. Because all clients are mobile devices, the situation with the request hanging / falling off is the norm. H15/H18 are easily reproducible by simulating a bad/missing connection on the device.
4 - The reason is the amount of data. The further in the game the player has reached, the fatter his requests are. And the more data, the longer the request is processed (point 1) and the more chances that something will go wrong.
5 - To speed up the request, we refuse the JSON body in the request, we pass the parameters in a different way. Nothing can be done about H15/18 on the server side, so we do some magic on the client so that it can handle failed requests.