Why did Nginx stop sending requests to the upstream server after a huge load?

S

Stanislav Gordienko2019-04-02 08:30:16

Nginx

Stanislav Gordienko, 2019-04-02 08:30:16

Hello,
The project uses Nginx as a load balancer. Here is a simplified configuration:

upstream ima {
    server serverA:3000;
    server serverA:3000 backup;
    server serverA:3000 backup;
    server serverA:3000 backup;
}

server {
    server_name localhost;
    gzip on;

    sendfile on;

    gzip_http_version 1.0;
    gzip_proxied      any;
    gzip_min_length   500;
    gzip_disable      "MSIE [1-6]\.";
    gzip_types        text/plain text/xml text/css
                      text/comma-separated-values
                      text/javascript
                      application/x-javascript
                      application/json
                      application/atom+xml;

    proxy_connect_timeout       10;
    proxy_send_timeout          12;
    proxy_read_timeout          14;

    send_timeout                600;
    client_body_timeout         600;
    client_header_timeout       600;
    keepalive_timeout           600;

    client_max_body_size 50M;
    client_body_buffer_size 20M;

    access_log /home/nginx-access.log;
    error_log /home/nginx-error.log warn;

    location /checksum {
        log_format upstream_logging '$remote_addr - $remote_user [$time_local] '
                                    '"$request" $status $body_bytes_sent '
                                    '"$http_referer" "$http_user_agent" "$gzip_ratio"'
                                    '"$upstream_connect_time" "$upstream_header_time" "$upstream_response_time" "$request_time"';

        access_log /home/upstreams.log upstream_logging;

        proxy_pass http://ima;
        proxy_redirect     off;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Host $server_name;
        proxy_next_upstream error timeout;
        proxy_intercept_errors off;
    }
  }

Our serverA is an ELB that refers to 16 service instances in Kubernetes. Upstream
uses the same serverA five times. This was done before me as a workaround, if the first processing fails or the service Pod is not available, then Nginx will switch to the next serverA backup and start the process again. This plug works and is successful. But, at the weekend we ran a load test, sent a large amount of data and requests at the same time, on Monday we noticed a lot of 499 errors in the Nginx logs ( “GET /result/final/id-goes-here HTTP/1.1” 499 0 “-“ “Our Client Name” “-“ “-“ “-“ “-“ “64.587”
), and when we started pulling this service manually, we noticed that it falls off manually on a timeout, returning the Could not get any response page in Postman.
Interestingly, it was only Nginx that was stupid to return the result, when they launched a request directly to the upstream server (ELB), then it returned the result. We assumed that during a heavy load, Nginx marked all servers (identical) in upstream as down , and therefore did not return anything, hung on the request and fell off on a timeout. There were more questions than answers.
Has anyone experienced something similar in their practice? How did they decide? Perhaps there are similar cases?
Thank you.