E
E
EgorkaOle2021-01-30 23:21:21
Amazon Web Services
EgorkaOle, 2021-01-30 23:21:21

How to diagnose strange network problems on AWS servers?

Hello everyone, not an expert in AWS or networking, but I observe a strange situation:
M4 or M5 EC2 servers are used with Windows, 10Gbps conductivity, several network interfaces, several IPs.
The servers run several network intensive applications that connect to various other servers and services via web sockets and other protocols. Usually I see that the entire network takes somewhere around 30-50 Mbps.
There are situations when one of the programs starts to experience serious network lags, for example, it receives messages with a delay or does not receive them at all. For example:
10 apps, everyone eats a little bit, one abruptly starts eating more network, in general, the whole system eats 100 mbps (usually 30-50), 1-2 other apps start to suffer, restarting the first app helps with disconnecting some connections from this app. The load on the network and cpu will be reduced.
I also noticed that a similar thing happens if one of the apps starts eating more cpu, for example 20-30% (in total, the system eats 50% somewhere in this case), the network of other apps will also suffer and restarting this app can help.
In general, I don’t understand why using 100 Mbps from 10 Gbps can create such problems.
How can this be and where to look for the problem?

Answer the question

In order to leave comments, you need to log in

3 answer(s)
W
Wexter, 2021-01-31
@Wexter

And what about the network? Obviously some kind of plug in the software itself due to the load

E
Eugene, 2021-02-01
@yellowmew

With almost 99.9% probability, AWS has absolutely nothing to do with it.
1. Look at the network stack on problem servers (at least monitor it or something, do you look into monitoring?) For example, a fairly common problem is non-closing tcp close_wait. Optimize network settings for your application.
2. monitor the application itself - what changes at the moments of sticking. Try some APM (NewRelic, Datadog and others) if the application allows.
In general, if you had data for analysis, you could assume something.
Do monitoring if it hasn't already been done. See monitoring.

R
Roman Mirilaczvili, 2021-01-31
@2ord

There is no evidence that the problems are caused by the network itself. Look in the applications why they do not give data, debug. Look for blockages.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question