Why is there more traffic going through one gateway than another when load balancing?

E

Evgeny Ferapontov2014-06-27 13:56:37

linux

Evgeny Ferapontov, 2014-06-27 13:56:37

Excerpt from the script:

# Multi-wan routing settings
# ByFly routing settings

ip route add 192.168.111.0/30 dev eth1 src 192.168.111.2 table byfly
ip route add default via 192.168.111.1 table byfly

# BN routing settings

ip route add 192.168.110.0/30 dev eth2 src 192.168.110.2 table bn
ip route add default via 192.168.110.1 table bn

# Creating rules

ip rule add from 192.168.111.2 table byfly prio 1000
ip rule add from 192.168.110.2 table bn prio 1000

# Here's round robin to ensure that system started with at least one gateway

ip route add default scope global nexthop via 192.168.111.1 dev eth1 weight 1 nexthop via 192.168.110.1 dev eth2 weight 1

# GWPING call

nohup /usr/sbin/gwping &

The priorities are the same, the weight of the routes is the same, the metrics of the created routes are the same. GWPING only switches everything to one gateway in case of a disconnect and returns everything as it was. Eth1 always carries at least twice as much traffic as eth2. Where to dig?
PS I tried to swap them in round robin - the result is the same.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

T

throughtheether, 2014-06-27
@e1ferapontov

I will note right away that I am not a great specialist in the Linux network subsystem. I proceed from the following provisions:
1) you, apparently, have implemented the Equal cost multi-path scheme .
2) I assume that reasonable people implemented the Linux network subsystem, so the choice of outgoing route is per-flow, i.e. for each data 'stream' (a 'stream' is characterized by destination source IP addresses, IP protocol number, source and destination ports)

Why is there more traffic going through one gateway than another when load balancing?

In short, because there is no traffic balancing observed. In my understanding, balancing is when we track one parameter ('load', be it link utilization, number of connections, whatever) and change another parameter accordingly (outgoing route, interface, etc.) to equalize (bring to balance) changes by implementing feedback. In this sense, balancing is observed in various load balancers (load balancers), such as F5, haproxy and others.
In your case, most likely, the traffic is divided ( load sharing ) based on which flow it belongs to. Accordingly, increased utilization of one of the links indicates that there is a high-intensity flow ( Elephant flow), i.e. a large number of packets have the same hash and are directed to the same link. There may also be nuances with the separation of traffic generated by the host itself. Well, there is always the possibility of bugs in the software.

Where to dig?

To make sure the hypothesis is correct, you can dump the traffic (at the point before splitting into links) and study it using wireshark (Statitics -> Conversations -> tabs TCP, UDP, two right columns in bps). If the hypothesis is correct, you will find a pair of sockets utilizing a significant portion of the link's bandwidth.
I also assumed that you showed the actual settings and that you have exactly two default routes. If suddenly one more one, with the same weight (with the same metric), then even in the ideal case, the division will most likely be 1:1:2. This is due to the peculiarities of the ECMP implementation.
TL;DR: This is not balancing, this is traffic separation. It is extremely difficult to achieve an ideal division of traffic in half in the general case.
UPD :

Now the car is on the test stand. There is no traffic at all here, except for two pings running for testing.

Please provide the exact commands you are running. Specify whether you are running them on the server-router with two links or on a third-party machine. Ping targets - whether they respond to pings. What are the exact quantitative characteristics of traffic on the links (interface such and such, so much incoming, so much outgoing), how do you measure it (average value for 1,5,10 minutes)
But even without these data, we can say that your test not quite correct. Per-flow hashing is optimized for UDP and TCP traffic. Therefore, I recommend that you generate on the machine for the eth0 interfaces (and the correctly spelled default gateway) using hpingUDP traffic with randomized source and destination addresses, source and destination ports. If in this case the traffic is distributed approximately evenly, then everything works properly.

T

tgz, 2014-06-27
@tgz

Show
ip a
ip ru li
ip r