Network problem on multiple identical servers. What could be the problem?

H

hellname2015-03-20 12:00:28

linux

hellname, 2015-03-20 12:00:28

Hello. I ran into a problem, and I can’t find the reason for this for several weeks. Next, I will describe the architecture:
There are 4 fronts (linux, ubuntu-12.04, kernel 3.2.0-76), these are servers with 2xXeon E-5, 16GB RAM and 1 gigabit port that looks in VLAN, on these fronts a web application runs on php, the application stores some data in memcache (this is a separate dedicated server with 1xXeon E-3, 16GB RAM + 1gpbs vlan), since the application is clustered and the load on each front is always the same, I can’t even understand the cause of the problem. The essence of the problem:
On 3 out of 4 servers errors appear Server memcached_server (tcp 11211) failed with: Connection timed out (110), because of this, the user session drops accordingly (this can be an authorization session or a cookie storage session) The most interesting thing is that the servers are identical to each other by hardware, settings, code, services and package versions.

net.core.somaxconn      = 65000 
net.core.netdev_max_backlog   = 3000 
net.core.rmem_default      = 524288 
net.core.wmem_default      = 393216 
net.core.rmem_max      = 16777216 
net.core.wmem_max      = 16777216 
net.unix.max_dgram_qlen      = 256 
net.ipv4.ip_forward            = 0 
net.ipv4.tcp_rmem            = 4096 87380 16777216 
net.ipv4.tcp_wmem            = 4096 65536 16777216 
net.ipv4.tcp_congestion_control         = htcp 
net.ipv4.tcp_mtu_probing         = 1 
net.ipv4.tcp_timestamps            = 1 
net.ipv4.tcp_sack            = 1 
net.ipv4.tcp_fack            = 1 
net.ipv4.tcp_dsack            = 1 
net.ipv4.tcp_syncookies            = 1 
net.ipv4.tcp_max_syn_backlog         = 16384 
net.ipv4.tcp_synack_retries         = 5 
net.ipv4.tcp_abort_on_overflow         = 0 
net.ipv4.icmp_echo_ignore_broadcasts      = 1 
net.ipv4.icmp_ignore_bogus_error_responses   = 0 
net.ipv4.neigh.default.gc_thresh1      = 512 
net.ipv4.neigh.default.gc_thresh2      = 1024 
net.ipv4.neigh.default.gc_thresh3      = 2048 
net.ipv4.conf.all.rp_filter         = 1 
net.ipv4.conf.default.rp_filter         = 1 
net.ipv4.conf.all.accept_redirects      = 0 
net.ipv4.conf.default.accept_redirects      = 0 
net.ipv4.conf.all.send_redirects      = 1 
net.ipv4.conf.default.send_redirects      = 1 
net.ipv4.conf.all.accept_source_route      = 0 
net.ipv4.conf.default.accept_source_route   = 0 
net.ipv4.conf.all.proxy_arp         = 1 
net.ipv4.conf.default.proxy_arp       = 1 
net.netfilter.nf_conntrack_max         = 1048576 
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait=1 
net.ipv4.tcp_fin_timeout=10 
net.ipv4.ip_local_port_range="16384 65534" 
net.ipv4.tcp_tw_reuse=1 
net.netfilter.nf_conntrack_tcp_timeout_established=600 
net.ipv4.tcp_slow_start_after_idle=0 
echo "131072" > /sys/module/nf_conntrack/parameters/hashsize 
ifconfig eth1 txqueuelen 10000

cat /etc/rc.local 
#!/bin/sh -e 
# 
# rc.local 
# 
# This script is executed at the end of each multiuser runlevel. 
# Make sure that the script will "exit 0" on success or any other 
# value on error. 
# 
# In order to enable or disable this script just change the execution 
# bits. 
# 
# By default this script does nothing. 

# RX/TX Backlog phy 
ETHTOOL="/sbin/ethtool" 

$ETHTOOL -G eth0 rx 4096 tx 4096 
$ETHTOOL -G eth1 rx 4096 tx 4096 
$ETHTOOL -G eth2 rx 4096 tx 4096 
$ETHTOOL -G eth3 rx 4096 tx 4096

ethtool -i eth1 
driver: igb 
version: 3.2.10-k 
firmware-version: 1.5-9 
bus-info: 0000:02:00.1 
supports-statistics: yes 
supports-test: yes 
supports-eeprom-access: yes 
supports-register-dump: yes 




ethtool eth1 
Settings for eth1: 
   Supported ports: [ TP ] 
   Supported link modes:   10baseT/Half 10baseT/Full 
                           100baseT/Half 100baseT/Full 
                           1000baseT/Full 
   Supported pause frame use: No 
   Supports auto-negotiation: Yes 
   Advertised link modes:  10baseT/Half 10baseT/Full 
                           100baseT/Half 100baseT/Full 
                           1000baseT/Full 
   Advertised pause frame use: No 
   Advertised auto-negotiation: Yes 
   Speed: 1000Mb/s 
   Duplex: Full 
   Port: Twisted Pair 
   PHYAD: 1 
   Transceiver: internal 
   Auto-negotiation: on 
   MDI-X: Unknown 
   Supports Wake-on: pumbg 
   Wake-on: g 
   Current message level: 0x00000003 (3) 
                drv probe 
   Link detected: yes 


ethtool -S eth1 | grep err 
     rx_crc_errors: 0 
     rx_missed_errors: 1 
     tx_aborted_errors: 0 
     tx_carrier_errors: 0 
     tx_window_errors: 0 
     tx_deferred_ok: 0 
     rx_long_length_errors: 0 
     rx_short_length_errors: 0 
     rx_align_errors: 0 
     rx_errors: 0 
     tx_errors: 0 
     rx_length_errors: 0 
     rx_over_errors: 0 
     rx_frame_errors: 0 
     rx_fifo_errors: 1 
     tx_fifo_errors: 0 
     tx_heartbeat_errors: 0 
     rx_queue_0_csum_err: 0 
     rx_queue_1_csum_err: 0 
     rx_queue_2_csum_err: 0 
     rx_queue_3_csum_err: 0 
     rx_queue_4_csum_err: 0 
     rx_queue_5_csum_err: 0 
     rx_queue_6_csum_err: 0 
     rx_queue_7_csum_err: 0

memcachedstat:

STAT listen_disabled_num 0 
STAT curr_connections 59

All servers communicate with each other via a gigabit vlan and the traffic inside it does not exceed 50mbps, the error occurs only on 3 servers, there are no errors on 4 (at least this week), nginx + php5-fpm with the memcache module is running on the server.
I would appreciate advice and any help.
Thank you.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

S

ShamblerR, 2015-03-20
@ShamblerR

Why did you decide that Server memcached_server (tcp 11211) failed with: Connection timed out (110), is it connected to the network at all?
yes, I will give you a dozen examples when it falls like this due to server settings and code