Description
Hello,
I will show here a simple scenario that reproduces the problem that I'm having with traefik. I have a traefik instance, configured with 1 backend and proxying all requests to it. The backend is just a uWSGI instance serving a 1024 bytes static file.
First I will present the results and what I could observe and then I will show how to reproduce the problem.
Both systems are running on the same host and traefik is proxying content using 127.0.0.1
. I'm using apache ab to generate the load.
Here is the current traefik config file:
graceTimeOut = 600
MaxIdleConnsPerHost = 100000
logLevel = "DEBUG"
defaultEntryPoints = ["http"]
[entryPoints]
[entryPoints.http]
address = ":90"
[web]
address = ":8082"
[file]
[backends]
[backends.backend1]
[backends.backend1.servers.server1]
url = "http://127.0.0.1:9993"
weight = 0
[backends.backend1.LoadBalancer]
method = "drr"
[backends.backend1.maxconn]
amount = 2048
extractorfunc = "request.host"
[frontends]
[frontends.frontend1]
backend = "backend1"
passHostHeader = true
[frontends.frontend1.routes.test_1]
rule = "Host:myhost"
I'm running this version of traefik, downloaded from github.
Version: v1.0.2
Codename: reblochon
Go version: go1.6.2
Built: 2016-08-02_05:35:02PM
OS/Arch: linux/amd64
And when I run apache ab, even with a low load I see sporadic failing requests:
The command: ab -c 2 -n 10 -H "Host: myhost" http://localhost:90/static/data
The result from apache ab:
Document Path: /static/data
Document Length: 1024 bytes
Concurrency Level: 2
Time taken for tests: 0.016 seconds
Complete requests: 10
Failed requests: 1
(Connect: 0, Receive: 0, Length: 1, Exceptions: 0)
Non-2xx responses: 1
Total transferred: 10849 bytes
HTML transferred: 9237 bytes
Requests per second: 630.28 [#/sec] (mean)
Time per request: 3.173 [ms] (mean)
Time per request: 1.587 [ms] (mean, across all concurrent requests)
Transfer rate: 667.76 [Kbytes/sec] received
And traefik logs shows:
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.703455ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 2.42295ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.628134ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 2.788108ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.647956ms
WARN[2016-08-12T18:01:14-03:00] Error forwarding to http://127.0.0.1:9993, err: net/http: HTTP/1 transport connection broken: readLoopPeekFailLocked: EOF
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.542621ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.591373ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.308187ms
DEBU[2016-08-12T18:01:14-03:00] Round trip: http://127.0.0.1:9993, code: 200, duration: 1.281541ms
Interesting to note is that only 9 requests are shown in uWSGI logs, that is, the failed request didn't even make to the backend server.
Important to note is that I can run this same ab
command directly to uWSGI instance and no requests fails. Even with higher loads, for example:
concurrency: 64, total requests: 2048
Document Path: /static/data
Document Length: 1024 bytes
Concurrency Level: 64
Time taken for tests: 0.463 seconds
Complete requests: 2048
Failed requests: 0
Total transferred: 2275328 bytes
HTML transferred: 2097152 bytes
Requests per second: 4419.50 [#/sec] (mean)
Time per request: 14.481 [ms] (mean)
Time per request: 0.226 [ms] (mean, across all concurrent requests)
Transfer rate: 4794.98 [Kbytes/sec] received
0 failed requests when running directly to uWSGI server.
Do you have any insights of what could be happening here? At the same time that I think traefik has a problem I think that uWSGI might be also part of the problem, because if I switch uWSGI with nginx, all requests works as expected. But how to explain the fact that uWSGI responds to all requests when accessed directly (not proxied by traefik)?
Thanks a LOT for any insights you may have on this problem.
How to reproduce
Running uWSGI
You need to install uwsgi. You can do this wih pip install uwsgi
. Create a static/
folder and create the 1024 bytes file inside it:
mkdir static
dd if=/dev/zero bs=1 count=1024 of=static/data
Run uwsgi with:
uwsgi --buffer-size 32768 --http-socket 127.0.0.1:9993 --processes 4 --http-timeout 60 --static-map2=/static=./
Running Traefik
Just run traefik with the provided config.
Running apache ab
ab -c 2 -n 10 -H "Host: myhost" http://localhost:90/static/data
Activity
andersbetner commentedon Aug 15, 2016
This is just my guess, but it seems like uwsgi closes a connection that traefik thinks should be active (connection: keep-alive).
There is an issue related to this in golang golang/go#15446 (resolved in golang 1.7 due out any day now...)
That issue only mentions POST requests, but the symptom you get is the same.
What happens if you disable keep-alive in uwsgi? Do you get the same result?
daltonmatos commentedon Aug 15, 2016
Hello @andersbetner, thanks for the insight. Actually my current uwsgi setup dos not add the
Connection: keep-alive
header. So my first results were gathered with keep-alive turned off.I jusst ran a quick test and added the option
--add-header "Connection: Keep-Alive"
to my uwsgi instance. The header is being added corectly:I re-ran the
ab
test and still see failing requests, same behavior.I also tried to compile traefik changing the
FROM golang:1.6.3
line toFROM golang:1.7
(inside build.Dockerfile), but it did not work. =DIndeed it will be a good test to dun traefik with Go 1.7.
Thanks!!
daltonmatos commentedon Sep 12, 2016
Hello all,
I did the following test:
Ran the same ab command line (2 concurent connections with 10 total requests), but this time I captured all the traffic with tcpdump. In this run I had 1 failed request.
Here is the tcpdump saved file: https://s3-us-west-2.amazonaws.com/daltonmatos-public/ab-lo.tcpdump
Note: In this test traefik is running on port
90
and uwsgi on port9000
.I opened the saved tcpdump file in wireshark and applied these filters:
tcp.dstport == 90 && tcp.connection.syn
Returns 10 results, that is, apache ab opened 10 connections to traefik.tcp.dstport == 9000 && tcp.connection.syn
Returns 9 results, that is, traefik is only connecting 9 times to uwsgi.http && http.request.uri == "/static/collect/ok" && tcp.dstport == 9000
Returns 9 results, thats is, only 9 http requests were sent to uwsgi.This shows me that traefik is missing one connection to my backend, in this case, uwsgi.
Any insight about what could cause this?
@emilevauge
errm commentedon Sep 12, 2016
@daltonmatos it looks to me like the issue linked to by @andersbetner is to blame. could you try your tests with
containous/traefik:pr-641
as that image should have been built with go 1.7...daltonmatos commentedon Sep 13, 2016
Ok @errm, I will test it again and report back. Thank you very much!
daltonmatos commentedon Sep 13, 2016
Hello @errm, I just ran the tests again. The error still occurs, but the error message is now much more understandable. Look:
This is the log of a fisrt run of 10 requests, concurrency of 2.
With this error I can now search about how uwsgi behaves with idle connections.
On a second run (same parameters) there was new errors.
Both runs resulted in 1 request failed, which means that traefik seems to be correctly retrying the request when this "EOF" error occurs.
About the idle connections, does it make sense to traefik detect this closed connection and automatically reconnect with the backend?
I will now search about uwsgi and idle connections.
Thanks a lot for the help!
Also, are there any plans to officially release traefik built with golang 1.7? Thanks!
daltonmatos commentedon Sep 13, 2016
Hey @errm and @andersbetner,
@andersbetner suggested earlier that keep alive could be the problem. At that time I "validated" this by only adding the
Connection: keep-alive
header in the responses but this is wrong since uwsgi is not really working with keep alive turned on, after all this is just a header like any other.I now found out the
--http-keepalive
option of uwsgi (I found it searching about idle connections being closed) and now it all works!I ran a test with 128 concurrent requests and a total of 10k requests, no errors!
At the same time that I thank you all for the help I also apologize for making you loose some time on this problem. Although the problem could have been solved before, the "idle connection" hint was the key to finally solve it.
Thank you very much!
Even so, I still hope to see traefik officially built on golang 1.7! ;-)
andersbetner commentedon Sep 13, 2016
@daltonmatos
No worries, I'm glad you solved your problem.
daltonmatos commentedon Sep 27, 2016
Just closing, since the problem is now solved.
Thanks all!
nathan-osman commentedon Dec 3, 2016
@daltonmatos I am having the exact same issue but uWSGI doesn't seem to have a
--http-keepalive
flag:What version of uWSGI were you using?
nathan-osman commentedon Dec 3, 2016
Seems like the
http
plugin needs to be loaded in order for the--http-keepalive
flag to work:However, even with that set, I'm still getting the same
http: server closed idle connection
error. Strangely enough, runningab
with the keepalive (-k
) flag produces the following output:The
Keep-Alive requests: 0
is leading me to believe that I'm still missing a setting or flag somewhere.Attempting the requests with cURL (using the
-v
flag) reveals the following:jpic commentedon Mar 2, 2017
@nathan-osman you need to add the http plugin, ie.
--plugin python27,http --http-keepalive