Skip to content

when etcd node have many http long polling connections, it may cause etcd to respond slowly to http requests. #14185

Closed
@hansedong

Description

@hansedong

What happened?

I have 3 etcd nodes, and clients operate etcd through http.

The operations of the client are only http requests, and the request types include GET, PUT, and long polling for Watch scenarios. Actually, the client is APISIX, and APISIX internally uses lua to operate etcd through HTTP protocol.

However, when there are many connections on the etcd node, operating the etcd node through http will be particularly slow.

Show the number of connections on 3 nodes respectively:

[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
263
[root@knode10-132-15-159 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
7
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
98

Then, on each etcd node, query the same key by curl:

1、Request the etcd server of 10-132-15-158 node, which has the most 263 connections, the result is that the http request takes the longest (7 seconds).

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM -k

{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2695209","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m7.132s
user	0m0.061s
sys	0m0.063s

2、Request the etcd server of 10-132-15-159 node, which has the 9 connections, the result is that the http request takes the longest (0.097 seconds).

[root@knode10-132-15-159 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.159:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"1021713008895288175","revision":"2695484","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m0.097s
user	0m0.053s
sys	0m0.037s

3、Request the etcd server of 10-132-15-160 node, which has the 90 connections, the result is that the http request takes the longest (0.092 seconds).

[root@knode10-132-15-160 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2695532","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real	0m0.092s
user	0m0.053s
sys	0m0.034s

It can be explained above that the etcd node with the largest number of connections also takes longer to respond to http requests.

Now I'll move the connections from knode10-132-15-158 to knode10-132-15-160 and try again(actually, just let APISIX connect to node knode10-132-15-160 ).

[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
97
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
260

Request the etcd server of node 10-132-15-158, the response cost 0.159s

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m0.159s
user	0m0.067s
sys	0m0.085s

Request the etcd server of node 10-132-15-160, the response cost 10.335s

[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem   --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}

real	0m10.335s
user	0m0.067s
sys	0m0.076s

I would like to ask, what is the source of this problem? what should be done to fix this problem?

What did you expect to happen?

none

How can we reproduce it (as minimally and precisely as possible)?

I'm not sure about simulating hundreds of http long polling connections with some tool.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version

etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version

etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

ETCD_ARGS="--cert-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158.pem
--key-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158-key.pem
--peer-cert-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158.pem
--peer-key-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158-key.pem
--trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--peer-trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--name etcd1
--enable-grpc-gateway
--data-dir /DATA1/etcd
--client-cert-auth
--peer-client-cert-auth
--experimental-initial-corrupt-check
--experimental-corrupt-check-time 6h
--initial-advertise-peer-urls https://10.132.15.158:2380
--listen-peer-urls https://10.132.15.158:2380
--listen-client-urls https://10.132.15.158:2379,https://127.0.0.1:2379,https://10.132.15.158:4001
--advertise-client-urls https://10.132.15.158:2379,https://10.132.15.158:4001
--initial-cluster-token etcd-apisix-product
--initial-cluster etcd1=https://10.132.15.158:2380,etcd2=https://10.132.15.159:2380,etcd3=https://10.132.15.160:2380
--initial-cluster-state new
--heartbeat-interval 2000
--election-timeout 10000
--snapshot-count 10000
--auto-compaction-mode periodic
--auto-compaction-retention 30m
--max-request-bytes 33554432
--quota-backend-bytes 34359738368
--logger zap
--log-level debug
--backend-bbolt-freelist-type=map"

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table

+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
|        ID        | STATUS  | NAME  |         PEER ADDRS         |                     CLIENT ADDRS                      | IS LEARNER |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
|  e2dda917d99636f | started | etcd2 | https://10.132.15.159:2380 | https://10.132.15.159:2379,https://10.132.15.159:4001 |      false |
| 23c53ae1f1f3c46d | started | etcd3 | https://10.132.15.160:2380 | https://10.132.15.160:2379,https://10.132.15.160:4001 |      false |
| e3dfcbd232a3ac40 | started | etcd1 | https://10.132.15.158:2380 | https://10.132.15.158:2379,https://10.132.15.158:4001 |      false |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table

+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.132.15.158:2379 | e3dfcbd232a3ac40 |   3.5.4 |   18 MB |     false |      false |         7 |    2698464 |            2698464 |        |
| https://10.132.15.159:2379 |  e2dda917d99636f |   3.5.4 |   18 MB |     false |      false |         7 |    2698464 |            2698464 |        |
| https://10.132.15.160:2379 | 23c53ae1f1f3c46d |   3.5.4 |   18 MB |      true |      false |         7 |    2698464 |            2698464 |        |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

No response

Activity

ahrtr

ahrtr commented on Jun 30, 2022

@ahrtr
Member

This issue should have already been resolved by 14169, which isn't merged yet.

Can you try to reproduce this issue using above PR?

hansedong

hansedong commented on Jul 1, 2022

@hansedong
Author

@ahrtr

In my usage scenario, I recompile and deploy your repo, and after about 1 hour of testing, this problem is confirmed to be resolved. thank you very much for your help.

I would like to ask, will the 3.5.5 version include this change? When will version 3.5.5 be released?

ahrtr

ahrtr commented on Jul 1, 2022

@ahrtr
Member

Thanks for the feedback.

Yes, the PR will be cherry picked to 3.5.

We are still working on 3.5.5, but the exact date is not clear yet so far. FYI #14138

ahrtr

ahrtr commented on Jul 13, 2022

@ahrtr
Member

Resolved by #14169 and #14219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hansedong@ahrtr

        Issue actions

          when etcd node have many http long polling connections, it may cause etcd to respond slowly to http requests. · Issue #14185 · etcd-io/etcd