Description
What happened?
I have 3 etcd nodes, and clients operate etcd through http.
The operations of the client are only http requests, and the request types include GET, PUT, and long polling for Watch scenarios. Actually, the client is APISIX, and APISIX internally uses lua
to operate etcd through HTTP protocol.
However, when there are many connections on the etcd node, operating the etcd node through http will be particularly slow.
Show the number of connections on 3 nodes respectively:
[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
263
[root@knode10-132-15-159 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
7
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
98
Then, on each etcd node, query the same key by curl:
1、Request the etcd server of 10-132-15-158 node, which has the most 263 connections, the result is that the http request takes the longest (7 seconds).
[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM -k
{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2695209","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real 0m7.132s
user 0m0.061s
sys 0m0.063s
2、Request the etcd server of 10-132-15-159 node, which has the 9 connections, the result is that the http request takes the longest (0.097 seconds).
[root@knode10-132-15-159 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.159:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"1021713008895288175","revision":"2695484","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real 0m0.097s
user 0m0.053s
sys 0m0.037s
3、Request the etcd server of 10-132-15-160 node, which has the 90 connections, the result is that the http request takes the longest (0.092 seconds).
[root@knode10-132-15-160 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2695532","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real 0m0.092s
user 0m0.053s
sys 0m0.034s
It can be explained above that the etcd node with the largest number of connections also takes longer to respond to http requests.
Now I'll move the connections from knode10-132-15-158 to knode10-132-15-160 and try again(actually, just let APISIX connect to node knode10-132-15-160 ).
[root@knode10-132-15-158 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
97
[root@knode10-132-15-160 operation]# netstat -apn |grep ESTAB |grep "2379"|wc -l
260
Request the etcd server of node 10-132-15-158, the response cost 0.159s
[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.158:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"16420066870069275712","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real 0m0.159s
user 0m0.067s
sys 0m0.085s
Request the etcd server of node 10-132-15-160, the response cost 10.335s
[root@knode10-132-15-158 operation]# time curl -X POST -d '{"key": "L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA=="}' https://10.132.15.160:2379/v3/kv/range --cacert /etc/etcd/etcdSSL/ca.pem --cert /etc/etcd/etcdSSL/clientssl/apisix/apisix.pem --key /etc/etcd/etcdSSL/clientssl/apisix/apisix-key.pem --key-type PEM --cert-type PEM
{"header":{"cluster_id":"5357725075044320405","member_id":"2577531103840814189","revision":"2697372","raft_term":"7"},"kvs":[{"key":"L2FwaXNpeC91cHN0cmVhbXMvbXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MA==","create_revision":"2318448","mod_revision":"2694680","version":"142","value":"eyJ0eXBlIjoicm91bmRyb2JpbiIsImxhYmVscyI6eyJlbnYtdHlwZSI6InByb2R1Y3QiLCJhcHBjb2RlIjoiYmFpem95b2F1dGgyYWRhcHRlci1kYXRhIiwicGxhdGZvcm0iOiJhb3MiLCJ2ZXJzaW9uIjoiMzk0MTYyIn0sImlkIjoibXMtYW9zLWJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YS1wcm9kdWN0LWdwLTM5NDE2Mi04MCIsInVwZGF0ZV90aW1lIjoxNjU2NTgyMDUyLCJjcmVhdGVfdGltZSI6MTY1NjMyNjQyOCwibmFtZSI6Im1zLWFvcy1iYWl6b3lvYXV0aDJhZGFwdGVyLWRhdGEtcHJvZHVjdC1ncC0zOTQxNjItODAiLCJoYXNoX29uIjoidmFycyIsInBhc3NfaG9zdCI6InBhc3MiLCJkZXNjIjoidXBzdHJlYW0gZm9yIGFwcGNvZGU6IGJhaXpveW9hdXRoMmFkYXB0ZXItZGF0YSBhbmQgdmVyc2lvbjogMzk0MTYyIiwibm9kZXMiOnsiMTAuMTMxLjg2LjEwMDo4MCI6MTAsIjEwLjEzMS44Ny4yMjg6ODAiOjEwfSwic2NoZW1lIjoiaHR0cCJ9"}],"count":"1"}
real 0m10.335s
user 0m0.067s
sys 0m0.076s
I would like to ask, what is the source of this problem? what should be done to fix this problem?
What did you expect to happen?
none
How can we reproduce it (as minimally and precisely as possible)?
I'm not sure about simulating hundreds of http long polling connections with some tool.
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5
Etcd configuration (command line flags or environment variables)
ETCD_ARGS="--cert-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158.pem
--key-file=/etc/etcd/etcdSSL/servers/10.132.15.158/etcdserver-10.132.15.158-key.pem
--peer-cert-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158.pem
--peer-key-file=/etc/etcd/etcdSSL/peers/10.132.15.158/etcdpeer-10.132.15.158-key.pem
--trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--peer-trusted-ca-file=/etc/etcd/etcdSSL/ca.pem
--name etcd1
--enable-grpc-gateway
--data-dir /DATA1/etcd
--client-cert-auth
--peer-client-cert-auth
--experimental-initial-corrupt-check
--experimental-corrupt-check-time 6h
--initial-advertise-peer-urls https://10.132.15.158:2380
--listen-peer-urls https://10.132.15.158:2380
--listen-client-urls https://10.132.15.158:2379,https://127.0.0.1:2379,https://10.132.15.158:4001
--advertise-client-urls https://10.132.15.158:2379,https://10.132.15.158:4001
--initial-cluster-token etcd-apisix-product
--initial-cluster etcd1=https://10.132.15.158:2380,etcd2=https://10.132.15.159:2380,etcd3=https://10.132.15.160:2380
--initial-cluster-state new
--heartbeat-interval 2000
--election-timeout 10000
--snapshot-count 10000
--auto-compaction-mode periodic
--auto-compaction-retention 30m
--max-request-bytes 33554432
--quota-backend-bytes 34359738368
--logger zap
--log-level debug
--backend-bbolt-freelist-type=map"
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
| e2dda917d99636f | started | etcd2 | https://10.132.15.159:2380 | https://10.132.15.159:2379,https://10.132.15.159:4001 | false |
| 23c53ae1f1f3c46d | started | etcd3 | https://10.132.15.160:2380 | https://10.132.15.160:2379,https://10.132.15.160:4001 | false |
| e3dfcbd232a3ac40 | started | etcd1 | https://10.132.15.158:2380 | https://10.132.15.158:2379,https://10.132.15.158:4001 | false |
+------------------+---------+-------+----------------------------+-------------------------------------------------------+------------+
$ etcdctl --endpoints=<member list> endpoint status -w table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.132.15.158:2379 | e3dfcbd232a3ac40 | 3.5.4 | 18 MB | false | false | 7 | 2698464 | 2698464 | |
| https://10.132.15.159:2379 | e2dda917d99636f | 3.5.4 | 18 MB | false | false | 7 | 2698464 | 2698464 | |
| https://10.132.15.160:2379 | 23c53ae1f1f3c46d | 3.5.4 | 18 MB | true | false | 7 | 2698464 | 2698464 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Relevant log output
No response
Activity
ahrtr commentedon Jun 30, 2022
This issue should have already been resolved by 14169, which isn't merged yet.
Can you try to reproduce this issue using above PR?
hansedong commentedon Jul 1, 2022
@ahrtr
In my usage scenario, I recompile and deploy your repo, and after about 1 hour of testing, this problem is confirmed to be resolved. thank you very much for your help.
I would like to ask, will the 3.5.5 version include this change? When will version 3.5.5 be released?
ahrtr commentedon Jul 1, 2022
Thanks for the feedback.
Yes, the PR will be cherry picked to 3.5.
We are still working on 3.5.5, but the exact date is not clear yet so far. FYI #14138
MaxConcurrentStreams
for http2 #14169ahrtr commentedon Jul 13, 2022
Resolved by #14169 and #14219