Description
What version of gRPC and what language are you using?
1.68.0, 1.68.1
python (client-side)
C++ (server-side)
What operating system (Linux, Windows,...) and version?
$ uname -a
Linux toaster-8f489 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
What runtime / compiler are you using (e.g. python version or version of gcc)
Using python 3.12 x86-64 docker image
python 3.12.7
What did you do?
In gRPC 1.68.x and later, something changed when grpc.keepalive_time_ms
is enabled, causing keepalive pings to fail when there is no traffic between the client and server.
When the gRPC client is idle for at least the same amount as the ping timeout after sending initial traffic (e.g. unary call), the keep alive ping times out and can mark the channel unavailable.
I was looking for FD Shutdown
issues and stumbled on #22088 among others; the suggested work-around of "aligning versions" (by downgrading to grpcio 1.67.x) made me think the issues are could be related.
The following is a reproducible test case:
- Run the
greeter_server
from the hello world example in the C++ QuickStart. I used a version compiled against grpc 1.67.1 and 1.68.1 for this example.
./greeter_server
- Run a modified version of the python greeter_client.py:
python3 -m pip install -U 'grpcio>=1.68.0'
python3 ./greeter_client.py target-host:50051
The notable changes here are passing the channel options.
grpc.keepalive_time_ms
: 2 secondsgrpc.http2.ping_timeout_ms
: 10 seconds
I set the keep alive time to 2 seconds and the ping timeout to 10 seconds to reduce the amount of waiting. If I kept the default value of 1 minute for grpc.http2.ping_timeout_ms
and also made my idle time the same, we see the same issue.
What did you expect to see?
The client should work continuously without interruption.
This was the case when run against grpcio 1.67.1 and older.
This is also the case when running the C++ version of the greeter_client code in both v1.67.x and v1.68.x.
What did you see instead?
On the client side, the channel would be marked UNAVAILABLE
. Enabling verbose/trace logs indicated this was due to a ping timeout. See the attached logs of 1.67-run.log
and 1.68-run.log
for comparison. These were obtained by running:
GRPC_TRACE=health_check_client,http,http2_ping,http_keepalive GRPC_VERBOSITY=debug python3 ./greeter_client.py 192.168.60.151:50051
With these logs enabled, this caught my attention:
I0000 00:00:1734041240.934819 1624 writing.cc:758] CLIENT[0x7f35a0001500]: Set ping timeout timer of 10000ms for ping id 13964267164106753470
I0000 00:00:1734041240.934834 1624 chttp2_transport.cc:2822] ipv4:192.168.60.151:50051: Start BDP ping err=OK
I0000 00:00:1734041240.934846 1624 chttp2_transport.cc:3013] ipv4:192.168.60.151:50051: Keepalive ping cancelled. Resetting timer.
2024-12-12 22:07:20,935 Greeter client received: Hello you; sleeping 10 seconds
...
I0000 00:00:1734041250.936770 1624 chttp2_transport.cc:1796] ipv4:192.168.60.151:50051: Ping timeout. Closing transport.
I0000 00:00:1734041250.936848 1624 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state IDLE -> WRITING [GOAWAY_SENT]
I0000 00:00:1734041250.936942 1624 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state WRITING -> WRITING+MORE [RST_
STREAM]
I0000 00:00:1734041250.936961 1624 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:ping timeout {grpc_status:14, created_time:"2024-
12-12T22:07:30.93686731+00:00"}
...
I0000 00:00:1734041250.937497 1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:goaway sent {created_time:"2024-12-12T22:07:30.93
7479352+00:00"}
I0000 00:00:1734041250.937572 1637 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state WRITING -> IDLE [finish writi
ng]
I0000 00:00:1734041250.937605 1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:Delayed close due to in-progress write {created_t
ime:"2024-12-12T22:07:30.937111861+00:00", children:[UNKNOWN:ping timeout {grpc_status:14, created_time:"2024-12-12T22:07:30.93686731+00:00"}, UNKNOWN
:goaway sent {grpc_status:14, created_time:"2024-12-12T22:07:30.937479352+00:00"}]}
I0000 00:00:1734041250.937706 1637 chttp2_transport.cc:3033] transport 0x7f35a0001500 set connectivity_state=4; status=OK; reason=close_transport
2024-12-12 22:07:30,937 gRPC error <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "ping timeout"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"ping timeout", grpc_status:14, created_time:"2024-12-12T22:07:30.937145
783+00:00"}"
>
I0000 00:00:1734041250.937870 1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:FD Shutdown {created_time:"2024-12-12T22:07:30.93
7736449+00:00", children:[UNAVAILABLE:endpoint shutdown]}
If the client attempted to send another unary call during this period when the ping timeout occurred, there is a chance that the channel is in an unavailable state (especially if there are no other child endpoints) and raise the error.
Anything else we should know about your project / environment?
Other observations:
- If there is regular traffic on the channel between the keep-alive/ping timeout values, this problem doesn't seem to happen.
- If there are multiple backing children on the target endpoint (e.g. target hostname resolves to N IPs), the amount of ping timeouts is also linear. But depending on the load-balancing policy and/or retry behavior, the problem can be masked by using an available channel.
- Toggling the value of
grpc.keepalive_permit_without_calls
with either0
or1
doesn't seem to affect the results
Work-arounds:
- Downgrade to grpc 1.67.x on the client side
- Disable keepalive
Activity
[-]"Failed to connect to remote host: FD Shutdown" caused by keep alive ping timeouts[/-][+]keep alive ping timeouts / "Failed to connect to remote host: FD Shutdown"[/+]grpcio
versions on pre-3.13 versions of Python hatchet-dev/hatchet-python#2948 remaining items