Skip to content

keep alive ping timeouts / "Failed to connect to remote host: FD Shutdown" #38282

Open
@ginoledesma

Description

@ginoledesma

What version of gRPC and what language are you using?

1.68.0, 1.68.1
python (client-side)
C++ (server-side)

What operating system (Linux, Windows,...) and version?

$ uname -a
Linux toaster-8f489 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.4 LTS
Release:	22.04
Codename:	jammy

What runtime / compiler are you using (e.g. python version or version of gcc)

Using python 3.12 x86-64 docker image

python 3.12.7

What did you do?

In gRPC 1.68.x and later, something changed when grpc.keepalive_time_ms is enabled, causing keepalive pings to fail when there is no traffic between the client and server.

When the gRPC client is idle for at least the same amount as the ping timeout after sending initial traffic (e.g. unary call), the keep alive ping times out and can mark the channel unavailable.

I was looking for FD Shutdown issues and stumbled on #22088 among others; the suggested work-around of "aligning versions" (by downgrading to grpcio 1.67.x) made me think the issues are could be related.

The following is a reproducible test case:

  1. Run the greeter_server from the hello world example in the C++ QuickStart. I used a version compiled against grpc 1.67.1 and 1.68.1 for this example.
./greeter_server
  1. Run a modified version of the python greeter_client.py:
python3 -m pip install -U 'grpcio>=1.68.0'
python3 ./greeter_client.py target-host:50051

The notable changes here are passing the channel options.

  • grpc.keepalive_time_ms: 2 seconds
  • grpc.http2.ping_timeout_ms: 10 seconds

I set the keep alive time to 2 seconds and the ping timeout to 10 seconds to reduce the amount of waiting. If I kept the default value of 1 minute for grpc.http2.ping_timeout_ms and also made my idle time the same, we see the same issue.

What did you expect to see?

The client should work continuously without interruption.

This was the case when run against grpcio 1.67.1 and older.
This is also the case when running the C++ version of the greeter_client code in both v1.67.x and v1.68.x.

What did you see instead?

On the client side, the channel would be marked UNAVAILABLE. Enabling verbose/trace logs indicated this was due to a ping timeout. See the attached logs of 1.67-run.log and 1.68-run.log for comparison. These were obtained by running:

GRPC_TRACE=health_check_client,http,http2_ping,http_keepalive GRPC_VERBOSITY=debug python3 ./greeter_client.py 192.168.60.151:50051

With these logs enabled, this caught my attention:

I0000 00:00:1734041240.934819    1624 writing.cc:758] CLIENT[0x7f35a0001500]: Set ping timeout timer of 10000ms for ping id 13964267164106753470
I0000 00:00:1734041240.934834    1624 chttp2_transport.cc:2822] ipv4:192.168.60.151:50051: Start BDP ping err=OK
I0000 00:00:1734041240.934846    1624 chttp2_transport.cc:3013] ipv4:192.168.60.151:50051: Keepalive ping cancelled. Resetting timer.
2024-12-12 22:07:20,935 Greeter client received: Hello you; sleeping 10 seconds
...
I0000 00:00:1734041250.936770    1624 chttp2_transport.cc:1796] ipv4:192.168.60.151:50051: Ping timeout. Closing transport.
I0000 00:00:1734041250.936848    1624 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state IDLE -> WRITING [GOAWAY_SENT]
I0000 00:00:1734041250.936942    1624 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state WRITING -> WRITING+MORE [RST_
STREAM]
I0000 00:00:1734041250.936961    1624 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:ping timeout {grpc_status:14, created_time:"2024-
12-12T22:07:30.93686731+00:00"}
...
I0000 00:00:1734041250.937497    1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:goaway sent {created_time:"2024-12-12T22:07:30.93
7479352+00:00"}
I0000 00:00:1734041250.937572    1637 chttp2_transport.cc:923] W:0x7f35a0001500 CLIENT [ipv4:192.168.60.151:50051] state WRITING -> IDLE [finish writi
ng]
I0000 00:00:1734041250.937605    1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:Delayed close due to in-progress write {created_t
ime:"2024-12-12T22:07:30.937111861+00:00", children:[UNKNOWN:ping timeout {grpc_status:14, created_time:"2024-12-12T22:07:30.93686731+00:00"}, UNKNOWN
:goaway sent {grpc_status:14, created_time:"2024-12-12T22:07:30.937479352+00:00"}]}
I0000 00:00:1734041250.937706    1637 chttp2_transport.cc:3033] transport 0x7f35a0001500 set connectivity_state=4; status=OK; reason=close_transport
2024-12-12 22:07:30,937 gRPC error <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "ping timeout"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"ping timeout", grpc_status:14, created_time:"2024-12-12T22:07:30.937145
783+00:00"}"
>
I0000 00:00:1734041250.937870    1637 chttp2_transport.cc:1678] 0x7f35a0001500 CANCEL PINGS: UNKNOWN:FD Shutdown {created_time:"2024-12-12T22:07:30.93
7736449+00:00", children:[UNAVAILABLE:endpoint shutdown]}

If the client attempted to send another unary call during this period when the ping timeout occurred, there is a chance that the channel is in an unavailable state (especially if there are no other child endpoints) and raise the error.

Anything else we should know about your project / environment?

Other observations:

  • If there is regular traffic on the channel between the keep-alive/ping timeout values, this problem doesn't seem to happen.
  • If there are multiple backing children on the target endpoint (e.g. target hostname resolves to N IPs), the amount of ping timeouts is also linear. But depending on the load-balancing policy and/or retry behavior, the problem can be masked by using an available channel.
  • Toggling the value of grpc.keepalive_permit_without_calls with either 0 or 1 doesn't seem to affect the results

1.68-run.log
1.67-run.log

Work-arounds:

  1. Downgrade to grpc 1.67.x on the client side
  2. Disable keepalive

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @sergiitk@ginoledesma@gnossen@sourabhsinghs@sreenithi

      Issue actions

        keep alive ping timeouts / "Failed to connect to remote host: FD Shutdown" · Issue #38282 · grpc/grpc