Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unavailable error when MaxConnectionAge and MaxConnectionAgeGrace is enabled on the server #2767

Closed
wjywbs opened this issue Apr 12, 2019 · 15 comments · Fixed by #2919
Closed
Assignees

Comments

@wjywbs
Copy link
Contributor

wjywbs commented Apr 12, 2019

What version of gRPC are you using?

v1.18.0/v1.19.1

What version of Go are you using (go version)?

1.11

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I tried to enable MaxConnectionAge on a C++ server and call from a Go client. When the connection age was reached, the rpcs in the Go client failed with the unavailable error and the "transport is closing" message.

What did you expect to see?

No unavailable errors.

What did you see instead?

I test with both age and grace time set to one minute, and the Go client reported lots of errors. Each rpc takes a few seconds to complete, within the one minute grace period.

grpc-go v1.18.0/v1.19.1
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: <nil>
rpc error: code = Unavailable desc = the connection is draining
rpc error: code = Unavailable desc = transport is closing

Reference: grpc/grpc#12295 (comment)

@menghanl
Copy link
Contributor

Can you turn on logging for the client and server?

For go client: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

For C++ server: try GRPC_VERBOSITY=DEBUG GRPC_TRACE=server_channel

@dfawley
Copy link
Member

dfawley commented May 2, 2019

ping @wjywbs - could you gather the debugging info requested and do you have a minimal reproducible test case?

@wjywbs
Copy link
Contributor Author

wjywbs commented May 2, 2019

Sure. I'll try to create a minimized demo.

@easwars
Copy link
Contributor

easwars commented May 2, 2019

I tried with a Go server and Go client (which makes one connection to the server and repeatedly makes RPCs on it).
When the MaxConnectionAge and MaxConnectionAgeGrace on the server were much greater than the time taken to complete the RPCs, I saw the following in the client logs:
I0502 14:33:32.238317 58768 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc0007e5360, TRANSIENT_FAILURE
I0502 14:33:32.238395 58768 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc0007e5360, CONNECTING
I0502 14:33:32.242678 58768 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc0007e5360, READY
and the client kept running without any issues.

But, when the MaxConnectionAge and MaxConnectionAgeGrace on the server was lesser than the time taken to complete the RPC, I saw the following on the client:
I0502 15:24:30.825472 83849 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000475c00, TRANSIENT_FAILURE
I0502 15:24:30.825538 83849 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000475c00, CONNECTING
I0502 15:24:30.829788 83849 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000475c00, READY
F0502 15:24:30.835994 83849 fortuneteller_client.go:83] Error received while calling callTellFortune: rpc error: code = Unavailable desc = transport is closing

@easwars
Copy link
Contributor

easwars commented May 2, 2019

I think this is what is happening:

Since MaxConnectionAge and MaxConnectionAgeGrace have passed on the server, it force closes the connection.

And in the client in http2Client.reader(), the call to t.framer.fr.ReadFrame returns EOF, and since this is not of type http2.StreamError we call Close() on the transport, which closes all active streams with ErrConnClosing

At this point, I don't know enough to say whether this is a bug or is WAI :-)

@dfawley
Copy link
Member

dfawley commented May 2, 2019

If MaxConnectionAgeGrace passes, then this would be expected. From the issue linked, I understood that the grace period was not being reached and errors were still appearing. @wjywbs - can you confirm this?

@wjywbs
Copy link
Contributor Author

wjywbs commented May 2, 2019

Yes, I set both age and grace to 60s and each rpc finishes within a few seconds, but the unavailable errors appear a lot of times.

@dfawley
Copy link
Member

dfawley commented May 23, 2019

@wjywbs do you have a minimal reproducible test case we can use to replicate the problem you are seeing?

@hanjm
Copy link

hanjm commented May 29, 2019

hi @dfawley @wjywbs, i set MaxConnectionAge for long live grpc stream server recently, then a lots of grpc log "transport is closing", so i found this issue and write a test case to replicate the problem. you may need use -count 100 because sometimes it test pass sometimes it test fail.

// 1. run case 1
// go test -run "TestKeepaliveMaxAge/case1-Fatal-Unavailable-transport is closing" -count 100
// result: test fail
//
package testcase1

import (
	"context"
	"google.golang.org/grpc"
	"google.golang.org/grpc/examples/route_guide/routeguide"
	"google.golang.org/grpc/grpclog"
	"google.golang.org/grpc/keepalive"
	"io"
	"log"
	"net"
	"net/http"
	"net/http/pprof"
	"os"
	"testing"
	"time"
)

type routeGuidService struct {
}

func (*routeGuidService) GetFeature(ctx context.Context, p *routeguide.Point) (*routeguide.Feature, error) {
	return &routeguide.Feature{}, nil
}

func (*routeGuidService) ListFeatures(p *routeguide.Rectangle, s routeguide.RouteGuide_ListFeaturesServer) error {
	return nil
}

func (*routeGuidService) RecordRoute(s routeguide.RouteGuide_RecordRouteServer) error {
	return nil
}

func (*routeGuidService) RouteChat(s routeguide.RouteGuide_RouteChatServer) error {
	go func() {
		err := s.Send(&routeguide.RouteNote{Message: "server send"})
		if err != nil {
			log.Printf("failed to send:%s", err)
		}
	}()
	for {
		_, err := s.Recv()
		if err != nil {
			if err == io.EOF {
				return nil
			}
			log.Printf("failed to recv:%s", err)
			return err
		}
	}
}

// go test -run "TestKeepaliveMaxAge/case1-Fatal-Unavailable-transport is closing" -count 100
func TestKeepaliveMaxAge(t *testing.T) {
	grpclog.SetLoggerV2(grpclog.NewLoggerV2WithVerbosity(os.Stderr, os.Stderr, os.Stderr, 10))
	keepAliveMaxConnectionAge := time.Second * 5

	// 1. star server
	s := grpc.NewServer(
		grpc.KeepaliveParams(keepalive.ServerParameters{
			MaxConnectionAge: keepAliveMaxConnectionAge,
		}),
	)
	routeguide.RegisterRouteGuideServer(s, &routeGuidService{})
	const addr = "127.0.0.1:8000"
	const pprofAddr = "127.0.0.1:8001"
	go func() {
		_ = pprof.Index
		err := http.ListenAndServe(pprofAddr, nil)
		if err != nil && err != http.ErrServerClosed {
			return
		}
	}()
	ln, err := net.Listen("tcp", addr)
	if err != nil {
		t.Fatal(err)
		return
	}
	go func() {
		err := s.Serve(ln)
		if err != nil {
			t.Error(err)
			return
		}
	}()
	defer func() {
		s.GracefulStop()
	}()
	// wait server start up
	time.Sleep(time.Millisecond * 100)
	// 2. start client
	cc, err := grpc.DialContext(context.Background(), addr, grpc.WithInsecure())
	if err != nil {
		t.Fatal(err)
		return
	}
	// go test -run "TestKeepaliveMaxAge/case1-Fatal-Unavailable-transport is closing" -count 100
	t.Run("case1-Fatal-Unavailable-transport is closing", func(t *testing.T) {
		timer := time.NewTimer(keepAliveMaxConnectionAge + time.Second)
		for {
			_, err := routeguide.NewRouteGuideClient(cc).GetFeature(context.Background(), &routeguide.Point{})
			if err != nil {
				// unexpected: its Fatal
				// must use grpc.FailFast(false), it's ok
				t.Fatal(err)
				return
			}
			select {
			case <-timer.C:
				return
			default:
			}
		}
	})
}

@hanjm
Copy link

hanjm commented Jun 2, 2019

update. if i use grpc.FailFast(false) or grpc.WithDefaultCallOptions(grpc.FailFast(false)), it's will work fine.

@dfawley
Copy link
Member

dfawley commented Jun 20, 2019

We suspect this may be fixed by #2857. Can you try with this change and let us know if it resolves your problems?

@wjywbs
Copy link
Contributor Author

wjywbs commented Jul 11, 2019

I tested v1.22.0, and there were still many "rpc error: code = Unavailable desc = the connection is draining". With grpc.WaitForReady(true), there were less errors and the descriptions were "the connection is draining" and "transport is closing".

@dfawley
Copy link
Member

dfawley commented Jul 11, 2019

It seems this might be a problem with how we handle the race between choosing a transport and the transport getting the GOAWAY from the server. I'll take a look next week.

@dfawley
Copy link
Member

dfawley commented Jul 22, 2019

@wjywbs please try your workloads with the fix in #2919 and let us know if you experience any further problems. It's still technically possible that RPCs can fail, but it should be extremely unlikely now. Thanks!

@wjywbs
Copy link
Contributor Author

wjywbs commented Jul 26, 2019

Thanks for your fix @dfawley. I tested the fix by sending rpcs for 16 minutes to servers with 60s max connection age and grace. There were 4 occurrences of "rpc error: code = Unavailable desc = the connection is draining", which were much fewer. But this also means that I need to check for this error and retry if a few requests can't be dropped.

I didn't test grpc.WaitForReady(true) this time.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants