-
Notifications
You must be signed in to change notification settings - Fork 673
Description
While testing WebClient acting as a proxy under load, we reproduce the issue when connection pool holds some connections with invalid state HttpObjectEncoder.
When client gets such connection from a pool and uses it to make request to a destination server, it immediately fails because DefaultHttpRequest cannot be encoded by HttpObjectEncoder due to the fact that encoders state is set to ST_CONTENT_NON_CHUNK (or any other), while it expects it to be ST_INIT. A kind of broken state machine inside the encoder.
The issue is mainly observed when the client proxies heavy upstream data (uploads content to a destination server).
Disabling connection pool fixes the thing.
Observed exception:
io.netty.handler.codec.EncoderException: java.lang.IllegalStateException: unexpected message type: DefaultHttpRequest
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738) ~[netty-transport-4.1.16.Final.jar:4.1.16.Final]
...
Activity
lnxmad commentedon Oct 5, 2017
I can confirm this issue.. hard to recreate, but often happens when the
WebClient
receives a500
status from the server. @alexmaddriver How did you disable the connection pool?violetagg commentedon Oct 5, 2017
Hi,
Is it possible to test with the latest available Netty - netty-4.1.16.Final
Thanks,
Violeta
violetagg commentedon Oct 5, 2017
Please ignore it I saw in the stack trace that you are using it already.
rstoyanchev commentedon Oct 6, 2017
@alexmaddriver could you show some snippet that's representative of how the
WebClient
is used in this case? In particular I presume that you're using one of thebodyToXxx
methods to consume the response content? And the version for the spring-webflux module is 5.0.0.RELEASE?lhotari commentedon Oct 6, 2017
There might be a timing issue with the cleanup action that gets subscribed by
reactor.ipc.netty.ReactorNetty#registerForClose
, called fromreactor.ipc.netty.ReactorNetty#addHandlerAfterReactorCodecs
.Perhaps what happens is that the action to remove the handler gets called after the new request flow has already started to use the existing connection. When the expected handler is missing from the Netty pipeline one would get an exception that was reported.
When I was working on #176 , I noticed a similar issue when I first had an invalid fix for the memory leak I had found. Coincidentally that fix might also fix this issue since it would remove the possibility that the cleanup action would get signalled out-of-order. The reason why this could help is that the
onClose
publisher used for theonInactive
processor inreactor.ipc.netty.channel.ChannelOperations
now subscribes to a processor inreactor.ipc.netty.channel.ServerContextHandler
that is disposed an and unsubscribed from the upstream closeFuture before the channel gets released back to the pool. The reason I did so was to fix a memory leak that was caused by active subscribers to Channel's closeFuture being left around hanging and leaking memory. The same change might have other benefits.Could you @alexmaddriver please try out #176 ? It might not be the most optimal fix for this issue or even fix this issue, but it would be interesting to see if it changes anything. :)
alexmaddriver commentedon Oct 6, 2017
Hello,
We use the latest spring-webflux 5.0.0.RELEASE.
@lnxmad: We disable connection pool when creating WebClient as follows:
WebClient.builder()
.clientConnector(new ReactorClientHttpConnector(options -> options.disablePool()))
.build();
@rstoyanchev: we do not use bodyToXXX, instead directly "targetResponse.body(toDataBuffers())", because of a recent bug with bodyToXXX methods which used to close connections as a side affect. As it was fixed in 0.7.0.RELEASE, we can now certainly switch to bodyToFlux.
The code snippet:
protected Mono<ResponseEntity<Flux>> proxyPostRequest(ServerHttpRequest request) {
return client
.post()
.uri(destinationUriGenerator.apply(request.getURI()))
.headers(headers -> filterRequestHeaders(request, headers))
.headers(this::setUserAgent)
.body(request.getBody(), DataBuffer.class)
.exchange()
.map(toResponseEntity());
}
private Function<ClientResponse, ResponseEntity<Flux>> toResponseEntity() {
return targetResponse -> new ResponseEntity<>(
targetResponse.body(toDataBuffers()),
targetResponse.statusCode()
);
}
alexmaddriver commentedon Oct 6, 2017
@lhotari, thank you for explanation. Sure, we can give it a try and will come back with results.
In fact, we put quite some effort to switch form VertX to web flux client to unify our product stack. And now very close to RTM version and disabling connection pool workaround is not good. Because of that willing to put our efforts to contribute in fixing the problem, or at least fully understand when the issue happens :)
rstoyanchev commentedon Oct 6, 2017
@alexmaddriver with regards to
bodyToXxx
and the side effect of closing, indeed that was fixed in 5.0 GA release. Theclose()
method was removed in favor usingbodyToMono(Void.class)
if not interested in the response content along with this improvement.violetagg commentedon Oct 19, 2017
Hi,
As we have several fixes recently that might be relevant, is it possible to test with the latest 0.7.1.BUILD-SNAPSHOT version.
Thanks,
Violeta
alexmaddriver commentedon Oct 19, 2017
Just tried the latest build and the problem is still there. It is easily reproduced after a few HTTP errors, e.g. 404. After that MessageToMessageEncoder gets into invalid state.
alexmaddriver commentedon Oct 19, 2017
We use reactor-netty + webflux client based proxy server between clients and a content server.
Also, we wrote client emulating tool able to generate any load using both HTTP and HTTPS connectors, with any content size, including huge packages uploaded / downloaded in parallel. Such emulator is very handy to discover performance, concurrency and memory leak type of issues. If you want our QEs can share more details for you to be able to setup such testing in-house.
violetagg commentedon Oct 19, 2017
@alexmaddriver I tried to use the snippet above to produce some reproducible example but I'm still not able to do that. If you have something can you share with us. And yes we are interested in the tool that you are developing if you can share more details it will be great.
36 remaining items
lhotari commentedon May 1, 2018
@violetagg The
java.lang.IllegalStateException: unexpected message type: DefaultHttpRequest
error also appears with the test app provided in spring-cloud/spring-cloud-gateway#228 . I tried the test app with the recent snapshot versions. The issue remains, but it seems that it takes longer to reproduce when usingreactor-netty 0.7.7.BUILD-SNAPSHOT
. I did a small change to the test app in my fork to see if consuming the body would change the results. It didn't. When client connection pooling is disabled, the issue cannot be reproduced with the spring-cloud-gw-issue-228 test app (java -Dspring.cloud.gateway.httpclient.pool.type=DISABLED -jar target/gateway-0.0.1-SNAPSHOT.jar
).Do you believe that fixing #323 and #324 will fix the issue reported in spring-cloud/spring-cloud-gateway#228 ?
violetagg commentedon May 1, 2018
@lhotari Yes I believe so
violetagg commentedon May 1, 2018
@lhotari can you try this fix #343 - it is for #323
lhotari commentedon May 1, 2018
I tested the fix #343 and I wasn't able to reproduce the
java.lang.IllegalStateException: unexpected message type: DefaultHttpRequest
problem with the spring-cloud-gw-issue-228 test app . Great work @violetagg !madgnome commentedon May 15, 2018
I'm seeing this issue with 0.7.7.RELEASE version (using Spring boot 2.0.2)
Before seeing this exception there's this one in the logs:
and then:
violetagg commentedon May 15, 2018
@madgnome There might be different causes that lead to this
IllegalStateException
any chance to provide reproducible scenario?Thanks,
Violeta
madgnome commentedon May 16, 2018
I tried to reproduce locally but no luck. Disabling the pool fixed the issue in prod.
@violetagg Do you have any tips to reproduce this type of issue?
madgnome commentedon May 16, 2018
I've turned on more logging but not all of it because it's too verbose unfortunately.
Successful log sequences
Last log sequences before the first exception:
Shit starts to hit the fan here
Log when the SSL exception happens, just before getting in the bad state
Bad state log statements:
Hopefully that helps
madgnome commentedon May 17, 2018
I've managed to reproduce with HttpClient logging handler logs enabled:
Last successful log sequences:
⚠️ It's in reverse order (older log are at the end)
Shit starts to hit the fan from here:
Log when the SSL exception happens:
⚠️ It's in reverse order (older log are at the end)
It looks like the ssl connection gets closed but the channel doesn't.
Looking at previous logs following a
SslCloseCompletionEvent
we can see the channel being unregisteredmadgnome commentedon May 17, 2018
I've created a separate issue to track this #361
anonymousXrandom commentedon Jun 28, 2018
We have to disable connection pool too. When the post requests contain a large body, it's easy to reproduce this issue. It's frustrating without the pool, the number of requests per second is very limited due to the limit of available ports.
violetagg commentedon Jun 28, 2018
@anonymousXrandom What is the Reactor Netty version that you use? We have fixes for this issue.
Please create new issue and describe your scenario there.