Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests #3481

Open
MartinMajor opened this issue Sep 25, 2020 · 18 comments
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:client Issues related to the HTTP Client t:core Issues related to the akka-http-core module

Comments

@MartinMajor
Copy link

Hello Hakkers,

Our stack:

  • Scala 2.12
  • Akka 2.6.8
  • Akka-http 10.2.0

After upgrading from akka-http 10.1.11 we started to randomly get akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests exceptions. It happens on average 1x per 5000 requests. It is unlikely that it has anything to do with the target server, because exception is usually returned 1ms after calling that http request therefore there is no time to even ping target server.

According to logs it usually happens when some request ends and another follows directly after the previous one.

I've read this issue so I tried to enable DEBUG for connection pool and this is one occurrence:

2020-09-25T13:24:15.077Z DEBUG app/Pool(shared->https://domain:443) Dispatching request [POST /url Strict(415 bytes)] to pool
2020-09-25T13:24:15.077Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Received response
2020-09-25T13:24:15.090Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Before event [onResponseReceived] In state [WaitingForResponse] for [344 ms]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]onResponseReceived in WaitingForResponse with false
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseDispatch)]After event [onResponseReceived] State change [WaitingForResponse] -> [WaitingForResponseDispatch]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseDispatch)]Before event [onResponseDispatchable] In state [WaitingForResponseDispatch] for [0 ms]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseEntitySubscription)]After event [onResponseDispatchable] State change [WaitingForResponseDispatch] -> [WaitingForResponseEntitySubscription]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseEntitySubscription)]Before event [onResponseEntitySubscribed] In state [WaitingForResponseEntitySubscription] for [0 ms]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForEndOfResponseEntity)]After event [onResponseEntitySubscribed] State change [WaitingForResponseEntitySubscription] -> [WaitingForEndOfResponseEntity]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForEndOfResponseEntity)]Before event [onResponseEntityCompleted] In state [WaitingForEndOfResponseEntity] for [0 ms]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (Idle)]After event [onResponseEntityCompleted] State change [WaitingForEndOfResponseEntity] -> [Idle]
2020-09-25T13:24:15.091Z DEBUG app/Pool(shared->https://domain:443) [2 (Idle)]Slot became idle... Trying to pull
2020-09-25T13:24:15.094Z DEBUG app/Pool(shared->https://domain:443) [2 (Idle)]Dispatching request [POST /url Strict(415 bytes)]
2020-09-25T13:24:15.105Z DEBUG app/Pool(shared->https://domain:443) [2 (Idle)]Before event [onNewRequest] In state [Idle] for [16 ms]
2020-09-25T13:24:15.151Z DEBUG app/Pool(shared->https://domain:443) [2 (PushingRequestToConnection)]After event [onNewRequest] State change [Idle] -> [PushingRequestToConnection]
2020-09-25T13:24:15.153Z DEBUG app/Pool(shared->https://domain:443) [2 (PushingRequestToConnection)]Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
2020-09-25T13:24:15.153Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
2020-09-25T13:24:15.259Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Connection failed
2020-09-25T13:24:15.260Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Before event [onConnectionFailed] In state [WaitingForResponse] for [166 ms]
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Ongoing request [POST /url Strict(415 bytes)] is failed because of [connection failure]: [The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests]
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseDispatch)]After event [onConnectionFailed] State change [WaitingForResponse] -> [WaitingForResponseDispatch]
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponseDispatch)]Before event [onResponseDispatchable] In state [WaitingForResponseDispatch] for [0 ms]
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (Failed)]After event [onResponseDispatchable] State change [WaitingForResponseDispatch] -> [Failed]
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (Failed)]State change from [WaitingForResponseDispatch] to [Failed(akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests)]. Closing the existing connection.
2020-09-25T13:24:15.261Z DEBUG app/Pool(shared->https://domain:443) [2 (Unconnected)]Slot became idle... Trying to pull

Code that do the call can be simplified to this:

  def doRequest(method: HttpMethod, url: String, body: String, headers: immutable.Seq[RawHeader])(implicit system: ActorSystem[_]) = {
    val request =
      HttpRequest(method, url)
        .withEntity(ContentTypes.`application/json`, body)
        .withHeaders(headers)
        .addHeader(Accept(ContentTypes.`application/json`.mediaType))

    val settings = ConnectionPoolSettings(system.toClassic)

    Http()(system.toClassic).singleRequest(request, settings = settings)
  }

and the settings are:

akka {
  http {
    host-connection-pool {
        max-connections = 16
        max-open-requests = 64
    }
  }
}

Thank you for any help!

@jrudolph jrudolph added the 0 - new Ticket is unclear on it's purpose or if it is valid or not label Sep 28, 2020
@jrudolph
Copy link
Contributor

Thanks for the report, @MartinMajor.

It is unlikely that it has anything to do with the target server, because exception is usually returned 1ms after calling that http request therefore there is no time to even ping target server.

According to this line:

2020-09-25T13:24:15.260Z DEBUG app/Pool(shared->https://domain:443) [2 (WaitingForResponse)]Before event [onConnectionFailed] In state [WaitingForResponse] for [166 ms]

it failed only after 166ms so that would allow a round-trip to the server I guess. Where do you get that 1ms measurement from?

The most likely explanation would be that the server closed the connection without setting Connection: close first on a response. I.e. you get a race condition between sending a new request and noticing that the server has closed the connection. This is the classical problem of HTTP/1.1 pipelining and there's little you can do about it aside from always running POST requests on a new connection.

@MartinMajor
Copy link
Author

it failed only after 166ms so that would allow a round-trip to the server I guess. Where do you get that 1ms measurement from?

The 0-1ms measurement was the most usual time when we had debug level disabled and it was the time between calling .singleRequest() and completing the Future.

We are seeing this behavior both when calling external services and when we call other internal services using akka-http as a server (routed through nginx proxy).

Your explanation makes sense but the weird thing is that we didn't see this before upgrading to akka-http 10.2.0.

aside from always running POST requests on a new connection.

Are POST requests somehow special? Because we see the same problems with GETs as well.

Is there some way how can I confirm that this is the case? According to documentation client-side pipelining is not currently supported. Can I force using new connection for each request?

@jrudolph
Copy link
Contributor

it failed only after 166ms so that would allow a round-trip to the server I guess. Where do you get that 1ms measurement from?

The 0-1ms measurement was the most usual time when we had debug level disabled and it was the time between calling .singleRequest() and completing the Future.

Logging is quite extensive so it might change timing a bit but it's not completely clear how it should mess up things to this degree.

Could you try capturing a trace with tcpdump / wireshark for this issue? With that we could pinpoint the issue pretty clearly between server and client.

We are seeing this behavior both when calling external services and when we call other internal services using akka-http as a server (routed through nginx proxy).

If you have access to the proxy maybe you could check if there are any settings that would limit the life-time of persistent connections to the backend. Common options might be a maximum number of requests to run per connection (though, in that case a Connection: close response header would be expected) or a maximum lifetime per connection.

Your explanation makes sense but the weird thing is that we didn't see this before upgrading to akka-http 10.2.0.

It could be a timing issue or a coincidence or a bug. But it's hard to say without more information.

Are POST requests somehow special? Because we see the same problems with GETs as well.

POST requests cannot be easily retried. GET requests will just be retried by default if they fail (because it's such a common scenario that a server closes a connection after a while).

Is there some way how can I confirm that this is the case? According to documentation client-side pipelining is not currently supported. Can I force using new connection for each request?

It's not about pipelining but about persistent connections which are really important for throughput. Using a new connection for each request means giving up these throughput benefits.

https://doc.akka.io/docs/akka-http/current/client-side/connection-level.html#opening-http-connections shows how to run requests through a connection manually. You could replace Source.queue by Source.single if you intent to run only single requests through a connection. If at all, I would recommend that only for requests that cannot be retried. (But of course it would be better if we could get to the bottom of this issue)

@MartinMajor
Copy link
Author

Hi, thank you for the response!

These are two examples of requests that failed much faster:

2020-09-28T13:51:04.469Z DEBUG app/Pool(shared->https://domain:443) Dispatching request [POST /url Strict(433 bytes)] to pool
2020-09-28T13:51:04.481Z DEBUG app/Pool(shared->https://domain:443) [0 (Idle)]Dispatching request [POST /url Strict(433 bytes)]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (Idle)]Before event [onNewRequest] In state [Idle] for [23 ms]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (PushingRequestToConnection)]After event [onNewRequest] State change [Idle] -> [PushingRequestToConnection]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (PushingRequestToConnection)]Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponse)]After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponse)]Connection failed
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponse)]Before event [onConnectionFailed] In state [WaitingForResponse] for [0 ms]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponse)]Ongoing request [POST /url Strict(433 bytes)] is failed because of [connection failure]: [The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponseDispatch)]After event [onConnectionFailed] State change [WaitingForResponse] -> [WaitingForResponseDispatch]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (WaitingForResponseDispatch)]Before event [onResponseDispatchable] In state [WaitingForResponseDispatch] for [0 ms]
2020-09-28T13:51:04.486Z DEBUG app/Pool(shared->https://domain:443) [0 (Failed)]After event [onResponseDispatchable] State change [WaitingForResponseDispatch] -> [Failed]
2020-09-28T13:51:04.487Z DEBUG app/Pool(shared->https://domain:443) [0 (Failed)]State change from [WaitingForResponseDispatch] to [Failed(akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests)]. Closing the existing connection.
2020-09-28T13:51:04.487Z DEBUG app/Pool(shared->https://domain:443) [0 (Unconnected)]Slot became idle... Trying to pull
2020-09-28T15:25:50.394Z DEBUG app/Pool(shared->https://domain:443) [1 (Idle)]After event [onResponseEntityCompleted] State change [WaitingForEndOfResponseEntity] -> [Idle]
2020-09-28T15:25:50.394Z DEBUG app/Pool(shared->https://domain:443) [1 (Idle)]Slot became idle... Trying to pull
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) Dispatching request [POST /url Strict(346 bytes)] to pool
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (Idle)]Dispatching request [POST /url Strict(346 bytes)]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (Idle)]Before event [onNewRequest] In state [Idle] for [2 ms]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (PushingRequestToConnection)]After event [onNewRequest] State change [Idle] -> [PushingRequestToConnection]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (PushingRequestToConnection)]Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponse)]After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponse)]Connection failed
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponse)]Before event [onConnectionFailed] In state [WaitingForResponse] for [0 ms]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponse)]Ongoing request [POST /url Strict(346 bytes)] is failed because of [connection failure]: [The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponseDispatch)]After event [onConnectionFailed] State change [WaitingForResponse] -> [WaitingForResponseDispatch]
2020-09-28T15:25:50.396Z DEBUG app/Pool(shared->https://domain:443) [1 (WaitingForResponseDispatch)]Before event [onResponseDispatchable] In state [WaitingForResponseDispatch] for [0 ms]
2020-09-28T15:25:50.402Z DEBUG app/Pool(shared->https://domain:443) [1 (Failed)]After event [onResponseDispatchable] State change [WaitingForResponseDispatch] -> [Failed]
2020-09-28T15:25:50.403Z DEBUG app/Pool(shared->https://domain:443) [1 (Failed)]State change from [WaitingForResponseDispatch] to [Failed(akka.http.impl.engine.client.OutgoingConnectionBlueprint$UnexpectedConnectionClosureException: The http server closed the connection unexpectedly before delivering responses for 1 outstanding requests)]. Closing the existing connection.
2020-09-28T15:25:50.403Z DEBUG app/Pool(shared->https://domain:443) [1 (Unconnected)]Slot became idle... Trying to pull

Could you try capturing a trace with tcpdump / wireshark for this issue? With that we could pinpoint the issue pretty clearly between server and client.

I tried to run the application from localhost with wireshark but the issue doesn't happen so often so I didn't catch one. I might try to deploy wireshark to production for a while but it will take some time to test that properly because I don't want to break the production.

If you have access to the proxy maybe you could check if there are any settings that would limit the life-time of persistent connections to the backend. Common options might be a maximum number of requests to run per connection (though, in that case a Connection: close response header would be expected) or a maximum lifetime per connection.

I haven't found anything suspicious (which proves nothing - I might just overlooked that). I will test that even more.

POST requests cannot be easily retried. GET requests will just be retried by default if they fail (because it's such a common scenario that a server closes a connection after a while).

Isn't possible to retry even non-GET request when it is detected that connection was broken before request is send?

@MartinMajor
Copy link
Author

Hi, I was able to reproduce the issue while recording communication and I found segment, that I think is weird.

Note: This was the first time that I used Wireshark and I'm noob in TCP therefore I might be completely wrong.

Wireshark screenshot

  • I have filtered only this relevant TCP stream
  • 391: Server sent FIN packet.
  • 393: Client responded by ACK therefore akka-http must knew that this connection is already closing.
  • 394: Client tried to send another request to the closing connection (this seems weird to me)
  • 395: Client sent FIN
  • 397: Server acked request therefore connection was closed
  • 398: Server acked FIN therefore connection was closed

@jrudolph
Copy link
Contributor

jrudolph commented Oct 5, 2020

Thanks for sharing that @MartinMajor.

In this case, it looks like the server could have known about the connection closure, but it is a race condition in any case.

One problem is that there are several layers involved in transporting a connection close signal to the pool. E.g. the ACKs are directly sent by the kernel without the application seeing anything about that. The application sees that the peer has closed the connection only when it reads from the connection the next time. But even then the information needs to trickle up to the pool first before the pool knows a connection has been closed.

For that reason, HTTP defines that a server should set a Connection: close header before closing the connection. Can you check whether the server set that header in the last response it sent?

@jrudolph
Copy link
Contributor

jrudolph commented Oct 5, 2020

I just checked that at least in the latest version things work as expected. So, with a response Connection: close header no request would be sent to the connection again. Without that header, it depends on the timing.

@MartinMajor
Copy link
Author

Thanks for the explanation! I've checked the Wireshark logs and server indeed sent Connection: keep-alive and then closed the connection which is weird. I'll check the nginx configuration if I'll be able to find out why is this happening.

Do you have any idea why this started to happen after upgrade from akka-http 10.1.11 -> 10.2.0 ? I agree that server's behavior is wrong but it would be nice if akka-http could handle that anyway 😉

@jrudolph
Copy link
Contributor

jrudolph commented Oct 5, 2020

Do you have any idea why this started to happen after upgrade from akka-http 10.1.11 -> 10.2.0 ?

Might be just a timing issue.

I agree that server's behavior is wrong but it would be nice if akka-http could handle that anyway wink

There's no way to handle all cases because it's a timing issue and we cannot guess what the server is going to do.

@MartinMajor
Copy link
Author

I've dug a little bit and I've been able to trace the problem to its roots:

We are using nginx as a load-balancer in front of our backend services. When some service starts or stops it is registered to Consul and then using Consul-template we generate new nginx configuration and gracefully reload nginx. When this happen, nginx keeps running old requests but use new configuration for new requests. Unfortunately there is not much to do with keep-alive connections that currently don't handle any requests so it closes them. This causes that last HTTP message contains Connection: keep-alive but then FIN packet is sent. In case when you are unlucky and client just sent a request, it won't be processed. The problem is that client cannot be sure whether server started to process this request. According to RFC 2616 client should repeat request in case it is idempotent. In case it is not (POST, PUT etc), it is up to user to decide what to do. For me, this is proof that HTTP keep-alive protocol is broken.

Kubernetes dealt with similar problem and they solved it by using dynamic nginx configuration (using Lua) that doesn't require reload. We'll probably use that solution too.

For me, akka-http behaves correctly according to RFC. But when RFC allows terminating TCP connection during HTTP request establishing, akka-http should throw some specific exception in this case with good name allowing user to handle it. Currently it is generic exception that is moreover in private class thus it cannot be handled.

@jrudolph
Copy link
Contributor

Thanks for digging into it, @MartinMajor. That sounds like a very reasonable description.

For me, this is proof that HTTP keep-alive protocol is broken.

Indeed. HTTP/2 has a slightly improved mechanism where you can at least give a hint at which request was the last one that has been fully processed.

Currently it is generic exception that is moreover in private class thus it cannot be handled.

Yes, would be good to move the exception so it can be more easily handled. Not sure though about what more concrete information you see missing?

@jrudolph
Copy link
Contributor

Yes, would be good to move the exception so it can be more easily handled. Not sure though about what more concrete information you see missing?

This is being tracked in #768 but we couldn't yet get a consensus about how the public exception hierarchy would look like.

@MartinMajor
Copy link
Author

Not sure though about what more concrete information you see missing?

When I think about it, the exception is fine. I would just add a scaladoc that would explain more thoroughly TCP & HTTP details.

This is being tracked in #768 but we couldn't yet get a consensus about how the public exception hierarchy would look like.

I understand that it might not be an easy decision but it should be done somehow because according to RFC client code is responsible to handle these situations therefore it needs to handle that exception.

Thanks a lot @jrudolph for all the information!

@jrudolph jrudolph modified the milestone: 10.2.5 Jun 14, 2021
@jrudolph jrudolph added 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:client Issues related to the HTTP Client t:core Issues related to the akka-http-core module and removed 0 - new Ticket is unclear on it's purpose or if it is valid or not labels Jun 14, 2021
@jrudolph
Copy link
Contributor

We added a new pool keep-alive-timeout setting that can be used to make the pool close the connection before the backend does.

We haven't yet improved the error message. One problem is that there are multiple ways the error message could play out depending on the exact timing. I'm leaving this ticket open for improvements in this regard.

@jrudolph
Copy link
Contributor

I also filed a new ticket for a potential alternative solution: #3834.

@ergunbaris
Copy link

Is this problem introduced with 10.1.12 version where there were changes around http2 handling? Can you please verify this?

@MartinMajor
Copy link
Author

We are using http/1.1 so if those changes were for http2 only it shouldn't be linked. We started to notify this problem after the upgrade but it might just be a coincidence.

I can't easily test it now with the old version, sorry :(

@saideliveroo2003
Copy link

@MartinMajor what was the fix for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted t:client Issues related to the HTTP Client t:core Issues related to the akka-http-core module
Projects
None yet
Development

No branches or pull requests

4 participants