Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP client pool stops processing requests with "Exceeded configured max-open-requests value" #3662

Open
vilunov opened this issue Nov 25, 2020 · 57 comments
Labels
bug t:client Issues related to the HTTP Client t:client-new-pool t:core Issues related to the akka-http-core module

Comments

@vilunov
Copy link

vilunov commented Nov 25, 2020

We use sttp v2 with single-request API (no streams). Some configs:

akka.http.host-connection-pool {
  max-connections = 64
  max-open-requests = 128 // Must be a power of 2
}

akka.http.client {
  idle-timeout = 5 seconds
  connecting-timeout = 1 second  // this is ignored when used with sttp, but we introduced a workaround
}

At some point of time this happens:

Screenshot_20201125_111803

Here the requests to typing stopped working at 17:16, shortly after the server started at 17:11. Such freeze can happen at any point of the instance's life though, even days after the start.

Here are some hypotheses that we considered:

  • The result stream is not consumed somehow and clogs the request pool -- stream consumption is handled by the sttp adapter.
  • The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.
  • Threads in the threadpool are being blocked -- in the heapdump most of the threads are either RUNNING or WAITING for new work.

Error log:

akka.stream.BufferOverflowException: Exceeded configured max-open-requests value of [128]. This means that the request queue of this pool (HostConnectionPoolSetup(typing,80,ConnectionPoolSetup(ConnectionPoolSettings(64,0,2,128,1,1 minute,100 milliseconds,15 seconds,30 seconds,ClientConnectionSettings(Some(User-Agent: gateway),3 seconds,300 milliseconds,512,None,WebSocketSettings(<function0>,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$236/564086167@db159b5),List(),ParserSettings(2048,16,64,64,8192,64,Some(33554432),8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$216/512238546@25c032d,akka.util.ConstantFun$$$Lambda$217/203048480@5a78e22),100 milliseconds,None,Http2ClientSettingsImpl(256,65536,10000000,512000,1024,false,None),TCPTransport),1 second,List()),akka.http.scaladsl.HttpConnectionContext$@20049e4b,akka.event.MarkerLoggingAdapter@116df1d))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
	at akka.http.impl.engine.client.PoolInterface$Logic.<init>(PoolInterface.scala:106)
	at akka.http.impl.engine.client.PoolInterface$PoolInterfaceStage.createLogicAndMaterializedValue(PoolInterface.scala:94)
	at akka.stream.impl.GraphStageIsland.materializeAtomic(PhasedFusingActorMaterializer.scala:700)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:499)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:449)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:441)
	at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:703)
	at akka.http.impl.engine.client.PoolInterface$.apply(PoolInterface.scala:80)
	at akka.http.impl.engine.client.PoolMasterActor.akka$http$impl$engine$client$PoolMasterActor$$startPoolInterface(PoolMasterActor.scala:131)
	at akka.http.impl.engine.client.PoolMasterActor$$anonfun$receive$1.applyOrElse(PoolMasterActor.scala:162)
	at akka.actor.Actor.aroundReceive(Actor.scala:534)
	at akka.actor.Actor.aroundReceive$(Actor.scala:532)
	at akka.http.impl.engine.client.PoolMasterActor.aroundReceive(PoolMasterActor.scala:111)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:573)
	at akka.actor.ActorCell.invoke(ActorCell.scala:543)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:269)
	at akka.dispatch.Mailbox.run(Mailbox.scala:230)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:242)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
@raboof
Copy link
Contributor

raboof commented Nov 25, 2020

I don't have much experience with sttp. A minimal project that shows the problem in isolation and can be used to further diagnose the problem might be a good start?

@vilunov
Copy link
Author

vilunov commented Nov 25, 2020

We cannot provide a minimal project at all. The same project running in our staging environment, this only happens in production under heavy load. sttp should not be a problem here, as these errors come from akka.

@jrudolph
Copy link
Contributor

The freeze happens due to the pool's being overloaded -- we introduced timeouts and even with them the pools still freeze until the instance is restarted.

I think this is the most likely case. Since timeouts currently don't purge requests from the pool, you can end up in a situation where all requests will be handled too late and eventually time out (especially since the queueing time is part of the timeout).

@jrudolph
Copy link
Contributor

Dealing with server overload is a hard problem and there's not a good solution to that problem currently. After all the goal must be to shave off as much work from the server as possible when you detect the problem (however, where should that those requests then go?).

Here are some hints at what you could try to do:

  • provide more capacity at the server
  • keep queues as small as possible (max-open-requests even smaller than 128) and then put a buffer before the pool that you manage actively. Add a circuit breaker in front of it that rejects requests directly when overload is detected.
  • To avoid having to restart the app completely, when you detect these kinds of situations, you can shutdown the pool manually. You can get a handle to the pool (HostConnectionPool) by materializing Http().cachedHostConnectionPoolHttps for the host in question and call HostConnectionPool.shutdown. There's also Http().shutdownAllConnectionPools() (though, I think that might leave some requests hanging).

If you can provide debug logging for the pool (filter for class akka.http.impl.engine.client.PoolId), we can look into what other causes could lead to the behavior. Which version of akka-http do you use?

@vilunov
Copy link
Author

vilunov commented Nov 25, 2020

I'm afraid we don't have PoolId yet -- we used 10.1.2 at the time of creating the memory dump. Right now we use 10.2.0 (downgraded form 10.2.1 due to a bug with S3) and observe the same problem.

However, I see objects of class akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot and see that most of the have _state = WaitingForResponse. I assume this means that no answer has been received yet.

@jrudolph
Copy link
Contributor

However, I see objects of class akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1$Slot and see that most of the have _state = WaitingForResponse. I assume this means that no answer has been received yet.

Yes

@vilunov
Copy link
Author

vilunov commented Nov 26, 2020

We followed your advice with keeping queues as small as possible and used following configs:

akka.http.host-connection-pool {
  max-connections = 256
  max-open-requests = 256 // Must be a power of 2
}

Still, shortly after startup we encountered the same problem with one of our instances, all requests to a single hostname were failed without a single successful request completion during 30 minutes. We will retrieve and analyze the heap dump a bit later, what should we look for besides PoolId objects?

@jrudolph
Copy link
Contributor

DEBUG log information might be more interesting than looking through the dumps because they will also give a timeline of what's going on.

@vilunov
Copy link
Author

vilunov commented Nov 26, 2020

Can you specify which exact loggers should we enable debug-level logging for? We cannot afford to store all debug logs, sadly

@jrudolph
Copy link
Contributor

akka.http.impl.engine.client.PoolId might be enough but otherwise, everything under akka.http would help.

@vilunov
Copy link
Author

vilunov commented Nov 30, 2020

We cannot afford to enable debug logs for this loggers as it increases our logs volume tenfold, we simply do not have the resources to store so much logs at our scale.

Following the advice, we disabled the queue by setting both max-connections max-open-requests to 256. This did not help us, we still have stuck client pools.

All slots in the akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1 have a _state of WaitingForResponse with waitingForEndOfRequestEntity = false. They also have a disconnectAt set to a time earlier than the current time (moment of the dump creation), currentTimeout = null and currentTimeoutId = -1. Why are they not terminated by the timeout?

@jrudolph
Copy link
Contributor

jrudolph commented Dec 3, 2020

All slots in the akka.http.impl.engine.client.pool.NewHostConnectionPool$HostConnectionPoolStage$$anon$1 have a _state of WaitingForResponse with waitingForEndOfRequestEntity = false. They also have a disconnectAt set to a time earlier than the current time (moment of the dump creation), currentTimeout = null and currentTimeoutId = -1. Why are they not terminated by the timeout?

That means that all requests are waiting for responses from the server. The disconnectAt timeout is only applied between requests. Since we don't enfore any response timeouts currently, the connections will only be reclaimed when the idle timeout kicks in. Are you saying that not even the idle-timeout kicks in?

@vilunov
Copy link
Author

vilunov commented Dec 3, 2020

Nope, we have idle-timeout set to 5 seconds and it kicks for some requests judging by the logs, but disconnectAt has a value of current time minus ~30 minutes, i.e. these requests are waiting for responses for around half an hour. Metrics indicate the same – no requests are completed for a long time.

@jrudolph
Copy link
Contributor

jrudolph commented Dec 3, 2020

Hard to say what could be the reason for the idle-timeout not being triggered. I think that's the first time that I hear about the idle-timeout not working. This is just a wild guess but if you use Kamon, you could try running without it. Hard to diagnose it without more info.

@vilunov
Copy link
Author

vilunov commented Dec 3, 2020

No, we don't use Kamon. By metrics I mean our own non-invasive metrics around requests invocations.

@spangaer
Copy link

Looks like we get to join the dance. (Akka HTTP 10.1.x).
I don't have any more information to bring to the table other then @vilunov isn't the only one suffering from it.

In our case we use rather big pools.

  client{
    connecting-timeout = 15s
    idle-timeout = 50s
  }
  
  
  host-connection-pool{
    idle-timeout = 500s
    max-connections = 32767
    min-connections = 0
    max-open-requests = 131072
  }

They don't time out, and just get stuck for hours. I've seen it auto resolve in some cases. But current occurrence has been stuck for about 24h now.

@spangaer
Copy link

spangaer commented Dec 14, 2021

Another thing I can add to is that I have code doing Akka HTTP to Akka HTTP calls, that don't seem to suffer while these calls have an ALB sitting in between.

@SkyTrix
Copy link

SkyTrix commented Dec 14, 2021

Could this somehow be related to this failure? #3916

@jrudolph
Copy link
Contributor

If this happens with high CPU load in the pool, then it might be related to the issue fixed in 67ac11c. I'd recommend to update to 10.2.x for big pools (> 500-1000 connections).

@spangaer
Copy link

Does that commit fix some slot release escapes somehow?

(I've looked at the commit, but it doesn't seem to be apparent from there to me)

Yes, high CPU load would have been involved here.

@jrudolph
Copy link
Contributor

Does that commit fix some slot release escapes somehow?

(I've looked at the commit, but it doesn't seem to be apparent from there to me)

Yes, high CPU load would have been involved here.

That commit fixes that the pool management logic itself spins and contributes to CPU load up to a degree where little useful work can be done by the pool. If you are saying it isn't making any progress at all, it might still be something different, but even in that case, it would be good to try on 10.2 and see if it shows the same issue.

@spangaer
Copy link

Well the state is 32k slots in WaitingForResponse, with seemingly all SlotConnections to claim being connected while nothing is moving anymore (since last evening) and server side and ALB timeouts long expired..

From what you're saying it seems like I should expect a single CPU cooking all this time? Which is not the case.

It seems to involve some kind of connectivity issues of which the errors escape the slot release process.

@spangaer
Copy link

While studying on this code (v10.1.x; but still valid)
https://github.com/akka/akka-http/blob/v10.2.7/akka-http-core/src/main/scala/akka/http/impl/engine/client/pool/NewHostConnectionPool.scala#L289

Isn't that a race condition?

If lines 271 and 288 are reached at the same time two invocations of updateState() will be running side-by-side and the outcome will be rather unpredictable.

currentTimeoutId should be volatile, because it reads from a different thread then the one clearing/changing it, but even that won't cut it I fear. Seems like either it needs to involve checks and sets on both sides, in the right order (or some locking) or ensuring it happens inside the actor "event loop".

Or am I missing something?

@spangaer
Copy link

Further study: If a materialized connectionFlow: Flow[HttpRequest, HttpResponse, Future[Http.OutgoingConnection]] of HostConnectionPoolStage , once connected, never produces an error. The slot will get stuck in WaitingForResponse(.

At that point, no timeout code is active in the connection pool.

@spangaer
Copy link

This fix seems to acknowledge such a possibility.
https://github.com/akka/akka-http/blame/v10.1.14/akka-http-core/src/main/resources/reference.conf#L342

Any chance that a heavy CPU load could make the 100 ms fall short.
And if so, what is the risk of raising that value to say 20 seconds?

@spangaer
Copy link

By now I have as much as a smoking gun. I'm able to reproduce by creating load on a local system and get some hanging slots.

(was trying to validate if those settings did something for me)

With debug logging enabled they spit out this

[435 (WaitingForResponse)] Connection cancelled 
[534 (WaitingForResponse)] Connection cancelled 
[449 (WaitingForResponse)] Connection cancelled

I found these by first looking at 1 hanging slot and finding that message, then looking for the message, which occurs the exact amount of times as the number of hanging slots (81 times).

When looking at the code, it seems to tell me that an attempt is made to send an HttpRequest to an HTTP stream, but that one can no longer process it so responds a cancellation instead. Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen. Which makes me think that the cancellation signal should be processed as some kind of failure instead of just being logged.

@spangaer
Copy link

By now I've figured that thing is sort of the end of the line. But more is going on before it ever gets there. But I'm working on making that visible.

@spangaer
Copy link

Ok, I went through 7 hells to exert the right amount of stress to reproduce the problem and at the same time not loose the debug logging, but I think I'm there.

Note: for things that have the same timestamp, you cannot trust the order. As it's retroactively gets sorted and there's no extra counter to ensure ordering

Time	message	thread_name
Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Dispatching request [GET /id/current/prpl Empty]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Before event [onNewRequest] In state [Unconnected] for [105770 ms]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299	[698 (Connecting)] After event [onNewRequest] State change [Unconnected] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299	[698 (Unconnected)] Establishing connection	listofnumbers-akka.actor.default-dispatcher-33
Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2201 ms]	listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Connection attempt succeeded	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [7402 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Slot connection was established	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:43:03.183	[698 (WaitingForResponse)] Connection cancelled	listofnumbers-akka.actor.default-dispatcher-49
Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-47
Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [69286 ms]	listofnumbers-akka.actor.default-dispatcher-47

I'm still in the process of analyzing what that sequence of events means, but for blocking slots it fairly consistently looks like this, so I thought I could share it already.

[449 (WaitingForResponse)] Connection cancelled is maintained as tell for a blocking slot.

@jrudolph
Copy link
Contributor

Thanks for this analysis, @spangaer. I'm going to have a quick look if anything stands out. Have you already tried with Akka HTTP 10.2.7? In any case, a fix will only appear in 10.2.7 as 10.1.x is now in deep maintenance mode.

@jrudolph jrudolph added the 0 - new Ticket is unclear on it's purpose or if it is valid or not label Dec 16, 2021
@jrudolph jrudolph added 3 - in progress Someone is working on this ticket and removed 1 - triaged Tickets that are safe to pick up for contributing in terms of likeliness of being accepted labels Dec 16, 2021
@jrudolph jrudolph self-assigned this Dec 16, 2021
@jrudolph
Copy link
Contributor

Ok, I think I found a problem when the connection is closed while only part of the response headers have been received.

@jrudolph
Copy link
Contributor

out

@jrudolph
Copy link
Contributor

Hmm, that's a potential problem but none that should happen in the pool, the stages are closed once the responses are pulled (which they are all the time for the pool).

@spangaer
Copy link

I don't expect that those idle-timeouts help. They only apply when the pool connection actually is idle, not while it is waiting for a response.

I meant for a different issue "Connection reset by peer". Which happens if an Idle connection is closed by the other side. I've seen that go sour at the time a new request goes in which then fails for no good reason. But is easily solved by closing the client side before you expect the server side to do the same. It now looks to be addressed by the keepalive-timout.

Neat graph btw, how did you generate that? (haven't figured out to interpret it though)

@spangaer
Copy link

Yes, I expect the TCP connection to be gone but the Akka HTTP counterpart being still been kept alive for some reason.

Hmm. Doesn't the cancellation signal that that structure is being cleaned-up ? I thought the delay that was introduced is meant to give a chance for Throwable to come out the back door, but by time the cancellation arrives all hope is lost. So you can start thinking of cleaning up?

@jrudolph
Copy link
Contributor

I dug a bit deeper but I cannot reproduce anything with the actual pool involved as well.

If you want to play around, you could try if changing this makes any difference:

          override def onDownstreamFinish(): Unit =
            withSlot { slot => 
				slot.debug("Connection cancelled")
				slot.onConnectionFailed(new RuntimeException("connection cancelled"))
                responseIn.cancel()
            }

You could also try to run akka.stream.testkit.scaladsl.StreamTestKit.assertNoChildren and look at / send the output.

@jrudolph
Copy link
Contributor

You could also try to run akka.stream.testkit.scaladsl.StreamTestKit.assertNoChildren and look at / send the output.

This is what creates those dot graphs. The output will be rather huge for your usecase, though.

@spangaer
Copy link

I guess that change would just clean-up after cancellation ? Which would probably unlock the slot, right? I'll probably give that a shot at some point. What beats me at this moment is (I've ordered the log statements a I expect them to be occurring):

Time	message	thread_name

Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Dispatching request [GET /id/current/prpl Empty]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.298	[698 (Unconnected)] Before event [onNewRequest] In state [Unconnected] for [105770 ms]	listofnumbers-akka.actor.default-dispatcher-27
Dec 15, 2021 @ 21:42:29.299	[698 (Unconnected)] Establishing connection	listofnumbers-akka.actor.default-dispatcher-33
Dec 15, 2021 @ 21:42:29.299	[698 (Connecting)] After event [onNewRequest] State change [Unconnected] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-27

Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2201 ms]	listofnumbers-akka.actor.default-dispatcher-44
Dec 15, 2021 @ 21:42:31.501	[698 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]	listofnumbers-akka.actor.default-dispatcher-44

Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Connection attempt succeeded	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [7402 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (Connecting)] Slot connection was established	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]	listofnumbers-akka.actor.default-dispatcher-37
Dec 15, 2021 @ 21:42:38.903	[698 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-37

Dec 15, 2021 @ 21:43:03.183	[698 (WaitingForResponse)] Connection cancelled	listofnumbers-akka.actor.default-dispatcher-49

Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [69286 ms]	listofnumbers-akka.actor.default-dispatcher-47
Dec 15, 2021 @ 21:43:48.193	[698 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]	listofnumbers-akka.actor.default-dispatcher-47

Where are those onNewConnectionEmbargo coming from? The only path I see there is over Unconnected.

Next to that, all 85 "Connection cancellation" statements arrive in a time pan 600 ms, which is a preceded by a 23 second GAP in all logging, included in the 25 second gap here.

@spangaer
Copy link

And it's not one big GC pause I have other records that do show activity.
It does do a dozen or so small young G1GC's using about 500 ms cpu time over that time.

@spangaer
Copy link

If you want to play around, you could try if changing this makes any difference:

Still need to get to this. Just a sanity check. We can expect "this to solve it", so it's more of matter to see that cancellations occur, but slots get unblocked anyway. Correct?

@jrudolph
Copy link
Contributor

Where are those onNewConnectionEmbargo coming from? The only path I see there is over Unconnected.

That means that a connection attempt failed for any connection on that pool. In that case, we send all slots a notice to hold back creating new connections if needed until the embargo is over. This doesn't affect connected slots.

Next to that, all 85 "Connection cancellation" statements arrive in a time pan 600 ms, which is a preceded by a 23 second GAP in all logging, included in the 25 second gap here.

Ultimately, the cancellation issue is likely caused by a downstream event that closes many connections. So, maybe something else is already amiss before that event.

If you want to play around, you could try if changing this makes any difference:

Still need to get to this. Just a sanity check. We can expect "this to solve it", so it's more of matter to see that cancellations occur, but slots get unblocked anyway. Correct?

This could help, it will propagate the cancellation back to the remaining alive part of the HTTP connection implementation and also frees the slot. The exact behavior is hard to say since I didn't manage to reproduce the conditions leading to the behavior.

@spangaer
Copy link

OK this time around my biggest challenge was getting that patch built and my test rig to work still. I actually reverted some of my previous settings. Symmetric idle-timeout = 50s, remove the cancellation delay. It looks like I accidentally set the pool idle timeout to 29s, but it doesn't really make a difference here, the pool is more then active enough for that. (given that I built from 10.1.15, I was aiming for keep-alive-timeout )

I patched with a variant of the suggestion

  override def onDownstreamFinish(): Unit =
	withSlot { slot =>
	  slot.debug("Connection cancelled")
	  slot.onConnectionFailed(new IllegalStateException("http stream connection cancelled"))
	  responseIn.cancel()
	}

And as you can tell by these logs put the hammer down on my machine.

It looks like it works as intended. The interesting bits are around 16:50:30.712 (remember ordering is off for matching timestamps)

I think it would have fell on it's feet. The slot does seem to unlock and continue from there. My machine was so broken down that it became dysfunctional after that. Profiler disconnected and failed to reconnect and even accessing my web services stopped working eventually.

Will probably take this patch and put it on a RND deployment, see if it does sensible things there. Does seem like it makes sens to take it further.

slot 2342


Time	message
Dec 23, 2021 @ 16:46:04.348	[2342 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [172562 ms]
Dec 23, 2021 @ 16:46:04.348	[2342 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:46:05.585	[2342 (Embargoed)] Slot timeout after 140 milliseconds
Dec 23, 2021 @ 16:46:05.585	[2342 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:05.585	[2342 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [1259 ms]
Dec 23, 2021 @ 16:46:05.585	[2342 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:46:09.275	[2342 (Embargoed)] After event [onNewConnectionEmbargo] State change [OutOfEmbargo] -> [Embargoed]
Dec 23, 2021 @ 16:46:09.275	[2342 (OutOfEmbargo)] Before event [onNewConnectionEmbargo] In state [OutOfEmbargo] for [3507 ms]
Dec 23, 2021 @ 16:46:09.636	[2342 (Embargoed)] Slot timeout after 103 milliseconds
Dec 23, 2021 @ 16:46:09.636	[2342 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:09.636	[2342 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [542 ms]
Dec 23, 2021 @ 16:46:09.636	[2342 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:50:19.154	[2342 (OutOfEmbargo)] Before event [onNewRequest] In state [OutOfEmbargo] for [249517 ms]
Dec 23, 2021 @ 16:50:19.154	[2342 (Connecting)] After event [onNewRequest] State change [OutOfEmbargo] -> [Connecting]
Dec 23, 2021 @ 16:50:19.154	[2342 (OutOfEmbargo)] Establishing connection
Dec 23, 2021 @ 16:50:19.154	[2342 (OutOfEmbargo)] Dispatching request [GET /session/current/info Empty]
Dec 23, 2021 @ 16:50:22.024	[2342 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:22.024	[2342 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2856 ms]
Dec 23, 2021 @ 16:50:28.996	[2342 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:28.996	[2342 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [6969 ms]
Dec 23, 2021 @ 16:50:29.278	[2342 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.278	[2342 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [248 ms]
Dec 23, 2021 @ 16:50:29.772	[2342 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [158 ms]
Dec 23, 2021 @ 16:50:29.772	[2342 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:30.157	[2342 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [156 ms]
Dec 23, 2021 @ 16:50:30.157	[2342 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:30.499	[2342 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
Dec 23, 2021 @ 16:50:30.499	[2342 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.499	[2342 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [226 ms]
Dec 23, 2021 @ 16:50:30.499	[2342 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]
Dec 23, 2021 @ 16:50:30.499	[2342 (Connecting)] Connection attempt succeeded
Dec 23, 2021 @ 16:50:30.499	[2342 (Connecting)] Slot connection was established
Dec 23, 2021 @ 16:50:30.526	[2342 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [8 ms]
Dec 23, 2021 @ 16:50:30.526	[2342 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.712	[2342 (WaitingForResponse)] Ongoing request [GET /session/current/info Empty] is failed because of [connection failure]: [http stream connection cancelled]
Dec 23, 2021 @ 16:50:30.712	[2342 (Failed)] State change from [WaitingForResponse] to [Failed(java.lang.IllegalStateException: http stream connection cancelled)]. Closing the existing connection.
Dec 23, 2021 @ 16:50:30.712	[2342 (WaitingForResponse)] Connection cancelled
Dec 23, 2021 @ 16:50:30.712	[2342 (Failed)] After event [onConnectionFailed] State change [WaitingForResponse] -> [Failed]
Dec 23, 2021 @ 16:50:30.712	[2342 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:50:30.712	[2342 (WaitingForResponse)] Before event [onConnectionFailed] In state [WaitingForResponse] for [166 ms]
Dec 23, 2021 @ 16:50:30.712	[2342 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [0 ms]
Dec 23, 2021 @ 16:50:42.845	[2342 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:50:42.845	[2342 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:50:42.845	[2342 (Embargoed)] Slot timeout after 140 milliseconds
Dec 23, 2021 @ 16:50:42.845	[2342 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [12900 ms]
Dec 23, 2021 @ 16:51:01.194	[2342 (OutOfEmbargo)] Before event [onNewConnectionEmbargo] In state [OutOfEmbargo] for [18346 ms]
Dec 23, 2021 @ 16:51:01.194	[2342 (Embargoed)] After event [onNewConnectionEmbargo] State change [OutOfEmbargo] -> [Embargoed]
Dec 23, 2021 @ 16:51:01.865	[2342 (Embargoed)] After event [onNewConnectionEmbargo] State change [Embargoed] -> [Embargoed]
Dec 23, 2021 @ 16:51:01.865	[2342 (Embargoed)] Before event [onNewConnectionEmbargo] In state [Embargoed] for [241 ms]
Dec 23, 2021 @ 16:51:07.999	[2342 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:51:07.999	[2342 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:51:07.999	[2342 (Embargoed)] Slot timeout after 355 milliseconds
Dec 23, 2021 @ 16:51:07.999	[2342 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [6565 ms]

slot 2303

Time	message
Dec 23, 2021 @ 16:46:04.348	[2303 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:46:04.348	[2303 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [172561 ms]
Dec 23, 2021 @ 16:46:05.079	[2303 (Embargoed)] Slot timeout after 101 milliseconds
Dec 23, 2021 @ 16:46:05.081	[2303 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:05.081	[2303 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [754 ms]
Dec 23, 2021 @ 16:46:05.081	[2303 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:46:09.275	[2303 (Embargoed)] After event [onNewConnectionEmbargo] State change [OutOfEmbargo] -> [Embargoed]
Dec 23, 2021 @ 16:46:09.275	[2303 (OutOfEmbargo)] Before event [onNewConnectionEmbargo] In state [OutOfEmbargo] for [4013 ms]
Dec 23, 2021 @ 16:46:10.283	[2303 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:10.283	[2303 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [1190 ms]
Dec 23, 2021 @ 16:46:10.283	[2303 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:46:10.283	[2303 (Embargoed)] Slot timeout after 166 milliseconds
Dec 23, 2021 @ 16:50:19.135	[2303 (OutOfEmbargo)] Before event [onNewRequest] In state [OutOfEmbargo] for [248851 ms]
Dec 23, 2021 @ 16:50:19.135	[2303 (OutOfEmbargo)] Establishing connection
Dec 23, 2021 @ 16:50:19.135	[2303 (Connecting)] After event [onNewRequest] State change [OutOfEmbargo] -> [Connecting]
Dec 23, 2021 @ 16:50:19.135	[2303 (OutOfEmbargo)] Dispatching request [GET /session/current/info Empty]
Dec 23, 2021 @ 16:50:22.023	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:22.023	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2875 ms]
Dec 23, 2021 @ 16:50:28.996	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [6969 ms]
Dec 23, 2021 @ 16:50:28.996	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.277	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [248 ms]
Dec 23, 2021 @ 16:50:29.277	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.771	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.771	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [158 ms]
Dec 23, 2021 @ 16:50:30.156	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:30.156	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [156 ms]
Dec 23, 2021 @ 16:50:30.499	[2303 (Connecting)] Connection attempt succeeded
Dec 23, 2021 @ 16:50:30.499	[2303 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]
Dec 23, 2021 @ 16:50:30.499	[2303 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
Dec 23, 2021 @ 16:50:30.499	[2303 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [226 ms]
Dec 23, 2021 @ 16:50:30.499	[2303 (Connecting)] Slot connection was established
Dec 23, 2021 @ 16:50:30.499	[2303 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.525	[2303 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.525	[2303 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [8 ms]
Dec 23, 2021 @ 16:50:30.712	[2303 (WaitingForResponse)] Connection cancelled
Dec 23, 2021 @ 16:50:30.712	[2303 (Failed)] State change from [WaitingForResponse] to [Failed(java.lang.IllegalStateException: http stream connection cancelled)]. Closing the existing connection.
Dec 23, 2021 @ 16:50:30.712	[2303 (WaitingForResponse)] Before event [onConnectionFailed] In state [WaitingForResponse] for [166 ms]
Dec 23, 2021 @ 16:50:30.712	[2303 (WaitingForResponse)] Ongoing request [GET /session/current/info Empty] is failed because of [connection failure]: [http stream connection cancelled]
Dec 23, 2021 @ 16:50:30.712	[2303 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [0 ms]
Dec 23, 2021 @ 16:50:30.712	[2303 (Failed)] After event [onConnectionFailed] State change [WaitingForResponse] -> [Failed]
Dec 23, 2021 @ 16:50:30.712	[2303 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:50:42.787	[2303 (Embargoed)] Slot timeout after 107 milliseconds
Dec 23, 2021 @ 16:50:42.787	[2303 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:50:42.787	[2303 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [12842 ms]
Dec 23, 2021 @ 16:50:42.787	[2303 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:50:57.498	[2303 (OutOfEmbargo)] Dispatching request [GET /id/current/prpl Empty]
Dec 23, 2021 @ 16:50:57.498	[2303 (Connecting)] After event [onNewRequest] State change [OutOfEmbargo] -> [Connecting]
Dec 23, 2021 @ 16:50:57.498	[2303 (OutOfEmbargo)] Before event [onNewRequest] In state [OutOfEmbargo] for [14709 ms]
Dec 23, 2021 @ 16:50:57.498	[2303 (OutOfEmbargo)] Establishing connection
Dec 23, 2021 @ 16:51:01.194	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:51:01.194	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [3694 ms]
Dec 23, 2021 @ 16:51:01.864	[2303 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:51:01.864	[2303 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [241 ms]
Dec 23, 2021 @ 16:51:02.294	[2303 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:51:02.294	[2303 (Connecting)] Ongoing request [GET /id/current/prpl Empty] is failed because of [connection attempt failed]: [Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information]
Dec 23, 2021 @ 16:51:02.294	[2303 (Connecting)] Before event [onConnectionAttemptFailed] In state [Connecting] for [489 ms]
Dec 23, 2021 @ 16:51:02.294	[2303 (Failed)] State change from [Connecting] to [Failed(akka.stream.StreamTcpException: Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information)]. Closing the existing connection.
Dec 23, 2021 @ 16:51:02.294	[2303 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [0 ms]
Dec 23, 2021 @ 16:51:02.294	[2303 (Failed)] After event [onConnectionAttemptFailed] State change [Connecting] -> [Failed]
Dec 23, 2021 @ 16:51:02.294	[2303 (Connecting)] Connection attempt failed with Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information
Dec 23, 2021 @ 16:51:13.523	[2303 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [11599 ms]
Dec 23, 2021 @ 16:51:13.523	[2303 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:51:13.523	[2303 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:51:13.523	[2303 (Embargoed)] Slot timeout after 248 milliseconds

slot 2270

Time	message
Dec 23, 2021 @ 16:46:04.348	[2270 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:46:04.348	[2270 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [172561 ms]
Dec 23, 2021 @ 16:46:05.075	[2270 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [751 ms]
Dec 23, 2021 @ 16:46:05.075	[2270 (Embargoed)] Slot timeout after 109 milliseconds
Dec 23, 2021 @ 16:46:05.075	[2270 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:46:05.075	[2270 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:09.275	[2270 (Embargoed)] After event [onNewConnectionEmbargo] State change [OutOfEmbargo] -> [Embargoed]
Dec 23, 2021 @ 16:46:09.275	[2270 (OutOfEmbargo)] Before event [onNewConnectionEmbargo] In state [OutOfEmbargo] for [4017 ms]
Dec 23, 2021 @ 16:46:10.453	[2270 (Embargoed)] Slot timeout after 190 milliseconds
Dec 23, 2021 @ 16:46:10.453	[2270 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:46:10.453	[2270 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:46:10.453	[2270 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [1359 ms]
Dec 23, 2021 @ 16:50:19.117	[2270 (OutOfEmbargo)] Dispatching request [GET /id/current/prpl Empty]
Dec 23, 2021 @ 16:50:19.117	[2270 (OutOfEmbargo)] Before event [onNewRequest] In state [OutOfEmbargo] for [248664 ms]
Dec 23, 2021 @ 16:50:19.117	[2270 (OutOfEmbargo)] Establishing connection
Dec 23, 2021 @ 16:50:19.119	[2270 (Connecting)] After event [onNewRequest] State change [OutOfEmbargo] -> [Connecting]
Dec 23, 2021 @ 16:50:22.023	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:22.023	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [2892 ms]
Dec 23, 2021 @ 16:50:28.995	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [6969 ms]
Dec 23, 2021 @ 16:50:28.995	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.277	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [248 ms]
Dec 23, 2021 @ 16:50:29.277	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:29.771	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [158 ms]
Dec 23, 2021 @ 16:50:29.771	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:30.156	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [156 ms]
Dec 23, 2021 @ 16:50:30.156	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:50:30.499	[2270 (PushingRequestToConnection)] After event [onConnectionAttemptSucceeded] State change [Connecting] -> [PushingRequestToConnection]
Dec 23, 2021 @ 16:50:30.499	[2270 (Connecting)] Connection attempt succeeded
Dec 23, 2021 @ 16:50:30.499	[2270 (Connecting)] Before event [onConnectionAttemptSucceeded] In state [Connecting] for [226 ms]
Dec 23, 2021 @ 16:50:30.499	[2270 (Connecting)] Slot connection was established
Dec 23, 2021 @ 16:50:30.499	[2270 (PushingRequestToConnection)] Before event [onRequestDispatched] In state [PushingRequestToConnection] for [0 ms]
Dec 23, 2021 @ 16:50:30.499	[2270 (WaitingForResponse)] After event [onRequestDispatched] State change [PushingRequestToConnection] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.525	[2270 (WaitingForResponse)] After event [onNewConnectionEmbargo] State change [WaitingForResponse] -> [WaitingForResponse]
Dec 23, 2021 @ 16:50:30.525	[2270 (WaitingForResponse)] Before event [onNewConnectionEmbargo] In state [WaitingForResponse] for [8 ms]
Dec 23, 2021 @ 16:50:30.711	[2270 (WaitingForResponse)] Connection cancelled
Dec 23, 2021 @ 16:50:30.711	[2270 (WaitingForResponse)] Before event [onConnectionFailed] In state [WaitingForResponse] for [166 ms]
Dec 23, 2021 @ 16:50:30.712	[2270 (WaitingForResponse)] Ongoing request [GET /id/current/prpl Empty] is failed because of [connection failure]: [http stream connection cancelled]
Dec 23, 2021 @ 16:50:30.712	[2270 (Failed)] State change from [WaitingForResponse] to [Failed(java.lang.IllegalStateException: http stream connection cancelled)]. Closing the existing connection.
Dec 23, 2021 @ 16:50:30.712	[2270 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:50:30.712	[2270 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [0 ms]
Dec 23, 2021 @ 16:50:30.712	[2270 (Failed)] After event [onConnectionFailed] State change [WaitingForResponse] -> [Failed]
Dec 23, 2021 @ 16:50:42.965	[2270 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:50:42.965	[2270 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:50:42.965	[2270 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [13020 ms]
Dec 23, 2021 @ 16:50:42.965	[2270 (Embargoed)] Slot timeout after 190 milliseconds
Dec 23, 2021 @ 16:50:57.475	[2270 (OutOfEmbargo)] Dispatching request [GET /id/current/prpl Empty]
Dec 23, 2021 @ 16:50:57.475	[2270 (OutOfEmbargo)] Before event [onNewRequest] In state [OutOfEmbargo] for [14509 ms]
Dec 23, 2021 @ 16:50:57.475	[2270 (Connecting)] After event [onNewRequest] State change [OutOfEmbargo] -> [Connecting]
Dec 23, 2021 @ 16:50:57.475	[2270 (OutOfEmbargo)] Establishing connection
Dec 23, 2021 @ 16:51:01.194	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:51:01.194	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [3717 ms]
Dec 23, 2021 @ 16:51:01.864	[2270 (Connecting)] After event [onNewConnectionEmbargo] State change [Connecting] -> [Connecting]
Dec 23, 2021 @ 16:51:01.864	[2270 (Connecting)] Before event [onNewConnectionEmbargo] In state [Connecting] for [241 ms]
Dec 23, 2021 @ 16:51:02.288	[2270 (Failed)] After event [onConnectionAttemptFailed] State change [Connecting] -> [Failed]
Dec 23, 2021 @ 16:51:02.288	[2270 (Failed)] State change from [Connecting] to [Failed(akka.stream.StreamTcpException: Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information)]. Closing the existing connection.
Dec 23, 2021 @ 16:51:02.288	[2270 (Connecting)] Before event [onConnectionAttemptFailed] In state [Connecting] for [463 ms]
Dec 23, 2021 @ 16:51:02.288	[2270 (Connecting)] Ongoing request [GET /id/current/prpl Empty] is failed because of [connection attempt failed]: [Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information]
Dec 23, 2021 @ 16:51:02.288	[2270 (Embargoed)] After event [onNewConnectionEmbargo] State change [Unconnected] -> [Embargoed]
Dec 23, 2021 @ 16:51:02.288	[2270 (Unconnected)] Before event [onNewConnectionEmbargo] In state [Unconnected] for [0 ms]
Dec 23, 2021 @ 16:51:02.288	[2270 (Connecting)] Connection attempt failed with Tcp command [Connect(localhost:13000,None,List(ReceiveBufferSize(131072), SendBufferSize(131072)),Some(15 seconds),true)] failed because of java.net.ConnectException: Connection refused: no further information
Dec 23, 2021 @ 16:51:13.275	[2270 (OutOfEmbargo)] After event [onTimeout] State change [Embargoed] -> [OutOfEmbargo]
Dec 23, 2021 @ 16:51:13.275	[2270 (Embargoed)] Before event [onTimeout] In state [Embargoed] for [11377 ms]
Dec 23, 2021 @ 16:51:13.275	[2270 (OutOfEmbargo)] Slot became idle... Trying to pull
Dec 23, 2021 @ 16:51:13.275	[2270 (Embargoed)] Slot timeout after 214 milliseconds

@jrudolph
Copy link
Contributor

Thanks for the update. Maybe we can get a similar patch in even we don't know how we got into this situation.

jrudolph added a commit to jrudolph/akka-http that referenced this issue Jan 10, 2022
Refs akka#3662

In the issue, it was observed that a connection might only cancel a pool
connection but would not fail it (which was the implicit assumption of the
existing code). In this case a pool might get more and more unresponsive
with slots hanging after connections being canceled but not failed.

The exact condition could not be reproduced but let's be proactive here
and fail the slot if we see unexpected cancellation.
@spangaer
Copy link

I was about to post the patch commit, but I saw you took it up and refined it so I've picked the commits from your branch and published a local build to deploy.

I'll report on the effects as we observe them.

@spangaer
Copy link

spangaer commented Feb 7, 2022

So, we've been running the patch in our dev cluster for a bit and it doesn't seem to cause issues so far. While the connection problem did indeed occur. The trigger seems to be CPU starvation, but the exact cause is unknown. All connection failures did seem to point to once specific place. 🤷‍♂️
Anyhow, we intend to keep the patch, hope it makes it to the mainline.

@jrudolph
Copy link
Contributor

jrudolph commented Feb 8, 2022

Great, thanks again @spangaer. The patch has been merged to main, so it will definitely be in 10.2.8. Should we ever do another 10.1.x release, it will also be included.

@AvaPL
Copy link

AvaPL commented May 27, 2022

It seems that the issue is still here in 10.2.9. We've been running 3 Alpakka Elasticsearch flows, which internally use Http().singleRequest, for several days without any issues and after some time we ended up with the log below. Tried to reproduce it locally but with no luck. It appears only under heavy load, after quite long time. Locally it runs max 3 requests in parallel as expected. No other part of the application uses the same URL as the flows.

Restarting stream due to failure [1]: java.lang.RuntimeException: Request failed for POST https://confidential:1234/_bulk
java.lang.RuntimeException: Request failed for POST https://confidential:1234/_bulk
	at akka.stream.alpakka.elasticsearch.impl.ElasticsearchSimpleFlowStage$StageLogic$$anonfun$onPush$4.applyOrElse(ElasticsearchSimpleFlowStage.scala:102)
	at akka.stream.alpakka.elasticsearch.impl.ElasticsearchSimpleFlowStage$StageLogic$$anonfun$onPush$4.applyOrElse(ElasticsearchSimpleFlowStage.scala:100)
	at scala.concurrent.impl.Promise$Transformation.run$$original(Promise.scala:490)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:141)
	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:63)
	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:100)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:94)
	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:100)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: akka.stream.BufferOverflowException: Exceeded configured max-open-requests value of [32]. This means that the request queue of this pool (HostConnectionPoolSetup(confidential,1234,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,Duration.Inf,100 milliseconds,2 minutes,30 seconds,Duration.Inf,ClientConnectionSettings(Some(User-Agent: akka-http/10.2.9),60 seconds,60 seconds,512,None,WebSocketSettings(<function0>,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$1984/0x0000000100b4f040@6c9a4044,false),List(),ParserSettings(2048,16,64,64,8192,64,Some(9223372036854775807),8388608,256,10485760,5,Strict,RFC6265,true,Set(),Full,Error,Error,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,false,true,akka.util.ConstantFun$$$Lambda$1242/0x00000001006ef040@5f58a09c,akka.util.ConstantFun$$$Lambda$1242/0x00000001006ef040@5f58a09c,akka.util.ConstantFun$$$Lambda$1243/0x00000001006ef840@7501e445),100 milliseconds,None,Http2ClientSettingsImpl(256,65536,10000000,512000,1024,false,0 seconds,0 seconds,0,3 seconds,100 milliseconds,2 minutes,None),TCPTransport),1 second,List()),akka.http.scaladsl.HttpsConnectionContext@69d149f2,akka.event.MarkerLoggingAdapter@4b65fdf))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
	at akka.http.impl.engine.client.PoolInterface$Logic.<init>(PoolInterface.scala:106)
	at akka.http.impl.engine.client.PoolInterface$PoolInterfaceStage.createLogicAndMaterializedValue(PoolInterface.scala:94)
	at akka.stream.impl.GraphStageIsland.originalMaterializeAtomic(PhasedFusingActorMaterializer.scala:705)
	at akka.stream.impl.GraphStageIsland.materializeAtomic(PhasedFusingActorMaterializer.scala:73)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:503)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:452)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:444)
	at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:746)
	at akka.http.impl.engine.client.PoolInterface$.apply(PoolInterface.scala:80)
	at akka.http.impl.engine.client.PoolMasterActor.akka$http$impl$engine$client$PoolMasterActor$$startPoolInterface(PoolMasterActor.scala:131)
	at akka.http.impl.engine.client.PoolMasterActor$$anonfun$receive$1.applyOrElse(PoolMasterActor.scala:162)
	at akka.actor.Actor.aroundReceive(Actor.scala:537)
	at akka.actor.Actor.aroundReceive$(Actor.scala:535)
	at akka.http.impl.engine.client.PoolMasterActor.aroundReceive(PoolMasterActor.scala:111)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
	at akka.actor.ActorCell.invoke$$original(ActorCell.scala:548)
	at akka.actor.ActorCell.invoke(ActorCell.scala:61)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
	at akka.dispatch.Mailbox.run$$original(Mailbox.scala:231)
	at akka.dispatch.Mailbox.run(Mailbox.scala:32)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
	... 5 more

@spangaer
Copy link

Seems like in a flow context, if possible, cachedConnectionPool would be preferred, so there's a chance of backpressure to take effect.
As I'd suspect the consuming code in this case.

@AvaPL
Copy link

AvaPL commented May 27, 2022

Http().singleRequest creates a pool per host so it should not matter that much in this case. Also when the code is run locally, the backpressure is visible. We do not send more requests than the number of parallel flows.

@spangaer
Copy link

spangaer commented May 28, 2022 via email

@AvaPL
Copy link

AvaPL commented May 30, 2022

I opened another issue #4127, as it seems that indeed the issue is caused by the restarts, but it still doesn't clean up the pool.

@jrudolph
Copy link
Contributor

jrudolph commented Jun 7, 2022

See also the conclusion to #4127 in #4133: if a connection cannot be established and connection attempts are backed off, the pool in 10.2.9 and earlier might have less buffer available than expected.

@johanandren johanandren removed the 3 - in progress Someone is working on this ticket label Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug t:client Issues related to the HTTP Client t:client-new-pool t:core Issues related to the akka-http-core module
Projects
None yet
Development

No branches or pull requests

7 participants