-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP client pool stops processing requests with "Exceeded configured max-open-requests value" #3662
Comments
I don't have much experience with sttp. A minimal project that shows the problem in isolation and can be used to further diagnose the problem might be a good start? |
We cannot provide a minimal project at all. The same project running in our staging environment, this only happens in production under heavy load. sttp should not be a problem here, as these errors come from akka. |
I think this is the most likely case. Since timeouts currently don't purge requests from the pool, you can end up in a situation where all requests will be handled too late and eventually time out (especially since the queueing time is part of the timeout). |
Dealing with server overload is a hard problem and there's not a good solution to that problem currently. After all the goal must be to shave off as much work from the server as possible when you detect the problem (however, where should that those requests then go?). Here are some hints at what you could try to do:
If you can provide debug logging for the pool (filter for class |
I'm afraid we don't have However, I see objects of class |
Yes |
We followed your advice with keeping queues as small as possible and used following configs: akka.http.host-connection-pool {
max-connections = 256
max-open-requests = 256 // Must be a power of 2
} Still, shortly after startup we encountered the same problem with one of our instances, all requests to a single hostname were failed without a single successful request completion during 30 minutes. We will retrieve and analyze the heap dump a bit later, what should we look for besides |
DEBUG log information might be more interesting than looking through the dumps because they will also give a timeline of what's going on. |
Can you specify which exact loggers should we enable debug-level logging for? We cannot afford to store all debug logs, sadly |
|
We cannot afford to enable debug logs for this loggers as it increases our logs volume tenfold, we simply do not have the resources to store so much logs at our scale. Following the advice, we disabled the queue by setting both All |
That means that all requests are waiting for responses from the server. The |
Nope, we have |
Hard to say what could be the reason for the idle-timeout not being triggered. I think that's the first time that I hear about the idle-timeout not working. This is just a wild guess but if you use Kamon, you could try running without it. Hard to diagnose it without more info. |
No, we don't use Kamon. By metrics I mean our own non-invasive metrics around requests invocations. |
Looks like we get to join the dance. (Akka HTTP 10.1.x). In our case we use rather big pools.
They don't time out, and just get stuck for hours. I've seen it auto resolve in some cases. But current occurrence has been stuck for about 24h now. |
Another thing I can add to is that I have code doing Akka HTTP to Akka HTTP calls, that don't seem to suffer while these calls have an ALB sitting in between. |
Could this somehow be related to this failure? #3916 |
If this happens with high CPU load in the pool, then it might be related to the issue fixed in 67ac11c. I'd recommend to update to 10.2.x for big pools (> 500-1000 connections). |
Does that commit fix some slot release escapes somehow? (I've looked at the commit, but it doesn't seem to be apparent from there to me) Yes, high CPU load would have been involved here. |
That commit fixes that the pool management logic itself spins and contributes to CPU load up to a degree where little useful work can be done by the pool. If you are saying it isn't making any progress at all, it might still be something different, but even in that case, it would be good to try on 10.2 and see if it shows the same issue. |
Well the state is 32k slots in WaitingForResponse, with seemingly all SlotConnections to claim being connected while nothing is moving anymore (since last evening) and server side and ALB timeouts long expired.. From what you're saying it seems like I should expect a single CPU cooking all this time? Which is not the case. It seems to involve some kind of connectivity issues of which the errors escape the slot release process. |
While studying on this code (v10.1.x; but still valid) Isn't that a race condition? If lines 271 and 288 are reached at the same time two invocations of updateState() will be running side-by-side and the outcome will be rather unpredictable.
Or am I missing something? |
Further study: If a materialized At that point, no timeout code is active in the connection pool. |
This fix seems to acknowledge such a possibility. Any chance that a heavy CPU load could make the 100 ms fall short. |
By now I have as much as a smoking gun. I'm able to reproduce by creating load on a local system and get some hanging slots. (was trying to validate if those settings did something for me) With debug logging enabled they spit out this
I found these by first looking at 1 hanging slot and finding that message, then looking for the message, which occurs the exact amount of times as the number of hanging slots (81 times). When looking at the code, it seems to tell me that an attempt is made to send an HttpRequest to an HTTP stream, but that one can no longer process it so responds a cancellation instead. Which is logged and nothing more. I'm guessing the hope is that onUpstreamFailure(ex: Throwable) is invoked too, but that doesn't seem to happen. Which makes me think that the cancellation signal should be processed as some kind of failure instead of just being logged. |
By now I've figured that thing is sort of the end of the line. But more is going on before it ever gets there. But I'm working on making that visible. |
Ok, I went through 7 hells to exert the right amount of stress to reproduce the problem and at the same time not loose the debug logging, but I think I'm there. Note: for things that have the same timestamp, you cannot trust the order. As it's retroactively gets sorted and there's no extra counter to ensure ordering
I'm still in the process of analyzing what that sequence of events means, but for blocking slots it fairly consistently looks like this, so I thought I could share it already.
|
Thanks for this analysis, @spangaer. I'm going to have a quick look if anything stands out. Have you already tried with Akka HTTP 10.2.7? In any case, a fix will only appear in 10.2.7 as 10.1.x is now in deep maintenance mode. |
Ok, I think I found a problem when the connection is closed while only part of the response headers have been received. |
Hmm, that's a potential problem but none that should happen in the pool, the stages are closed once the responses are pulled (which they are all the time for the pool). |
I meant for a different issue "Connection reset by peer". Which happens if an Idle connection is closed by the other side. I've seen that go sour at the time a new request goes in which then fails for no good reason. But is easily solved by closing the client side before you expect the server side to do the same. It now looks to be addressed by the Neat graph btw, how did you generate that? (haven't figured out to interpret it though) |
Hmm. Doesn't the cancellation signal that that structure is being cleaned-up ? I thought the delay that was introduced is meant to give a chance for Throwable to come out the back door, but by time the cancellation arrives all hope is lost. So you can start thinking of cleaning up? |
I dug a bit deeper but I cannot reproduce anything with the actual pool involved as well. If you want to play around, you could try if changing this makes any difference:
You could also try to run |
This is what creates those |
I guess that change would just clean-up after cancellation ? Which would probably unlock the slot, right? I'll probably give that a shot at some point. What beats me at this moment is (I've ordered the log statements a I expect them to be occurring):
Where are those Next to that, all 85 "Connection cancellation" statements arrive in a time pan 600 ms, which is a preceded by a 23 second GAP in all logging, included in the 25 second gap here. |
And it's not one big GC pause I have other records that do show activity. |
Still need to get to this. Just a sanity check. We can expect "this to solve it", so it's more of matter to see that cancellations occur, but slots get unblocked anyway. Correct? |
That means that a connection attempt failed for any connection on that pool. In that case, we send all slots a notice to hold back creating new connections if needed until the embargo is over. This doesn't affect connected slots.
Ultimately, the cancellation issue is likely caused by a downstream event that closes many connections. So, maybe something else is already amiss before that event.
This could help, it will propagate the cancellation back to the remaining alive part of the HTTP connection implementation and also frees the slot. The exact behavior is hard to say since I didn't manage to reproduce the conditions leading to the behavior. |
OK this time around my biggest challenge was getting that patch built and my test rig to work still. I actually reverted some of my previous settings. Symmetric I patched with a variant of the suggestion override def onDownstreamFinish(): Unit =
withSlot { slot =>
slot.debug("Connection cancelled")
slot.onConnectionFailed(new IllegalStateException("http stream connection cancelled"))
responseIn.cancel()
} And as you can tell by these logs put the hammer down on my machine. It looks like it works as intended. The interesting bits are around I think it would have fell on it's feet. The slot does seem to unlock and continue from there. My machine was so broken down that it became dysfunctional after that. Profiler disconnected and failed to reconnect and even accessing my web services stopped working eventually. Will probably take this patch and put it on a RND deployment, see if it does sensible things there. Does seem like it makes sens to take it further. slot 2342
slot 2303
slot 2270
|
Thanks for the update. Maybe we can get a similar patch in even we don't know how we got into this situation. |
Refs akka#3662 In the issue, it was observed that a connection might only cancel a pool connection but would not fail it (which was the implicit assumption of the existing code). In this case a pool might get more and more unresponsive with slots hanging after connections being canceled but not failed. The exact condition could not be reproduced but let's be proactive here and fail the slot if we see unexpected cancellation.
I was about to post the patch commit, but I saw you took it up and refined it so I've picked the commits from your branch and published a local build to deploy. I'll report on the effects as we observe them. |
So, we've been running the patch in our dev cluster for a bit and it doesn't seem to cause issues so far. While the connection problem did indeed occur. The trigger seems to be CPU starvation, but the exact cause is unknown. All connection failures did seem to point to once specific place. 🤷♂️ |
Great, thanks again @spangaer. The patch has been merged to |
It seems that the issue is still here in 10.2.9. We've been running 3 Alpakka Elasticsearch flows, which internally use
|
Seems like in a flow context, if possible, cachedConnectionPool would be preferred, so there's a chance of backpressure to take effect. |
|
Yes it makes a pool, it would even be the same pool. But you'd only avoid
that error if you ensure that the sum of outstanding requests does not
exceed that setting on a system level.
The retry makes me suspect that is at risk.
Come to think of it, given above, I'm thinking if you're not better served
with a single connection HTTP flow.
It will be more aggressive towards ES though. (As it would succeed in
retrying sooner).
|
I opened another issue #4127, as it seems that indeed the issue is caused by the restarts, but it still doesn't clean up the pool. |
We use sttp v2 with single-request API (no streams). Some configs:
At some point of time this happens:
Here the requests to
typing
stopped working at 17:16, shortly after the server started at 17:11. Such freeze can happen at any point of the instance's life though, even days after the start.Here are some hypotheses that we considered:
Error log:
The text was updated successfully, but these errors were encountered: