Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry storm on fetchOlder in message list #945

Open
gnprice opened this issue Sep 17, 2024 · 0 comments · May be fixed by #1050
Open

Retry storm on fetchOlder in message list #945

gnprice opened this issue Sep 17, 2024 · 0 comments · May be fixed by #1050
Labels
a-api Implementing specific parts of the Zulip server API a-msglist The message-list screen, except what's label:a-content a-sync Event queue; retry; local echo; races

Comments

@gnprice
Copy link
Member

gnprice commented Sep 17, 2024

@alexmv reported the app on his device was sending a rapid series of getMessages requests to the server — around 6 such requests a second.

Because the server rate-limits requests (by default to 200 per minute for each user, total across all types of requests), the user-facing effect was that the requests all failed, with the symptom that scrolling up further in the message list didn't work. The requests also put load on the server, though mitigated by the rate-limiting.

We should fix this.

Diagnosis

From looking at the code, what's happening is:

  • There's no overt retry logic for that fetch.

  • But: the thing that triggers that fetch is a listener for the scroll state. It can get called frequently by Flutter's scrolling logic; it's supposed to be lightweight to call repeatedly.

    So naturally if we already have one of these fetches in flight, that listener promptly returns. Otherwise if you're near the top of the data we have, and we haven't heard from the server that that's the beginning of history, we make a new /api/get-messages request to fetch more.

    And when the fetches are already failing, that amounts to a retry with no backoff.

Implementation

To fix the issue, we'll add backoff to MessageListView.fetchOlder, which is the method that invokes the API request.

When fetchOlder gets an error from the server, it should start a backoff timer (using our BackoffMachine so the intervals escalate) and ignore any further calls until the backoff expires, just like it ignores them when fetchingOlder is true.

We won't add any new retry logic. The fetchOlder call site, in _MessageListState._handleScrollMetrics, already provides plenty of retrying — all the user has to do is touch the screen to scroll slightly up or down, and fetchOlder will get called again. As Alex saw, it can also get called even with no interaction when nothing else seems to be happening.

Out of scope

A further bonus refinement would be for us to understand 429 responses, and wait for the time that that response says:

But to really do that right probably belongs at a different layer, in ApiConnection, so that the information is shared across all the different requests we make on behalf of a given account.

And conversely I think if we fix the rapid retries, that will largely solve the problem. I suspect the reason we ended up with 429s here in the first place was likely because of too-rapid retries after some other source of error. Even if not, exponential backoff should let us get out of the situation.

@gnprice gnprice added a-msglist The message-list screen, except what's label:a-content a-api Implementing specific parts of the Zulip server API labels Sep 17, 2024
@gnprice gnprice added this to the Beta 4: Fall 2024 milestone Sep 17, 2024
@gnprice gnprice added the a-sync Event queue; retry; local echo; races label Oct 22, 2024
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate helper is to wait for the
BackoffMachine asynchronously without blocking fetchOlder.  This is
needed because we have to reset the `_skipFetchingOlder` flag once the
network backoff finishes.

This approach is different from how BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate helper is to wait for the
BackoffMachine asynchronously without blocking fetchOlder (so that we
can reach the finally block and return).

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate helper is to wait for the
BackoffMachine asynchronously without blocking fetchOlder (so that we
can reach the finally block and return).

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
@PIG208 PIG208 linked a pull request Nov 6, 2024 that will close this issue
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate helper is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate helper is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate class is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate class is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 6, 2024
The main purpose of having a separate class is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 7, 2024
The main purpose of having a separate class is to wait for the
BackoffMachine without blocking fetchOlder so that we can reach
the finally block and return.

This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
PIG208 added a commit to PIG208/zulip-flutter that referenced this issue Nov 7, 2024
This approach is different from how a BackoffMachine is typically used,
because the message list doesn't send and retry requests in a loop; its
caller retries rapidly on scroll changes, and we want to ignore the
excessive requests.

The test drops irrelevant requests with `connection.takeRequests`
without checking, as we are only interested in verifying that no request
was sent.

Fixes: zulip#945

Signed-off-by: Zixuan James Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a-api Implementing specific parts of the Zulip server API a-msglist The message-list screen, except what's label:a-content a-sync Event queue; retry; local echo; races
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

1 participant