Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: LSN lease edge cases around restarts/migrations #8817

Open
Tracked by #7497
jcsp opened this issue Aug 23, 2024 · 1 comment · May be fixed by #9055
Open
Tracked by #7497

pageserver: LSN lease edge cases around restarts/migrations #8817

jcsp opened this issue Aug 23, 2024 · 1 comment · May be fixed by #9055
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Contributor

jcsp commented Aug 23, 2024

Some cases come up when we do pageserver restarts/migrations while LSN leases are in play, and the pageserver's gc_cutoff has advanced past the lease.

  1. Validation of lease LSN vs. gc_cutoff: this makes sense for first request but not renewals. We cannot know the difference from the server side, so: page_service lease requests should not do the validation (these are used for renewals), but http API requests should do the validation (only used for initial grant of lease).
  2. getpage requests are validated against latest_gc_cutoff, with a special case for requests at leases. However, after pageserver restart/migration, there is a period where the compute may send getpage requests before it has sent its lease requests. To narrow this window, we should make compute_ctl send a lease renewal as soon as it sees a /configure API calls that updates pageserver_connstr.

On point 2, the suggested change doesn't eliminate the case of bad requests, but it limits to:

  • A very short period after migration/restart
  • Only happens when the GC cutoff has advanced past the lease location: most uses of static endpoints on tenants with ~24hr PITR will never hit this case.

We must somehow document that this is a legitimate case where we might see "client requested an LSN that has been GC'd", so that we don't get too worried if we ever see this

@ololobus
Copy link
Member

I do not fully understand the motivation of validation removal:

  1. In 1. it's proposed that we do not do validation in libpq protocol, but only do it in HTTP.
  2. Yet, in 2. it's mentioned that if the pageserver is restarted, for a short period, we do not have any lease, so once compute tries to renew it, it will be kinda 'acquire', not 'renew', right?

(the race explanation part looks valid, though)

A bit of a context. From the compute point of view, it's valuable to know that lease renewal failed permanently, and it doesn't make much sense to retry. Then we can stop retrying, set an internal error state; and in theory, we can teach cplane to notice that and shut down the compute. So I'd still prefer to have some permanent error in both APIs. Discussed that with @yliang412 a bit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants