You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some cases come up when we do pageserver restarts/migrations while LSN leases are in play, and the pageserver's gc_cutoff has advanced past the lease.
Validation of lease LSN vs. gc_cutoff: this makes sense for first request but not renewals. We cannot know the difference from the server side, so: page_service lease requests should not do the validation (these are used for renewals), but http API requests should do the validation (only used for initial grant of lease).
getpage requests are validated against latest_gc_cutoff, with a special case for requests at leases. However, after pageserver restart/migration, there is a period where the compute may send getpage requests before it has sent its lease requests. To narrow this window, we should make compute_ctl send a lease renewal as soon as it sees a /configure API calls that updates pageserver_connstr.
On point 2, the suggested change doesn't eliminate the case of bad requests, but it limits to:
A very short period after migration/restart
Only happens when the GC cutoff has advanced past the lease location: most uses of static endpoints on tenants with ~24hr PITR will never hit this case.
We must somehow document that this is a legitimate case where we might see "client requested an LSN that has been GC'd", so that we don't get too worried if we ever see this
The text was updated successfully, but these errors were encountered:
I do not fully understand the motivation of validation removal:
In 1. it's proposed that we do not do validation in libpq protocol, but only do it in HTTP.
Yet, in 2. it's mentioned that if the pageserver is restarted, for a short period, we do not have any lease, so once compute tries to renew it, it will be kinda 'acquire', not 'renew', right?
(the race explanation part looks valid, though)
A bit of a context. From the compute point of view, it's valuable to know that lease renewal failed permanently, and it doesn't make much sense to retry. Then we can stop retrying, set an internal error state; and in theory, we can teach cplane to notice that and shut down the compute. So I'd still prefer to have some permanent error in both APIs. Discussed that with @yliang412 a bit
Some cases come up when we do pageserver restarts/migrations while LSN leases are in play, and the pageserver's gc_cutoff has advanced past the lease.
On point 2, the suggested change doesn't eliminate the case of bad requests, but it limits to:
We must somehow document that this is a legitimate case where we might see "client requested an LSN that has been GC'd", so that we don't get too worried if we ever see this
The text was updated successfully, but these errors were encountered: