Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

Open
VAveryanov8 opened this issue Jan 30, 2025 · 2 comments

Comments

@VAveryanov8
Copy link
Collaborator

VAveryanov8 commented Jan 30, 2025

Views

In regular restore procedure we drop views before data is uploaded to the nodes and re-create them after. This is necessary for data consistency - snapshot operation in Scylla is not atomic, so backup can contain different numbers of rows in base and view tables. The way to fix this during restore is to re-create views from base tables and ignore views data during the backup.
I thinks there in no difference for 1-1-restore in that regard and we need to do the same, otherwise we can end up in with data mismatch between the base table and view table.

P.S. SM doesn't include views and secondary indexes into backup (https://manager.docs.scylladb.com/stable/backup/#selecting-tables-and-nodes-to-back-up)

TGC and Repair

Backup can be taken on 2025.01.01, but restored only on 2025.01.11 and the new cluster may delete all tombstones because their age is greater than gc_grace_period (more than 10 days passed since backup date).

Versioned files

The backup can contain versioned files which should be handled separately - versioned extension should be deleted.

@VAveryanov8
Copy link
Collaborator Author

Relates to #4202

@Michal-Leszczynski
Copy link
Collaborator

The difference regarding repair for the 1-1-restore is that if we do:

  • disable tgc
  • drop the views
  • download all the data
  • nodetool refresh all the data
  • rebuild the views

We have fully operational cluster that client can start using.
It would still require repair before re-enabling tgc, but this can happen later as a background task.
This means that the time needed to restore the data and meet the SLA wouldn't be harmed by the repair time.

In case of the regular restore, if client wanted to use the cluster before the repair finished, client would need to change query consistency level to ALL in their application (this restore is using --primary-replica-only), which means that:

  • client needs to adjust their application
  • they can expect degraded performance

So it's less appealing.

If we go in the direction of allowing users to use the cluster before repair, it shouldn't be a direct part of the restore task (like with the regular restore), but it should be indeed a follow-up task. We discussed two ways to accomplish that:

  1. 1-1-restore will schedule a separate, special repair task on success. This task would repair restored tables, but also re-enable their tgc (the info about tgc can be taken from SM DB or as an hidden repair param). The problem with this solution is that it doesn't ensure that the tgc will be re-enabled in the future. The repair task might be stopped by the user of simply fail for some reason. Then, if the user forgets about it, they can be running a cluster with disabled tgc, which would result in amplified disk usage over time and degraded performance.

  2. Instead of disabling tgc, we can set it to mode repair. This way, the situation where user didn't switch it back is less scary - they still need to run repairs and they will take care of the tombstones.

We can even combine those two approaches, which seems like the best solution, if we want to exclude repair time from the restore time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants