Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

VAveryanov8 · 2025-01-30T09:09:24Z

Views

In regular restore procedure we drop views before data is uploaded to the nodes and re-create them after. This is necessary for data consistency - snapshot operation in Scylla is not atomic, so backup can contain different numbers of rows in base and view tables. The way to fix this during restore is to re-create views from base tables and ignore views data during the backup.
I thinks there in no difference for 1-1-restore in that regard and we need to do the same, otherwise we can end up in with data mismatch between the base table and view table.

P.S. SM doesn't include views and secondary indexes into backup (https://manager.docs.scylladb.com/stable/backup/#selecting-tables-and-nodes-to-back-up)

TGC and Repair

Backup can be taken on 2025.01.01, but restored only on 2025.01.11 and the new cluster may delete all tombstones because their age is greater than gc_grace_period (more than 10 days passed since backup date).

Versioned files

The backup can contain versioned files which should be handled separately - versioned extension should be deleted.

The text was updated successfully, but these errors were encountered:

VAveryanov8 · 2025-01-30T09:12:02Z

Relates to #4202

Michal-Leszczynski · 2025-01-31T09:01:51Z

The difference regarding repair for the 1-1-restore is that if we do:

disable tgc
drop the views
download all the data
nodetool refresh all the data
rebuild the views

We have fully operational cluster that client can start using.
It would still require repair before re-enabling tgc, but this can happen later as a background task.
This means that the time needed to restore the data and meet the SLA wouldn't be harmed by the repair time.

In case of the regular restore, if client wanted to use the cluster before the repair finished, client would need to change query consistency level to ALL in their application (this restore is using --primary-replica-only), which means that:

client needs to adjust their application
they can expect degraded performance

So it's less appealing.

If we go in the direction of allowing users to use the cluster before repair, it shouldn't be a direct part of the restore task (like with the regular restore), but it should be indeed a follow-up task. We discussed two ways to accomplish that:

1-1-restore will schedule a separate, special repair task on success. This task would repair restored tables, but also re-enable their tgc (the info about tgc can be taken from SM DB or as an hidden repair param). The problem with this solution is that it doesn't ensure that the tgc will be re-enabled in the future. The repair task might be stopped by the user of simply fail for some reason. Then, if the user forgets about it, they can be running a cluster with disabled tgc, which would result in amplified disk usage over time and degraded performance.
Instead of disabling tgc, we can set it to mode repair. This way, the situation where user didn't switch it back is less scary - they still need to run repairs and they will take care of the tombstones.

We can even combine those two approaches, which seems like the best solution, if we want to exclude repair time from the restore time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

VAveryanov8 commented Jan 30, 2025 •

edited

Loading

VAveryanov8 commented Jan 30, 2025

Michal-Leszczynski commented Jan 31, 2025

Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

Looks like 1-1-restore require disabling tgc and repair afterwards + other things #4234

Comments

VAveryanov8 commented Jan 30, 2025 • edited Loading

Views

TGC and Repair

Versioned files

VAveryanov8 commented Jan 30, 2025

Michal-Leszczynski commented Jan 31, 2025

VAveryanov8 commented Jan 30, 2025 •

edited

Loading