As long as we have the database and Gitaly backups, we can restore deleted GitLab projects.
It is strongly suggested to read through this entire document before proceeding.
There is an alternate runbook for a different method of restoring namespaces, but the one in this document should be preferred.
There are two sources of data that we will be restoring: the project metadata (issues, merge requests, members, etc.), which is stored in the main database (Postgres), and the repositories (main and wiki) which are stored on a Gitaly shard.
Container images and CI artifacts are not restored by this process.
If a project is deleted in GitLab, it is entirely removed from the database. That is, we also lack the necessary metadata to recover data from file servers. Recovering meta- and project data is a multi-step process:
- Restore a full database backup and perform point-in-time recovery (PITR)
- Extract metadata necessary to recover from git/wiki data from file servers
- Export the project from the database backup and import into GitLab
We run a delayed archive replica of our production database, with
recovery_min_apply_delay = '8h'
in its recovery.conf. It is therefore at least
8h behind the production database at all times. If the request for restoration
comes quickly enough, we can skip the creation of a PITR instance and use this
delayed-replica instead.
Note that this procedure has not been used, and is speculative. Inform the engineer on-call before continuing. This will likely set off alerts due to lagging replication, which will need to be silenced for the duration of this procedure.
ssh
to the delayed replica.- In a
gitlab-psql
shell:SELECT pg_xlog_replay_pause();
systemctl stop chef-client.service
Later, when you have extracted the project export:
- In a
gitlab-psql
shell:SELECT pg_xlog_replay_resume();
systemctl start chef-client.service
Continue at "Export project from database backup and import into GitLab".
If the request arrived promptly and you were able to follow the special procedure above, skip this section.
In order to restore from a database backup, we leverage the backup restore pipeline in "gitlab-restore" project. It can be configured to start a new GCE instance and restore a backup to an exact point in time for later recovery (example MR). Currently, Postgres backups are created by WAL-E. To restore from such backups, either WAL-G or WAL-E can be used. The default is WAL-G, as it gives 3-4 times better restoration speed than WAL-E. Use WAL_E_OR_WAL_G
CI variable to switch to WAL-E if needed (see below).
- Push a commit similar to the example MR above. Note that you don't need to create an MR although you can if you like.
- You can start the process in CI/CD Pipelines of the "gitlab-restore" project.
- Select your branch, and configure the variables as detailed in the steps below.
- To perform PITR for the production database, use CI/CD variable
ENVIRONMENT
set togprd
. The default value isgstg
meaning that the staging database will be restored. - To ensure that your instance won't get destroyed in the end, set CI/CD variable
NO_CLEANUP
to1
. - In CI/CD Pipelines, when starting a new pipeline, you can choose any Git branch. But if you use something except
master
, there are high chances that production DB copy won't fit the disk. So, useGCE_DATADISK_SIZE
CI/CD variable to provision an instance with a large enough disk. As of January 2020, we need to use at least6000
(6000 GiB). Check theGCE_DATADISK_SIZE
value that is currently used in the backup verification schedules (see CI/CD Schedules). - By default, an instance of
n1-standard-16
type will be used. Such instances have "good enough" IO throughput and IOPS quotas Google throttles disk IO based on the disk size and the number of vCPUs). In the case of urgency, to get the best performance possible on GCP, consider usingn1-highcpu-32
, specifying CI/CD variableGCE_INSTANCE_TYPE
in the CI/CD pipeline launch interface. It is highly recommended to check the current resource consumption (total vCPUs, RAM, disk space, IP addresses, and the number of instances and disks in general) in the GCP quotas interfaces of the "gitlab-restore" project. - It is recommended (although not required) to specify the instance name using CI/CD variable
INSTANCE_NAME
. Custom names help distinguish GCE instances from auto-provisioned and from provisioned by someone else. An excellent example of custom name:nik-gprd-infrastructure-issue-1234
(means: requested bynik
, for environmentgprd
, for the theinfrastructure
issue1234
). If the custom name is not set, your instance gets a name likerestore-postgres-gprd-XXXXX
, whereXXXXX
is the CI/CD pipeline ID. - As mentioned above, by default, WAL-G will be used to restore from a backup. It is controlled by CI variable
WAL_E_OR_WAL_G
(default value:wal-g
). If WAL-E is needed, set CI variableWAL_E_OR_WAL_G
towal-e
, but expect that restoring will take much more time. For the "restore basebackup" phase, onn1-standard-16
, the expected speed of filling the PGDATA directory is 0.5 TiB per hour for WAL-E and 2 TiB per hour for WAL-G. - To control the process, SSH to the instance. The main items to check:
df -hT /var/opt/gitlab
to ensure that the disk is not full (if it hits 100%, it won't be noticeable in the CI/CD interfaces, unfortunately),sudo journalctl -f
to see basebackup fetching and, later, WAL fetching/replaying happening.
- Finally, especially if you have made multiple attempts to provision an instance via CI/CD Pipelines interface, check VM Instances in GCP console to ensure that there are no stalled instances related to your work. If there are some, delete them manually. If you suspect that your attempts failing because of some WAL-G issues, try WAL-E (see above).
The instance will progress through a series of operations:
- The basebackup will be downloaded
- The Postgres server process will be started, and will begin progressing past the basebackup by recovering from WAL segments downloaded from GCS.
- Initially, Postgres will be in crash recovery mode and will not accept connections.
- At some point, Postgres will accept connections, and you can check its
recovery point by running
select pg_last_xact_replay_timestamp();
in agitlab-psql
shell. - Check back every 30 minutes or so until the recovery point you wanted has been reached. You don't need to do anything to stop further recovery, your branch has configured it to pause at this point.
After the process completes, an instance with a full GitLab installation and a production copy of the database is available for the next steps.
Note that the startup script will never actually exit due to the branch configuration that causes Postgres to pause recovery when some point is reached. It loops forever waiting for a recovery point equal to script start time.
Here, we use the restored database instance with a GitLab install to export the project through the standard import/export mechanism. We want to avoid starting a full GitLab instance (to perform the export throughout the UI) because this sits on a full-sized production database. Instead, we use a rails console to trigger the export.
- Start Redis:
gitlab-ctl start redis
(Redis is not going to be used really, but it's a required dependency) - Start Rails:
gitlab-rails console
Modify the literals in the following console example and run it. This retrieves a project by its ID, which we obtain by searching for it by namespace ID and its name. We also retrieve an admin user. Use yours for auditability. The ProjectTreeSaver needs to run "as a user", so we use an admin user to ensure that we have permissions.
irb(main):024:0> Namespace.find_by_path('some-ns')
=> #<Group id:1234 @myns>
irb(main):027:0> Project.where(namespace_id: 1234, path: 'some-project')
=> #<ActiveRecord::Relation [#<Project id:5678 myns/some-project>]>
irb(main):028:0> proj = Project.find(5678)
=> #<Project id:5678 myns/some-project>
irb(main):028:0> proj.repository_storage
... note down this output...
irb(main):028:0> proj.disk_path
... note down this output ...
irb(main):023:0> admin_user = User.find_by_username('an-admin')
=> #<User id:1234 @an-admin>
pts = Gitlab::ImportExport::ProjectTreeSaver.new(project: proj, current_user: admin_user, shared: proj.import_export_shared)
... some output that includes the path to a project.json file. Note this down.
pts.save
We now have the Gitaly shard and path on persistent disk the project was stored on, and if the final command succeeded, we have a project metadata export JSON.
It's possible that the save command failed with a "storage not found" error. If
this is the case, edit /etc/gitlab/gitlab.rb
and add a dummy entry to
git_data_dirs
for the shard, then run gitlab-ctl reconfigure
, and restart
the console session. We are only interested in the project metadata for now, but
the Project#repository_storage
must exist in config.
An example of the git_data_dirs
config entry in gitlab.rb
:
git_data_dirs({
"default" => {
"path" => "/mnt/nfs-01/git-data"
},
"nfs-file40" => {
"path" => "/mnt/nfs-01/git-data"
}
})
You can safely duplicate the path from the default
git_data_dir, it doesn't
matter that it won't contain the repository.
Make the project.json accessible to your gcloud ssh
user:
mv /path/to/project.json /tmp/
chmod 644 /tmp/project.json
Download the file, replacing the project and instance name in this example as
appropriate: gcloud --project gitlab-restore compute scp restore-postgres-gprd-88895:/tmp/project.json ./
Download a stub exported project.
On your local machine, replace the project.json inside the stub archive with the real one:
mkdir repack
tar -xf test_project_export.tar.gz -C repack
cd repack
cp ../project.json ./
tar -czf ../repacked.tar.gz ./
Log into gitlab.com using your admin account, and navigate to the namespace in which we are restoring the project. Create a new project, using the "from GitLab export" option. Name it after the deleted project, and upload repacked.tar.gz.
A project can also be imported on the command line.
This will create a new project with equal metadata to the deleted one. It will have a stub repo and wiki. The persistent object itself is new: it has a new project_id, and the repos are not necessarily stored on the same Gitaly shard, and will have new disk_paths.
Browse the restored project's member list. If your admin account is listed as a maintainer, leave the project.
Start a production console, and locate the new project object using the same
method as above. Note down its repository_storage
and disk_path
. These point
us to the stub repo (and wiki repo) that we'll now replace with a backup.
The first step is to check if the repositories still exist at the old location. They likely do not, but it's possible that unlike project metadata they have not (yet) been removed.
Using the repository_storage and disk_path obtained from the DB backup (i.e.
for the old, deleted project metadata), ssh into the relevant Gitaly shard and
navigate to /var/opt/gitlab/git-data/repositories
. Check if <disk_path>.git
and <disk_path>.wiki.git
exist. If so, create a snapshot of this disk in the
GCE console.
If these directories do not exist, browse GCE snapshots for the last known good snapshot of the Gitaly persistent disk on which the repository used to be stored.
Either way, you now have a snapshot with which to follow the next steps.
Run all commands on the server in a root shell.
- Create a new disk from the snapshot. Give it a relevant name and description, and ensure it's placed in the same zone as the Gitaly shard referenced by the new project metadata (i.e. that obtained from the production console).
- In the GCE console, edit the Gitaly shard on which the new, stub repositories are stored. Attach the disk you just created with a custom name "pitr".
- GCP snapshots are not guaranteed to be consistent. Check the filesystem:
fsck.ext4 /dev/disk/by-id/google-pitr
. If this fails, do not necessarily stop if you are later able to mount it: the user is already missing their repository, and if we are lucky the part of the filesystem containing it is not corrupted. Later, we ask the customer to check the repository, including runninggit fsck
. Unfortunately it's possible that the repository would already have failed this check, and we can't know. - Mount the disk:
mkdir /mnt/pitr; mount -o ro /dev/disk/by-id/google-pitr /mnt/pitr
- Navigate to the parent of
/var/opt/gitlab/git-data/repositories/<disk_path>.git
. mv <new_hash>.git{,moved-by-your-name}
, and similar for the wiki repository. Reload the project page and you should see an error. This double-checks that you have moved the correct repositories (the stubs). You canrm -rf
these directories now.<new_hash>
refers to the final component of the newdisk_path
.cp -a /mnt/pitr/git-data/repositories/<old_disk_path>.git ./<new_hash>.git
, and similarly for the wiki repo. If you've followed the steps up to now your CWD is something like/var/opt/gitlab/git-data/repositories/@hashed/ab/cd
.- Reload the project page. You should see the restored repository, and wiki.
umount /mnt/pitr
- In the GCE console, edit the Gitaly instance, removing the pitr disk.
Once the customer confirms everything is restored as expected, you can delete any disks, and Postgres PITR instances created by this process.
It might be worth asking the customer to check their repository with git fsck
.
If the filesystem-level fsck we ran on the Gitaly shard succeeded, then the
result of git fsck
doesn't matter that much: the repository might already have
been corrupted, and that's not necessarily our fault. However, if both fsck
s
failed, we can't know whether the corruption predated the snapshot or not.