Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Built in support for warm DR standby #33
Comments
A couple things that are blocking continuous restores to a warm standby:
We can clean up the backups from the restore scripts but ideally the server side scripts would be retooled to keep only a few of the most recent backups here. Alternatively, it might be nice to be able to pass an option to the |
I would like that as well. DR is a major use case. |
|
If you go down that path, watch out for disk space. In my primary GHE, the / is the standard 75GB and /data/repositories is a much larger volume (say 500GB). The standby GHE has the same config. If the 500GB is sufficiently used, the standby VM won't have enough space to perform ghe-backup. Please add a location that is writable by the admin user to store those temp backup files and that volume must be extendable. |
Few questions:
|
Is this issue still relevant, now that GHE2 has become available? |
I would still like to have a better story around recovery time from backup in a separate datacenter, including the ability to continuously restore each backup to an instance in standby mode. The pieces are there to do this today but right now our documentation and testing is limited to restoring cold with a new instance. Needs testing, ironing out any remaining issues, and documentation. I'd also like to get something in place for @quocvu's suggestion in #33 (comment) of shipping backup-utils on the GHE appliance itself, being able to use it as the backup host, and having an out-of-the-box configuration that lets the backup host act as the standby instance. I think that can be split out from this issue, though. |
Setting up a warm standby VM is currently fairly straightforward once basic backups are in place but could benefit from built in support in github/backup-utils. The basic idea is to configure a new VM (possibly in another DC) and leave it in maintenance mode. Then modify the scheduled backup run to use the following instead of just
ghe-backup
:The Git backup and restore portions are fully incremental so this should be efficient enough to schedule as regularly as every hour, bringing the RPO down to an acceptable level. The RTO could be as low as minutes with a warm standby VM, or could be ~1 hour if people prefer to opt for a cold / provision-VM-at-time-of-recovery setup. Both options should be available and the choice will be based on how much cash / operational work people want to take on vs. optimizing the RPO/RTO. No changes to the current 11.10.343 release are necessary for this.
I ran through the basic process of setting up a warm standby VM yesterday and wanted to document the process. Let's assume the primary GHE instance is at "github.example.com". The process for setting up a standby is:
ghe-backup
against the primary to get a first successful snapshot.<standby-ip>
.<standby-ip>
. This should be set with a low TTL (like 5 minutes). The main "github.example.com" DNS entry should also have a low TTL.ghe-standby github-standby.example.com
(WIP version here) from the backup site. This is essentiallyghe-maintenance -s && ghe-import-settings && sudo enterprise-configure
on the remote side. It puts the standby in maintenance mode and loads in settings from the last snapshot. The standby VM will stay in maintenance mode until it's activated.ghe-import-ssh-host-keys
here but that changes the host key signature and will cause excessive SSH warnings and prompts. We can find a way to fit this into theghe-standby
script sanely.ghe-restore github-standby.example.com
.ghe-backup && ghe-restore github-standby.example.com
.At this point, we have backups being taken and loaded into the standby on a regular basis. The process for recovery / failing over is:
ghe-activate github-standby.example.com
script going here. This just takes the standby instance out of maintenance mode viaghe-maintenance -u
.<standby-ip>
.