Built in support for warm DR standby #33

rtomayko · 2014-08-02T00:32:24Z

Setting up a warm standby VM is currently fairly straightforward once basic backups are in place but could benefit from built in support in github/backup-utils. The basic idea is to configure a new VM (possibly in another DC) and leave it in maintenance mode. Then modify the scheduled backup run to use the following instead of just ghe-backup:

ghe-backup && ghe-restore <standby-ip>

The Git backup and restore portions are fully incremental so this should be efficient enough to schedule as regularly as every hour, bringing the RPO down to an acceptable level. The RTO could be as low as minutes with a warm standby VM, or could be ~1 hour if people prefer to opt for a cold / provision-VM-at-time-of-recovery setup. Both options should be available and the choice will be based on how much cash / operational work people want to take on vs. optimizing the RPO/RTO. No changes to the current 11.10.343 release are necessary for this.

I ran through the basic process of setting up a warm standby VM yesterday and wanted to document the process. Let's assume the primary GHE instance is at "github.example.com". The process for setting up a standby is:

Run ghe-backup against the primary to get a first successful snapshot.
Boot a new 11.10.320 VM to act as the standby and record the <standby-ip>.
Create a DNS entry: "github-standby.example.com" pointed to <standby-ip>. This should be set with a low TTL (like 5 minutes). The main "github.example.com" DNS entry should also have a low TTL.
Upload license and 11.10.342 GHP to the standby VM via http://github-standby.example.com/setup.
Add the backup site SSH key to authorized keys in manage http://github-standby.example.com/setup/settings.
Run ghe-standby github-standby.example.com (WIP version here) from the backup site. This is essentially ghe-maintenance -s && ghe-import-settings && sudo enterprise-configure on the remote side. It puts the standby in maintenance mode and loads in settings from the last snapshot. The standby VM will stay in maintenance mode until it's activated.
Should also run ghe-import-ssh-host-keys here but that changes the host key signature and will cause excessive SSH warnings and prompts. We can find a way to fit this into the ghe-standby script sanely.
Perform an initial restore of the latest snapshot with ghe-restore github-standby.example.com.
Schedule the backup run as ghe-backup && ghe-restore github-standby.example.com.

At this point, we have backups being taken and loaded into the standby on a regular basis. The process for recovery / failing over is:

Put the primary instance in maintenance mode (if it's still up and available).
Take the standby instance out of maintenance mode. I have a WIP ghe-activate github-standby.example.com script going here. This just takes the standby instance out of maintenance mode via ghe-maintenance -u.
Check that https://github-standby.example.com is up and working.
Point github.example.com DNS to <standby-ip>.
Point github-standby.example.com DNS to the old primary if it should take over as the standby host. If it's borked, start at the beginning and set up a new standby VM.

rtomayko · 2014-08-02T18:42:52Z

A couple things that are blocking continuous restores to a warm standby:

ghe-import-redis backs up the current redis.rdb file to /data/redis/redis.rdb.<timestamp>.bak but never cleans them up. We'll exhaust disk space with these if restores happened every hour.
ghe-import-es-indices has a similar problem. The current set of ES indexes are backed up to /home/admin/elasticsearch-indices.<timestamp> before the new indexes are put in place. This will fill up disk pretty quickly.

We can clean up the backups from the restore scripts but ideally the server side scripts would be retooled to keep only a few of the most recent backups here. Alternatively, it might be nice to be able to pass an option to the ghe-import-* scripts that tells them to avoid backing these things up altogether. It's useful when restoring to an existing VM but a warm standby will never have had data we'd want to keep around and these operations take time.

quocvu · 2014-09-10T00:58:44Z

I would like that as well. DR is a major use case.
Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself

rtomayko · 2014-09-10T01:33:03Z

Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself

👍 That's definitely the approach I think we'll be taking here in a future release.

quocvu · 2014-09-10T02:10:23Z

If you go down that path, watch out for disk space. In my primary GHE, the / is the standard 75GB and /data/repositories is a much larger volume (say 500GB). The standby GHE has the same config.

If the 500GB is sufficiently used, the standby VM won't have enough space to perform ghe-backup. Please add a location that is writable by the admin user to store those temp backup files and that volume must be extendable.

quocvu · 2014-09-10T05:06:56Z

Few questions:

I can delete those /home/admin/elasticsearch-indices. after the ghe-restore, but how I can I remove /data/redis/redis.rdb..bak since they belong to root user? This alone is a blocker to use this feature.
step 2, why do we need ghe-activate? Can we just use ghe-maintenance directly?
why is ghe-import-settings needed before the backup even took place? It seems that is something we would do after the restore. What bothers me is the hostname of the standby VM is set to github.example.com but the DNS is pointing to the primary still. That is inconsistent on the network. And I suspect it would render the standby unusable
where can I find ghe-import-settings & sudo enterprise-configure scripts?

xeago · 2014-12-14T15:01:32Z

Is this issue still relevant, now that GHE2 has become available?

rtomayko · 2014-12-14T17:07:56Z

I would still like to have a better story around recovery time from backup in a separate datacenter, including the ability to continuously restore each backup to an instance in standby mode. The pieces are there to do this today but right now our documentation and testing is limited to restoring cold with a new instance. Needs testing, ironing out any remaining issues, and documentation.

I'd also like to get something in place for @quocvu's suggestion in #33 (comment) of shipping backup-utils on the GHE appliance itself, being able to use it as the backup host, and having an out-of-the-box configuration that lets the backup host act as the standby instance. I think that can be split out from this issue, though.

rtomayko added the enhancement label Aug 19, 2014

rtomayko mentioned this issue Oct 24, 2014

Delete old elasticsearch backups on restore #71

Closed

github / backup-utils

Built in support for warm DR standby #33

Built in support for warm DR standby #33

rtomayko commented Aug 2, 2014

rtomayko commented Aug 2, 2014

quocvu commented Sep 10, 2014

rtomayko commented Sep 10, 2014

quocvu commented Sep 10, 2014

quocvu commented Sep 10, 2014

xeago commented Dec 14, 2014

rtomayko commented Dec 14, 2014

github / backup-utils

Join GitHub today

GitHub is where the world builds software

Built in support for warm DR standby #33

Built in support for warm DR standby #33

Comments

rtomayko commented Aug 2, 2014

rtomayko commented Aug 2, 2014

quocvu commented Sep 10, 2014

rtomayko commented Sep 10, 2014

quocvu commented Sep 10, 2014

quocvu commented Sep 10, 2014

xeago commented Dec 14, 2014

rtomayko commented Dec 14, 2014

Essential cookies

Always active

Analytics cookies