Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built in support for warm DR standby #33

Open
rtomayko opened this issue Aug 2, 2014 · 8 comments
Open

Built in support for warm DR standby #33

rtomayko opened this issue Aug 2, 2014 · 8 comments

Comments

@rtomayko
Copy link
Contributor

@rtomayko rtomayko commented Aug 2, 2014

Setting up a warm standby VM is currently fairly straightforward once basic backups are in place but could benefit from built in support in github/backup-utils. The basic idea is to configure a new VM (possibly in another DC) and leave it in maintenance mode. Then modify the scheduled backup run to use the following instead of just ghe-backup:

ghe-backup && ghe-restore <standby-ip>

The Git backup and restore portions are fully incremental so this should be efficient enough to schedule as regularly as every hour, bringing the RPO down to an acceptable level. The RTO could be as low as minutes with a warm standby VM, or could be ~1 hour if people prefer to opt for a cold / provision-VM-at-time-of-recovery setup. Both options should be available and the choice will be based on how much cash / operational work people want to take on vs. optimizing the RPO/RTO. No changes to the current 11.10.343 release are necessary for this.


I ran through the basic process of setting up a warm standby VM yesterday and wanted to document the process. Let's assume the primary GHE instance is at "github.example.com". The process for setting up a standby is:

  1. Run ghe-backup against the primary to get a first successful snapshot.
  2. Boot a new 11.10.320 VM to act as the standby and record the <standby-ip>.
  3. Create a DNS entry: "github-standby.example.com" pointed to <standby-ip>. This should be set with a low TTL (like 5 minutes). The main "github.example.com" DNS entry should also have a low TTL.
  4. Upload license and 11.10.342 GHP to the standby VM via http://github-standby.example.com/setup.
  5. Add the backup site SSH key to authorized keys in manage http://github-standby.example.com/setup/settings.
  6. Run ghe-standby github-standby.example.com (WIP version here) from the backup site. This is essentially ghe-maintenance -s && ghe-import-settings && sudo enterprise-configure on the remote side. It puts the standby in maintenance mode and loads in settings from the last snapshot. The standby VM will stay in maintenance mode until it's activated.
  7. Should also run ghe-import-ssh-host-keys here but that changes the host key signature and will cause excessive SSH warnings and prompts. We can find a way to fit this into the ghe-standby script sanely.
  8. Perform an initial restore of the latest snapshot with ghe-restore github-standby.example.com.
  9. Schedule the backup run as ghe-backup && ghe-restore github-standby.example.com.

At this point, we have backups being taken and loaded into the standby on a regular basis. The process for recovery / failing over is:

  1. Put the primary instance in maintenance mode (if it's still up and available).
  2. Take the standby instance out of maintenance mode. I have a WIP ghe-activate github-standby.example.com script going here. This just takes the standby instance out of maintenance mode via ghe-maintenance -u.
  3. Check that https://github-standby.example.com is up and working.
  4. Point github.example.com DNS to <standby-ip>.
  5. Point github-standby.example.com DNS to the old primary if it should take over as the standby host. If it's borked, start at the beginning and set up a new standby VM.
@rtomayko
Copy link
Contributor Author

@rtomayko rtomayko commented Aug 2, 2014

A couple things that are blocking continuous restores to a warm standby:

  • ghe-import-redis backs up the current redis.rdb file to /data/redis/redis.rdb.<timestamp>.bak but never cleans them up. We'll exhaust disk space with these if restores happened every hour.
  • ghe-import-es-indices has a similar problem. The current set of ES indexes are backed up to /home/admin/elasticsearch-indices.<timestamp> before the new indexes are put in place. This will fill up disk pretty quickly.

We can clean up the backups from the restore scripts but ideally the server side scripts would be retooled to keep only a few of the most recent backups here. Alternatively, it might be nice to be able to pass an option to the ghe-import-* scripts that tells them to avoid backing these things up altogether. It's useful when restoring to an existing VM but a warm standby will never have had data we'd want to keep around and these operations take time.

@quocvu
Copy link

@quocvu quocvu commented Sep 10, 2014

I would like that as well. DR is a major use case.
Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself

@rtomayko
Copy link
Contributor Author

@rtomayko rtomayko commented Sep 10, 2014

Ideally, I don't want a 3rd VM to schedule this. The warm standby VM should be running ghe-backup and ghe-restore onto itself

👍 That's definitely the approach I think we'll be taking here in a future release.

@quocvu
Copy link

@quocvu quocvu commented Sep 10, 2014

If you go down that path, watch out for disk space. In my primary GHE, the / is the standard 75GB and /data/repositories is a much larger volume (say 500GB). The standby GHE has the same config.

If the 500GB is sufficiently used, the standby VM won't have enough space to perform ghe-backup. Please add a location that is writable by the admin user to store those temp backup files and that volume must be extendable.

@quocvu
Copy link

@quocvu quocvu commented Sep 10, 2014

Few questions:

  1. I can delete those /home/admin/elasticsearch-indices. after the ghe-restore, but how I can I remove /data/redis/redis.rdb..bak since they belong to root user? This alone is a blocker to use this feature.
  2. step 2, why do we need ghe-activate? Can we just use ghe-maintenance directly?
  3. why is ghe-import-settings needed before the backup even took place? It seems that is something we would do after the restore. What bothers me is the hostname of the standby VM is set to github.example.com but the DNS is pointing to the primary still. That is inconsistent on the network. And I suspect it would render the standby unusable
  4. where can I find ghe-import-settings & sudo enterprise-configure scripts?
@xeago
Copy link
Contributor

@xeago xeago commented Dec 14, 2014

Is this issue still relevant, now that GHE2 has become available?

@rtomayko
Copy link
Contributor Author

@rtomayko rtomayko commented Dec 14, 2014

I would still like to have a better story around recovery time from backup in a separate datacenter, including the ability to continuously restore each backup to an instance in standby mode. The pieces are there to do this today but right now our documentation and testing is limited to restoring cold with a new instance. Needs testing, ironing out any remaining issues, and documentation.

I'd also like to get something in place for @quocvu's suggestion in #33 (comment) of shipping backup-utils on the GHE appliance itself, being able to use it as the backup host, and having an out-of-the-box configuration that lets the backup host act as the standby instance. I think that can be split out from this issue, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
@rtomayko @xeago @quocvu and others
You can’t perform that action at this time.