Disaster Recovery Guide
This document describes
- Steps needed to be performed in case of a disaster (DR — Failover scenario) i.e., to bring up application and services in “Failover Region”.
- Steps needed to operate back in the original state (DR — Fallback scenario) i.e., to run application and services in “Primary Region” once the region is operational.
DR — Failover Scenario
Steps:
Stop PROD application
- Set the desired count of PROD ecs services service to 0 through terraform.
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan
$ terraform apply env-PROD.plan
Promote DR Replica to standalone
- From the PROD/DR AWS Console, monitor replication lag and when lag is significantly less, promote the cross-region read replica.
AWS Region: “Failover Region”
RDS → demo-dr-db → Actions → Promote
- Set the replicate_source_db in env-DR terraform scripts to null, so the DB status can now reflect in the terraform state.
#Provision DB
module demo-db { ...
replicate_source_db = "" }
NOTE: There seems no way of promoting the read replica to standalone through terraform right now, hence the above 2 steps are needed.
Check DR application version
- Ensure current application version(image tag) in the task definition is the same as in PROD. Deploy services if needed.
Start DR application services
- Set the desired count >0 (precisely to the prod setting) by updating the ecs service in env-DR terraform scripts for all services.
- Run terraform plan and apply.
$ terraform plan -var-file="dr.tfvars" -out env-DR.plan
$ terraform apply env-DR.plan
- Monitor application logs in cloudwatch for all services
Update CloudFront origin to point to DR Front End ALB
- From the PROD/DR AWS Console,
CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to DR ALB from drop down → Yes, Edit
DR — Fallback Scenario
Steps:
Rename the existing prod db
NOTE: In case of actual disaster, this step may or may not be needed depending on how AWS recovers the resources
- Rename demo-prod-db to demo-prod-old-db in SG region from console or CLI (takes about a minute)
Create a cross region read replica of the DR Database
- From AWS console, create a read replica of demo-dr-db with below screenshot options.
Promote replica as standalone DB
- Once Replica is created, promote it(demo-prod-db) as stanalone database.
Synchronise PROD terraform state
- Run the prod terraform plan to see any changes detected in the DB. Run the terraform apply to apply those changes if needed
$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan
$ terraform apply env-PROD.plan
NOTE1: Minor changes like tags, max_allocated_storage were detected.
NOTE2: In case of actual disaster, all other resources in the SG region will be provisioned.
Start PROD application services
- Set the desired count of PROD ecs services service to original count through terraform.
$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan
$ terraform apply env-PROD.plan
NOTE: This terraform plan should set the desired_count of the ecs services and task the ecs tasks.
- Validate all production logs to ensure DB is accessible.
Update CloudFront origin to point back to PROD
- From the PROD/DR AWS Console,
CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to PROD ALB from drop down → Yes, Edit
- Test the production Url and ensure application is accessible.
Delete the DR Standalone & re-create a read replica from the current prod db
- From the AWS console, delete the demo-dr-db standalone instance.
- Run the DR terraform scripts to create a read replica of demo-prod-db in “Failover Region” region.
$ terraform init
$ terraform plan -var-file="dr.tfvars" -out env-DR.plan
$ terraform apply env-DR.plan
NOTE1: This terraform plan should create a read replica and update the DR secret with the new DB replica identifier.
NOTE2: This replica creation takes about 20–25 min and more depending on the size of the DB.
CloudWatch Alarms
- Check and update the RDS alarms configured for the demo-prod-db as necessary.
- Check all other cloud watch alarms are in the OK state as expected.