Disaster Recovery Guide

Swati Sannidhi

3 min readMay 1, 2021

This document describes

Steps needed to be performed in case of a disaster (DR — Failover scenario) i.e., to bring up application and services in “Failover Region”.
Steps needed to operate back in the original state (DR — Fallback scenario) i.e., to run application and services in “Primary Region” once the region is operational.

DR — Failover Scenario

Steps:

Stop PROD application

Set the desired count of PROD ecs services service to 0 through terraform.

$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan  
$ terraform apply env-PROD.plan

Promote DR Replica to standalone

From the PROD/DR AWS Console, monitor replication lag and when lag is significantly less, promote the cross-region read replica.

AWS Region: “Failover Region”

RDS → demo-dr-db → Actions → Promote

Set the replicate_source_db in env-DR terraform scripts to null, so the DB status can now reflect in the terraform state.

#Provision DB 
module demo-db {  ...   
replicate_source_db = ""   }

NOTE: There seems no way of promoting the read replica to standalone through terraform right now, hence the above 2 steps are needed.

Check DR application version

Ensure current application version(image tag) in the task definition is the same as in PROD. Deploy services if needed.

Start DR application services

Set the desired count >0 (precisely to the prod setting) by updating the ecs service in env-DR terraform scripts for all services.
Run terraform plan and apply.

$ terraform plan -var-file="dr.tfvars" -out env-DR.plan  
$ terraform apply env-DR.plan

Monitor application logs in cloudwatch for all services

Update CloudFront origin to point to DR Front End ALB

From the PROD/DR AWS Console,

CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to DR ALB from drop down → Yes, Edit

DR — Fallback Scenario

Steps:

Rename the existing prod db

NOTE: In case of actual disaster, this step may or may not be needed depending on how AWS recovers the resources

Rename demo-prod-db to demo-prod-old-db in SG region from console or CLI (takes about a minute)

Create a cross region read replica of the DR Database

From AWS console, create a read replica of demo-dr-db with below screenshot options.

Promote replica as standalone DB

Once Replica is created, promote it(demo-prod-db) as stanalone database.

Synchronise PROD terraform state

Run the prod terraform plan to see any changes detected in the DB. Run the terraform apply to apply those changes if needed

$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan 
$ terraform apply env-PROD.plan

NOTE1: Minor changes like tags, max_allocated_storage were detected.

NOTE2: In case of actual disaster, all other resources in the SG region will be provisioned.

Start PROD application services

Set the desired count of PROD ecs services service to original count through terraform.

$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan 
$ terraform apply env-PROD.plan

NOTE: This terraform plan should set the desired_count of the ecs services and task the ecs tasks.

Validate all production logs to ensure DB is accessible.

Update CloudFront origin to point back to PROD

From the PROD/DR AWS Console,

CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to PROD ALB from drop down → Yes, Edit

Test the production Url and ensure application is accessible.

Delete the DR Standalone & re-create a read replica from the current prod db

From the AWS console, delete the demo-dr-db standalone instance.
Run the DR terraform scripts to create a read replica of demo-prod-db in “Failover Region” region.

$ terraform init 
$ terraform plan -var-file="dr.tfvars" -out env-DR.plan 
$ terraform apply env-DR.plan

NOTE1: This terraform plan should create a read replica and update the DR secret with the new DB replica identifier.

NOTE2: This replica creation takes about 20–25 min and more depending on the size of the DB.

CloudWatch Alarms

Check and update the RDS alarms configured for the demo-prod-db as necessary.
Check all other cloud watch alarms are in the OK state as expected.