Disaster Recovery Guide

This document describes

  1. Steps needed to be performed in case of a disaster (DR — Failover scenario) i.e., to bring up application and services in “Failover Region”.
  2. Steps needed to operate back in the original state (DR — Fallback scenario) i.e., to run application and services in “Primary Region” once the region is operational.

DR — Failover Scenario

Steps:

Stop PROD application

  • Set the desired count of PROD ecs services service to 0 through terraform.
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan  
$ terraform apply env-PROD.plan

Promote DR Replica to standalone

  • From the PROD/DR AWS Console, monitor replication lag and when lag is significantly less, promote the cross-region read replica.

AWS Region: “Failover Region”

RDS → demo-dr-db → Actions → Promote

  • Set the replicate_source_db in env-DR terraform scripts to null, so the DB status can now reflect in the terraform state.
#Provision DB 
module demo-db {
...
replicate_source_db = ""
}

NOTE: There seems no way of promoting the read replica to standalone through terraform right now, hence the above 2 steps are needed.

Check DR application version

  • Ensure current application version(image tag) in the task definition is the same as in PROD. Deploy services if needed.

Start DR application services

  • Set the desired count >0 (precisely to the prod setting) by updating the ecs service in env-DR terraform scripts for all services.
  • Run terraform plan and apply.
$ terraform plan -var-file="dr.tfvars" -out env-DR.plan  
$ terraform apply env-DR.plan
  • Monitor application logs in cloudwatch for all services

Update CloudFront origin to point to DR Front End ALB

  • From the PROD/DR AWS Console,

CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to DR ALB from drop down → Yes, Edit

DR — Fallback Scenario

Steps:

Rename the existing prod db

NOTE: In case of actual disaster, this step may or may not be needed depending on how AWS recovers the resources

  • Rename demo-prod-db to demo-prod-old-db in SG region from console or CLI (takes about a minute)

Create a cross region read replica of the DR Database

  • From AWS console, create a read replica of demo-dr-db with below screenshot options.

Promote replica as standalone DB

  • Once Replica is created, promote it(demo-prod-db) as stanalone database.

Synchronise PROD terraform state

  • Run the prod terraform plan to see any changes detected in the DB. Run the terraform apply to apply those changes if needed
$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan
$ terraform apply env-PROD.plan

NOTE1: Minor changes like tags, max_allocated_storage were detected.

NOTE2: In case of actual disaster, all other resources in the SG region will be provisioned.

Start PROD application services

  • Set the desired count of PROD ecs services service to original count through terraform.
$ terraform init
$ terraform plan -var-file="prod.tfvars" -out env-PROD.plan
$ terraform apply env-PROD.plan

NOTE: This terraform plan should set the desired_count of the ecs services and task the ecs tasks.

  • Validate all production logs to ensure DB is accessible.

Update CloudFront origin to point back to PROD

  • From the PROD/DR AWS Console,

CloudFront → Click PROD Distribution ID → Origins and Origin Groups → Select Origin & Edit → Update “Origin Domain Name” to PROD ALB from drop down → Yes, Edit

  • Test the production Url and ensure application is accessible.

Delete the DR Standalone & re-create a read replica from the current prod db

  • From the AWS console, delete the demo-dr-db standalone instance.
  • Run the DR terraform scripts to create a read replica of demo-prod-db in “Failover Region” region.
$ terraform init 
$ terraform plan -var-file="dr.tfvars" -out env-DR.plan
$ terraform apply env-DR.plan

NOTE1: This terraform plan should create a read replica and update the DR secret with the new DB replica identifier.

NOTE2: This replica creation takes about 20–25 min and more depending on the size of the DB.

CloudWatch Alarms

  • Check and update the RDS alarms configured for the demo-prod-db as necessary.
  • Check all other cloud watch alarms are in the OK state as expected.