4 REAL SCENARIOS, 4 REAL PLAYBOOKS
Both regions run live traffic simultaneously. A region failure is invisible to users because the other region is already serving them.
Primary serves live traffic. Standby (warm or cold) is idle until primary fails. Much cheaper than active-active.
Region-wide outage (power, fiber cut, cloud provider meltdown). The whole region is offline. How do we recover?
A scan's auto-fix gets approved and applied, but it cascades into a bigger problem. How does TITAN itself clean up its own mess?
deploy-titan.sh --rollback=<scan-id>deploy-titan.sh --kill-all halts every agent, freezes the environment, opens a support ticket with TITAN AI engineering on-call (24/7).RTO / RPO / COST TRADEOFFS
T-0 TO T+10 MIN · WHAT DR-GUARD DOES
3 failed health probes, 90s window
90sProbe DR site — is it healthy?
15sTraffic Manager / Route 53 weight swap
30-60sRead-replica → primary
60-300sDR compute warm → full size
60sApp endpoint /health returns 200
30sPagerDuty + Slack + email
5sHIPAA / SOC 2 audit log exported
autoDR-GUARD USES AI TO PICK THE RIGHT RECOVERY PATH PER RESOURCE
A dead VM, a dead AKS cluster, and a dead Cosmos DB all need different recovery playbooks. DR-GUARD asks the LLM (Claude in normal mode, local Llama in AIRLOCK) to generate a resource-type-specific plan every time. No one-size-fits-all.
| RESOURCE | AZURE | AWS | GCP |
|---|---|---|---|
| VM / Compute | Azure Backup → VHD snapshot to DR region → redeploy via Bicep | AMI cross-region copy + EBS snapshot repl → relaunch from AMI | Persistent Disk snapshot + image export → recreate |
| AKS / EKS / GKE | Velero backup → restore to DR AKS → reapply Helm | Velero + ECR cross-region → EKS clone via eksctl | Velero + Artifact Registry repl → recreate GKE |
| SQL Database | Azure SQL geo-rep → forced failover of failover-group | RDS read-replica promotion in DR region | Cloud SQL cross-region replica promote |
| NoSQL / Document | Cosmos DB multi-region writes or forced failover | DynamoDB Global Tables (multi-region by design) | Firestore multi-region or Spanner (global) |
| Object Storage | GRS / RA-GRS → client-side endpoint failover | S3 Cross-Region Replication + multi-region access | Multi-region bucket or dual-region turbo repl |
| Secrets / KMS | Key Vault geo-repl + backup blobs → restore to DR KV | Secrets Mgr cross-region repl; multi-region KMS keys | Secret Mgr policies; Cloud KMS multi-region keyrings |
| Networking / DNS | Traffic Manager priority → Front Door dual-region | Route 53 health-check → CloudFront dual-origin | Cloud DNS + Cloud Load Balancing failover |
| App Service / Fn | App Service slots + Front Door backend failover | Lambda cross-region + API Gateway endpoint swap | Cloud Run + Cloud LB multi-region backends |
| Data Warehouse | Synapse geo-backup + RA-GRS underlying storage | Redshift snapshot copy to DR + cluster restore | BigQuery multi-region datasets (built-in) |
DR-GUARD does not hardcode recovery for every resource (there are hundreds of services). Instead, when a failure is detected, it inventories the resources that need to be restored and asks the LLM (Claude API, or local Llama 3 in AIRLOCK mode) to generate a resource-specific recovery plan. The plan includes exact CLI commands, rollback, pre-flight checks, estimated RTO, and compliance mapping. Human approval required before execution (can be auto-approved for gold-tier playbooks the client pre-signs).
DR-GUARD IS A PREMIUM ADD-ON