TITAN DR-GUARD

What about ACTIVE-ACTIVE DR setup?

Both regions run live traffic simultaneously. A region failure is invisible to users because the other region is already serving them.

✓ SOLUTION — DR-GUARD ACTIVE-ACTIVE MODE

Traffic Manager / Route 53 / Cloud DNS with geo-routing or weighted 50/50 to both regions
Database: multi-master replication — Azure Cosmos DB (multi-region writes), AWS DynamoDB Global Tables, GCP Cloud Spanner
Storage: cross-region replication with eventual consistency (<10s typical lag)
DR-GUARD polls both region health every 30s. On failure detection: automatically drops the sick region's weight to 0 within 60 seconds
Users in that region see a one-off TLS reconnect; zero data loss with CosmosDB/Spanner strong-consistency tier
RTO: < 60 sec. RPO: 0 (synchronous) or < 10 sec (async).

What about ACTIVE-STANDBY DR setup?

Primary serves live traffic. Standby (warm or cold) is idle until primary fails. Much cheaper than active-active.

✓ SOLUTION — DR-GUARD ACTIVE-STANDBY MODE

DNS weighted 100/0 toward primary; secondary pre-warmed with Azure SQL geo-replica, RDS read-replica, or Cloud SQL read-replica
DR-GUARD detects primary down (3 consecutive failed health checks across 90s)
Pre-flight check: confirms secondary region is itself healthy before failing into it (never fail into a broken site)
Flip DNS weights: primary=0, secondary=100 (propagation ~30-60s)
Promote SQL read-replica to primary (1-5 min depending on DB size)
Scale secondary compute from minimal to full (Azure App Service / ECS / GKE auto-scale)
Notify oncall + stakeholders via PagerDuty + email + Slack
RTO: 1-5 min (gold tier), 5-30 min (silver/bronze). RPO: < 5 min.

If an ENTIRE REGION goes down, how do we bring it back up?

Region-wide outage (power, fiber cut, cloud provider meltdown). The whole region is offline. How do we recover?

✓ SOLUTION — DR-GUARD REGION-DOWN PLAYBOOK

Minute 0-1: DR-GUARD health probes fail 3 times (90s). Region declared down.
Minute 1: Validates DR site health (active standby already running, or pilot-light warmed up)
Minute 1-2: DNS failover — Traffic Manager flips to secondary region
Minute 2-5: Database promotion (if needed) — secondary read-replica becomes primary
Minute 5: Compute scales up in DR region from warm to full
Minute 5-10: Secondary serves ALL traffic. Incident declared. Oncall paged.
Primary recovers (hours/days later): DR-GUARD re-syncs data primary←secondary, runs consistency check, then flips DNS back to primary in a planned window. No rush — secondary is fine.
Full audit log of every step (timestamps, commands, approvals) for post-incident review + HIPAA/SOC 2 evidence.

If TITAN AI's OWN AGENT breaks something, how do we recover?

A scan's auto-fix gets approved and applied, but it cascades into a bigger problem. How does TITAN itself clean up its own mess?

✓ SOLUTION — AGENT-INCIDENT RECOVERY (4-LAYER SAFETY NET)

Layer 1 - Pre-change snapshot: CONDUCTOR captured the exact state BEFORE the fix ran. Just restore the snapshot.
Layer 2 - Paired rollback command: Every AI fix ships with its rollback. One-click reverse: deploy-titan.sh --rollback=<scan-id>
Layer 3 - DR-GUARD failover: If rollback takes too long or doesn't work, DR-GUARD fails the whole region over to DR. Users never notice.
Layer 4 - Kill switch: deploy-titan.sh --kill-all halts every agent, freezes the environment, opens a support ticket with TITAN AI engineering on-call (24/7).
Written warranty: TITAN AI LLC indemnifies clients for any environment damage caused by our agent's auto-approved actions under the standard MSA. Nothing auto-approves without your opt-in flag.
Full forensics: every agent action logged with timestamp, command, result, and approval signature. Exportable to SIEM / forensic review in seconds.

ACTIVE-ACTIVE

RTO

< 60 SEC

RPO

DR COST

100% OF PRIMARY

Both regions live, load-balanced
Multi-master DB (Cosmos / DynamoDB Global / Spanner)
Invisible failover — users never know
Best for: critical transaction workloads, payment rails, healthcare EHR APIs

HIGHEST COST · BEST AVAILABILITY

ACTIVE-STANDBY (WARM)

RTO

1-5 MIN

RPO

< 5 MIN

DR COST

20-40%

Secondary region has minimal compute + read-replica DB
DNS flip + DB promotion on failover
Sweet spot for most production workloads
Best for: SaaS apps, internal tools, banking portals

BALANCED COST · SOLID RTO

PILOT LIGHT

RTO

15-30 MIN

RPO

< 15 MIN

DR COST

5-10%

Only critical services run in DR (DB replica + IAM + secrets)
App tier scaled-down or dormant; scales up on failover
Cheap, slower RTO
Best for: non-critical apps, dev/test DR, legacy systems

LOW COST · MODERATE RTO

BACKUP-AND-RESTORE

RTO

4-24 HRS

RPO

1-24 HRS

DR COST

< 5%

Geo-redundant backups only — no standby compute
On DR: provision new region from IaC, restore data, re-run pipelines
Cheapest, slowest
Best for: compliance-only DR (HIPAA minimum), batch systems

CHEAPEST · SLOWEST RECOVERY

T+0

DETECT

3 failed health probes, 90s window

90s

T+2m

VALIDATE DR

Probe DR site — is it healthy?

15s

T+3m

DNS FLIP

Traffic Manager / Route 53 weight swap

30-60s

T+4m

DB PROMOTE

Read-replica → primary

60-300s

T+5m

SCALE UP

DR compute warm → full size

60s

T+6m

VERIFY

App endpoint /health returns 200

30s

T+7m

NOTIFY

PagerDuty + Slack + email

T+8m

EVIDENCE

HIPAA / SOC 2 audit log exported

auto

A dead VM, a dead AKS cluster, and a dead Cosmos DB all need different recovery playbooks. DR-GUARD asks the LLM (Claude in normal mode, local Llama in AIRLOCK) to generate a resource-type-specific plan every time. No one-size-fits-all.

RESOURCE	AZURE	AWS	GCP
VM / Compute	Azure Backup → VHD snapshot to DR region → redeploy via Bicep	AMI cross-region copy + EBS snapshot repl → relaunch from AMI	Persistent Disk snapshot + image export → recreate
AKS / EKS / GKE	Velero backup → restore to DR AKS → reapply Helm	Velero + ECR cross-region → EKS clone via eksctl	Velero + Artifact Registry repl → recreate GKE
SQL Database	Azure SQL geo-rep → forced failover of failover-group	RDS read-replica promotion in DR region	Cloud SQL cross-region replica promote
NoSQL / Document	Cosmos DB multi-region writes or forced failover	DynamoDB Global Tables (multi-region by design)	Firestore multi-region or Spanner (global)
Object Storage	GRS / RA-GRS → client-side endpoint failover	S3 Cross-Region Replication + multi-region access	Multi-region bucket or dual-region turbo repl
Secrets / KMS	Key Vault geo-repl + backup blobs → restore to DR KV	Secrets Mgr cross-region repl; multi-region KMS keys	Secret Mgr policies; Cloud KMS multi-region keyrings
Networking / DNS	Traffic Manager priority → Front Door dual-region	Route 53 health-check → CloudFront dual-origin	Cloud DNS + Cloud Load Balancing failover
App Service / Fn	App Service slots + Front Door backend failover	Lambda cross-region + API Gateway endpoint swap	Cloud Run + Cloud LB multi-region backends
Data Warehouse	Synapse geo-backup + RA-GRS underlying storage	Redshift snapshot copy to DR + cluster restore	BigQuery multi-region datasets (built-in)

🤖 LLM-DRIVEN RECOVERY PLANNING

DR-GUARD does not hardcode recovery for every resource (there are hundreds of services). Instead, when a failure is detected, it inventories the resources that need to be restored and asks the LLM (Claude API, or local Llama 3 in AIRLOCK mode) to generate a resource-specific recovery plan. The plan includes exact CLI commands, rollback, pre-flight checks, estimated RTO, and compliance mapping. Human approval required before execution (can be auto-approved for gold-tier playbooks the client pre-signs).

Works for any Azure / AWS / GCP resource type — including services we haven't explicitly modeled
In AIRLOCK mode, plans are generated by on-prem Llama 3 — zero internet required
Every LLM-generated plan sanity-checked against destructive-command blocklist before showing the operator
Audit-logged with exact prompt + response for post-incident review

PREMIUM ADD-ON

TITAN DR-GUARD

$75,000/yr

Stacks on any package. Per-environment pricing. Enterprise DR (regulated + multi-cloud) on request.

✓ All 4 topologies supported (active-active / active-standby / pilot-light / backup-restore)
✓ Multi-cloud: Azure Traffic Manager + AWS Route 53 + GCP Cloud DNS
✓ 24/7 monitoring · 30-second probe interval
✓ Quarterly DR drills + evidence package (HIPAA / SOC 2 / NIST / PCI)
✓ Agent-incident recovery layer (kill switch, warranty-backed)
✓ AIRLOCK-compatible (offline mode, signed DR playbooks)
✓ Failover SLA: gold tier RTO < 60 sec, silver < 5 min
✓ DR ops engineer on-call via TITAN AI 24/7 tier

ADD DR-GUARD → DISCUSS YOUR DR NEEDS

YOUR QUESTIONS — ANSWERED

What about ACTIVE-ACTIVE DR setup?

What about ACTIVE-STANDBY DR setup?

If an ENTIRE REGION goes down, how do we bring it back up?

If TITAN AI's OWN AGENT breaks something, how do we recover?

4 DR TOPOLOGIES — PICK YOUR TIER

ACTIVE-ACTIVE

ACTIVE-STANDBY (WARM)

PILOT LIGHT

BACKUP-AND-RESTORE

REGION-DOWN FAILOVER — THE TIMELINE

DETECT

VALIDATE DR

DNS FLIP

DB PROMOTE

SCALE UP

VERIFY

NOTIFY

EVIDENCE

EVERY RESOURCE TYPE — ITS OWN STRATEGY

PRICING