Understand Multi-AZ, Route 53 failover, backup-and-restore, pilot-light, warm-standby, and service-quota decisions for SAA-C03 resilience scenarios.
This objective is where AWS tests whether you understand the difference between normal production resilience and true disaster recovery. SAA-C03 often gives several technically valid answers here. The best one is the design that matches the required RTO, RPO, and failure scope without unnecessary cost or complexity.
The current exam guide points to AWS global infrastructure, DR strategies, failover strategies, distributed design patterns, immutable infrastructure, load balancing, proxies such as RDS Proxy, service quotas, throttling, storage durability options, and workload visibility.
Ask three questions in order:
Once you answer those, the right architecture usually narrows quickly.
| Requirement | Strongest first fit | Why |
|---|---|---|
| Survive one instance failure | Auto Scaling, health checks, and stateless placement | Replaces unhealthy compute automatically |
| Survive one Availability Zone failure | Multi-AZ design with load balancing and managed data services | Classic HA requirement inside one Region |
| Survive one Region failure | Cross-Region failover or active-active pattern | AZ design alone does not cover a regional outage |
| Recover older data state after corruption or deletion | Backup, snapshot, PITR, or versioned recovery design | Availability controls do not replace recovery controls |
| Requirement | Strongest first fit | Why |
|---|---|---|
| Survive one-instance or one-AZ failure inside a Region | Multi-AZ placement and load balancing | Strong default for production resilience |
| Lowest-cost DR with slower recovery | Backup and restore | Cheapest but slower recovery |
| Faster recovery with core systems already ready | Pilot light or warm standby | Improves RTO compared with backup only |
| Near-continuous service across Regions | Active-active multi-Region | Highest complexity and cost, strongest continuity |
flowchart LR
B["Backup and restore"] --> P["Pilot light"]
P --> W["Warm standby"]
W --> A["Active-active"]
As you move right, recovery usually gets faster and operational cost usually increases. SAA-C03 often asks you to choose the smallest pattern that still satisfies the business requirement.
This is the kind of failover configuration SAA-C03 expects you to read quickly:
1Resources:
2 AppPrimaryRecord:
3 Type: AWS::Route53::RecordSet
4 Properties:
5 HostedZoneName: example.com.
6 Name: app.example.com.
7 Type: CNAME
8 SetIdentifier: primary
9 Failover: PRIMARY
10 TTL: '60'
11 ResourceRecords:
12 - primary.example.net
13 HealthCheckId: abc12345-1111-2222-3333-444455556666
14
15 AppSecondaryRecord:
16 Type: AWS::Route53::RecordSet
17 Properties:
18 HostedZoneName: example.com.
19 Name: app.example.com.
20 Type: CNAME
21 SetIdentifier: secondary
22 Failover: SECONDARY
23 TTL: '60'
24 ResourceRecords:
25 - secondary.example.net
What to notice:
| Requirement | Strongest first fit | Why |
|---|---|---|
| Fast database failover within one Region | Multi-AZ database deployment | High availability answer inside a Region |
| Offload relational reads | Read replica | Read scale is not the same as HA failover |
| Recover earlier data state after user error | Backup, snapshot, or point-in-time recovery | Recovery objective is different from live failover |
| Durable object storage with deletion recovery | S3 versioning plus backup/retention design | Durability alone does not handle accidental deletion |
The exam does not stop at topology. It also checks whether the environment can recover automatically and whether the standby design can actually scale under stress.
| Concern | Strongest first check | Why |
|---|---|---|
| Immutable replacement of unhealthy servers | Launch templates, Auto Scaling, and infrastructure as code | Rebuilding is usually stronger than repairing pets |
| Standby environment cannot scale during disaster | Service quotas and throttling limits | The DR pattern fails if quotas stay sized for normal traffic |
| Users report intermittent failures and failover timing is unclear | Health checks, CloudWatch metrics, and tracing visibility | Observability supports resilience decisions |
| Legacy application opens too many database connections during failover | RDS Proxy | Helps connection handling without rewriting the whole app |
| Symptom | Strongest first check | Why |
|---|---|---|
| App servers recover, but the database still becomes the outage point | Single point of failure in the stateful tier | HA must include the data layer, not just stateless compute |
| DR test works at low scale but not during real failover | Quotas, warm capacity, and automation readiness | Standby patterns are only as strong as their activation path |
| The team says the system is highly available because it has read replicas | Replica versus failover role | Read scale does not automatically provide synchronous HA |
| The design can fail over, but nobody can prove when or why | Health checks, metrics, and tracing | Visibility is part of resilience, not a separate concern |
Move next into 3. High-Performing Architectures to study the storage, compute, database, network, and ingestion layers that drive workload speed and scale.