Skip to main content
auto recoverysystem resiliencedevopsinfrastructure automation

Auto Recovery

Auto-recovery is the process of restoring a system to a known-good state after a failure, without human intervention. A Kubernetes pod starts crash-looping at 3 AM because a downstream dependency timed out. The orchestrator detects the unhealthy pod, kills it, starts a new instance, verifies the health check passes, and reroutes traffic. The on-call engineer wakes up to a resolved alert and a Slack summary, not a page demanding immediate action.

Auto-recovery handles the symptoms — it restarts the process, replaces the instance, or fails over to a backup. It does not investigate why the failure happened. That is the domain of auto-remediation, which identifies root causes and applies targeted fixes. Recovery is the fire sprinkler system; remediation is the fire investigator. Most organizations need both, but they solve different problems and operate at different levels of the stack.

TL;DR

Auto-recovery automatically restores systems to working state after failures. It covers four phases: detect the problem, diagnose severity, execute recovery (restart, failover, scale), and validate the fix worked. The pattern is built into every major cloud platform and orchestrator. Auto-recovery handles symptoms; pair it with auto-remediation to address root causes. Governance of recovery playbooks prevents runaway automation.

What Auto-Recovery Does (and Does Not Do)

Auto-recovery has a narrow, well-defined scope. It detects that something is broken, and it restores service. It does not ask why the thing broke, whether it will break again, or whether the underlying architecture needs to change. This narrowness is a feature, not a limitation — it makes recovery fast, predictable, and safe to automate.

AUTO-RECOVERY VS. AUTO-REMEDIATIONDimensionAuto-RecoveryAuto-RemediationGoalRestore stateFix root causeApproachReactiveProactive / intelligentTypical levelInfrastructureApplication / configExampleRestart crashed podPatch config drift
Click to enlarge

Think of it through a concrete analogy. When a web server process crashes, auto-recovery restarts it. When an EC2 instance fails a health check, the Auto Scaling Group terminates it and launches a replacement. When a database primary goes down, the replica promotes itself. In every case, the system returns to a working state — but the underlying cause (memory leak, hardware failure, network partition) remains unaddressed until someone or something investigates.

The Four Phases

Auto-recovery follows a consistent lifecycle across platforms and implementations. Each phase has a clear input, a clear output, and a handoff to the next.

AUTO-RECOVERY LIFECYCLE1. DetectHealth check returns 503Metrics exceed threshold2. DiagnoseIdentify failing componentAssess severity level3. RecoverRestart, failover, or scaleExecute playbook4. ValidateHealth check returns 200Metrics normalizeFeedback loop — continuous monitoring resumes
Click to enlarge

Phase 1: Detect. A liveness probe sends an HTTP request to the application every 10 seconds. The probe returns a 503. After three consecutive failures, the orchestrator marks the pod as unhealthy. Detection also happens through metric thresholds (CPU above 95% for 5 minutes), log pattern matching (repeated OOM errors), or synthetic transactions (a test purchase flow that times out).

Phase 2: Diagnose. The system determines what failed and how severe it is. Is this a single pod in a ten-replica deployment (minor) or the only instance behind the load balancer (critical)? Is the failure isolated to one availability zone or affecting the entire region? Diagnosis determines which recovery playbook to execute — a single-pod restart requires different action than a zone-level failover.

Phase 3: Recover. The system executes the appropriate recovery action. For a crashed pod, Kubernetes creates a new instance using the same image and configuration. For an unhealthy EC2 instance in an Auto Scaling Group, the group terminates the instance and launches a replacement behind the load balancer. For a database primary failure, the replica promotes itself and the application reconnects. The recovery action is selected from a pre-defined playbook — there is no improvisation at this stage.

Phase 4: Validate. After recovery, the system confirms the fix worked. The health check probe returns 200. Response latency drops back to normal ranges. Error rates fall below threshold. If validation fails, the system escalates — either retrying recovery with a different approach or alerting a human. Validation prevents the recovery system from declaring success while the application is still broken.

Recovery Patterns

Four patterns handle the majority of automated recovery scenarios. Each applies to a different failure mode.

Circuit breaker. When a downstream dependency starts failing, the circuit breaker stops sending requests to it. Instead of every request timing out (degrading the caller), the breaker returns a fallback response immediately. After a configured interval, it sends a single test request. If the dependency is back, the breaker closes and normal traffic resumes. If not, it stays open. This prevents a single failing service from cascading into a full outage — a pattern that has prevented countless 3 AM pages since Netflix popularized it with Hystrix.

Retry with exponential backoff. Transient failures — a momentary network blip, a connection pool exhaustion that clears itself — often resolve on their own. The system retries the failed operation after 1 second, then 2, then 4, then 8, with random jitter added to each interval. The jitter prevents the thundering herd problem: without it, 500 clients that all failed at the same time would all retry at the same time, overloading the recovering service.

Bulkhead isolation. Critical resources — thread pools, connection pools, memory allocations — are partitioned so that one runaway consumer cannot exhaust them for everyone. If the reporting service starts consuming all available database connections, the bulkhead ensures that the checkout service retains its own dedicated pool. The failure is contained to the reporting service rather than taking down the entire platform.

Failover. When a primary component fails entirely, traffic shifts to a standby. Database failover promotes a read replica to primary. DNS failover routes traffic to a different region. Load balancer failover removes unhealthy targets and distributes traffic to healthy ones. The key requirement is that the standby must be genuinely ready — warm, synchronized, and tested through regular failover drills.

Organizations that implement automated recovery processes reduce mean time to recovery (MTTR) by 60-70% compared to manual incident response, with the largest gains in off-hours incidents.

— Google, Site Reliability Engineering

Auto-Recovery in Practice

Two worked examples show how recovery patterns combine in real deployments.

Kubernetes deployment with liveness probes. A payment service runs as a Deployment with three replicas, each configured with a liveness probe that hits /healthz every 15 seconds. One replica's JVM enters a GC death spiral — it is technically running but not responding within the 3-second probe timeout. After three consecutive failures (45 seconds), Kubernetes kills the pod and starts a new one. The Deployment's readiness gate prevents the new pod from receiving traffic until it passes its readiness probe — confirming the application has fully initialized and connected to its database. A PodDisruptionBudget ensures that during this recovery, at least two replicas remain available, so the recovery itself does not cause an outage.

AWS Auto Scaling Group replacing an unhealthy EC2 instance. An application runs behind an Application Load Balancer (ALB) on three EC2 instances in an Auto Scaling Group (ASG). The ALB health check sends an HTTP request to each instance every 30 seconds. One instance's EBS volume develops latency issues, causing the application to respond slowly and eventually time out on health checks. After two consecutive failures, the ALB stops routing traffic to the unhealthy instance. The ASG detects the unhealthy instance, terminates it, and launches a replacement from the same launch template. The new instance boots, runs its user data script to configure the application, registers with the ALB, passes its health check, and starts receiving traffic. Total recovery time: approximately 3-5 minutes, without human intervention.

What Goes Wrong

Auto-recovery is not risk-free. Poorly configured recovery automation creates its own category of incidents.

RECOVERY ANTI-PATTERNSRecovery LoopPod restarts 50x in an hour.Cause is not transient — itwill fail again immediately.Fix: CrashLoopBackOff + max retriesCascading FailoverRegion A fails. Traffic shiftsto Region B, which cannothandle the doubled load.Fix: capacity planning + load sheddingThundering HerdAll instances restart at once.Simultaneous cold startsoverwhelm the database.Fix: rolling restart + jitter
Click to enlarge

Recovery loops. A pod crashes because of a configuration error — a missing environment variable, a bad database connection string. Auto-recovery restarts it. It crashes again immediately. Kubernetes enters CrashLoopBackOff, increasing the delay between restarts, but the pod will never succeed because the underlying cause is not transient. Without a maximum retry limit and an escalation policy, the system burns CPU cycles restarting a pod that can never run. The mitigation: configure a maximum restart count, and after that limit is reached, alert a human and stop retrying.

Cascading failovers. Region A fails. Auto-recovery shifts all traffic to Region B. But Region B was provisioned for 50% of total capacity — it was the backup, not a full mirror. The doubled load overwhelms Region B, and now both regions are down. The mitigation: capacity planning that ensures every failover target can handle the full load, combined with load shedding that drops low-priority traffic rather than accepting all of it.

Thundering herd. A widespread issue causes 200 application instances to crash simultaneously. Auto-recovery starts all 200 at once. Every instance's startup routine queries the configuration database, floods the cache layer with cold reads, and establishes connection pools. The downstream systems, already stressed from the original failure, collapse under the startup load. The mitigation: rolling restarts with staggered intervals, connection pool warm-up, and circuit breakers that limit the rate of simultaneous recovery actions.

Recovery that masks problems. Auto-recovery restarts a service that crashed due to a memory leak. The service works fine for another 6 hours, then crashes again. The recovery system restarts it again. This cycle repeats for weeks, with recovery masking a genuine bug that never gets investigated because the pager never fires long enough for someone to notice the pattern. The mitigation: track recovery frequency per service and alert when a service is being recovered more often than its baseline.

Measuring Recovery Effectiveness

Four metrics tell you whether auto-recovery is working as intended.

Mean Time to Detection (MTTD) measures how quickly the system identifies a failure. A well-configured health check with 10-second intervals and a 3-failure threshold gives an MTTD of approximately 30 seconds. Synthetic monitoring that runs critical paths every minute adds another detection layer. Target: under 60 seconds for critical services.

Mean Time to Recovery (MTTR) measures the elapsed time from detection to validated restoration. For a Kubernetes pod restart, this is typically 30-90 seconds. For an EC2 instance replacement in an ASG, 3-5 minutes. For a database failover, 30 seconds to 2 minutes depending on the technology. If MTTR is consistently long, the bottleneck is usually in the validation phase — slow health checks or readiness probes that take too long to pass.

Recovery success rate is the percentage of automated recovery attempts that resolve the issue without human intervention. A healthy system should sustain above 85%. Below that, either the recovery playbooks are too simplistic for the failure modes they encounter, or the underlying infrastructure has chronic issues that recovery cannot solve.

False positive rate measures how often recovery triggers unnecessarily — killing a healthy pod because a health check endpoint was temporarily slow, or replacing an instance that was actually fine. High false-positive rates erode trust in the recovery system and, worse, can cause outages: unnecessarily killing pods during a traffic spike reduces capacity exactly when it is needed most. Target: below 5%.

Downtime costs enterprises an average of $5,600 per minute. Automated recovery that reduces MTTR from 30 minutes to 3 minutes for a critical service saves roughly $150,000 per incident.

— Gartner, The Cost of Downtime

How Dawiso Supports Recovery Operations

Recovery playbooks need context to work correctly. When a data pipeline fails at 3 AM, the recovery automation needs to know: which services depend on this pipeline? What is the downstream impact of an outage? Who owns the affected data assets? Is this a Tier 1 regulatory pipeline that requires immediate escalation, or a Tier 3 internal report that can wait until morning?

Dawiso's data catalog provides this context. Every data asset has an owner, a classification tier, and a record of which downstream processes depend on it. Data lineage shows the blast radius — if the "Customer" pipeline is down, lineage reveals that "Order Summary," "Churn Prediction," and "Regulatory Report" are all affected, enabling the recovery system to prioritize accordingly.

Through the Model Context Protocol (MCP), recovery automation tools can query Dawiso programmatically. Before executing a recovery action, the tool checks: what is the classification tier of this asset? Who owns it? What is the blast radius? This metadata-driven approach prevents the blind automation that causes recovery systems to treat a non-critical batch job with the same urgency as a regulatory pipeline — or worse, to recover a downstream system before the upstream source that feeds it.

Auto-recovery keeps systems running. Governed metadata ensures the recovery automation knows what it is recovering, why it matters, and who should be notified when it happens.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved