Recent AWS and Azure outages prove one thing: resilience isn’t optional. Here’s how to achieve it.
View in browser

Hi there!


Were you affected by the recent cloud outages? In this issue of Mission Infrastructure, we focus on how to keep customer‑facing services up when a region (or an entire cloud provider) wobbles. We also feature some of the best picks from our blog, news on the biggest events of the year, and our latest product updates.

 

Read on — and stay resilient 💪

Designing a Truly Resilient Cloud Strategy

On October 20, a major AWS outage in the us-east-1 region occurred due to a Domain Name System (DNS) failure affecting the DynamoDB service. This problem reverberated across the internet, disrupting numerous services that rely on AWS infrastructure. Just nine days later, a global incident struck Microsoft Azure, triggered by what the company believes was “an inadvertent configuration change.” Both of these events highlight the deep interconnectivity of the major cloud providers’ systems and the fragility of a digital ecosystem that relies on these providers never making a mistake.

 

Faced with such uncertainty, how do you ensure resilience? You need to design with disruption in mind.

 

Build a fortress

 

Most cloud outages hurt because workloads assume the control plane and data plane remain functional at all times. This assumption is wrong. You need to implement external monitoring and failover capabilities that function independently of your cloud provider's foundational services. Your first step is to try and plan a failover path around the blast radius of possible failure points:

  1. Tier your services. To prevent the compromise of systems that control everything else, ensure that only Tier‑0 (the most privileged administrative roles and resources) get setup cross‑region or cross‑cloud. Everything else should degrade gracefully.
  2. Eliminate hidden single‑region dependencies. Shared queues, auth, feature flags, or state backends often live only in one region. You should mirror them or document the tradeoffs.
  3. Adopt a multi‑region first policy. Use active‑active or active‑passive across two regions with health‑checked DNS; pre‑warm capacity and replicate data with desired RPO/RTO.
  4. Maintain control‑plane independence. Treat CI/CD, secrets, and IaC state as first‑class dependencies; host them for cross‑region reachability. A note on Azure: Plan degraded paths if Front Door/Traffic Manager or similar control‑plane services are impaired.
  5. Identify when to go multicloud. Use multiple clouds to reduce correlated risk for Tier‑0 APIs or meet regulatory/commercial needs. But remember, your ops costs will increase if you don’t adopt good multicloud management (policy, drift, inventory, and workflows in one place).

Remember that multicloud isn’t “double everything.” It’s selecting a minimum viable subset that you standardize and enforce. Here’s a guide to choosing, governing, and operating a multicloud strategy efficiently.

 

Start with quick wins

  • DNS failover health checks. In AWS, create Route 53 health checks for Tier‑0 endpoints; set TTL ≤30s, and test cutover monthly.
  • Front Door probes. Azure’s Front Door health probes identify situations where an origin is unavailable or unhealthy. Configure health probes and validate priority failover behavior every month.
  • Secondary region reality check. Can your app boot cold in Region B without relying on Region A’s control plane? Prove it with a "+region" chaos day.
  • Traffic Manager priority routing. In Azure, use priority profiles for critical DNS names; document TTLs, and rehearse cutovers.
  • State & secrets redundancy. Replicate your state backend and secrets store across regions; publish the RPO.

New Articles — Top Picks

Atlantis with Terragrunt – Automate Terraform Workflows

When you write Terraform configurations at scale for large cloud infrastructures, it can be tedious to keep things like provider versions up to date across all Terraform configurations. However, an entire ecosystem of tools has grown around Terraform to help you run it at scale. In this article, Mattias Fjellström explains how to run Atlantis with Terragrunt.

 

--> Read the article

GitOps at Scale: Strategies for Enterprise Adoption & Growth

GitOps is easily configured for simple projects in small environments, but adding more repositories, deployment targets, and team members makes it harder to operate successfully. In this guide, James Walker explores these challenges, along with best practices for achieving GitOps success at scale.

 

--> Read the article

DevOps ROI: Why it Matters & How to Measure

The benefits of an effective DevOps implementation might not be apparent immediately, so you need a clear plan for measuring your DevOps return on investment (ROI). In this guide, James Walker explores the concept of DevOps ROI and shares techniques and best practices for measuring ROI using specific DevOps metrics.

 

--> Read the article

Top 12 Policy as Code (PaC) Tools in 2025

Policy as code (PaC) applies declarative definitions, version control, testing, and automation to organizational rules and guardrails. Policies are written in code-like languages, stored in Git, and enforced automatically in pipelines or runtime environments. In this article, Mariusz Michalowski explores some of the most widely used policy-as-code tools and platforms.

 

--> Read the article

Guide to Ansible Facts and Fact Gathering

For Ansible to perform tasks such as conditionally installing packages based on the OS, dynamically configuring services with different names across distributions, or templating configuration files with system-specific paths and values, it needs accurate fact gathering In this article, Divine Odazie discusses Ansible fact gathering, how to use those facts, and when to use them.

--> Read the article

Accelerate Your Spacelift Journey: Introducing the Spacelift Accelerator for Rapid PoC Delivery

Spacelift can transform your infrastructure management, but platform engineers, DevOps teams, and SREs often grapple with how to quickly establish a well-architected Spacelift foundation that showcases the platform’s capabilities while following best practices from the outset. In this article, guest author Maciej Socha, DevOps Engineer at Semantive, describes how leveraging the Spacelift Core Config Accelerator that Semantive devised can reduce setup time from four weeks to just three to five days.

 

--> Read the article

Read more blog posts

Events 📌

KubeCon + CloudNativeCon North America 2025

November 10 – 13, 2025

Atlanta, Georgia

 

Join us at KubeCon + CloudNativeCon North America 2025 and discover how you can bring order to your infrastructure.

 

Here’s where you’ll find us:

  • Booth #541 every day
  • OpenTofu Day (Monday, November 10) — listen to Christian Mesh, OpenTofu Technical Lead, give practical insights into performance pitfalls and profile techniques to optimize your IaC.
  • IaCConf Connect Atlanta (Monday after hours) — kick back with other IaC fans, learn, and shape the future of infrastructure as code.

Read the KubeCon + CloudNativeCon North America 2025 Guide for all the details.

AWS re:Invent 2025

December 1 –  5, 2025

Las Vegas, Nevada

 

Will you be in Vegas for the cloud event of the year? Join us at AWS re:Invent 2025 and see why Spacelift is trusted by so many platform engineering and DevOps leaders.

 

Here’s where we’ll be:

  • Booth #1533 every day
  • The Infrastructure Grand Prix (Tuesday after hours) — Join us and our friends at Cortex.io, Teleport, Incident.io, Glean, and Carahsoft at F1 Arcade Las Vegas for the Speed & Control Experience! Enjoy F1 sim racing, networking, and refreshments.
  • Spacelift + Datadog (Wednesday after hours) — We’re pairing up with Datadog to talk visibility, control, and scaling your infrastructure, all in a relaxed, invite-only setting.

Register and find out more here.

Missed Last Month’s Edition?

If you didn’t get a chance to read our previous issue, don’t worry—we’ve got you covered. Each edition of Mission Infrastructure is packed with insights, industry trends, and practical guides you won’t want to miss.

Catch up on Mission Infrastructure

Product Updates 🚀

Introducing Cross-Space Module Sharing to Spacelift

Spacelift is bringing cross-space module sharing to the platform. Once you create a module, it becomes available automatically in its origin space and all descendants by default. Now, you can select your desired availability across spaces, which enables increased module portability and makes scaling and working across teams even easier. Available on all paid plans.

--> Learn more

Introducing the Upgraded Navigation Experience in Spacelift

The navigation menu in the Spacelift platform has been overhauled. Now, instead of including every menu item on one list, we have collapsed them into groups that are easier to browse.

--> Learn more

    Get Started with Spacelift

    Ready to start your journey to IaC orchestration? Schedule a demo or explore yourself with a Spacelift free trial.

    Book a demo
    Start for free

    About  •  Terms of service  •  Contact

    Spacelift, Inc., 541 Jefferson Ave STE 100, Redwood City, California 94063

    Unsubscribe Manage preferences