Get started with Chkk for free today! No credit card required
Learn more
Learn more
Back to the blog
Technology
February 8, 2023

From Fighting Fires to Availability Assurance

Written by
Fawad Khaliq
X logoLinkedin logo
Start for free
Estimated Reading time
3 min

In November of 2020, I woke up to the blaring sound of the Amazon pager at 5am. As a service owner at Amazon EKS, I knew this couldn’t be good, and every subsequent page I received in the next two hours was a harbinger of an impending catastrophe. As I logged into the systems, I saw that some of our customers were unable to reach our service. It was a defect in the software, a completely avoidable defect.

And what’s frustrating is that these failures are not a one-time event. I have experienced hundreds of events, where our customers (DevOps, SREs and Platform Engineers …) relive avoidable failures over and over, without any effective way for them to learn from each other’s errors and stop them from happening again.

I kept thinking there’s got to be a better way. I noticed that the aviation and security industries have solved the problem of learning from each other. They collectively learn from failures, and implement mechanisms to ensure failures never happen again. I thought it was time we did the same for the software industry so I decided to leave Amazon and founded Chkk to avoid repetition of previously-encountered failures and eliminate operational pains of incidents caused by these failures.

How?

I believe reactive approaches to infrastructure availability will not be sufficient in the coming years, and will be complemented with a new discipline of availability assurance which prevents downtime instead of fighting fires. We need a way to 1/ collectively learn and codify knowledge from developer communities and already-occurred incidents in a structured, authoritative, and interoperable catalog for all types of infrastructure, 2/ prevent introduction of new Availability Risks in all infrastructure deployment tools, and 3/ remediate Availability Risks latent in your infrastructure, before they become incidents.

Why now?

Cloud, SaaS vendors and open-source communities provide the building blocks of modern applications. I saw that most incidents occur at the interconnects of these building blocks, the responsibility of infrastructure availability has shifted from vendors to DevOps, SRE, Infrastructure/Platform teams inside each organization. These are lean, hyper-specialized developers chartered to ensure availability, performance and scaling of their respective infrastructure blocks. Talent is scarce and 80% of their time is spent on operations and maintenance. A new way of solving this problem is required.

Mental models have already shifted. Developers go to community forums to learn and share knowledge about incidents, remediations and optimizations. Collaboration for risk resolution is already happening through standardized schemas/taxonomies in other verticals (CVE, SLSA, …).

Building blocks are now available. IaC deployments have been standardized through a handful of popular tools (Terraform, CDK, GitOps …) Enforcement abstractions are available at all layers (eBPF, Admission Controllers, Policy Engines, …) And adoption of tools for availability risk discovery, detection, remediation, and prevention is now attainable. While “there is no compression algorithm for experience”, proven mechanisms and hyper-specialized tools are available to catalyze a culture of availability assurance, learning and collaboration across infrastructure teams.

Where to go from here?

Imagine if all DevOps, SREs, and platform engineers could learn from each other and avoid repeating the same failures. Wouldn’t that be a great world to live in? To achieve this, we must build a network of infrastructure developers, cloud providers, software vendors, and open-source communities who collectively learn from each other to propagate a culture of availability assurance. I believe that’s how everyone will operate the infrastructure. I’m super excited about how Collective Learning will transform how we think about availability–read Ali’s blog for details. I invite you to join us on this exciting journey and sign up for early access

In November of 2020, I woke up to the blaring sound of the Amazon pager at 5am. As a service owner at Amazon EKS, I knew this couldn’t be good, and every subsequent page I received in the next two hours was a harbinger of an impending catastrophe. As I logged into the systems, I saw that some of our customers were unable to reach our service. It was a defect in the software, a completely avoidable defect.

And what’s frustrating is that these failures are not a one-time event. I have experienced hundreds of events, where our customers (DevOps, SREs and Platform Engineers …) relive avoidable failures over and over, without any effective way for them to learn from each other’s errors and stop them from happening again.

I kept thinking there’s got to be a better way. I noticed that the aviation and security industries have solved the problem of learning from each other. They collectively learn from failures, and implement mechanisms to ensure failures never happen again. I thought it was time we did the same for the software industry so I decided to leave Amazon and founded Chkk to avoid repetition of previously-encountered failures and eliminate operational pains of incidents caused by these failures.

How?

I believe reactive approaches to infrastructure availability will not be sufficient in the coming years, and will be complemented with a new discipline of availability assurance which prevents downtime instead of fighting fires. We need a way to 1/ collectively learn and codify knowledge from developer communities and already-occurred incidents in a structured, authoritative, and interoperable catalog for all types of infrastructure, 2/ prevent introduction of new Availability Risks in all infrastructure deployment tools, and 3/ remediate Availability Risks latent in your infrastructure, before they become incidents.

Why now?

Cloud, SaaS vendors and open-source communities provide the building blocks of modern applications. I saw that most incidents occur at the interconnects of these building blocks, the responsibility of infrastructure availability has shifted from vendors to DevOps, SRE, Infrastructure/Platform teams inside each organization. These are lean, hyper-specialized developers chartered to ensure availability, performance and scaling of their respective infrastructure blocks. Talent is scarce and 80% of their time is spent on operations and maintenance. A new way of solving this problem is required.

Mental models have already shifted. Developers go to community forums to learn and share knowledge about incidents, remediations and optimizations. Collaboration for risk resolution is already happening through standardized schemas/taxonomies in other verticals (CVE, SLSA, …).

Building blocks are now available. IaC deployments have been standardized through a handful of popular tools (Terraform, CDK, GitOps …) Enforcement abstractions are available at all layers (eBPF, Admission Controllers, Policy Engines, …) And adoption of tools for availability risk discovery, detection, remediation, and prevention is now attainable. While “there is no compression algorithm for experience”, proven mechanisms and hyper-specialized tools are available to catalyze a culture of availability assurance, learning and collaboration across infrastructure teams.

Where to go from here?

Imagine if all DevOps, SREs, and platform engineers could learn from each other and avoid repeating the same failures. Wouldn’t that be a great world to live in? To achieve this, we must build a network of infrastructure developers, cloud providers, software vendors, and open-source communities who collectively learn from each other to propagate a culture of availability assurance. I believe that’s how everyone will operate the infrastructure. I’m super excited about how Collective Learning will transform how we think about availability–read Ali’s blog for details. I invite you to join us on this exciting journey and sign up for early access

Tags
No items found.

Continue reading

Spotlight

Spotlight: Simplifying Contour Upgrades with Chkk

by
Chkk Team
Read more
Hidden Toil

5 Reasons Why Delaying Open Source Software Upgrades Is a Bad Idea

by
Awais Nemat
Read more
Spotlight

Spotlight: Seamless cert-manager Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Argo Rollouts Upgrades with Chkk

by
Chkk Team
Read more
Upgrade Advisory

Upgrade Advisory: Pods Stuck in Pending During Kubelet v1.30 → v1.31 Upgrade

by
Chkk Team
Read more
Spotlight

Spotlight: Simplifying Self-Managed Apache Kafka Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless Calico Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: NGINX Ingress Controller Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: KEDA Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Streamlining Prometheus Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: RabbitMQ Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless Kyverno Upgrades with Chkk

by
Chkk Team
Read more
News

Google Container Registry Deprecation 2025: How to Migrate to Artifact Registry

by
Chkk Team
Read more
Spotlight

Spotlight: HashiCorp Vault Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Streamlining Crossplane Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless External DNS Upgrades with Chkk

by
Chkk Team
Read more
Case Study

How Dexcom Derisked GKE Upgrades and Sped Them Up by 5x using Chkk

by
Chkk Team
Read more
Case Study

Assuring Compliance and Availability for Yoti’s On-Prem Platform with Chkk

by
Chkk Team
Read more
Case Study

How a Fortune 500 Enterprise Avoided $500K in EKS Extended Support Fees, Achieved 80% Reduction in Prep Time, and Boosted Upgrade Productivity by 200%

by
Chkk Team
Read more
Case Study

How a Fortune 1000 Enterprise Standardized Multi-Cloud (EKS & GKE) Upgrades for 30+ Add-Ons, Avoided 6x Costs, and Achieved an 80% Reduction in Prep Time

by
Chkk Team
Read more
Spotlight

Spotlight: Upgrading Self-Managed Redis

by
Chkk Team
Read more
Spotlight

Spotlight: Simplifying Self-Managed Elasticsearch Upgrades with Chkk

by
Chkk Team
Read more
News

GKE & EKS Extended Support: Are 6x Fees for Supporting Older Kubernetes Versions Justified?

by
Ali Khayam
Read more
Spotlight

Spotlight: Seamless Karpenter Upgrades with Chkk

by
Chkk Team
Read more
Operational Safety

Forced EKS & GKE Upgrades: How to Manage Business Continuity Risks

by
Fawad Khaliq
Read more
Spotlight

Spotlight: How Chkk Streamlines & Safeguards Cilium Upgrades

by
Chkk Team
Read more
Technology

Kubernetes Admission Controllers and Webhooks Deep Dive

by
Chkk Team
Read more
Spotlight

Chkk Spotlight: Istio

by
Chkk Team
Read more
Technology

Pod Disruption Budgets: Pitfalls, Evictions & Kubernetes Upgrades

by
Chkk Team
Read more
Technology

cgroup v1 to v2 Migration in Kubernetes

by
Chkk Team
Read more
Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more
News

GKE Follows EKS & AKS, Launches Extended Support with a 500% Surcharge for Delayed Upgrade

by
Ali Khayam
Read more
News

AKS Long Term Support and EKS Extended Support: Similarities & Differences

by
Ali Khayam
Read more
News

Amazon launches EKS extended support… How does it impact you?

by
Ali Khayam
Read more
Platform Engineering

Platform teams need a delightfully different approach, not one that sucks less

by
Fawad Khaliq
Read more
Technology

Kubernetes Enters Its Second Decade: Insights from KubeCon Chicago

by
Fawad Khaliq
Read more
Company

Launching Chkk Operational Safety Platform

by
Awais Nemat
Read more
Technology

What Makes Kubernetes Upgrades So Challenging?

by
Fawad Khaliq
Read more
Company

4 Lessons from our SOC2 Journey

by
Fawad Khaliq
Read more
Technology

Collective Learning: The Power of Not Repeating Others’ Mistakes

by
Ali Khayam
Read more
Technology

From Fighting Fires to Availability Assurance

by
Fawad Khaliq
Read more
Company

Welcome to Chkk

by
Awais Nemat
Read more