Get started with Chkk for free today! No credit card required
Learn more
Learn more
Back to the blog
Company
October 25, 2023

Launching Chkk Operational Safety Platform

Written by
Awais Nemat
X logoLinkedin logo
Start for free
Estimated Reading time
5 min

Today, my team and I are excited to publicly launch the Chkk Operational Safety Platform. We want to thank our early customers and design partners that have been working with us very closely since February, when we opened our waitlist to anyone interested in proactively addressing infrastructure errors, disruptions, and failures.

I'm humbled that our customers love Chkk, already used by enterprises across various industry verticals. Thank you for helping us validate our thesis and develop our product.

Our core thesis

While working at AWS, we observed a recurring pattern: different enterprises, at different points in time, experienced the same errors, failures, and disruptions due to the same root causes. Every company reactively responded to the same set of issues that other companies had already dealt with. There was no easy way for any of them to find out, a priori, about known Operational Risks lurking in their infrastructure that can trigger incidents leading to downtime. We realized that there was an opportunity to help our future customers.

Our thesis was: 

  1. Customers care about availability and want to proactively prevent errors and not wait until after the impact, which wastes time and effort, and risks reputation and credibility.
  2. If an Operational Risk has already materialized into a disruption somewhere in the world, it is highly likely that it will materialize over and over again in many enterprises, and cause operational pain and loss.
  3. Customers want to learn and not repeat a mistake that has already caused others harm, but there is no simple, automated, and trusted way for them to learn from each other and avoid known risks.

That's where Chkk comes in.

We took inspiration from cybersecurity, where security vulnerabilities are reported publicly, and came up with this simple idea: If there's any error, failure, or disruption that has happened anywhere in the world, we will learn about it. We’ll convert it into a Risk Signature, similar to a virus signature, and then we will stream it to all our customers, where it will be scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause disruptions, much like antivirus software detects and removes viruses before they start causing harm.


With Chkk, our customers learn about Operational Risks from an authoritative source and proactively prevent these incidents from happening altogether.

How the Chkk Operational Safety Platform works

Our first product is a SaaS service designed for organizations that are running mission-critical applications on Kubernetes infrastructure. We help them reduce Operational Risks, prevent errors and disruptions, and operate Kubernetes safely and efficiently. Not only do we identify and prioritize risks, we also provide Preverified Upgrade Plans to our customers, so they can cut down weeks of preparation prework to days, and safely remediate these risks without worrying about the complexities and intricate interdependencies that exist when fixing these issues.

There are three distinct modules in the Chkk Operational Safety Platform.

Upgrade Copilot is especially valuable for Platform, DevOps, and SRE Engineers responsible for planning and executing infrastructure upgrades. We provide Preverified Upgrade Plans containing a detailed sequence of steps that need to be executed for remediation. We then optionally verify these steps on a digital twin of their infrastructure, executing the prescribed sequence of steps, to validate that the plan works as expected. This significantly reduces the time and effort required for planning these upgrades and also derisks the execution of this critical task for our customers.

Artifact Register maintains an inventory of all components, container images, repositories, and tools across multiple clusters and clouds. It gives our customers visibility into what exists where, reducing the need for manual and error-prone tracking using spreadsheets and scripts that they currently use.

Risk Ledger is similar to security risk ledgers, but tailored specifically towards identifying contextualized Operational Risks within Kubernetes infrastructures. It enables our customers to become proactive in addressing potential failures before they happen.

All modules of Chkk seamlessly integrate with existing workflows and tools (IaC, packaging, deployment, monitoring, ticketing, and alerting) and simplify existing operational processes.

Powered by Collective Learning

Many of our customers ask us: how do you learn about all the issues and Operational Risks? How do you make sure that your remediations and Upgrade Plans are safe to execute? How do you manage these intractable problems? What’s the magic?

The magic is our Collective Learning Technology.

At the heart of Collective Learning is the Risk Signature Database, or RSig DB. Think of it as a CVE database for Availability Risks, along with a Knowledge Graph that captures all the relationships across different artifacts – issues, release notes, and any and all breaking changes.

On the backend, our technology continuously sources and populates this RSig DB and Knowledge Graph from multiple sources. First and foremost, we mine the internet for publicly available information – incidents, reports, tickets, issues, and discussions on internet forums. We scour everything where we can find a signal. Our research team then validates these candidates and converts them into programmatic signatures that can later be scanned and contextualized against a customer’s infrastructure.

We  also ingest release notes, breaking changes, and bug report feeds from Kubernetes add-on vendors and open-source projects into our RSig DB and the Knowledge Graph. And of course, we also learn from our users. We continuously add these learnings to our RSig DB and Knowledge Base, which become more valuable for our customers over time.

All Chkk modules use the Database and Knowledge Graph to identify and prioritize risks, and locate them with pinpoint accuracy within a Kubernetes fleet. We also use them to create and preverify the Upgrade Plans that our customers use to remediate these issues.

We’ve taken time and care to build and fine tune the technology to prioritize and address the right risks. Our customers appreciate that we offer concise actionable plans to resolve the most critical risks, rather than burdening them with an exhaustive list of unnecessary ones.

A bright future ahead

In order to build a future powered by Collective Learning, Chkk has raised $5.2 million in seed funding from angels and VCs led by Sequoia Capital. We are grateful that Sequoia believes in our mission and is joining us in democratizing the wisdom of operating software at scale.

We have built the Chkk Operational Safety Platform for our customers running mission-critical apps on Kubernetes infrastructure. It helps Platform, DevOps, and SRE teams proactively manage and remediate risks, execute safe upgrades, eliminate wasted effort, and accomplish more with fewer resources. 

The Chkk Operational Safety Platform is available today – it installs in minutes and integrates into your existing tools and workflows. Please sign up to get started.

Today, my team and I are excited to publicly launch the Chkk Operational Safety Platform. We want to thank our early customers and design partners that have been working with us very closely since February, when we opened our waitlist to anyone interested in proactively addressing infrastructure errors, disruptions, and failures.

I'm humbled that our customers love Chkk, already used by enterprises across various industry verticals. Thank you for helping us validate our thesis and develop our product.

Our core thesis

While working at AWS, we observed a recurring pattern: different enterprises, at different points in time, experienced the same errors, failures, and disruptions due to the same root causes. Every company reactively responded to the same set of issues that other companies had already dealt with. There was no easy way for any of them to find out, a priori, about known Operational Risks lurking in their infrastructure that can trigger incidents leading to downtime. We realized that there was an opportunity to help our future customers.

Our thesis was: 

  1. Customers care about availability and want to proactively prevent errors and not wait until after the impact, which wastes time and effort, and risks reputation and credibility.
  2. If an Operational Risk has already materialized into a disruption somewhere in the world, it is highly likely that it will materialize over and over again in many enterprises, and cause operational pain and loss.
  3. Customers want to learn and not repeat a mistake that has already caused others harm, but there is no simple, automated, and trusted way for them to learn from each other and avoid known risks.

That's where Chkk comes in.

We took inspiration from cybersecurity, where security vulnerabilities are reported publicly, and came up with this simple idea: If there's any error, failure, or disruption that has happened anywhere in the world, we will learn about it. We’ll convert it into a Risk Signature, similar to a virus signature, and then we will stream it to all our customers, where it will be scanned in their environments. That way, our customer can proactively detect, identify, and remediate Operational Risks before they cause disruptions, much like antivirus software detects and removes viruses before they start causing harm.


With Chkk, our customers learn about Operational Risks from an authoritative source and proactively prevent these incidents from happening altogether.

How the Chkk Operational Safety Platform works

Our first product is a SaaS service designed for organizations that are running mission-critical applications on Kubernetes infrastructure. We help them reduce Operational Risks, prevent errors and disruptions, and operate Kubernetes safely and efficiently. Not only do we identify and prioritize risks, we also provide Preverified Upgrade Plans to our customers, so they can cut down weeks of preparation prework to days, and safely remediate these risks without worrying about the complexities and intricate interdependencies that exist when fixing these issues.

There are three distinct modules in the Chkk Operational Safety Platform.

Upgrade Copilot is especially valuable for Platform, DevOps, and SRE Engineers responsible for planning and executing infrastructure upgrades. We provide Preverified Upgrade Plans containing a detailed sequence of steps that need to be executed for remediation. We then optionally verify these steps on a digital twin of their infrastructure, executing the prescribed sequence of steps, to validate that the plan works as expected. This significantly reduces the time and effort required for planning these upgrades and also derisks the execution of this critical task for our customers.

Artifact Register maintains an inventory of all components, container images, repositories, and tools across multiple clusters and clouds. It gives our customers visibility into what exists where, reducing the need for manual and error-prone tracking using spreadsheets and scripts that they currently use.

Risk Ledger is similar to security risk ledgers, but tailored specifically towards identifying contextualized Operational Risks within Kubernetes infrastructures. It enables our customers to become proactive in addressing potential failures before they happen.

All modules of Chkk seamlessly integrate with existing workflows and tools (IaC, packaging, deployment, monitoring, ticketing, and alerting) and simplify existing operational processes.

Powered by Collective Learning

Many of our customers ask us: how do you learn about all the issues and Operational Risks? How do you make sure that your remediations and Upgrade Plans are safe to execute? How do you manage these intractable problems? What’s the magic?

The magic is our Collective Learning Technology.

At the heart of Collective Learning is the Risk Signature Database, or RSig DB. Think of it as a CVE database for Availability Risks, along with a Knowledge Graph that captures all the relationships across different artifacts – issues, release notes, and any and all breaking changes.

On the backend, our technology continuously sources and populates this RSig DB and Knowledge Graph from multiple sources. First and foremost, we mine the internet for publicly available information – incidents, reports, tickets, issues, and discussions on internet forums. We scour everything where we can find a signal. Our research team then validates these candidates and converts them into programmatic signatures that can later be scanned and contextualized against a customer’s infrastructure.

We  also ingest release notes, breaking changes, and bug report feeds from Kubernetes add-on vendors and open-source projects into our RSig DB and the Knowledge Graph. And of course, we also learn from our users. We continuously add these learnings to our RSig DB and Knowledge Base, which become more valuable for our customers over time.

All Chkk modules use the Database and Knowledge Graph to identify and prioritize risks, and locate them with pinpoint accuracy within a Kubernetes fleet. We also use them to create and preverify the Upgrade Plans that our customers use to remediate these issues.

We’ve taken time and care to build and fine tune the technology to prioritize and address the right risks. Our customers appreciate that we offer concise actionable plans to resolve the most critical risks, rather than burdening them with an exhaustive list of unnecessary ones.

A bright future ahead

In order to build a future powered by Collective Learning, Chkk has raised $5.2 million in seed funding from angels and VCs led by Sequoia Capital. We are grateful that Sequoia believes in our mission and is joining us in democratizing the wisdom of operating software at scale.

We have built the Chkk Operational Safety Platform for our customers running mission-critical apps on Kubernetes infrastructure. It helps Platform, DevOps, and SRE teams proactively manage and remediate risks, execute safe upgrades, eliminate wasted effort, and accomplish more with fewer resources. 

The Chkk Operational Safety Platform is available today – it installs in minutes and integrates into your existing tools and workflows. Please sign up to get started.

Tags
No items found.

Continue reading

Spotlight

Spotlight: Simplifying Contour Upgrades with Chkk

by
Chkk Team
Read more
Hidden Toil

5 Reasons Why Delaying Open Source Software Upgrades Is a Bad Idea

by
Awais Nemat
Read more
Spotlight

Spotlight: Seamless cert-manager Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Argo Rollouts Upgrades with Chkk

by
Chkk Team
Read more
Upgrade Advisory

Upgrade Advisory: Pods Stuck in Pending During Kubelet v1.30 → v1.31 Upgrade

by
Chkk Team
Read more
Spotlight

Spotlight: Simplifying Self-Managed Apache Kafka Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless Calico Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: NGINX Ingress Controller Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: KEDA Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Streamlining Prometheus Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: RabbitMQ Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless Kyverno Upgrades with Chkk

by
Chkk Team
Read more
News

Google Container Registry Deprecation 2025: How to Migrate to Artifact Registry

by
Chkk Team
Read more
Spotlight

Spotlight: HashiCorp Vault Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Streamlining Crossplane Upgrades with Chkk

by
Chkk Team
Read more
Spotlight

Spotlight: Seamless External DNS Upgrades with Chkk

by
Chkk Team
Read more
Case Study

How Dexcom Derisked GKE Upgrades and Sped Them Up by 5x using Chkk

by
Chkk Team
Read more
Case Study

Assuring Compliance and Availability for Yoti’s On-Prem Platform with Chkk

by
Chkk Team
Read more
Case Study

How a Fortune 500 Enterprise Avoided $500K in EKS Extended Support Fees, Achieved 80% Reduction in Prep Time, and Boosted Upgrade Productivity by 200%

by
Chkk Team
Read more
Case Study

How a Fortune 1000 Enterprise Standardized Multi-Cloud (EKS & GKE) Upgrades for 30+ Add-Ons, Avoided 6x Costs, and Achieved an 80% Reduction in Prep Time

by
Chkk Team
Read more
Spotlight

Spotlight: Upgrading Self-Managed Redis

by
Chkk Team
Read more
Spotlight

Spotlight: Simplifying Self-Managed Elasticsearch Upgrades with Chkk

by
Chkk Team
Read more
News

GKE & EKS Extended Support: Are 6x Fees for Supporting Older Kubernetes Versions Justified?

by
Ali Khayam
Read more
Spotlight

Spotlight: Seamless Karpenter Upgrades with Chkk

by
Chkk Team
Read more
Operational Safety

Forced EKS & GKE Upgrades: How to Manage Business Continuity Risks

by
Fawad Khaliq
Read more
Spotlight

Spotlight: How Chkk Streamlines & Safeguards Cilium Upgrades

by
Chkk Team
Read more
Technology

Kubernetes Admission Controllers and Webhooks Deep Dive

by
Chkk Team
Read more
Spotlight

Chkk Spotlight: Istio

by
Chkk Team
Read more
Technology

Pod Disruption Budgets: Pitfalls, Evictions & Kubernetes Upgrades

by
Chkk Team
Read more
Technology

cgroup v1 to v2 Migration in Kubernetes

by
Chkk Team
Read more
Operational Safety

OpenAI’s Outage: The Complexity and Fragility of Modern AI Infrastructure on Kubernetes

by
Fawad Khaliq
Read more
News

EKS launches Auto Mode… How can you adopt it?

by
Ali Khayam
Read more
Change Safety

CrowdStrike outage was the symptom; missing Operational Safety was the cause

by
Fawad Khaliq
Read more
News

GKE Follows EKS & AKS, Launches Extended Support with a 500% Surcharge for Delayed Upgrade

by
Ali Khayam
Read more
News

AKS Long Term Support and EKS Extended Support: Similarities & Differences

by
Ali Khayam
Read more
News

Amazon launches EKS extended support… How does it impact you?

by
Ali Khayam
Read more
Platform Engineering

Platform teams need a delightfully different approach, not one that sucks less

by
Fawad Khaliq
Read more
Technology

Kubernetes Enters Its Second Decade: Insights from KubeCon Chicago

by
Fawad Khaliq
Read more
Company

Launching Chkk Operational Safety Platform

by
Awais Nemat
Read more
Technology

What Makes Kubernetes Upgrades So Challenging?

by
Fawad Khaliq
Read more
Company

4 Lessons from our SOC2 Journey

by
Fawad Khaliq
Read more
Technology

Collective Learning: The Power of Not Repeating Others’ Mistakes

by
Ali Khayam
Read more
Technology

From Fighting Fires to Availability Assurance

by
Fawad Khaliq
Read more
Company

Welcome to Chkk

by
Awais Nemat
Read more