Certificate Outages & How to Prevent Them

Certificate Outages & How to Them Prevent

An expired certificate is all it takes to bring down a critical service. From Microsoft Teams to Spotify, some of the world's largest platforms have suffered outages caused by a single forgotten certificate. Understanding why these incidents happen is the first step toward making sure they never happen to you.

Overview

Certificate outages are not a theoretical risk. They happen to the largest, most well-resourced technology companies in the world, and they happen with alarming regularity. Here are three examples that made headlines.

In February 2020, Microsoft Teams went down for several hours because an authentication certificate had expired. Millions of users were unable to sign in, collaborate, or access their files during a period when remote work was becoming critical. The root cause was straightforward: a certificate renewal was missed.

In 2017, Equifax suffered one of the most consequential data breaches in history. While the breach itself was caused by an unpatched vulnerability, investigators later revealed that an expired certificate on a network monitoring tool had left the company blind to the attack for 76 days. The expired certificate disabled the inspection device that should have detected the exfiltration of 147 million records.

In 2020, Spotify experienced a global outage that lasted about an hour, traced back to an expired TLS certificate. Users could not stream music, and the brand suffered reputational damage that far exceeded the cost of a simple renewal.

These incidents share a common pattern: a critical certificate expired, no one noticed in time, and the resulting outage had consequences far beyond what a simple renewal would have cost. This chapter explores why these outages happen and, more importantly, how to prevent them.

Key Steps

The Certificate Expires

Every certificate has a "Not After" date. When that date passes, the certificate is no longer valid. This is by design: limited validity periods reduce the window of exposure if a private key is compromised. But it also means that every certificate is a ticking clock that requires timely renewal.

Connections Start Failing

When a client (browser, API consumer, device) encounters an expired certificate, it refuses to establish a secure connection. The TLS handshake fails, and the client displays an error or silently drops the connection. If the certificate secures a load balancer, proxy, or API gateway, the impact cascades to every service behind it.

The Scramble Begins

Operations teams are alerted (often by end users, not by monitoring). The first challenge is diagnosis: certificate errors can look like network failures, DNS issues, or application bugs. Once the expired certificate is identified, the team must locate it, generate or obtain a replacement, and deploy it to every affected system.

Service is Restored

The new certificate is deployed and services recover. Depending on the complexity of the environment, this can take anywhere from minutes to hours. The post-mortem reveals what everyone already suspected: the renewal was missed because no one was tracking it, or the alert was sent but went to the wrong person, or the certificate was not in any inventory at all.

Continuous Monitoring & Alerting

Deploy monitoring that continuously checks every certificate's expiration date and sends escalating alerts as the deadline approaches. Alerts should go to the certificate owner, their manager, and a central operations team. Multi-channel notifications (email, Slack, PagerDuty) ensure that no alert goes unnoticed. Monitoring should cover not just the certificates you know about, but the entire network through regular discovery scans.

Automation with ACME and CLM

The most reliable way to prevent an expiration outage is to remove humans from the renewal process entirely. Protocols like ACME enable fully automated certificate issuance and renewal. A CLM platform orchestrates this automation at scale, handling the entire certificate lifecycle from request to deployment to renewal without manual intervention.

Ownership Mapping

Every certificate must have an assigned owner: a team or individual who is accountable for its renewal and maintenance. Ownership should be mandatory at the time of issuance and updated when staff changes occur. When ownership is clear, alerts reach the right person, and accountability eliminates the "I thought someone else was handling it" failure mode.

Incident Runbooks

Even with the best prevention, organizations should have a documented runbook for certificate incidents. The runbook should specify how to identify a certificate outage, where to find the affected certificate, how to issue an emergency replacement, and how to deploy it. A well-rehearsed runbook reduces mean time to recovery (MTTR) from hours to minutes.

Evertrust & Certificate Outages & How to Prevent Them

Complete visibility — Evertrust CLM discovers every certificate across your infrastructure, including the ones hidden in cloud environments, CDNs, and legacy systems that no spreadsheet has ever tracked.

Smart alerting — Configurable, escalating alerts ensure that expiring certificates are flagged well in advance. Notifications go to certificate owners with automatic escalation if action is not taken, so nothing slips through the cracks.

Automated renewal — Integrate with ACME, SCEP, EST, and native connectors to automate certificate renewal end to end. Certificates are renewed and deployed before expiration, with zero manual intervention required.

Expiration dashboards — Real-time dashboards show every certificate approaching expiration, organized by owner, environment, and criticality. Your operations team always knows exactly where risk exists.

PKI

CLM

Use Cases

Industries

Compliance

Learn

Tools & Insights

Events & Community