Post-Mortem on 19/20 March 2025 Outage

Jannis Fedoruk-Betschki
Jannis Fedoruk-Betschki
•
2 min read

Today, from 16:56 UTC to 17:12 UTC, Magic Pages experienced a complete service outage that affected all customer websites. Here's what happened and how it was resolved.

What happened?

I was setting up a new backup location for all customer websites and needed to add new credentials to the storage server cluster that provides storage for all Ghost sites hosted in Magic Pages' Kubernetes infrastructure. Since all of infrastructure is managed by Ansible as "Infrastructure as Code," I had to add these new credentials to a vault and redeploy the infrastructure.

👉
Ansible is an orchestration tool. It's kind of like a recipe. You define how you want your infrastructure to look like, and Ansible then "cooks" it.

What I failed to consider was that I had changed ISPs at my home office approximately 2 months ago, resulting in a new static IP address that didn't match what Ansible had recorded. Because of how the security protocols were designed in Ansible, the deployment created a firewall configuration with only the old IP address being able to access the server. I effectively locked me out of the system.

Simultaneously, the Ansible script was misconfigured and inadvertently deleted access for the entire Kubernetes cluster. This created a cascading effect:

  1. I couldn't access the system to fix it due to the firewall rules.
  2. The Kubernetes cluster lost its access permissions, therefore all Ghost sites went offline, since they couldn't find their files anymore.
  3. The backend server that would normally handle such firewall adjustments "on the fly" was also locked out.
⚠️
Update, 20 March 2025, 03:55 UTC: on a subset of websites (around 70) this issue occurred again on 20 March 2025, around 00:10 UTC. This coincides roughly with the time the storage server started running a complete backup, as scheduled. Since this was the first backup to the new location, all data needed to be copied. A job I understimated. This temporarily impacted the connection to one of the Kubernetes nodes, which could therefore not serve sites. The issue was resolved around 03:05 UTC by removing the node and deploying a new one.

How did I fix it?

To resolve this issue, I had to:

  1. Use my hosting provider's rescue system to gain emergency access to the physical server.
  2. Temporarily disable the firewall to regain access.
  3. Connect to the server using normal credentials.
  4. Fix the Ansible configuration and firewall rules with proper IPs for both my workstation and the Kubernetes cluster.
  5. Redeploy the corrected configuration.

After implementing these steps, all Magic Pages websites were back online within 16 minutes of the initial outage.

What could be improved?

This incident highlighted one critical area for improvement, which can be summarised as IP address management. The Ansible configuration needs a more dynamic way to handle IP changes, especially for administrator access. The separation between Kubernetes access ("on the fly", as IPs change often), and administrative access must be rethought and improved.

Future Prevention

I am implementing several improvements to prevent similar issues:

  1. Creating a dedicated VPN for administrative access that isn't dependent on static IPs.
  2. Drawing up a new process that makes sure Ansible knows about all Kubernetes IP addresses, which must have access after a deployment.

This outage, while brief, serves as a good reminder of how even routine infrastructure changes can have unexpected consequences when security configurations aren't properly managed.

Jannis Fedoruk-Betschki

About Jannis Fedoruk-Betschki

I'm the founder of Magic Pages, providing managed Ghost hosting that makes it easy to focus on your content instead of technical details.

You might also like

Customer Showcase

Websites powered by Magic Pages

See what real publishers have built with Ghost CMS and Magic Pages hosting.

Start Your 14-Day Free Trial

No credit card required • Set up in minutes