Post-Mortem on 18 July 2024 outage

Today, on 18 July 2024, from approximately 15:15 GMT to 15:50 GMT, Magic Pages had an outage that affected customer websites. I quickly want to share what happened.

What happened?

At Magic Pages I have done quite some work in the last few months to make sure that all systems are triple-replicated, so an outage of a single server – or even two – isn't an issue.

At 15:15 GMT I received an email from an upstream server provider that one of the servers that manages some customer sites has crashed. Nothing major. To be fair, this happens once or twice a month. Usually, the two backups kick in and nobody notices.

This is what I also expected today. Yet, after a few minutes, I noticed that things did not go back to normal. To understand why, I need to dig a little bit into the networking that's going on in the back.

The three "management server" Magic Pages uses all have their own IP address. However, to the outside world they expose only one – a so-called "failover IP". The idea is simple: when one server crashes, it quickly redirects traffic to another.

This mechanism failed today (yes, I see the irony of a "failover IP" failing…).

What did I do to fix this?

I then called the on-call staff at the data center – and, unfortunately, they didn't share happy news with me. One of the network switches failed, which led to certain IP ranges not being accessible.

So…nothing I personally could do really. Without the IP address, there isn't anything to switch over.

The staff at the data center was quite quick to replace the faulty network switch though. They hooked it up and restarted the system within 15 minutes. Things then came back bit by bit.

What will I do to avoid that in the future?

Well, to be quite frank with you…there isn't too much I can do in a situation like this. Hardware will always fail at some point. Even the best backup will take a couple of minutes to kick in.

Now, tech-savvy readers might ask a valid question: why not use a so-called "load balancer" instead of a static "failover IP"? The difference is simple: instead of routing traffic to a single server and switching over in case of an outage, a load balancer automatically routes traffic to ALL servers.

And hey, I am currently working on switching from failover IPs to load balancers. But here's the thing: if an IP is unreachable, it's unreachable. It doesn't matter whether it's set up to route traffic to a single server or to three. Having a load balancer instead of a failover IP would not have changed the situation here, in my eyes.

Update, 19 July 2024

Around 00:24 CEST on 19 July 2024, the same issue appeared again, leading to a longer timeout due to slow response from the data center. This has only been resolved at 05:50 CEST.

Around 00:40 CEST I have called the data center again to find out what the issue was. The on-call technicians have confirmed that the same problem appeared again. Unfortunately, this time the resolution took a lot longer, given that they were looking for a sustainable long-term solution.

Given that no update has been given from their side within an hour, I have decided to move websites to a new infrastructure setup in a different data center.

Around 05:30 CEST this migration has been finished for all sites, leaving none of them in the affected data center.

Built by Magic Pages customers

Magic Pages is proud to have helped so many amazing publications come to life. Here is a small selection of what our customers built on Magic Pages.

Screenshot of Ellie Mathieson's website

Ellie Mathieson

Digital Storefront
Screenshot of Big Idea Bible

Big Idea Bible

Personal Blog
Screenshot of Bento

Bento

Ghost Theme