Post-Mortem on 15 January 2025 outage

Today, from approximately 09:55 CET to 11:36 CET, Magic Pages experienced a database outage that affected customer websites. Here's what happened and how we resolved it.

What happened?

Magic Pages runs on a high-availability MySQL database cluster with three nodes. This setup ensures that even if one node fails, the service continues operating normally. Earlier today, one of the three nodes went down, and as designed, the other two nodes maintained operations without interruption.

The problem began when I attempted to reconnect the third node. The database server reported ambiguous errors about missing metadata information. This type of error typically indicates cluster configuration issues, so I initially attempted to scale down to a single node to rebuild the metadata. However, this led to similar errors on the second node.

After investigation, I discovered the root cause: the Kubernetes infrastructure had an issue with its Persistent Volume Claims (PVCs). While the PVCs appeared to exist, they couldn't properly provision new volumes for the database nodes that needed to be rebuilt.

How did we fix it?

The resolution involved several steps. First, I had to restart the storage server to resolve the PVC provisioning issues. This only took a few seconds.

Once the storage was operational, it took approximately 45 minutes for the MySQL nodes to properly restore and sync with each other, which was necessary for the re-building of the cluster meta data.

Finally, I needed to clean up the corrupted metadata and redeploy the MySQL Router, which is responsible for directing traffic to the appropriate database nodes.

What worked well?

Data Integrity: Despite the outage, no data was lost thanks to the high-availability setup.
Partial Redundancy: The initial node failure was handled gracefully by the cluster.

What could be improved?

Metadata Management: The cluster metadata proved to be a single point of failure, despite the redundant setup.
Storage Monitoring: Better monitoring of PVC health could help prevent similar issues.
Recovery Process: While effective, the 45-minute recovery time suggests room for optimization.
Status Page: Magic Pages' status page did not immediately notice the issue, even though it is set up to detect when customer websites fail.

Future Prevention

I am implementing several improvements:

Enhanced monitoring of storage system health
Improved failover procedures for the MySQL Router
Investigate and fix issues with the status page, so it shows reliable information.

Magic Pages Blog Incidents

About Jannis Fedoruk-Betschki

I'm the founder of Magic Pages, providing managed Ghost hosting that makes it easy to focus on your content instead of technical details.

Post-Mortem on 15 January 2025 outage

What happened?

How did we fix it?

What worked well?

What could be improved?

Future Prevention

About Jannis Fedoruk-Betschki

You might also like

Post-Mortem on 14 April 2025 Outage

ActivityPub Support Coming to Magic Pages

March 2025 Update on Magic Pages

Websites powered by Magic Pages