Today, from approximately 09:55 CET to 11:36 CET, Magic Pages experienced a database outage that affected customer websites. Here's what happened and how we resolved it.
What happened?
Magic Pages runs on a high-availability MySQL database cluster with three nodes. This setup ensures that even if one node fails, the service continues operating normally. Earlier today, one of the three nodes went down, and as designed, the other two nodes maintained operations without interruption.
The problem began when I attempted to reconnect the third node. The database server reported ambiguous errors about missing metadata information. This type of error typically indicates cluster configuration issues, so I initially attempted to scale down to a single node to rebuild the metadata. However, this led to similar errors on the second node.
After investigation, I discovered the root cause: the Kubernetes infrastructure had an issue with its Persistent Volume Claims (PVCs). While the PVCs appeared to exist, they couldn't properly provision new volumes for the database nodes that needed to be rebuilt.
How did we fix it?
The resolution involved several steps. First, I had to restart the storage server to resolve the PVC provisioning issues. This only took a few seconds.
Once the storage was operational, it took approximately 45 minutes for the MySQL nodes to properly restore and sync with each other, which was necessary for the re-building of the cluster meta data.
Finally, I needed to clean up the corrupted metadata and redeploy the MySQL Router, which is responsible for directing traffic to the appropriate database nodes.
What worked well?
- Data Integrity: Despite the outage, no data was lost thanks to the high-availability setup.
- Partial Redundancy: The initial node failure was handled gracefully by the cluster.
What could be improved?
- Metadata Management: The cluster metadata proved to be a single point of failure, despite the redundant setup.
- Storage Monitoring: Better monitoring of PVC health could help prevent similar issues.
- Recovery Process: While effective, the 45-minute recovery time suggests room for optimization.
- Status Page: Magic Pages' status page did not immediately notice the issue, even though it is set up to detect when customer websites fail.
Future Prevention
I am implementing several improvements:
- Enhanced monitoring of storage system health
- Improved failover procedures for the MySQL Router
- Investigate and fix issues with the status page, so it shows reliable information.