Today, from 16:56 UTC to 17:12 UTC, Magic Pages experienced a complete service outage that affected all customer websites. Here's what happened and how it was resolved.
What happened?
I was setting up a new backup location for all customer websites and needed to add new credentials to the storage server cluster that provides storage for all Ghost sites hosted in Magic Pages' Kubernetes infrastructure. Since all of infrastructure is managed by Ansible as "Infrastructure as Code," I had to add these new credentials to a vault and redeploy the infrastructure.
What I failed to consider was that I had changed ISPs at my home office approximately 2 months ago, resulting in a new static IP address that didn't match what Ansible had recorded. Because of how the security protocols were designed in Ansible, the deployment created a firewall configuration with only the old IP address being able to access the server. I effectively locked me out of the system.
Simultaneously, the Ansible script was misconfigured and inadvertently deleted access for the entire Kubernetes cluster. This created a cascading effect:
- I couldn't access the system to fix it due to the firewall rules.
- The Kubernetes cluster lost its access permissions, therefore all Ghost sites went offline, since they couldn't find their files anymore.
- The backend server that would normally handle such firewall adjustments "on the fly" was also locked out.
How did I fix it?
To resolve this issue, I had to:
- Use my hosting provider's rescue system to gain emergency access to the physical server.
- Temporarily disable the firewall to regain access.
- Connect to the server using normal credentials.
- Fix the Ansible configuration and firewall rules with proper IPs for both my workstation and the Kubernetes cluster.
- Redeploy the corrected configuration.
After implementing these steps, all Magic Pages websites were back online within 16 minutes of the initial outage.
What could be improved?
This incident highlighted one critical area for improvement, which can be summarised as IP address management. The Ansible configuration needs a more dynamic way to handle IP changes, especially for administrator access. The separation between Kubernetes access ("on the fly", as IPs change often), and administrative access must be rethought and improved.
Future Prevention
I am implementing several improvements to prevent similar issues:
- Creating a dedicated VPN for administrative access that isn't dependent on static IPs.
- Drawing up a new process that makes sure Ansible knows about all Kubernetes IP addresses, which must have access after a deployment.
This outage, while brief, serves as a good reminder of how even routine infrastructure changes can have unexpected consequences when security configurations aren't properly managed.

About Jannis Fedoruk-Betschki
I'm the founder of Magic Pages, providing managed Ghost hosting that makes it easy to focus on your content instead of technical details.