DevOps

Lessons learned from a server outage

At just gone midnight this morning when I was nicely tucked up in bed, Uptime Robot sent me an email telling me that this very website, was not reachable.

I was fast asleep, and my phone was on do not disturb, so that situation stayed as it was until I woke up.

After waking up and seeing the message, I assumed that the docker container running the UI bit of this website had stopped running. Easy, I thought, I’ll just start up the container again, or I’ll just trigger another deployment and let my awesome GitLab CI pipeline do it’s thing.

Except this wasn’t the case. After realising, in a mild panic, that I could not even SSH onto the server that hosts this site, I got into contact with the hosting company (OneProvider) for some support.

I sent off a support message and sat back in my chair and reflected for a minute. Had I been a fool for rejecting the cloud? If this website was being hosted in a cloud service somewhere, would this have happened? Maybe I was stupid to dare to run my own server.

But as I gathered my thoughts, I calmed down and told myself I was being unreasonable. Over the last few years, I’ve seen cloud based solutions in various formats, also fail. One of the worst I experienced was with Amazon Redshift, when Amazon changed an obscure setup requirement meaning that you essentially needed some other Amazon service in order to be able to use Redshift. I’ve also been using a paid BitBucket service for a while with a client, which has suffered from around 5 outages of varying severity in the last 12 months. In a weird co-incidence, one of the worst outages was today. In comparison, my self hosted GitlLab instance has never gone down outside of running updates in the year and a half that I’ve had it.

Cloud based or on my own server, if an application went down I would still go through the same support process:

  • Do some investigation and try to fix
  • Send off a support request if I determined the problem was to do with the service or hosting provider

Check your SLA

An SLA or a Service Level Agreement basically outlines a service provider’s responsibilities for you. OneProvider’s SLA states that they will aim to resolve network issues within an hour, and hardware issues within two hours for their Paris data centre. Incidentally, other data centres don’t have any agreed time because of their location – like the Cairo datacenter. If they miss these SLAs, they have self imposed compensation penalties.

Two hours had long gone, so whatever the problem was, I was going to get some money back.

Getting back online

I have two main services running off of this box:

I could live with the link archive going offline for a few hours, but I really didn’t want my website going down. It has content that I’ve been putting on here for years, and believe it or not, it makes me a little beer money each month though some carefully selected and relevant affiliate links.

Here’s where docker helped. I got the site back online pretty quickly simply by starting the containers up on one of my other servers. Luckily the nginx instance on that webserver was already configured to handle any requests for edspencer.me.uk, so once the containers were started, I just needed to update the A record to point at the other server (short TTL FTW).

Within about 20 minutes, I got another email from Uptime Robot telling me that my website was back online, and that it had been down for 9 hours (yikes).

Check your backups

I use a WordPress plugin (updraft) to automate backups of this website. It works great, but the only problem is that I had set this up to take a backup weekly. Frustratingly, the last backup cycle had happened before I had penned two of my most recent posts.

I started to panic again a little. What if the hard drive in my server had died and I had lost that data forever? I was a fool for not reviewing my backup policy more. What if I lost all of the data in my link archive? I was angry at myself.

Back from the dead

At about 2pm, I got an email response from OneProvider. This was about 4 hours after I created the support ticket.

The email stated that a router in the Paris data centre that this server lives in, had died, and that they were replacing it and it would be done in 30 minutes.

This raises some questions about OneProvider’s ability to provide hosting.

  • Was the problem  highlighted by their monitoring, or did they need me to point it out?
  • Why did it take nearly 14 hours from the problem occurring to it being resolved?

I’ll be keeping my eyes on their service for the next two months.

Sure enough, the server was back and available, so I switched the DNS records back over to to point to this server.

Broken backups

Now it was time to sort out all of those backups. I logged into all of my servers and did a quick audit of my backups. It turns out there were numerous problems.

Firstly, my GitLab instance was being backup up, but the backup files were not being shipped off of the server. Casting my memory back, I can blame this on my Raspberry Pi that corrupted itself a few months ago that was previously responsible for pulling backups into my home network. I’ve now setup Rsync to pull the backups onto my RAIDed NAS.

Secondly, as previously mentioned, my WordPress backups were happening too infrequently. This has now been changed to a daily backup, and Updraft is shunting the backup files into Dropbox.

Thirdly – my link archive backups. These weren’t even running! The database is now backed up daily using Postgres’ awesome pg_dump feature in a cron job. These backups are then also pulled using Rsync down to my NAS.

It didn’t take long to audit and fix the backups – it’s something I should have done a while ago.

Lessons Learned

  • Build some resilience in. Deploy your applications to multiple servers as part of your CI pipeline. You could even have a load balancer do the legwork of directing your traffic to servers that it knows are responding correctly.
  • Those servers should be in different data centres, possibly with different hosting providers.
  • Make sure your backups are working. You should check these regularly or even setup an alerting system for when they fail.
  • Make sure your backups are sensible. Should they be running more frequently?
  • Pull your backups onto different servers so that you can get at them if the server goes offline