Who answers the page at 3am?
When you provide hosting, as Culture Foundry does, you need to provide on call support. Below, I’ll outline the components of Culture Foundry’s on call system.
The first, most important piece is to define support levels. Here at CF, we have two levels of support with two different prices. At this time, the professional service offering includes a response time of two hours during business hours and a response time of six hours other times (we of course strive to act as quickly as possible, but those times are the promises we make). The basic service offering includes a response time of one business day. We also define what triggers a page, which can depend on the client. The actual response time durations are less important than letting clients choose the level of responsiveness they are willing to pay for and writing down those commitments and sticking to them.
After informing our clients about the options and letting them choose the right plan for them, the next step is to determine who is on call. That is, who is tied to their phone, how frequently and for how long. Previous to rolling this program out last year, being on call rotated between two senior engineers. This was, to put it mildly, not sustainable. After analysis, we decided to have everyone on call. That includes the founders, the engagement managers, the senior engineers, and the junior engineers. Every full time employee is put on call for a week at a time (Wed to Wed). This works for two reasons:
- we document our systems so that everyone knows where to go. Like all documentation, this is in flux and is never perfect, but at least there is one place for everyone to go if a certain client has issues.
- we work with a third party to provide a first tier of support. This party has access to our systems and, while not having deep knowledge of applications, can restart web servers and accomplish other documented tasks. The non technical CF employee owns the client interaction and communication, is responsible for logging issues, and knows who on the CF team to escalate to, if needed, while the third party takes technical actions as directed, either by the CF team, by experience or from documentation.
Everyone participating in the on call rotation “spreads the pain” so that if a client is having regular issues, it’s clear to everyone that CF needs to invest to fix the root cause. We also have a #support slack channel and when you see that channel name light up, everyone knows something has gone wrong (more on how we use slack here). This is in contrast to the previous system where the brunt of the pain was bourne by a few people.
Another key component is to automate whatever you can. We’re certainly not at a google level of automation, but we aren’t at a google level of complexity either. The main services we use are uptimerobot (to manage uptime checks) and OpsGenie (to manage the oncall rotation). uptimerobot is an affordable robust service that I’ve used on other jobs as well as this one. A couple of things to be aware of:
- Sometimes the internet isn’t perfect and we get false positives from uptime robot. Avoid monitoring too frequently.
- Monitor the correct url. We have one application that very occasionally hangs. We were monitoring part of it, but uptimerobot was actually seeing cached html from a load balancer. After changing it to point to a deeper URL that actually hung, we were assured we’ll be paged if the app is down.
- uptimerobot can monitor anything accessible from the internet. If you need to monitor an internal application that isn’t accessible, you’ll need to use a different tool. For apps running on AWS, we’ve had success with custom cloudwatch events.
OpsGenie is less affordable than uptimerobot, but offers a lot of value in managing on call rotations. It lets users trade shifts and alerts can be fired via a number of integrations. Tips for OpsGenie include:
- Add the generated google calendar for the rotation to your google calendar. We live in gcal so it’s way better to have the oncall calendar as just another calendar rather than have to login to OpsGenie.
- Make sure everyone knows how to modify their notification settings. Some people may want text messages, while others may want phone calls or app notifications.
- Let poeple add the phone numbers OpsGenie uses to their list so they can get a call even if in Do Not Disturb mode.
- We have fallback users so that if the person on call or the third party we’ve contracted with is unavailable to respond, it will fall through to a senior member of the staff after a certain period of time.
Finally, after an oncall event happens, two follow up actions are generally needed. The first is fix the issue that is causing the downtime, whether that means engineering time, a customer support escalation if there’s a third party involved, or some other means of fixing this. This fix often goes through our normal development process, though depending on the issue it may be treated urgently. The second is to try to prevent it from happening again. This means noting the issue in the log and determining any changes to code or process that need to happen. We’re not perfect about this, but the goal is to prevent issues from turning into toil.
The paging system we’ve set up at Culture Foundry helps our clients and communicates clear expectations to employees who are responsible for providing that support.