How Google Runs Production Systems | Culture Foundry

I recently finished reading “Site Reliability Engineering: How Google Runs Production Systems”. This book, also known as the SRE book, is full of thought-provoking and useful advice, even for teams running systems smaller than Google’s infrastructure (which is pretty much everyone on planet Earth). They’ve pushed the bounds of many aspects of computing and it is fascinating to see where things break in that journey and systems evolve. (I’ve also read that this book is now out of date with Google internal systems. Don’t treat it as a bible, but rather a jumping off point for your efforts to build reliable systems.) The entire contents of this bookare available online, free. I find it easier to read a book made out of dead trees though. Less distraction.

The book covers a lot of territory and each chapter is by a different author or set of authors. The voice of the book is consistent, however. Some of the pieces have been previously published. There was some overlap between the chapters. Four editors and those many authors collaborated on the book—responsible, maybe, for the duplicate content. The book is over five hundred pages including appendices, so I’m not going to mention everything the book covered. But there were certain passages that resonated and I wanted to share those.

For one, the SRE book defines “toil”, which is ongoing manual labor required to run systems. Anyone who has ever run a production system will recognize the concept, and appreciate the choice of term. In the book, the authors discuss just how much toil to allow as well as the ongoing cost of toil. Toil was a major reason for creating the site reliability engineer position at Google. Site reliability engineering combines the jobs of sysadmin at scale with software engineer. As you can imagine, the ideal SRE is a bit of a unicorn.

The authors also enumerate multiple types of testing. At Google scale, there are many different kinds of automated testing and each handles different aspects of the system. One that was especially mind bending to me was canary testing; I’d heard of the concept before, but never seen the mathematical underpinnings. When you are canary testing, you roll out a change to a limited subset of users and see how the change affects their experience.

I’ve never read the CAP theorem, but am vaguely aware of how it limits distributed systems. The chapters on distributed consensus (and all the dead ends explored around the topic) opened my eyes. The explanations were clear and understandable to someone without much theoretical knowledge. Key takeaway for me: avoid NIH syndrome with such distributed systems unless you read and fully understand this chapter and the references. Finally, the book walks through how to build a distributed scheduling system to replace cron.

Data integrity at scale is another emergent aspect of large-scale systems that I’d never considered. Those of us who can do a backup and restore of smaller RDBMses can appreciate how important data integrity is, but working with systems where you’re worried about disk level corruption and that use exabytes of data (making performing round trip tests and other best practices prohibitive) requires a whole new approach. This chapter discusses the various strategies Google uses, including soft deletes and early detection.

Launching a new service within Google confuses enough that there are now launch specialists. This chapter has something for everyone, covering how this evolved over time, the nuts and bolts of a launch checklist and techniques for reliable launches. Not everyone launches projects at Google scale, but everyone wants a smooth launch, and load testing, feature flags and staged rollouts are all techniques available to everyone.

There is also a great chapter on managing incidents that anyone with operations responsibilities should read. I enjoyed the clear explanations and justifications for why every incident should have an owner so that problems don’t spiral out of control, and every resolved incident should have a post-mortem so that the organization can learn.

Reading the SRE book was a bit like visiting Ireland. You speak the language, but call upon a different set of references and some words have different meaning. I saw echos of my software delivery experiences, but they reflected back at me in ways I’d never seen. That shift in perspective alone was worth the time it took to grind through the book; the knowledge I gained was just icing on the cake.

Book Review: Site Reliability Engineering

Take the first step to

reclaim the power of tech

Subscribe to THE LIFT