I recently finished reading “Site Reliability Engineering: How Google Runs Production Systems”. This book, also known as the SRE book, is full of thought provoking and useful advice, even for teams running systems much smaller than Google’s infrastructure (which is pretty much everyone on planet Earth). They’ve pushed the bounds of many aspects of computing and it is fascinating to see where things break in that journey and systems evolve. (I’ve also read that this book is now out of date with Google internal systems. Don’t treat it as a bible, but rather a jumping off point for your efforts to build reliable systems.) The entire contents of this book are available online, for free. I find it much easier to read a book made ouf of dead trees though–less distraction.
The book covers a lot of territory and each chapter is by a different author or set of authors. The voice of the book is consistent, however. Some of the pieces have been previously published. There was some overlap between the chapters. The book itself was put together by four editors and those many authors, and that may be responsible for the duplicate content. The book is over five hundred pages including appendices, so I’m not going to mention everything the book covered. But there were certain passages that resonated and I wanted to share those.
For one, the SRE book defines “toil”, which is ongoing manual labor required to run systems. Anyone who has ever run a production system will recognize the concept, and appreciate how appropriate the word is. In the book, the authors discuss just how much toil to allow as well as the ongoing cost of toil. Toil was a major reason for the creation of the site reliability engineer position at Google. Site reliability engineering combines the jobs of sysadmin at scale with software engineer. As you can imagine, the ideal SRE is a bit of a unicorn.
The authors also enumerate multiple types of testing. At Google scale, there are many different kinds of automated testing and each handles different aspects of the system. One that was especially mind bending to me was canary testing; I’d heard of the concept before, but never seen the mathematical underpinnings. When you are canary testing, you roll out a change to a limited subset of users and see how the change affects their experience.
I’ve never read the CAP theorem, but am vaguely aware of how it limits distributed systems. The chapters on distributed consensus (and all the dead ends that have been explored around the topic) was eye opening. The explanations were clear and understandable to someone without much theoretical knowledge. Key takeaway for me: avoid NIH syndrome with such distributed systems unless you read and fully understand this chapter and the references. Finally, the book walks through how to build a distributed scheduling system to replace cron.
Data integrity at scale is another emergent aspect of large scale systems that I’d never considered. Those of us who can do a backup and restore of smaller RDBMses can appreciate how important data integrity is, but working with systems where you’re worried about disk level corruption and that use exabytes of data (making performing round trip tests and other best practices prohibitive) requires a whole new approach. This chapter discusses the various strategies Google uses, including soft deletes and early detection.
Launching a new service within Google is complicated enough that there are now launch specialists. This chapter has something for everyone, covering how this evolved over time, the nuts and bolts of a launch checklist and techniques for reliable launches. Not everyone launches projects at Google scale, but everyone wants a smooth launch, and load testing, feature flags and staged rollouts are all techniques available to everyone.
There is also a great chapter on managing incidents that anyone with operations responsibilities should read. I enjoyed the clear explanations and justifications for why every incident should have an owner so that problems don’t spiral out of control, and every resolved incident should have a post mortem so that the organization can learn.
Reading the SRE book was a bit like visiting Ireland. You speak the language, but have a different set of references and some words have different meaning. I saw echos of my software delivery experiences, but they were reflected back at me in ways I’d never seen. That shift in perspective alone was worth the time it took to grind through the book; the knowledge I gained was just icing on the cake.