https://sre.google/sre-book/table-of-contents/ Table of Contents Google 1. Foreword 2. Preface 3. Part I - Introduction 4. 1. Introduction 5. 2. The Production Environment at Google, from the Viewpoint of an SRE 6. Part II - Principles 7. 3. Embracing Risk 8. 4. Service Level Objectives 9. 5. Eliminating Toil 10. 6. Monitoring Distributed Systems 11. 7. The Evolution of Automation at Google 12. 8. Release Engineering 13. 9. Simplicity 14. Part III - Practices 15. 10. Practical Alerting 16. 11. Being On-Call 17. 12. Effective Troubleshooting 18. 13. Emergency Response 19. 14. Managing Incidents 20. 15. Postmortem Culture: Learning from Failure 21. 16. Tracking Outages 22. 17. Testing for Reliability 23. 18. Software Engineering in SRE 24. 19. Load Balancing at the Frontend 25. 20. Load Balancing in the Datacenter 26. 21. Handling Overload 27. 22. Addressing Cascading Failures 28. 23. Managing Critical State: Distributed Consensus for Reliability 29. 24. Distributed Periodic Scheduling with Cron 30. 25. Data Processing Pipelines 31. 26. Data Integrity: What You Read Is What You Wrote 32. 27. Reliable Product Launches at Scale 33. Part IV - Management 34. 28. Accelerating SREs to On-Call and Beyond 35. 29. Dealing with Interrupts 36. 30. Embedding an SRE to Recover from Operational Overload 37. 31. Communication and Collaboration in SRE 38. 32. The Evolving SRE Engagement Model 39. Part V - Conclusions 40. 33. Lessons Learned from Other Industries 41. 34. Conclusion 42. Appendix A. Availability Table 43. Appendix B. A Collection of Best Practices for Production Services 44. Appendix C. Example Incident State Document 45. Appendix D. Example Postmortem 46. Appendix E. Launch Coordination Checklist 47. Appendix F. Example Production Meeting Minutes 48. Bibliography Table of Contents * Table of Contents * Foreword * Preface * Part I - Introduction * Chapter 1 - Introduction * Chapter 2 - The Production Environment at Google, from the Viewpoint of an SRE * Part II - Principles * Chapter 3 - Embracing Risk * Chapter 4 - Service Level Objectives * Chapter 5 - Eliminating Toil * Chapter 6 - Monitoring Distributed Systems * Chapter 7 - The Evolution of Automation at Google * Chapter 8 - Release Engineering * Chapter 9 - Simplicity * Part III - Practices * Chapter 10 - Practical Alerting * Chapter 11 - Being On-Call * Chapter 12 - Effective Troubleshooting * Chapter 13 - Emergency Response * Chapter 14 - Managing Incidents * Chapter 15 - Postmortem Culture: Learning from Failure * Chapter 16 - Tracking Outages * Chapter 17 - Testing for Reliability * Chapter 18 - Software Engineering in SRE * Chapter 19 - Load Balancing at the Frontend * Chapter 20 - Load Balancing in the Datacenter * Chapter 21 - Handling Overload * Chapter 22 - Addressing Cascading Failures * Chapter 23 - Managing Critical State: Distributed Consensus for Reliability * Chapter 24 - Distributed Periodic Scheduling with Cron * Chapter 25 - Data Processing Pipelines * Chapter 26 - Data Integrity: What You Read Is What You Wrote * Chapter 27 - Reliable Product Launches at Scale * Part IV - Management * Chapter 28 - Accelerating SREs to On-Call and Beyond * Chapter 29 - Dealing with Interrupts * Chapter 30 - Embedding an SRE to Recover from Operational Overload * Chapter 31 - Communication and Collaboration in SRE * Chapter 32 - The Evolving SRE Engagement Model * Part V - Conclusions * Chapter 33 - Lessons Learned from Other Industries * Chapter 34 - Conclusion * Appendix A - Availability Table * Appendix B - A Collection of Best Practices for Production Services * Appendix C - Example Incident State Document * Appendix D - Example Postmortem * Appendix E - Launch Coordination Checklist * Appendix F - Example Production Meeting Minutes * Bibliography Copyright (c) 2017 Google, Inc. Published by O'Reilly Media, Inc. Licensed under CC BY-NC-ND 4.0