https://ppml.dev/ * Preface * 1 What Is This Book About? + 1.1 Machine Learning + 1.2 Data Science + 1.3 Software Engineering + 1.4 How Do They Go Together? * I Foundations of Scientific Computing * 2 Hardware Architectures + 2.1 Types of Hardware o 2.1.1 Compute o 2.1.2 Memory o 2.1.3 Connections + 2.2 Making Hardware Live Up to Expectations + 2.3 Local and Remote Hardware + 2.4 Choosing the Right Hardware for the Job * 3 Variable Types and Data Structures + 3.1 Variable Types o 3.1.1 Integers o 3.1.2 Floating Point o 3.1.3 Strings + 3.2 Data Structures o 3.2.1 Vectors and Lists o 3.2.2 Representing Data with Data Frames o 3.2.3 Dense and Sparse Matrices + 3.3 Choosing the Right Variable Types for the Job + 3.4 Choosing the Right Data Structures for the Job * 4 Analysis of Algorithms + 4.1 Writing Pseudocode + 4.2 Computational Complexity and Big-\(O\) Notation + 4.3 Big-\(O\) Notation and Benchmarking + 4.4 Algorithm Analysis for Machine Learning + 4.5 Some Examples of Algorithm Analysis o 4.5.1 Estimating Linear Regression Models o 4.5.2 Sparse Matrices Representation o 4.5.3 Uniform Simulations of Directed Acyclic Graphs + 4.6 Big-\(O\) Notation and Real-World Performance * II Best Practices for Machine Learning Pipelines * 5 Designing and Structuring Pipelines + 5.1 Data as Code + 5.2 Technical Debt o 5.2.1 At the Data Level o 5.2.2 At the Model Level o 5.2.3 At the Architecture (Design) Level o 5.2.4 At the Code Level + 5.3 Machine Learning Pipeline o 5.3.1 Project Scoping o 5.3.2 Producing a Baseline Implementation o 5.3.3 Data Ingestion and Preparation o 5.3.4 Model Training, Evaluation and Validation o 5.3.5 Deployment, Serving and Inference o 5.3.6 Monitoring, Logging and Reporting * 6 Writing Machine Learning Code + 6.1 Choosing Languages and Libraries + 6.2 Naming Things + 6.3 Coding Styles and Coding Standards + 6.4 Filesystem Structure + 6.5 Effective Versioning + 6.6 Code Review + 6.7 Refactoring + 6.8 Reworking Academic Code: An Example * 7 Packaging and Deploying Pipelines + 7.1 Model Packaging o 7.1.1 Standalone Packaging o 7.1.2 Programming Language Package Managers o 7.1.3 Virtual Machines o 7.1.4 Containers + 7.2 Model Deployment: Strategies + 7.3 Model Deployment: Infrastructure + 7.4 Model Deployment: Monitoring and Logging + 7.5 What Can Possibly Go Wrong? + 7.6 Rolling Back * 8 Documenting Pipelines + 8.1 Comments + 8.2 Documenting Public Interfaces + 8.3 Documenting Architecture and Design + 8.4 Documenting Algorithms and Business Cases + 8.5 Illustrating Practical Use Cases * 9 Troubleshooting and Testing Pipelines + 9.1 Data Are the Problem o 9.1.1 Large Data o 9.1.2 Heterogeneous Data o 9.1.3 Dynamic Data + 9.2 Models Are the Problem o 9.2.1 Large Models o 9.2.2 Black-Box Models o 9.2.3 Costly Models o 9.2.4 Many Models + 9.3 Common Signs That Something Is Up + 9.4 Tests Are the Solution o 9.4.1 What Do We Want to Achieve? o 9.4.2 What Should We Test? o 9.4.3 Offline and Online Data o 9.4.4 Testing Local and Testing Global o 9.4.5 Conceptual and Implementation Errors o 9.4.6 Code Coverage and Test Prioritisation * III Tools and Technologies * 10 Tools for Developing Pipelines + 10.1 Data Exploration and Experiment Tracking + 10.2 Code Development o 10.2.1 Code Editors and IDEs o 10.2.2 Notebooks o 10.2.3 Accessing Data and Documentation + 10.3 Build, Test and Documentation Tools * 11 Tools to Manage Pipelines in Production + 11.1 Infrastructure Management + 11.2 Machine Learning Software Management + 11.3 Dashboards, Visualisation and Reporting * IV A Case Study * 12 Recommending Recommendations: A Recommender System Using Natural Language Understanding + 12.1 The Domain Problem + 12.2 The Machine Learning Model + 12.3 The Infrastructure + 12.4 The Architecture of the Pipeline o 12.4.1 Data Ingestion and Data Preparation o 12.4.2 Data Tracking and Versioning o 12.4.3 Training and Experiment Tracking o 12.4.4 Model Packaging o 12.4.5 Deployment and Inference * References The Pragmatic Programmer for Machine Learning The Pragmatic Programmer for Machine Learning Engineering Analytics and Data Science Solutions Marco Scutari, Mauro Malvestio 2023-04-22 Preface Pitching new ideas by prefacing them with quotes like "Data scientist: the sexiest job of the 21st century" (Harvard Business Review 2012) or "Data is the new oil" (The Economist 2017) has become such a cliche that any audience (in business and academia alike) will collectively roll their eyes in exasperation. And for good reason. Likewise, we do not believe radiologists or lorry drivers will be replaced by artificial intelligence and out of a job for the foreseeable future, and we are not alone in realising the limits of machine learning (The Economist 2020). Even so, it is difficult to understate the impact that machine learning is having on many aspects of our lives. It has taken the pre-existing trends of using data and analytics (under the banner of "data mining", "big data" and similar buzzwords) to inform business decisions and drive scientific discovery, and made them ubiquitous. Machine learning has combined the mathematical rigour of information theory and statistics, the computational aspects of computer science and the goal-driven flexibility of optimisation theory, redefining how we work with data. The flip side of trying to distil parts of so many different disciplines has been the clash between their respective cultures, which has been well summarised by Leo Breiman in "Statistical Modeling: The Two Cultures" (Breiman 2001b). On top of that, there is a tension between machine learning practice in the industry and academia: the latter strongly values producing novel models and theoretical results, while the former is driven by the need to produce practical results that have business value. With so many different perspectives, it is a wonder that a rough consensus on what machine learning is has actually evolved! (Personally, our red line is conflating deep learning with machine learning. There is life beyond deep neural networks!) In this melting pot of ideas, we feel that software engineering has played a remarkably small role compared to other disciplines. Machine learning, after all, is "a technique that allows computer systems to improve with experience and data" (Goodfellow, Bengio, and Courville 2016). Therefore, there is a presumption that one will interact with a computer system, which in turn happens by engineering a piece of software that communicates to the computer system what it is supposed to do. The quality of this engineering is crucial in both academia and industry. In academia, software quality issues are one of the underlying causes of the "reproducibility crisis" (Nature 2016; Tatman, VanderPlas, and Dane 2018). In industry, poor engineering leads to lower practical and computational performance (Kang et al. 2021), to a quick accumulation of technical debt (Sculley et al. 2015 ) and sometimes to catastrophic failures with costs in the millions (.Seven 2014; The Register 2020; VPNOverview 2022; Sherman 2022). There is, of course, a sizeable body of accumulated wisdom on how to architect and write software in foundational books like The Pragmatic Programmer (Thomas and Hunt 2019) and A Philosophy of Software Design (Ousterhout 2018). However, these books are written with business software in mind, and we find that they do not capture or touch only tangentially on key practices that go a long way towards successfully implementing and deploying machine learning models. Analysis of algorithms; matching data and algorithms with appropriate hardware; embracing data as part of the software; testing and documenting algorithms and their implementations; modularising and building pipelines; and, last but not least, naming variables. From our experience in academia and in the industry, engineering software and teaching software engineering to students and new staff alike, these topics are often not given the importance they deserve. We hope to convince the readers of this book that the viability of any software that analyses data, whether you call it machine learning, data science or business analytics, depends crucially on putting careful thought into these engineering practices. We do not aim to be prescriptive: the individual practices that we discuss will be more or less relevant in different settings, and can be implemented with a variety of software tools. On the contrary, we want our readers to think about what we wrote in the context of their own experience and to figure out which parts apply and which do not! The book starts with a brief introduction to machine learning and software engineering, to set out how we view them and how we think that they should interact in practical applications. The remainder is structured in four parts, from foundational to practical: 1. Foundations of Scientific Computing: covering key topics that are foundational for the planning, analysis and design of machine learning software, such as: the trade-offs of using different hardware configurations; the characteristics of different data types and of suitable data structures; and the analysis of algorithms to determine their computational complexity. 2. Best Practices for Machine Learning and Data Science: revisiting best practices in software engineering from the point of view of a machine learning engineer, from writing, troubleshooting and deploying code to production (that is, serving models) to writing technical documentation. 3. Tools and Technologies: discussing broad classes of tools that shape how we think about what is feasible to do with machine learning pipelines, with examples from the state of the art and the trade-offs they make. 4. A Case Study: putting the recommendations in the previous chapters into practice by discussing and prototyping a machine learning pipeline for natural language understanding from the work of Lipizzi et al. (Lipizzi et al. 2022). All the material in this book, including the book itself, is available online at https://ppml.dev and will be updated to fix assorted typos and code problems as they become known to us. Finally, we would like to thank all the people who supported us and made this book possible. First of all, our families who put up with our long working hours. The colleagues who gave us feedback on early drafts of the book: Vincenzo Manzoni, Fabio Stella and Ron Kenett. And, last but not least, our editor Randi Cohen who bore with us through the many delays this book suffered during the Covid pandemic. References Breiman, L. 2001b. "Statistical Modeling: The Two Cultures." Statistical Science 16 (3): 199-231. Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press. Harvard Business Review. 2012. Data Scientist: The Sexiest Job of the 21st Century. https://hbr.org/2012/10/ data-scientist-the-sexiest-job-of-the-21st-century. Kang, S., R. Jin, X. Deng, and R. S. Kenett. 2021. "Challenges of Modeling and Analysis in Cybermanufacturing: A Review from a Machine Learning and Computation Perspective." Journal of Intelligent Manufacturing Online first. Lipizzi, C., H. Behrooz, M. Dressman, A. G. Vishwakumar, and K. Batra. 2022. "Acquisition Research: Creating Synergy for Informed Change." In Proceedings of the 19th Annual Acquisition Research Symposium, 242-55. Nature. 2016. "Reality Check on Reproducibility." Nature 533 (437). Ousterhout, J. 2018. A Philosophy of Software Design. Yaknyam Press. Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. 2015. "Hidden Technical Debt in Machine Learning Systems." In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), 2:2503-11. .Seven, D. 2014. Knightmare: A DevOps Cautionary Tale. https:// dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/. Sherman, E. 2022. What Zillow's Failed Algorithm Means for the Future of Data Science. https://fortune.com/education/business/articles/2022 /02/01/ what-zillows-failed-algorithm-means-for-the-future-of-data-science/. Tatman, R., J. VanderPlas, and S. Dane. 2018. "A Practical Taxonomy of Reproducibility for Machine Learning Research." In Proceedings of 2nd the Reproducibility in Machine Learning Workshop at ICML 2018. The Economist. 2017. The World's Most Valuable Resource Is No Longer Oil, but Data. https://www.economist.com/leaders/2017/05/06/ the-worlds-most-valuable-resource-is-no-longer-oil-but-data. The Economist. 2020. An Understanding of AI's Limitations Is Starting to Sink In. https://www.economist.com/technology-quarterly/2020/06/11 /an-understanding-of-ais-limitations-is-starting-to-sink-in. The Register. 2020. Twilio: Someone Waltzed into Our Unsecured AWS S3 Silo, Added Dodgy Code to Our JavaScript SDK for Customers. https:// www.theregister.com/2020/07/21/twilio_javascript_sdk_code_injection. Thomas, D., and A. Hunt. 2019. The Pragmatic Programmer: Your Journey to Mastery. Anniversary. Addison-Wesley. VPNOverview. 2022. Fintech App Switch Leaks Users' Transactions, Personal IDs. https://vpnoverview.com/news/ fintech-app-switch-leaks-users-transactions-personal-ids.