https://ppml.dev/

  * Preface
  * 1 What Is This Book About?
      + 1.1 Machine Learning
      + 1.2 Data Science
      + 1.3 Software Engineering
      + 1.4 How Do They Go Together?
  * I Foundations of Scientific Computing
  * 2 Hardware Architectures
      + 2.1 Types of Hardware
          o 2.1.1 Compute
          o 2.1.2 Memory
          o 2.1.3 Connections
      + 2.2 Making Hardware Live Up to Expectations
      + 2.3 Local and Remote Hardware
      + 2.4 Choosing the Right Hardware for the Job
  * 3 Variable Types and Data Structures
      + 3.1 Variable Types
          o 3.1.1 Integers
          o 3.1.2 Floating Point
          o 3.1.3 Strings
      + 3.2 Data Structures
          o 3.2.1 Vectors and Lists
          o 3.2.2 Representing Data with Data Frames
          o 3.2.3 Dense and Sparse Matrices
      + 3.3 Choosing the Right Variable Types for the Job
      + 3.4 Choosing the Right Data Structures for the Job
  * 4 Analysis of Algorithms
      + 4.1 Writing Pseudocode
      + 4.2 Computational Complexity and Big-\(O\) Notation
      + 4.3 Big-\(O\) Notation and Benchmarking
      + 4.4 Algorithm Analysis for Machine Learning
      + 4.5 Some Examples of Algorithm Analysis
          o 4.5.1 Estimating Linear Regression Models
          o 4.5.2 Sparse Matrices Representation
          o 4.5.3 Uniform Simulations of Directed Acyclic Graphs
      + 4.6 Big-\(O\) Notation and Real-World Performance
  * II Best Practices for Machine Learning Pipelines
  * 5 Designing and Structuring Pipelines
      + 5.1 Data as Code
      + 5.2 Technical Debt
          o 5.2.1 At the Data Level
          o 5.2.2 At the Model Level
          o 5.2.3 At the Architecture (Design) Level
          o 5.2.4 At the Code Level
      + 5.3 Machine Learning Pipeline
          o 5.3.1 Project Scoping
          o 5.3.2 Producing a Baseline Implementation
          o 5.3.3 Data Ingestion and Preparation
          o 5.3.4 Model Training, Evaluation and Validation
          o 5.3.5 Deployment, Serving and Inference
          o 5.3.6 Monitoring, Logging and Reporting
  * 6 Writing Machine Learning Code
      + 6.1 Choosing Languages and Libraries
      + 6.2 Naming Things
      + 6.3 Coding Styles and Coding Standards
      + 6.4 Filesystem Structure
      + 6.5 Effective Versioning
      + 6.6 Code Review
      + 6.7 Refactoring
      + 6.8 Reworking Academic Code: An Example
  * 7 Packaging and Deploying Pipelines
      + 7.1 Model Packaging
          o 7.1.1 Standalone Packaging
          o 7.1.2 Programming Language Package Managers
          o 7.1.3 Virtual Machines
          o 7.1.4 Containers
      + 7.2 Model Deployment: Strategies
      + 7.3 Model Deployment: Infrastructure
      + 7.4 Model Deployment: Monitoring and Logging
      + 7.5 What Can Possibly Go Wrong?
      + 7.6 Rolling Back
  * 8 Documenting Pipelines
      + 8.1 Comments
      + 8.2 Documenting Public Interfaces
      + 8.3 Documenting Architecture and Design
      + 8.4 Documenting Algorithms and Business Cases
      + 8.5 Illustrating Practical Use Cases
  * 9 Troubleshooting and Testing Pipelines
      + 9.1 Data Are the Problem
          o 9.1.1 Large Data
          o 9.1.2 Heterogeneous Data
          o 9.1.3 Dynamic Data
      + 9.2 Models Are the Problem
          o 9.2.1 Large Models
          o 9.2.2 Black-Box Models
          o 9.2.3 Costly Models
          o 9.2.4 Many Models
      + 9.3 Common Signs That Something Is Up
      + 9.4 Tests Are the Solution
          o 9.4.1 What Do We Want to Achieve?
          o 9.4.2 What Should We Test?
          o 9.4.3 Offline and Online Data
          o 9.4.4 Testing Local and Testing Global
          o 9.4.5 Conceptual and Implementation Errors
          o 9.4.6 Code Coverage and Test Prioritisation
  * III Tools and Technologies
  * 10 Tools for Developing Pipelines
      + 10.1 Data Exploration and Experiment Tracking
      + 10.2 Code Development
          o 10.2.1 Code Editors and IDEs
          o 10.2.2 Notebooks
          o 10.2.3 Accessing Data and Documentation
      + 10.3 Build, Test and Documentation Tools
  * 11 Tools to Manage Pipelines in Production
      + 11.1 Infrastructure Management
      + 11.2 Machine Learning Software Management
      + 11.3 Dashboards, Visualisation and Reporting
  * IV A Case Study
  * 12 Recommending Recommendations: A Recommender System Using
    Natural Language Understanding
      + 12.1 The Domain Problem
      + 12.2 The Machine Learning Model
      + 12.3 The Infrastructure
      + 12.4 The Architecture of the Pipeline
          o 12.4.1 Data Ingestion and Data Preparation
          o 12.4.2 Data Tracking and Versioning
          o 12.4.3 Training and Experiment Tracking
          o 12.4.4 Model Packaging
          o 12.4.5 Deployment and Inference
  * References

The Pragmatic Programmer for Machine Learning

The Pragmatic Programmer for Machine Learning

Engineering Analytics and Data Science Solutions

Marco Scutari, Mauro Malvestio

2023-04-22

Preface 

Pitching new ideas by prefacing them with quotes like "Data
scientist: the sexiest job of the 21st century" (Harvard Business
Review 2012) or "Data is the new oil" (The Economist 2017) has become
such a cliche that any audience (in business and academia alike) will
collectively roll their eyes in exasperation. And for good reason.
Likewise, we do not believe radiologists or lorry drivers will be
replaced by artificial intelligence and out of a job for the
foreseeable future, and we are not alone in realising the limits of
machine learning (The Economist 2020).

Even so, it is difficult to understate the impact that machine
learning is having on many aspects of our lives. It has taken the
pre-existing trends of using data and analytics (under the banner of
"data mining", "big data" and similar buzzwords) to inform business
decisions and drive scientific discovery, and made them ubiquitous.
Machine learning has combined the mathematical rigour of information
theory and statistics, the computational aspects of computer science
and the goal-driven flexibility of optimisation theory, redefining
how we work with data.

The flip side of trying to distil parts of so many different
disciplines has been the clash between their respective cultures,
which has been well summarised by Leo Breiman in "Statistical
Modeling: The Two Cultures" (Breiman 2001b). On top of that, there is
a tension between machine learning practice in the industry and
academia: the latter strongly values producing novel models and
theoretical results, while the former is driven by the need to
produce practical results that have business value. With so many
different perspectives, it is a wonder that a rough consensus on what
machine learning is has actually evolved! (Personally, our red line
is conflating deep learning with machine learning. There is life
beyond deep neural networks!)

In this melting pot of ideas, we feel that software engineering has
played a remarkably small role compared to other disciplines. Machine
learning, after all, is "a technique that allows computer systems to
improve with experience and data" (Goodfellow, Bengio, and Courville
2016). Therefore, there is a presumption that one will interact with
a computer system, which in turn happens by engineering a piece of
software that communicates to the computer system what it is supposed
to do. The quality of this engineering is crucial in both academia
and industry. In academia, software quality issues are one of the
underlying causes of the "reproducibility crisis" (Nature 2016;
Tatman, VanderPlas, and Dane 2018). In industry, poor engineering
leads to lower practical and computational performance (Kang et al.
2021), to a quick accumulation of technical debt (Sculley et al. 2015
) and sometimes to catastrophic failures with costs in the millions 
(.Seven 2014; The Register 2020; VPNOverview 2022; Sherman 2022).
There is, of course, a sizeable body of accumulated wisdom on how to
architect and write software in foundational books like The Pragmatic
Programmer (Thomas and Hunt 2019) and A Philosophy of Software Design
(Ousterhout 2018). However, these books are written with business
software in mind, and we find that they do not capture or touch only
tangentially on key practices that go a long way towards successfully
implementing and deploying machine learning models. Analysis of
algorithms; matching data and algorithms with appropriate hardware;
embracing data as part of the software; testing and documenting
algorithms and their implementations; modularising and building
pipelines; and, last but not least, naming variables. From our
experience in academia and in the industry, engineering software and
teaching software engineering to students and new staff alike, these
topics are often not given the importance they deserve. We hope to
convince the readers of this book that the viability of any software
that analyses data, whether you call it machine learning, data
science or business analytics, depends crucially on putting careful
thought into these engineering practices. We do not aim to be
prescriptive: the individual practices that we discuss will be more
or less relevant in different settings, and can be implemented with a
variety of software tools. On the contrary, we want our readers to
think about what we wrote in the context of their own experience and
to figure out which parts apply and which do not!

The book starts with a brief introduction to machine learning and
software engineering, to set out how we view them and how we think
that they should interact in practical applications. The remainder is
structured in four parts, from foundational to practical:

 1. Foundations of Scientific Computing: covering key topics that are
    foundational for the planning, analysis and design of machine
    learning software, such as: the trade-offs of using different
    hardware configurations; the characteristics of different data
    types and of suitable data structures; and the analysis of
    algorithms to determine their computational complexity.
 2. Best Practices for Machine Learning and Data Science: revisiting
    best practices in software engineering from the point of view of
    a machine learning engineer, from writing, troubleshooting and
    deploying code to production (that is, serving models) to writing
    technical documentation.
 3. Tools and Technologies: discussing broad classes of tools that
    shape how we think about what is feasible to do with machine
    learning pipelines, with examples from the state of the art and
    the trade-offs they make.
 4. A Case Study: putting the recommendations in the previous
    chapters into practice by discussing and prototyping a machine
    learning pipeline for natural language understanding from the
    work of Lipizzi et al. (Lipizzi et al. 2022).

All the material in this book, including the book itself, is
available online at

    https://ppml.dev

and will be updated to fix assorted typos and code problems as they
become known to us.

Finally, we would like to thank all the people who supported us and
made this book possible. First of all, our families who put up with
our long working hours. The colleagues who gave us feedback on early
drafts of the book: Vincenzo Manzoni, Fabio Stella and Ron Kenett.
And, last but not least, our editor Randi Cohen who bore with us
through the many delays this book suffered during the Covid pandemic.

References

Breiman, L. 2001b. "Statistical Modeling: The Two Cultures."
Statistical Science 16 (3): 199-231.

Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. MIT
Press.

Harvard Business Review. 2012. Data Scientist: The Sexiest Job of the
21st Century. https://hbr.org/2012/10/
data-scientist-the-sexiest-job-of-the-21st-century.

Kang, S., R. Jin, X. Deng, and R. S. Kenett. 2021. "Challenges of
Modeling and Analysis in Cybermanufacturing: A Review from a Machine
Learning and Computation Perspective." Journal of Intelligent
Manufacturing Online first.

Lipizzi, C., H. Behrooz, M. Dressman, A. G. Vishwakumar, and K.
Batra. 2022. "Acquisition Research: Creating Synergy for Informed
Change." In Proceedings of the 19th Annual Acquisition Research
Symposium, 242-55.

Nature. 2016. "Reality Check on Reproducibility." Nature 533 (437).

Ousterhout, J. 2018. A Philosophy of Software Design. Yaknyam Press.

Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner,
V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. 2015. "Hidden
Technical Debt in Machine Learning Systems." In Proceedings of the
28th International Conference on Neural Information Processing
Systems (NIPS), 2:2503-11.

.Seven, D. 2014. Knightmare: A DevOps Cautionary Tale. https://
dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/.

Sherman, E. 2022. What Zillow's Failed Algorithm Means for the Future
of Data Science. https://fortune.com/education/business/articles/2022
/02/01/
what-zillows-failed-algorithm-means-for-the-future-of-data-science/.

Tatman, R., J. VanderPlas, and S. Dane. 2018. "A Practical Taxonomy
of Reproducibility for Machine Learning Research." In Proceedings of
2nd the Reproducibility in Machine Learning Workshop at ICML 2018.

The Economist. 2017. The World's Most Valuable Resource Is No Longer
Oil, but Data. https://www.economist.com/leaders/2017/05/06/
the-worlds-most-valuable-resource-is-no-longer-oil-but-data.

The Economist. 2020. An Understanding of AI's Limitations Is Starting
to Sink In. https://www.economist.com/technology-quarterly/2020/06/11
/an-understanding-of-ais-limitations-is-starting-to-sink-in.

The Register. 2020. Twilio: Someone Waltzed into Our Unsecured AWS S3
Silo, Added Dodgy Code to Our JavaScript SDK for Customers. https://
www.theregister.com/2020/07/21/twilio_javascript_sdk_code_injection.

Thomas, D., and A. Hunt. 2019. The Pragmatic Programmer: Your Journey
to Mastery. Anniversary. Addison-Wesley.

VPNOverview. 2022. Fintech App Switch Leaks Users' Transactions,
Personal IDs. https://vpnoverview.com/news/
fintech-app-switch-leaks-users-transactions-personal-ids.