https://blog.mozilla.ai/taming-randomness-in-ml-models-with-hypothesis-testing-and-marimo/

Mozilla.ai Blog
Sign in Subscribe

Taming randomness in ML models with hypothesis testing and marimo

Davide Eynard

Davide Eynard

Oct 16, 2024 -- 5 min read
Taming randomness in ML models with hypothesis testing and marimo 
Photo by Alexander Lyashkov / Unsplash

The behavior of ML models is often affected by randomness at
different levels, from the initialization of model parameters to the
dataset split into training and evaluation. Thus, predictions made by
a model (including the answers an LLM gives to your questions) are
potentially different every time you run it. This makes evaluating
models' performance more complex than it might initially appear.

While academic papers almost always take into consideration the
variability of model predictions in their evaluations, this is way
less common in industry publications like blog posts or company
announcements. As a consequence, verifying whether a new ML model is
better than its predecessors from a table of published evaluation
metrics can be a non-trivial task, one that does not depend on the
metrics alone but on other parameters too, such as the number of
experiments and the variability of their results.

[AD_4nXeWBzHEU14Dq9KkXumBuYimRC9flf7JK0FWwlJ9m6_Y4eww-wuJw3Sk8WKK5xiyx4XPDAWzdIN_FpDCa1BB]
A table from OpenAI's paper MLE-bench: Evaluating Machine Learning
Agents on Machine Learning Engineering compares average model
performance on multiple runs.

This post is an introduction to hypothesis testing and how it can be
used to compare results from different experiments (for instance,
evaluations of different ML models) and draw conclusions that are
supported by statistical evidence. You will learn this by running a
marimo app in your browser, which will introduce you to the basics of
statistical testing through different examples based on the simple
concept of dice throwing. Using the marimo notebook will be the
digital version of throwing more dice than you ever thought possible!

Determinism vs randomness

One of my favorite statistics books is Darrell Huff's "How to Lie
with Statistics". Despite being published in 1954, it's still highly
relevant today in demonstrating how data can be misinterpreted to
provide false interpretations - sometimes in good faith, sometimes
not. For example, in Chapter 4 ("Much Ado about Practically Nothing"
), Huff introduces the concept of sampling errors and how their
presence makes some differences in data negligible in practice. He
describes how Old Gold cigarette manufacturers purposely ignored the
presence of these errors so they could advertise their brand as the
one with the lowest amount of tar and nicotine. Now, if that reminds
you of some LLM leaderboard or the latest paper providing a minor
improvement over the state-of-the-art, you will understand why I find
this 70-year-old book still so valuable!

The thing is, when the information you deal with is deterministic,
you can easily compare different elements. If you represent different
text documents as vectors (using any kind of representation, from
bag-of-words to modern embeddings), you can determine which document
is more similar to any other by calculating the distance between
their respective vectors. Things become more complex when information
is affected by randomness, and that is something which is happening
more often than one would expect in the tech, AI, and industrial
worlds. Examples include determining which of two machines works best
by measuring the amount of defective items they produce; assessing
which teaching approach is best by measuring students' test scores;
and of course finding which ML model performs best across a set of
experiments.

To compare phenomena whose outcomes are stochastic, or that can be
described by a random distribution, we use hypothesis testing. If you
want to know more about the topic, this Wikipedia article provides a
good general introduction to it. To find what this means in practice,
let us work on a simple example: dice throwing.

Dice throwing with marimo

Years after passing the stats 101 exam with a report on statistics
and cryptography (the paper itself being encrypted... in Italian!) I
taught the same course to CS students. When confronted with the
complexity of some topics in this course, I found that they always
appreciated having concrete (and possibly simple) examples.
Dice-throwing soon became one of my favorite examples: being very
intuitive and easily explainable, it was applicable to different
problems to provide the intuition of what was going on. I still keep
the first version of my notes online and from time to time I point
people to it.

Finding a good example is only half of the work needed to properly
explain a topic. The other half is how you present it, and I think
Seeing Theory was a great inspiration for me. If you are interested
in learning more about basic probability and statistics, I strongly
suggest you check out its visually stunning interactive demos.

Looking for a tool that would allow me to build something similar, I
found marimo, a reactive, open-source Python notebook that is "
reproducible, git-friendly, and deployable as scripts or apps". One
of the features I like the most is that you can run it entirely in
your browser, in the form of WASM notebooks. This allows me to easily
share all of its code, and you can run it without having to install
anything. Add to this a very helpful documentation and a great
community and you will understand why I had to try this!

The outcome is a marimo notebook that works as a practical
introduction to hypothesis testing. It starts with something as easy
as simulating a single die throw, and guiding you through more and
more advanced experiments (and a lot of virtual dice!) it will point
you to the right questions to ask yourself whenever you are presented
with a comparison between different ML models.

So, go to https://marimo.io/p/@mzai/on-hypothesis-testing follow the
instructions and start playing with the notebook in your browser. The
code is available but you can toggle it off from the three-dots menu
you find on the top right corner. Have fun, and please feel free to
reach out on Mozilla AI Discord or at my_first_name at mozilla.ai for
questions, feedback, and requests.

A screenshot from the marimo notebook described in the article,
showing two different experiments. In the former, a button with the
label "Throw die" followed by a large number 4 in the center of the
page. In the latter, a button with the label "Throw die 600 times"
and below that a histogram showing the outcomes of the throws.Your
first two experiments: throw a single die, then repeat many times and
count all the different outcomes.A screenshot from the marimo
notebook described in the article, showing an experiment called "Dice
comparison". A table shows mean and standard deviation values for a
Fair and an Unfair die, while on the top right two buttons allow to
throw either a fair or an unfair die 1200 times and a scroller below
them allows to choose the number of throws. Below, two histograms are
shown comparing the distributions of the throws for, respectively,
the fair (left) and unfair (right) dieA fair and an unfair die
distributions are compared across thousands of different throws.A
screenshot from the marimo notebook described in the article, showing
an experiment called "Approximating the normal distribution". Two
sliders allow to change the throws per experiment and the number of
experiments. Below them, two sampling distributions are plotted
respectively for the fair and unfair die, showing that both of them
resemble bell-shaped, normal distributionsAn empirical test comparing
the distribution of the means in the fair and unfair die use case.

Acknowledgements

Many thanks to Akshay Agrawal and Myles Scolnick for building marimo
and providing help and super early feedback, Vicki Boykis and Oleg
Lavrovsky for testing the notebook and offering me great suggestions
on how to improve it.

Read more

Introducing Lumigator 

Introducing Lumigator 

An MVP for Simplifying AI Model Selection In today's fast-moving AI
landscape, choosing the right large language model (LLM) for your
project can feel like navigating a maze. With hundreds of models,
each offering different capabilities, the process can be
overwhelming. That's why Mozilla.ai is developing Lumigator,

By Juliana Rodrigues Araujo Oct 1, 2024
The Importance of Ground Truth Data in AI Applications: An Overview

The Importance of Ground Truth Data in AI Applications: An Overview

Introduction We are witnessing explosive growth in the development of
artificial intelligence (AI) applications across various industries,
from virtual agents to healthcare diagnostics and autonomous driving.
This growth is powered by vast datasets and advanced algorithms that
enable AI systems to learn and make decisions in real-world
scenarios. However, the

By Sandra Antunes Sep 10, 2024
Towards Truly Multilingual AI: Breaking English Dominance

Towards Truly Multilingual AI: Breaking English Dominance

At the TAUS Massively Multilingual AI conference in Rome, I had the
honor of moderating a panel on Multilingual AI provocatively titled
"No English Please - How to move towards truly multilingual AI". In
this panel, we explored the pressing need for AI to transcend
linguistic barriers and embrace global

By Julie Beliao Jul 18, 2024
Uncovering GenAI Trends: Using Local Language Models to Explore 35
Organizations

Uncovering GenAI Trends: Using Local Language Models to Explore 35
Organizations

Over the past few months at Mozilla.ai, we've engaged with a number
of organizations to learn how they are using language models in
practice. We spoke with 35 organizations across various sectors,
including finance, government, startups, and large enterprises. Our
interviewees ranged from ML engineers to CTOs, capturing

By Stefan French Jun 3, 2024
Mozilla.ai Blog

  * mozilla.ai

Powered by Ghost

Mozilla.ai Blog

Subscribe to receive updates

Subscribe