[HN Gopher] Who needs MLflow when you have SQLite?
___________________________________________________________________
Who needs MLflow when you have SQLite?
Author : edublancas
Score : 202 points
Date : 2022-11-16 14:55 UTC (8 hours ago)
(HTM) web link (ploomber.io)
(TXT) w3m dump (ploomber.io)
| praveenhm wrote:
| what is alternative to MLflow other than SQLite, like Kubeflow,
| Metaflow?
| crucialfelix wrote:
| Weights and Balances https://wandb.ai/site
| pcerdam wrote:
| Weights and *Biases :)
| kuba_dmp wrote:
| neptune.ai https://neptune.ai/
| the83 wrote:
| Comet: https://www.comet.com/site/
| mmq wrote:
| https://github.com/polyaxon
| isoprophlex wrote:
| Yeah, MLFlow is a shitshow. The docs seem designed to confuse,
| the API makes Pandas look good and the internal data model is
| badly designed and exposed, as the article says.
|
| But, hordes of architects and managers who almost have a clue
| have been conditioned to want l and expect mlflow. And it's baked
| into databricks too, so for most purposes you'll be stuck with
| it.
|
| Props to the author for daring to challenge the status quo.
| idomi wrote:
| How many data scientists that use Databricks for modeling do
| you know?
| isoprophlex wrote:
| It's ubiquitous. I've consulted for a 100 person company that
| built a data product on top of some IoT data. Everything was
| in databricks, literally everything. (Not endorsing that,
| just an observation)
|
| Talking to a 2000+ person org now that is standardizing data
| science across the org using... you guessed it
| idomi wrote:
| Pretty interesting. I think this is part of this notion to
| release half baked products, like some of the stuff in
| there are really cool, just enough to get you in but it
| doesn't scale and usually is complex to deploy/use.
| dachryn wrote:
| its forced upon many of them that are in finance, banking,
| insurance, ...
|
| Mainly because those tend to run on Microsoft Azure, which
| has no decent analytics offering, and are pushing Databricks
| extremely hard. The CTO or whatever just pushes databricks.
| On paper it checks all the boxes. Mlops, notebooks,
| experiment management. It just does all of those things very
| badly, but the exec doesn't care. They only care about the
| microsoft credits. Just to avoid using Jupyter so the
| compliance teams stay happy as well because Microsoft sales
| people scared them away from from open source.
| akdor1154 wrote:
| What would you go with instead for collaborative notebooks?
|
| I ask because normally I tend pretty strongly towards the
| "NO just let the DSes/analysts work how they want to",
| which in this case would be running Jupyter locally.
| However DBr's notebooks seem genuinely useful.
|
| Is your issue "but I don't need Spark" or "i wanna code in
| a python project, not a notebook?", or something else?
|
| Imo if DBr cut their wedding to Spark and provided a
| Python-only nb environment they'd have a killer offering on
| their hands.
| nerdponx wrote:
| My team very nearly had this happen to us.
|
| We pushed back on it very, very, very hard, and finally
| convinced "IT" to not turn off our big Linux server running
| JupyterHub. We actually ended up using Databricks (PySpark,
| Delta Lake, hosted MLFlow) quite a bit for various
| purposes, and were happy to have it available.
|
| But the thought of forcing us into it as our _only_
| computing platform was a spine-chilling nightmare.
| Something that only a person who has no idea what data
| analysts and data scientists actually do all day would
| decide to do.
| chaps wrote:
| "the API makes Pandas look good"
|
| It sparks joy in my heart whenever I see shade cast against
| pandas.
| lordgroff wrote:
| Every time I open up pandas I jealously remember the
| expressive beauty of R for these tasks. But because we're all
| "serious" of course we must use Python for production lest we
| not be serious.
| laichzeit0 wrote:
| To be fair, taking R to production is a goddamn nightmare.
| chaxor wrote:
| R is a trash of a language. It doesn't have any sense of
| coherency to it at all. They keep trying to fix the
| underlying problems by ducktaping paradigms on to it over
| and over (S3, S4, R6, etc). There's never a clear sense
| of the best way to do anything, but plenty of options to
| do a thing in a very hacky 'script-kiddy' way. Looking
| out at the community of different projects it becomes
| clear that everyone is pretty lost as to what design
| principles should be used for certain tasks, so every
| repo has its own way of doing things (I know personal
| style occurs in other languages, but commonalities are
| much less recognizable in R projects). It's tragic that
| such a large community uses it.
| jmt_ wrote:
| Trash language is a bit harsh. I'm not sure I would try
| to put an R project into production or build a huge
| project with it but, at the very least, R/R Studio was
| the best scientific calculator I've ever used. Was
| particularly great during college
| lordgroff wrote:
| Yep, this is a mark of someone that's never used R but
| has heard a lot of incredibly ill informed criticism
| around it.
|
| One look of dplyr code over pandas would of course
| disabuse anyone of the notion that R is trash and the
| tragedy is Python will in the current state never have
| anything like that. That's the advantage of the language
| being influenced by Lisp vs not.
| tomrod wrote:
| I've heavily used R several times.
|
| I agree that it is a trash language and that, outside
| that many frontier academic ideas are available and some
| plotting preferences are solidly prescriptive, it should
| be thrown into the trash bin.
|
| Python, Julia when it gets its druthers for TTFP, Octave,
| Fortran, C, and eventually Rust. These are the tools I've
| found in use over and over and over again across
| business, government, and non-profits.
|
| Everywhere R is used by the org I have seen major gaps in
| capacity to deliver specifically because R doesn't scale
| well.
| nerdponx wrote:
| Try to separate the language from its standard library.
| Neither one is "trash".
|
| I agree that the standard library is what you might call
| "a chaotic disorganized mess".
| tomrod wrote:
| I'm not emotionally invested in tools so am happy to
| identify the user experience and operational experience
| as "trash."
|
| "Trash", despite its connotations of lacking value, is
| really just a chaotic disorganized mess of something made
| by artifice with dubious reclaim/reuse/recycle value.
| Being a subjective assessment, it is natural that one
| person's trash is a treasure to another.
| nerdponx wrote:
| I take issue with your implication that I'm emotionally
| invested in something when I shouldn't be. You are free
| to dislike R and not use it, but to claim that it's
| "trash" is to wrongly disavow its usefulness for the many
| people that do find it useful, and to cast aspersions on
| the judgement of all those people.
| whatever1 wrote:
| I have never seen a worse documented library. Initially I
| thought that they were lazy, now I realize that it cannot be
| documented because it is a total mess of a library held
| together with tape.
|
| Close second is the plotly library.
| nerdponx wrote:
| The Pandas documentation has improved quite a bit. Last I
| checked, the only part of the reference docs with a big gap
| was the description of "extension arrays" and accessors.
|
| The _user guide_ material absolutely needs work, and the
| examples in the reference docs tend to be a little
| contrived. But I absolutely have seen worse-documented
| libraries, such as Gunicorn and Pydantic.
| claytonjy wrote:
| I'm surprised to see Pydantic in here; I've used Pandas
| and Pydantic both quite a lot, and have found the
| Pydantic docs to be quite good! Also a much smaller
| library with a saner API, and thus easier to document
| well.
| __mharrison__ wrote:
| Genuinely curious what you have against the Pandas
| documentation. It has some of the best docstrings I've
| seen.
|
| (I also wrote a Pandas book or two... So there's that)
| chaps wrote:
| Docstrings are one thing, but functionality discovery,
| picking up from scratch, troubleshooting, etc are... not
| fun, nor easy with the documentation. If you know it well
| already and use it a lot it's easier to forgive its
| documentation faults since you can waive off the problems
| as "that's just learning something new".
|
| But for a lot of people who use it infrequently its
| documentation is a frustrating mess. Simple problems turn
| into significant time sinks of trying to find which page
| of the documentation to look at.
|
| A lot of issues are made worse by shit-awful interop
| between libraries that claim to fully support dayaframes,
| but often fail in non-obvious ways... meaning back to the
| documentation mines.
|
| I'd argue that because there's a market for a single
| author to write two books about it is indicative of
| documentation problems.
| 333luke wrote:
| What makes the documentation so bad in your opinion? I'm
| not arguing but curious since I use pandas all day at my
| job and can't think of any times the docs weren't clear to
| me. (Plotly I have had some annoying times with!)
| bobertlo wrote:
| I think the R docs are the intended reference material for
| pandas ;)
| dekhn wrote:
| What bothers me the most is the egregious data types for any
| argument. If it's a string, do this. If it's a list, do that.
| If it's a dictionary of lists, do this other thing.
|
| No, I want you to force me to provide my data in the right
| way and raise a noisy exception if I don't.
| nerdponx wrote:
| Series and DataFrame have "alternate constructors" for this
| purpose, and the loc/iloc accessors give you a bit more
| control.
|
| I agree that the magic type auto-detection is a bit too
| magical and sloppy, but you have to realize that data
| analysts and scientists have historically been incredibly
| sloppy programmers who _wanted_ as much magic as possible.
| It 's only in recent years that researchers have begun to
| value some amount of discipline in their research code.
| mostdataisnice wrote:
| Where does the article say that?
| isoprophlex wrote:
| About exposing the data inside MLFlow
|
| > I found the query feature extremely limiting (if my
| experiments are stored in a SQL table, why not allow me to
| query them with SQL).
| guangyeu wrote:
| As noted in an earlier comment, I think there is a false
| equivalence between end-to-end MLOps platforms like MLflow and
| tools for experiment tracking. The project looks like a solid
| tracking solution for individual data scientists, but it is not
| designed for collaboration among teams or organizations.
|
| > There were a few things I didn't like: it seemed too much to
| have to start a web server to look at my experiments, and I found
| the query feature extremely limiting (if my experiments are
| stored in a SQL table, why not allow me to query them with SQL).
|
| While a relational database (like sqlite) can store
| hyperparameters and metrics, it cannot scale for the many aspects
| of experiment tracking for a team/organization, from visual
| inspection of model performance results to sharing models to
| lineage tracking from experimentation to production. As noted in
| the article, you need a GUI on top of a SQL database to make
| meaningful model experimentation. The MLflow web service allows
| you to scale across your teams/organizations with interactive
| visualizations, built-in search & ranking, shareable snapshots,
| etc. You can run it across a variety of production-grade
| relational dBs so users can query the data directly through the
| SQL database or through a UI that makes it easier to search for
| those not interested in using SQL.
|
| > I also found comparing the experiments limited. I rarely have a
| project where a single (or a couple of) metric(s) is enough to
| evaluate a model. It's mostly a combination of metrics and
| evaluation plots that I need to look at to assess a model.
| Furthermore, the numbers/plots themselves have no value in
| isolation; I need to benchmark them against a base model, and
| doing model comparisons at this level was pretty slow from the
| GUI.
|
| The MLflow UI allows you to compare thousands of models from the
| same page in tabular or graphical format. It renders the
| performance-related artifacts associated with a model, including
| feature importance graphs, ROC & precision-recall curves, and any
| additional information that can be expressed in image, CSV, HTML,
| or PDF format.
|
| > If you look at the script's source code, you'll see that there
| are no extra imports or calls to log the experiments, it's a
| vanilla Python script.
|
| MLflow already provides low-code solutions for MLOps, including
| autologging. After running a single line of code -
| mlflow.autolog() - every model you train across the most
| prominent ML frameworks, including but not limited to scikit-
| learn, XGBoost, TensorFlow & Keras, PySpark, LightGBM, and
| statsmodels is automatically tracked with MLflow, including all
| relevant hyperparameters, performance metrics, model files,
| software dependencies, etc. All of this information is made
| immediately available in the MLflow UI.
|
| Addendum: As noted, there is a false equivalence between an end-
| to-end MLOps lifecycle platform like MLflow and tools for
| experiment tracking. To succeed with end-to-end MLOps,
| teams/organizations also need projects to package code for
| reproducibility on any platform across many different package
| versions, deploy models in multiple environments, and a registry
| to store and manage these models - all of which is provided by
| MLflow.
|
| It is battle-tested with hundreds of developers and thousands of
| organizations using widely-adopted open source standards. I
| encourage you to chime in on the MLflow GitHub on any issues and
| PRs, too!
| czumar wrote:
| +1. I'd also like to note that it's very easy to get started
| with MLflow; our quickstart walks you through the process of
| installing the library, logging runs, and viewing the UI:
| https://mlflow.org/docs/latest/quickstart.html.
|
| We'd love to work with the author to make MLflow Tracking an
| even better experiment tracking tool and immediately benefit
| thousands of organizations and users on the platform. MLflow is
| the largest open source MLOps platform with over 500 external
| contributors actively developing the project and a maintainer
| group dedicated to making sure your contributions &
| improvements are merged quickly.
| bfung wrote:
| How about a side-by-side comparison?
|
| Far too often, these articles of X is bad, use my homebrew Y
| instead, without showing comparison to X doesn't help illustrate
| 'why Y instead'.
|
| You know... <cheeky>For science.</cheeky>
| benjaminwootton wrote:
| The elephant in the room with data is that we don't need a lot of
| the fancy and powerful technology. SQL against a relational
| database gets us extraordinarily far. Add some Python scripts
| where we need some imperative logic and glue code, and a sprinkle
| of CI/CD if we really want to professionalise the work of data
| scientists. I think this covers the vast majority of situations.
|
| Despite being around it for some time, I'm not sure big data or
| machine learning needed to be a thing for the vast majority of
| businesses.
| bob1029 wrote:
| > SQL against a relational database gets us extraordinarily
| far.
|
| I think it gets us all the way once you consider the ability to
| expose domain-specific functions to SQL that are serviced by
| your application code.
|
| I've always been of the mindset that you can do anything with
| SQL if you are clever enough.
| citizenpaul wrote:
| Unless your income is depending on carrying out the exact
| demands of some money guy that's most common phrase while using
| a computer is "it won't let me" and they want "big data".
|
| Then you just suck it up and build one of the totally
| unnecessary big data systems that have been excreted all over
| the business world these days. I don't think the problem is
| that devs are over-engineering.
|
| I wonder what its called, makes me think of tragedy of the
| commons but probably not quite right.
| morelisp wrote:
| Maybe like 20 years ago you were right but today there's a
| generation that's _been working for 10 years_ on systems
| built like that. They don 't know any better, and in most
| cases nobody is around to teach them otherwise.
| tomrod wrote:
| Hierarchy on bueracracies, by Jean Tirole. I know because
| this was the phenomenon I wanted to study in grad school only
| to find he scooped me (on this an several items) by several
| decades.
|
| Edit: Tirole, Jean. "Hierarchies and bureaucracies: On the
| role of collusion in organizations." JL Econ. & Org. 2
| (1986): 181.
| chasil wrote:
| The article mentions this workflow:
|
| "Let's now execute the script multiple times, one per set of
| parameters, and store the results in the experiments.db SQLite
| database... After finishing executing the experiments, we can
| initialize our database (experiments.db) and explore the
| results."
|
| Be warned that issuing queries while DML is in process can
| result in SQLITE_BUSY, and the default behavior is to abort the
| transaction, resulting in lost data.
|
| Setting WAL mode for greater concurrency between a writer and
| reader(s) can lead to corruption if the IPC structures are not
| visible:
|
| "To accelerate searching the WAL, SQLite creates a WAL index in
| shared memory. This improves the performance of read
| transactions, but the use of shared memory requires that all
| readers must be on the same machine [and OS instance]."
|
| If the database will not be entirely left alone during DML,
| then the busy handler must be addressed.
| habibur wrote:
| None of these are a problem for the workload discussed.
|
| When I am working with sqlite I am more likely accessing it
| from a single machine.
|
| And in this case of ML, most likely from 1 process and by
| running multiple times in serial.
| isoprophlex wrote:
| Yeah and even if you do need to do proper big-dataset-ML... a
| SQL box and maybe something like a blob storage for large
| artifacts (S3, Azure storage account, whatever) is all you need
| as well. But if your boss bought The MLOps Experience, you
| gotta do what the cool kids are doing!
| navbaker wrote:
| I work in an environment where there are multiple tech teams
| developing models for multiple use cases on VMs and GPU clusters
| spread across our corporate intranet. Once you move beyond a
| single dev working on a model on their laptop, you absolutely
| need something that can handle not just metrics tracking, but
| making the model binaries available and providing a means to
| ensure reproducibility by the rest of the team. That's what
| MLFlow is providing for us. The API is a mess, but at least we
| didn't have to code up some bespoke in-house framework, we just
| put some engineers on task to play around with it for a few hours
| and figure out the nuances of basic interactions and deployed it.
| edublancas wrote:
| Agree. Once you have a team, you need to have a service they
| can all interact with. This release is a first step, we want to
| get the user experience right for an individual and then think
| of how to expand that to teams. Ultimately, the two things
| we're the most excited about are 1) you don't need to add any
| extra code (and it works with all libraries, not a pre-defined
| set) 2) SQL as the query language
| spicyramen_ wrote:
| cdong wrote:
| I don't get why a lot of people are calling mlflow a shitshow
| when it has done so much getting data scientist out of recording
| experiments via CSV. I can log models and parameters and use the
| UI to track different runs. After comparisons, I can use the
| registry to register different staging. If you have other model
| diagnostic charts you can log the artifact as well. I think
| mlflow v2 has auto logging included so why all the fuss?
| nerdponx wrote:
| People tend to forget that first movers rarely tend to also
| have the best design. MLFlow (and DVC) brought us out of the
| dark ages. Now we can build better tools, with the benefit of
| hindsight.
|
| Claiming that something is "broken" or "trash" when you mean "I
| don't like it" is a good way to make yourself feel big and
| smart, but it's not actually constructive.
| cameronfraser wrote:
| There are those who create and those who complain on the
| internet about tools they've used one time
| isoprophlex wrote:
| Okay that's coming across as a pretty snide remark aimed at
| me, I'll bite.
|
| Yes, I can understand why you comment that. I don't like
| blind slagging of free software either.
|
| But there are ALSO those whose day job it is, and has been
| for the last 2 years, to use a badly designed overcomplex
| horrorshow of a tool that could be replaced easily by
| something better ... if it wasn't for the lock-in effects and
| strong marketing.
|
| So I'm ventilating my frustration and at the same time
| expressing my gratitude to the person who made something
| fresh, that shows us things can be better.
|
| I can't build the replacement to MLFlow myself, but I can
| cheer people on who do, and let them know their efforts are
| sorely needed.
| phr0k wrote:
| guangyeu wrote:
| Could you provide context on why SQLite would replace MLflow?
| From the standpoint of model tracking (record and query
| experiments), projects (package code for reproducibility on any
| platform), deploy models in multiple environments, registry for
| storing and managing models, and now recipes (to simplify model
| creation and deployment), MLflow helps with the MLOps life cycle.
| [deleted]
| edublancas wrote:
| Fair point. MLflow has a lot of features to cover the end-to-
| end dev cycle. This SQLite tracker only covers the experiment
| tracking part.
|
| We have another project to cover the orchestration/pipelines
| aspect: https://github.com/ploomber/ploomber and we have plans
| to work on the rest of features. For now, we're focusing on
| those two.
| mostdataisnice wrote:
| SQLite is literally a backend for MLflow, so the argument being
| made really is that you should just use SQL when you can, which
| is kind of adjacent to any criticisms of MLflow
| edublancas wrote:
| Is querying the underlying SQL database officially supported in
| MLflow? Last time I used it, it wasn't documented. I took a
| look at the database and it wasn't end-user friendly.
| mostdataisnice wrote:
| As someone replied above, it's because SQL is just 1 backend
| and it's weird to expose an API that only works on 1 backend.
| Once you have many devs working together, you need a remote
| server. If you have a remote abstracted backend, it needs to
| have a unified API surface so the same client can talk to any
| backend. You might argue "This interface _should_ be SQL ",
| and to that I would say there are many file stores (like your
| local file system) that are not easy to control with SQL.
| afrnz wrote:
| You can also use mlflow locally with SQLite (https://www.mlflow.o
| rg/docs/latest/tracking.html#scenario-2-...). Even though I
| haven't tried querying the db directly ...
| frgtpsswrdlame wrote:
| Wow this looks perfect for what I need right now - just a bit of
| lightweight tracking.
| nerdponx wrote:
| DVC also fills the "lightweight tracking" niche, although it
| relies on automatically creating Git branches as its technique
| for tracking experiments. I personally find that distasteful,
| so I don't use it specifically for experiment tracking, but the
| feature is there.
|
| The company behind DVC is also building a handful of other
| related tools, e.g. https://iterative.ai/blog/iterative-studio-
| model-registry
| wxnx wrote:
| Hm, in what way do you find that DVC requires creating new
| branches for experiment tracking?
|
| I find the following workflow works well, for example:
|
| 1. Define steps depending on a `config.yml`.
|
| 2. Run an initial experiment (with an initial config) and
| commit the results.
|
| 3. Update config (preserving the alternate config and using
| symlinks from `config.yml` to various new configs if
| necessary), re-run, and commit.
|
| 4. Results are then all preserved in your git history.
| shcheklein wrote:
| It doesn't require creating a branch when you iterate, it
| requires creating a branch or commit if you want to share it
| with the team - see it on GitHub or in Studio. But even those
| lightweight iterations (https://dvc.org/doc/command-
| reference/exp/run) could shared as well via Git server - they
| won't be visible for now via UI in GH/Studio at the moment.
|
| Happy to provide more details on how it's done. It's actually
| quite interesting technical thing - custom Git namespace
| https://iterative.ai/blog/experiment-refs
| edublancas wrote:
| If you need help, you can open an issue on GitHub
| (https://github.com/ploomber/ploomber-engine) or join our
| Slack! (https://ploomber.io/community/)
| geminicoolaf wrote:
| What about BentoML?
| LeanderK wrote:
| I think MLflow is a good idea (very) badly executed. I would like
| to have a library that combines:
|
| - simple logging of (simple) metrics during and after training
|
| - simple logging of all arguments the model was created with
|
| - simple logging of a textual representation of the model
|
| - simple logging of general architecture details (number of
| parameters, regularisation hyperparameters, learning rate, number
| of epochs etc.)
|
| - and of course checkpoints
|
| - simple archiving of the model (and relevant data)
|
| and all that without much (coding) overhead and only using a
| shared filesystem (!) And with an easy notebook integration.
| MLflow just has way to many unnecessary features and is
| unreliable and complicated. When it doesn't work it's so
| frustrating, it's also quite often super slow. But I always end
| up creating something like MLflow when working on an architecture
| for a long time.
|
| EDIT: having written this...I fell like trying to write my own
| simple library after finishing the paper. A few ideas have
| already accumulated in my notes that would make my life easier.
|
| EDIT2: I actually remember trying to use SQLite to manage my
| models! But the server I worked on was locked down and going
| through the process to get somebody to install me SQLite was just
| not worth it. It's also was not available on the cluster for big
| experiments, where it would be even more work to get it, so I
| gave up on the idea of trying SQLite.
| Fiahil wrote:
| > I think MLFlow is a good idea (very) badly executed.
|
| Oh yes, I'm glad to see other with similar opinion.
| pletnes wrote:
| Sqlite is in python's stdlib, so how can this be an issue? Was
| there no local filesystem whatsoever?
| tekknolagi wrote:
| sqlite bindings are in the stdlib but not the library itself.
| imachine1980_ wrote:
| im asking from ignorance, what the difference in effect
| this context of not having the library itself?
| funklute wrote:
| Using the bindings is only possible if the library itself
| is already installed (since the bindings directly make
| use of the library, under the hood).
| edublancas wrote:
| I'm happy to collaborate with you, let's build the best
| experiment tracker out there! Feel free to ping me at
| eduardo@ploomber.io
| smehta73 wrote:
| have you used comet? it basically does everything you are
| asking and lot more user-friendly than MLFlow
| nerdponx wrote:
| Isn't Comet a proprietary SaaS? I like MLFlow because I can
| run it on my own computer if I want to.
| tomrod wrote:
| Check out flyte and union.ml. No personal affiliation, just
| good projects in the vein of
| airflow/prefect/mlflow/kubeflow
| YetAnotherNick wrote:
| I really like guild.ai. The best thing is that their
| developers assumed people to be lazy and automatically
| makes flag for global variables and track them.
___________________________________________________________________
(page generated 2022-11-16 23:00 UTC)