[HN Gopher] Launch HN: DAGWorks - ML platform for data science t...
___________________________________________________________________
Launch HN: DAGWorks - ML platform for data science teams
Hey HN! We're Stefan and Elijah, co-founders of DAGWorks
(https:///www.dagworks.io). We're on a mission to eliminate the
insane inefficiency of building and maintaining ML pipelines in
production. DAGWorks is based on Hamilton, an open-source project
that we created and recently forked (https://github.com/dagworks-
inc/hamilton). Hamilton is a set of high-level conventions for
Python functions that can be automatically converted into working
ETL pipelines. To that, we're adding a closed-source offering that
goes a step further, plugging these functions into a wide array of
production ML stacks. ML pipelines consist of computational steps
(code + data) that produce a working statistical model that a
business can use. A typical pipeline might be (1) pull raw data
(Extract), (2) transform that data into inputs for the model
(Transform), (3) define a statistical model (Transform), (4) use
that statistical model to predict on another data set (Transform)
and (5) push that data for downstream use (Load). Instead of
"pipeline" you might hear people call this "workflow", "ETL"
(Extract-Transform-Load), and so on. Maintaining these in
production is insanely inefficient because you need both data
scientists and software engineers to do it. Data scientists know
the models and data, but most can't write the code needed to get
things working in production infrastructure--for example, a lot of
mid-size companies out there use Snowflake to store data,
Pandas/Spark to transform it, and something like databrick's MLFlow
to handle model serving. Engineers can handle the latter, but
mostly aren't experts in the ML stuff. It's a classic impedance
mismatch, with all the horror stories you'd expect--e.g. when data
scientists make a change, engineers (or data scientists who aren't
engineers) have to manually propagate the change in production.
We've talked to teams who are spending as much as 50% of their time
doing this. That's not just expensive, it's gruntwork--those
engineers should be working on something else! Basically,
maintaining ML pipelines over time sucks for most teams. One way
out is to hire people who combine both skills, i.e. data scientists
who can also write production code. But these are rare and
expensive, and in our experience they usually are only expert at
one side of the equation and not as good at the other. The other
way is to build your own platform to automatically integrate models
+ data into your production stack. That way the data scientists can
maintain their own work without needing to hand things off to
engineers. However, most companies can't afford to make this
investment, and even for the ones that can, such in-house layers
tend to end up in spaghetti code and tech debt hell, because
they're not the company's core product. Elijah and I have been
building data and ML tooling for the last 7 years, most recently at
Stitch Fix, where we built a ML platform that served over 100 data
scientists from various modeling disciplines (some of our blog
posts, like [1], hit the front page of HN - thanks!). We saw first
hand the issues teams encountered with ML pipelines. Most
companies running ML in production need a ratio of 1:1 or 2:1 data
scientists to engineers. At bigger companies like Stitch Fix, the
ratio is more like 10:1--way more efficient--because they can
afford to build the kind of platform described above. With
DAGWorks, we want to bring the power of an intuitive ML Pipeline
platform to all data science teams, so a ratio of 1:1 is no longer
required. A junior data scientist should be able to easily and
safely write production code without deep knowledge of underlying
infrastructure. We decided to build our startup around Hamilton,
in large part due to the reception that it got here [2] - thanks
HN! We came up with Hamilton while we were at Stitch Fix (note: if
you start an open-source project at an employer, we recommend
forking it right away when you start a company. We only just did
that and left behind ~900 stars...). We are betting on it being our
abstraction layer to enable our vision of how to go about building
and maintaining ML pipelines, given what we learned at Stitch Fix.
We believe a solution has to have an open source component to be
successful (we invite you to check out the code). In terms of why
the name DAGWorks? We named the company after Directed Acyclic
Graphs because we think the DAG representation, which Hamilton also
provides, is key. A quick primer on Hamilton. With Hamilton we use
a new paradigm in Python (well not quite "new" as pytest fixtures
use this approach) for defining model pipelines. Users write
declarative functions instead of writing procedural code. For
example, rather than writing the following pandas code:
df['col_c'] = df['col_a'] + df['col_b'] You would write:
def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series:
"""Creating column c from summing column a and column b."""
return col_a + col_b Then if you wanted to create a new
column that used `col_c` you would write: def
col_d(col_c: pd.Series) -> pd.Series: # logic
These functions then define a "dataflow" or a directed acyclic
graph (DAG), i.e. we can create a "graph" with nodes: col_a, col_b,
col_c, and col_d, and connect them with edges to know the order in
which to call the functions to compute any result. Since you're
forced to write functions, everything becomes unit testable and
documentation friendly, with the ability to display lineage. You
can kind of think of Hamilton as "DBT for python functions", if you
know what DBT is. Have we piqued your interest? Want to go play
with Hamilton? We created https://www.tryhamilton.dev/ leveraging
pyodide (note it can take a while to load) so you can play around
with the basics without leaving your browser - it even works on
mobile! What we think is cool about Hamilton is that you don't
need to specify an "explicit pipeline declaration step", because
it's all encoded in the function and parameter names! Moreover,
everything is encapsulated in functions. So from a framework
perspective, if we wanted to (for example) log timing information,
or introspect inputs/outputs, delegate the function to Dask or Ray,
we can inject that at a framework level, without having to pollute
user code. Additionally, we can expose "decorators" (e.g.
@tag(...)) that can specify extra metadata to annotate the DAG
with, or for use at run time. This is where our DAGWorks Platform
fits in, providing off-the-shelf closed source extras in this way.
Now, for those of you thinking there's a lot of competition in this
space, or what we're proposing sounds very similar to existing
solutions, here's some thoughts to help distinguish Hamilton from
other approaches/technology: (1) Hamilton's core design principle
is helping people write more maintainable code; at a nuts and bolts
level, what Hamilton replaces is procedural code that one would
write. (2) Hamilton runs anywhere that python runs: notebook, a
python script, within airflow, within your python web service,
pyspark, etc. E.g. People use Hamilton for executing code in batch
tasks and online web services. (3) Hamilton doesn't replace a macro
orchestration system like airflow, prefect, dagster, metaflow,
zenML, etc. It runs within/uses them. Hamilton helps you not only
model the micro - e.g. feature engineering - but can also help you
model the macro - e.g. model pipelines. That said, given how big
machines are these days, model pipelines can commonly run on a
single machine - Hamilton is perfect for this. (4) Hamilton doesn't
replace things like Dask, Ray, Spark -- it can run on them, or
delegate to them. (5) Hamilton isn't just for building dataframes,
though it's quite good for that, you can model any python object
creation with it. Hamilton is data type agnostic. Our closed
source offering is currently in private beta, but we'd love to
include you in it (see next paragraph). Hamilton is free to use
(BSD-3 license) and we're investing in it heavily. We're still
working through pricing options for the closed source platform; we
think we'll follow the leads of others in the space like Weights &
Biases, and Hex.tech here in how they price. For those interested,
here's a video walkthrough of Hamilton, which includes a teaser of
what we're building on the closed source side -
https://www.loom.com/share/5d30a96b3261490d91713a18ab27d3b7.
Lastly, (1) we'd love feedback on Hamilton
(https://github.com/dagworks-inc/hamilton) and on any of the above,
and what we could do better. To stress the importance of your
feedback, we're going all-in on Hamilton. If Hamilton fails,
DAGWorks fails. Given that Hamilton is a bit of a "swiss army
knife" of what you could do with it, we need help prioritizing
features. E.g. we just released experimental PySpark UDF map
support, is that useful? Or perhaps you have streaming feature
engineering needs where we could add better support? Or you want a
feature to auto generate unit test stubs? Or maybe you are doing a
lot of time-series forecasting and want more power features in
Hamilton to help you manage inputs to your model? We'd love to hear
from you! (2) For those interested in the closed source DAGWorks
Platform, you can sign up for early access via www.dagworks.io
(leave your email, or schedule a call with me) - we apologize for
not having a self-serve way to onboard just yet. (3) If there's
something this post hasn't answered, do ask, we'll try to give you
an answer! We look forward to any and all of your comments! [1]
https://news.ycombinator.com/item?id=29417998 [2]
https://news.ycombinator.com/item?id=29158021
Author : krawczstef
Score : 110 points
Date : 2023-03-07 16:04 UTC (6 hours ago)
| nerdponx wrote:
| Data scientist here, stuck in the Dark Ages of "deploying" my
| models by writing bespoke Python apps that run on some kind of
| cloud container host like ECS. Dump the outputs to blob storage
| and slurp them back into the data warehouse nightly using
| Airflow. Lots of manual fussing around.
|
| What the heck are all these ML and data platforms, how do they
| benefit me, and how do I evaluate the gazillion options that seem
| to be out there?
|
| For example, I recently came across DStack (https://dstack.ai/)
| and have had an open browser tab sitting around waiting for me to
| figure out WTF it even does. DAGWorks seems like it does
| something similar. Is that true? Are these tools even comparable?
| How would I choose one or the other? Is there overlap with
| MLFlow?
| cheptsov wrote:
| Hey, the founder of dstack here. If I put it shortly, dstack
| allows you to define ML workflows as code and run them either
| locally or remotely (e.g. in a configured cloud). ML workflows
| here mean anything that you may want to do when you're
| developing a model - prepping data, training or finetuning a
| model, etc. The value - basically it automates running your
| workflows, without being dependant on any particular vendor. At
| the same time, you don't have to rewrite your Python scripts to
| use a particular API (because dstack is using YAML).
|
| We aim to build the most easy tool to run ML workflows -
| without making you use the UI of any vendor, or hustling with
| Kubernetes, custom Docker images, etc.
|
| MLFlow doesn't do what dstack does (automatic infrastructure
| provisioning) - unless you use Databricks.
| krawczstef wrote:
| Nice! We had a similar abstraction at Stitch Fix.
| sampo wrote:
| > bespoke Python apps that run on some kind of cloud container
| host like ECS.
|
| For this, you could use for example https://flyte.org/, if you
| have platform engineers that could set it up for you. But I
| don't think their website makes a good job of explaining what
| it is and what it does.
| krawczstef wrote:
| yep some of these ML platform tools only make sense once you
| reach a certain scale/set of problems. For flyte, it was
| created because Lyft had a very heterogeneous set of tools
| one could build a ML pipeline with. So they made it really
| easy to integrate and importantly serialize, data between
| systems. E.g. sql -> python -> spark -> python. But if all
| your data fits in memory, or you only one system, you might
| not be a great fit for Flyte.
|
| What I try to do, is understand who created the platform, and
| understand the environment of that company. Then that will
| give you a better idea of whether it makes sense for you or
| not.
| elijahbenizzy wrote:
| TL;DR -- ML platforms solve for a million different problems,
| and most people don't have all (or any) of them. Hamilton is a
| pretty simple way of organizing code, so it can plug in with a
| bunch of different approaches (and that's why we think its
| general purpose).
|
| > Data scientist here, stuck in the Dark Ages of "deploying" my
| models by writing bespoke Python apps that run on some kind of
| cloud container host like ECS. Dump the outputs to blob storage
| and slurp them back into the data warehouse nightly using
| Airflow. Lots of manual fussing around.
|
| Oh, man, been there! So, first, I want to say there are a _lot_
| of ML /data platforms -- largely because there are so many
| problems to solve, and they're not one-size-fits-all solutions.
|
| > What the heck are all these ML and data platforms, how do
| they benefit me, and how do I evaluate the gazillion options
| that seem to be out there?
|
| You probably don't need all of them, or all that many. As in
| everything, it depends on your pain-points. Given that you have
| a lot of manual fussing around, you probably want something to
| reduce it. We've found airflow to be painful, but a lot of dev
| teams/DS have airflow already integrated into their platform,
| so we wanted to build something that allows data scientists to
| plug into it. So for people who don't like airflow, the idea is
| that you could express your dataflow in Hamilton and DAGWorks
| can ship it to airflow (or any other orchestration system).
|
| > For example, I recently came across DStack
| (https://dstack.ai/) and have had an open browser tab sitting
| around waiting for me to figure out WTF it even does. DAGWorks
| seems like it does something similar. Is that true? Are these
| tools even comparable? How would I choose one or the other? Is
| there overlap with MLFlow?
|
| DStack is definitely a different approach, similar space.
| Hamilton is organized around python functions and dstack is
| more of a high-level workflow spec (reminds me of something we
| had at my old company). So Hamilton can model the "micro" of
| your workflow, whereas DStack models the "macro" -- managing
| artifacts. DStack could easily run Hamilton functions. What
| nothing we've found out there do (except perhaps kedro) is
| model the "micro" -- E.G. the specific fine-grained
| dependencies so you can take a look at your code and figure out
| how exactly it works.
|
| Re: MLFlow -- DAGWorks + MLFlow are pretty natural connectors.
| Hamilton functions can produce a model that DAGWorks would be
| able to save to mlflow. DAGWorks is more on the data transform
| side, and doesn't explicitly say how to represent a model.
| joshhart wrote:
| Congrats Stefan, from someone working at a competitor - always
| good to see more tools for production ML.
| elijahbenizzy wrote:
| Thanks! Appreciate it. I'm finding this space is massive and
| there's still more problems managing ML code than there are
| good solutions for it :) So lots of room for everyone.
| krawczstef wrote:
| Thanks Josh! I actually think we're pretty complementary, less
| competitive. For example, it'd be very conceivable for users to
| use Hamilton, DAGWorks, and Databricks together!
| marsupialtail_2 wrote:
| would love to collaborate on an integration with pyquokka
| (https://github.com/marsupialtail/quokka) once I put out a stable
| release end of this month :-)
| krawczstef wrote:
| Would love! Yeah I think how we did the PySpark Map UDF support
| should enable us to do something similar with Quokka.
| ropeladder wrote:
| Congrats on the launch, guys! Hamilton was the first MLOps
| library that really seemed to fit the challenges we face, because
| it offered a more granular way to structure our code. Really
| excited to see what other tools are on the way.
| elijahbenizzy wrote:
| Thank you! Recognize your username -- two of our newer
| decorators are your design/suggestion :)
|
| https://hamilton.readthedocs.io/en/latest/reference/api-refe...
| elijahbenizzy wrote:
| [dead]
| data_ders wrote:
| yo congrats again on the launch! Anders dbt Labs here with a
| "tough" question for you. Apologies for 1) my response being
| half-baked,and 2) if i haven't done my homework about Hamilton's
| features.
|
| coincidentally, my PR to the dbt viewpoint was closed by the docs
| team as "closed, won't do" [1]
|
| I really like the convention of data plane (where you describe
| how the data should be transformed) and the control plane (i.e.
| the configuration of the DAG, do this before this). In this
| paradigm, I believe that the control plane should be as simple as
| possible, and even perhaps limited in what can be done with the
| goal of pushing the user to take data transformation as
| tantamount. Maybe this is why I fell in love with dbt in the
| first place is because it does exactly this.
|
| "spicy" take: allowing users to write imperative code (e.g. using
| loops) that dynamically generates DAGs are never a good idea. I
| say this as someone who personally used to pester framework PMs
| for this exact feature before. While things like task groups
| (formerly subDAGs) [2] appear initially to be right answer, I
| always ended up regretting them. They're a
| scheduling/orchestration solution to a data transformation
| problem
|
| Can y'all speak to how Hamilton views the data and control plane,
| and how it's design philosophy encourages users to use the right
| tool for the job?
|
| p.s. thanks for humoring my pedantry and merging this! [3]
|
| [1]: https://github.com/dbt-labs/docs.getdbt.com/pull/2390 [2]:
| http://apache-airflow-docs.s3-website.eu-central-1.amazonaws...
| [3]: https://github.com/DAGWorks-Inc/hamilton/pull/105
| krawczstef wrote:
| Great questions!
|
| > "spicy" take: allowing users to write imperative code (e.g.
| using loops) that dynamically generates DAGs are never a good
| idea.
|
| Can you give some examples of when it was a bad idea? Otherwise
| to clarify, with Hamilton, there is no dynamism at runtime.
| When the DAG is generated, it's generated and it's fixed. The
| operator we have for doing this `@parameterize` requires
| everything to be known at DAG construction time. It's really
| just short hand for manually writing out all the functions. So
| I don't think it's quite the same story - it is more a "power
| user feature" - but when used, it makes code DRY-er, at the
| cost of some code readability.
|
| > While things like task groups (formerly subDAGs) [2] appear
| initially to be right answer, I always ended up regretting
| them. They're a scheduling/orchestration solution to a data
| transformation problem
|
| Yep. Hamilton has a concept of `subdag` too. It's really short
| hand for "chaining" Hamilton drivers. We take the latter
| approach (I believe) since it is there to help you more easily
| reuse parts of your DAG with different parameterizations. Since
| Hamilton isn't concerned with materialization boundaries we
| don't have to make a decision here so how it impacts
| scheduling/orchestration can be punted to a later time :)
|
| > Can y'all speak to how Hamilton views the data and control
| plane,
|
| Hamilton is just a library. There is no DB that needs to be run
| to use Hamilton. The only state required is your code. So at
| the simplest micro-level, Hamilton sits within a task, e.g.
| creating features, and replaces the python script you'd run
| there. So at this level, I'd argue data plane vs control plane
| doesn't really apply, unless you view code as the control plane
| and where it runs as the data plane... At the macro-level, e.g.
| a model pipeline pulling data, transforming it, fitting a
| model, etc., you can logically describe the dataflow with
| Hamilton, without breaking it up into computational tasks a
| priori. I'd say Hamilton here tries to be agnostic and provide
| the hooks you need to help you coordinate your control plane
| and facilitate operation on your data plane. Note: this is
| where we see DAGWorks coming in and helping provide more
| functionality for. E.g. with Hamilton you don't need to decide
| whether everything runs in a single task say on airflow, or
| multiple. It's up to you to make that decision. The beauty of
| which, is that conceptually, changing what is in a task, is
| really just boilerplate given all the information you have
| already encoded into your Hamilton DAG.
|
| > and how it's design philosophy encourages users to use the
| right tool for the job?
|
| With Hamilton, we believe python UDFs are the ultimate user
| interface. By using Hamilton we force you to chunk logic,
| integrations, into functions. We also provide ways to
| "decorate" function logic which gives the ability to inject
| logic around the running of said functions. So we're really
| quite agnostic to the tool, but want to provide the hooks to be
| able to easily and cleanly add, adjust, remove them. For
| example, to switch between Ray and Dask, our philosophy is that
| ideally you can write code that is agnostic to knowing about
| the implementation. Then at runtime add those concerns in. As
| another example, the ability to switch/change say observability
| vendors, should not force a large refactor on your code base.
| We have an extensible `@check_output` decorator that should
| constrain how much you "leak" from the underlying tools. In
| short: (1) write functions that don't leak implementation
| details, they should just instead try to limit to just
| expressing logic; (2) the Hamilton framework should have the
| hooks required for you to plug in "tool" concerns. Does that
| make sense? Happy to elaborate more.
| jdonaldson wrote:
| Can this be set up to yield data from individual functions
| instead of simply returning it?
| elijahbenizzy wrote:
| Thanks for the question! Clarification -- by "yield" do you
| mean "return a single function's result" or "use a function
| that's a generator"?
|
| In the former case, yes, it's pretty easy:
| driver.execute(final_vars=["my_func_1", "my_func_2"])
|
| will just give out the specific ones you want. In the latter
| case, it's a little trickier but doable -- we were just going
| over this with a user recently actually!
| https://github.com/DAGWorks-Inc/hamilton/issues/90
| sampo wrote:
| > Most companies running ML in production need a ratio of 1:1 or
| 1:2 data scientists to engineers. At bigger companies like Stitch
| Fix, the ratio is more like 1:10 -- way more efficient
|
| Did you write these wrong way round, maybe? Or are you saying a
| ratio of 1 data scientist to 10 engineers is efficient?
| dang wrote:
| Good catch--thanks! Fixed above.
| tartakovsky wrote:
| Amazing.
| krawczstef wrote:
| ah yep -- good catch -- d'oh -- and it's too late to update.
|
| But yes, the data scientists vs engineers should be flipped.
| cbb330 wrote:
| How can I convince someone to try this, if they are comparing
| this solution with dbt? Can only pick one :)
| elijahbenizzy wrote:
| First, worth jumping on tryhamilton.dev -- we made it to make
| it easy to get started.
|
| And, as always, it depends on your use case! If it's all
| analytics and DWH manipulations -- that's not Hamilton's
| expertise. If you want to do any DS work (model training,
| feature engineering, fine-grained lineage, etc...) and need to
| work in python, Hamilton is a great choice!
|
| Otherwise hamilton is a library - we do have an example of
| integrating with dbt
|
| https://towardsdatascience.com/hamilton-dbt-in-5-minutes-62e...
| ZeroCool2u wrote:
| Any thoughts on how DAGWorks compared to something like Domino
| Datalab[1]?
|
| 1:
| https://docs.dominodatalab.com/en/latest/user_guide/bc1c6d/s...
| elijahbenizzy wrote:
| Yeah! So I'm not an expert with domino, but looking at it, it
| serves a slightly different purpose.
|
| DAGWorks is an opinionated way to write code that allows you to
| abstract it away from the infrastructure, whereas Domino is
| more about making it easy to deal with infrastructure, manage
| datasets, etc... Also has a large notebook/development focus.
| Nothing in is built to make it natural for the code to live for
| a while/keep it well-maintained, which is the problem DAGWorks
| is trying to solve.
|
| The idea is that we can allow you to plug into whatever
| infrastructure you want and not have to think about it too much
| when you want to switch, although we're still building pieces
| of that out.
| ericcolson wrote:
| love the transparency this brings. Any 3rd party tools with plans
| to integrate with it? (e.g. analytics layer companies?)
| elijahbenizzy wrote:
| Thanks! Lots of plans in the works. Two directions we're
| moving:
|
| 1. Integrate with orchestration systems (E.G. run Hamilton
| pipelines on different orchestration platforms). You could
| imagine compiling to airflow pipelines, running hamilton on
| metaflow, compiling to a vertex pipeline etc...
|
| 2. Adapters to load data from/save data to external providers.
| E.G. logging data quality to whylogs, loading data up from
| snowflake, saving a model to mlflow...
|
| We're actively working on building this out though -- have some
| partners who hare helping build out use-cases!
| rubenfiszel wrote:
| Ola, I'm the founder of windmill [1] which is an OSS
| orchestration platform for scripts, including python, and I'm
| pretty sure one could compile an hamilton dag to a windmill
| openflow very simply (we use a mounted folder in `./shared`
| to share heavy data across steps). Right now, one can do ETL
| using polars on windmill but i'd love to have an even more
| structure code framework to recommend.
|
| We should chat!
|
| [1]: https://github.com/windmill-labs/windmill
| krawczstef wrote:
| Nice! Yep that's part of our vision to "compile onto", i.e.
| generate code for, frameworks such as yours.
|
| Happy to sketch something out - want to join our slack to
| chat https://join.slack.com/t/hamilton-
| opensource/shared_invite/z...?
| aldanor wrote:
| > With Hamilton we use a new paradigm in Python (well not quite
| "new" as pytest fixtures use this approach) for defining model
| pipelines. Users write declarative functions instead of writing
| procedural code. For example, rather than writing the following
| pandas code
|
| > These functions then define a "dataflow" or a directed acyclic
| graph (DAG), i.e. we can create a "graph" with nodes: col_a,
| col_b, col_c, and col_d, and connect them with edges to know the
| order in which to call the functions to compute any result.
|
| This 'new paradigm' already exists in Polars. Within the scope of
| a local machine, you can write declarative expressions which can
| then be used pretty much anywhere for querying instead of the
| usual arrays and series (arguments to
| filter/apply/groupby/agg/select etc), allowing it to build an
| execution graph for each query, optimise it and parallelise it,
| and try to only run through the data once if possible without
| cloning. Eg the example above can be written simply as
| col_c = (pl.col('a') + pl.col('b')).alias('c')
|
| It is obviously restricted to what is supported in polars, but a
| surprising amount of the typical data munging can be done with
| incredible efficiency, both cpu and ram wise.
| nerdponx wrote:
| As far as I know, Polars inherited this idea from (Py)Spark,
| where it was intended more or less as a port of SQL. And it's
| not so different from how ORMs usually look and feel.
|
| I think this design is a local maximum for languages that don't
| have first-class symbols and/or macros like R and Julia. I like
| to see convergence in this space.
|
| It's also interesting because this style of API is portable
| more or less unchanged to just about any programming language,
| from C# to Idris.
| krawczstef wrote:
| > It's also interesting because this style of API is portable
| more or less unchanged to just about any programming
| language, from C# to Idris.
|
| Yep I think a declarative syntax is quite portable and can be
| reimplemented easily in other languages.
|
| On the portable note, where portable we mean swapping
| dataframe implementations, it's even conceivable to write
| "agnostic" logic with Hamilton and then at runtime inject the
| right "objects" that then do the right thing at runtime. E.g.
| the following is polars specific: col_c =
| (pl.col('a') + pl.col('b')).alias('c')
|
| I think with Hamilton you could be more agnostic and enable
| it to run on both Pandas and Polars -- with TYPE here a
| placeholder to indicate something more generic...
| def col_c(a: TYPE, b: TYPE) -> TYPE: return a + b
|
| So at runtime you'd instantiate in your Driver some directive
| to say whether you're operating on pandas or with polars (or
| at least that's what I imagine in my head) and the framework
| would take care of the rest...
| elijahbenizzy wrote:
| Yeah! So we actually have an integration with polars. See
| https://github.com/DAGWorks-
| Inc/hamilton/blob/5c8e564d19ff23....
|
| To be clear, the specific paradigm we're referring to is this
| way of writing transforms as functions where the parameter name
| is the upstream dependency -- not the notion of delayed
| execution.
|
| I think there are two different concepts here though:
|
| 1. How the transforms are executed
|
| 2. How the transforms are organized
|
| Hamilton cares about (2) and delegates to Polars/pandas for
| (1). The problem we're trying to solve is the code getting
| messy and transforms being poorly documented/hard to own --
| Hamilton isn't going to solve the problem of optimizing compute
| as tooling like polars, pandas, and pyspark can handle that
| quite well.
| krawczstef wrote:
| Yep, we'd love more feedback on how to make the declarative
| syntax with Polars more natural with Hamilton so you can get
| the benefits of unit testing, documentation, visualization,
| swapping out implementations easily, etc.
| [deleted]
| sandGorgon wrote:
| genuine question - do Polars and Duckdb overlap in the problem
| space ?
| krawczstef wrote:
| I'll let someone with more polars & duckdb experience to
| weigh in.
|
| But in short yes. Especially if you take the perspective
| they're both trying to help you do operations over tabular
| data, where the result is also something tabular.
|
| Duckdb is "A Modern Modular and Extensible Database System"
| (https://www.semanticscholar.org/paper/DuckDB-A-Modern-
| Modula...). So it has a bit more to it than polars, as it has
| a lot of extensibility, for example, you can give it a pandas
| dataframe and it'll operate over it, and in some cases,
| faster than pandas itself.
|
| But otherwise at a high-level, yes you could probably replace
| one for the other in most instances, but not true for
| everything.
| sidlls wrote:
| In my experience building the pipeline and related infrastructure
| is not trivial, but it's also a relatively tiny problem compared
| to, well, everything else. That is, acquiring and moving data
| around, managing the data over the lifetime of a model's use, and
| serving adjacent needs (e.g. post-deployment analytics). How does
| DAGWorks help with all the rest of this stuff?
| krawczstef wrote:
| To give you another take on Elijah's answer:
|
| > acquiring and moving data around,
|
| yep with Hamilton we provide the ability to cleanly separate
| bits of logic that's required to change and update this. For
| example you'd write "data loader functions/modules" that are
| implementations for say reading from a DB, or a flat file, some
| vendor. If they output a standardized data structure, then the
| rest of your workflow would not be coupled to the
| implementation, but the common structure which Hamilton forces
| you to define. That way you can be pretty surgical with changes
| and understanding impacts.
|
| Regarding assessing impacts, Hamilton provides the ability to
| "visualize" and query for lineage as defined by your Hamilton
| functions. We think that with Hamilton we can make the "hey
| what does this impact?" question really easy to answer, so that
| when you do need to make changes you'll have more confidence in
| doing them.
|
| > managing the data over the lifetime of a model's use,
|
| Hamilton isn't opinionated about where data is stored. But
| given that if you define the flow of computation with Hamilton
| and use a version control system like git to version it, then
| all you need to then additionally track is what configuration
| your Hamilton code was run with, and associate those two with
| the produced materialized data/artifact (i.e. git SHA + config
| + materialized artifact), you have a good base with which to
| answer and ask queries of what data was used when and where.
| Rather than bringing in 3rd party systems to help here, we
| think there's a lot you can leverage with Hamilton to help
| here.
|
| For example, we have users looking at Hamilton to help
| answering governance concerns with models produced.
|
| > and serving adjacent needs (e.g. post-deployment analytics).
|
| If it's offline, then you can model and run that with Hamilton.
| The idea is to help provide integrations with whatever MLOps
| system here to make it easy to swap out.
|
| For online, e.g. a web-service, you could model the dataflow
| with Hamilton, and the build your own custom "compilation" to
| take Hamilton and project it onto a topology. During the
| projection, you could insert whatever monitoring concerns you'd
| want. So just to say, this part isn't straightforward right
| now, but there is a path to addressing it.
| vtuulos wrote:
| Congrats for the launch Stefan and Elijah! :)
|
| Like Stefan mention in the OP, Hamilton works well with tools
| like Metaflow which can help with many other concerns you
| mentioned. How you define your data transformations for ML is
| an open question that Hamilton addresses neatly.
|
| See here for an example of Metaflow+Hamilton in action:
| https://outerbounds.com/blog/developing-scalable-feature-eng...
| elijahbenizzy wrote:
| Thanks! Yeah -- love the example we built with the metaflow
| team. I even think Savin spoke right before me at PyData NYC
| in the same room!
| elijahbenizzy wrote:
| Thanks for your question! Good points -- those are not easy
| problems! I'd be really curious about what the community
| thinks, but here are my opinions:
|
| While Hamilton/DAGWorks is mainly for expressing the
| pipeline/abstracting away the infrastructure, DAGWorks can help
| make the model lifecycle easy as well:
|
| - Hamilton pipelines can be run anywhere, including in an
| online setting
|
| - Breaking into functions can make it modular/easy to annotate
| and gather post-hoc analysis
|
| - Hamilton pipelines specify the data movement in code --
| providing a source of truth
|
| That said, I think the ecosystem for doing this is _much_
| cleaner /easier to manage than it used to be -- the MLOps stack
| is far more sophisticated. Scalable/reliable compute (spark,
| Modin), easy storage (snowflake, new feature store technology),
| and more model experiment-as-a-service type systems (mlflow,
| model-db, etc...) have made these less of a difficult problem
| than it was in the past. As these permeate the industry and we
| develop more standards, I think that it pushes the problem up a
| level -- rather than figuring out exactly _how_ to solve these,
| the difficult part is looping together a bunch of systems that
| all do it fairly well but (a) require significant expertise to
| manage and (b) often result in code that 's super coupled to
| the systems themselves (making it hard to test). DAGWorks wants
| to decouple these from the pipeline code, enabling you to
| choose which systems you delegate to and not have to worry
| about it.
|
| Furthermore, we think that smaller pipelines are actually super
| underserved in ML/data science community -- E.G. pipelines that
| don't have a lot of the "moving data around" problems but can
| be run on a single machine. I've seen these suffer from getting
| too complex/being difficult to manage, and we think Hamilton
| can solve this out of the box.
|
| Thoughts?
___________________________________________________________________
(page generated 2023-03-07 23:00 UTC)