[HN Gopher] Hamilton: A Microframework for Dataframe Generation
___________________________________________________________________
Hamilton: A Microframework for Dataframe Generation
Author : gammarator
Score : 44 points
Date : 2021-12-02 16:11 UTC (6 hours ago)
(HTM) web link (multithreaded.stitchfix.com)
(TXT) w3m dump (multithreaded.stitchfix.com)
| __mharrison__ wrote:
| I would love to see more real-life examples. My take is that most
| people don't know about the .assign method [0] or severely
| underutilize it.
|
| 0 - https://www.metasnake.com/blog/pydata-assign.html
| pvitz wrote:
| I don't understand the advantage of the assign-method in the
| article you have linked to. If I would like to alter data in a
| column of a large dataframe, why should I use the assign-method
| (and copy data) instead of the mentioned loc-method and just
| overwrite the data (without a copy)? I am assuming, of course,
| that I will not need the original dataframe anymore.
| lmeyerov wrote:
| Yep, we regularly use assign & pipe to avoid the 80% case I
| think the article is about:
|
| ```
|
| df2 = df.assign(new_col_1=f(df))
|
| ```
|
| or
|
| ```
|
| def add_new_col_1(df): return df.assign(new_col_1=...)
|
| df2 = df.pipe(add_new_col_1)
|
| ```
|
| We do use reified compute DAGs in some places, but by that
| point, we're using dask anyways, and that kind of code ends
| up more annoying than if we could avoid. Speaking as someone
| who has done years of FRP/streams/etc., if users can stick
| with direct control flow or things that look like it
| (async/await), help them do it :)
|
| RE:Immutability, it's a convention that eliminates some
| classes of bugs, so quite nice when a team does it. Code
| written by senior / reviewed teams are nice b/c these things
| add up to either a pleasant experience or perpetual paranoia
| for the day-to-day:
|
| * A lot of our code is in notebook envs, where the ability to
| move cells up/down matters, so we try to only do single
| assignments to avoid non-reproducibility bugs
|
| * Similar in production, but more about when someone comes
| back and wants to edit or log. having to undo the name reuse
| is annoying, and may lead to bugs when you don't notice it
| spratzt wrote:
| Looks great and I shall definitely try it.
|
| I've always thought that Stitchfix were the most interesting
| ecommerce retail operation. It's a shame that none of their data
| science roles are available in London.
| wxnx wrote:
| This is awesome to see!
|
| I had a very similar idea recently (representing "big"
| preprocessing pipelines in Python as DAGs), but for research-
| scale projects. Funnily enough, I was also motivated by a time
| series project that has resulted in several thousand lines of
| gnarly preprocessing code, and growing - mostly in Pandas.
|
| I'm not sure Hamilton would totally work for our use case, but
| I'll be following closely either way. Thank you for open sourcing
| this!
| elijahbenizzy wrote:
| Author here -- that's awesome! Great minds think alike. Always
| interested in use-cases (that both fit and don't) -- feel free
| to open up an issue with something representative/what you want
| and I'm happy to contribute my opinion.
| akdor1154 wrote:
| Very interesting. In general i'm defaulting to "ahh not another
| syntax/grammar to learn!", but i can absolutely appreciate that a
| code base written with this would be far better than the messes
| of opaque pandas wrangling it would otherwise involve.
|
| I really appreciate API design that consciously encourages clean
| code, and on that basis this looks fantastic.
| jcmontx wrote:
| Thanks I'll stick to Verstappen
| imcoconut wrote:
| This looks amazing, and very timely.
|
| Any chance you have a mirror on gitlab (or github)?
| zxexz wrote:
| This[0] was linked in the article.
|
| [0] https://github.com/stitchfix/hamilton
| lr1970 wrote:
| An earlier thread 23 days ago [0]
|
| [0] https://news.ycombinator.com/item?id=29158021
| feoren wrote:
| It always amazes me how readily people assume that their problem
| is unique and special. This article can be summarized as:
| standard software engineering problem gets solved with standard
| software engineering practices. The problem they describe of a
| giant function that gets touched every time something changes is
| called a "God Function" or a "God Class" and it's one of the most
| common problems in software engineering. Their solution is called
| "Dependency Inversion" and it was designed to solve literally
| that exact problem in the exact way they're doing it.
|
| That doesn't mean it's easy! Anyone who applies Software
| Engineering principles to make their code better deserves
| congratulations. So congratulations, Stitchfix! But you could
| have found this solution a lot quicker if you didn't fall into
| that age old trap of thinking your data-science code was
| _different_ , and _excluded_ from standard software engineering.
| Everyone does this, and I don 't know why. They read all these
| blogs and books on software engineering best practices, but then
| they don't apply them to anything. You can't apply it to your SQL
| because it's just _different_. Your database organization is
| _different_. Your I /O code is _different_. Your API surface is
| _different_. Your UI is _different_. Your feature engineering is
| _different_.
|
| No! It's software. Stop thinking that you need a unique solution
| to your unique snowflake problem. Apply well-known and good
| software engineering practices to everything you ever do.
|
| They even say in the article "It hasn't grown convoluted out of
| ... bad software engineering practices". _Of course it has!_ What
| do you think software engineering practices are supposed to apply
| to!?
| [deleted]
| krawczstef wrote:
| Author here. Yep, definitely don't think conceptually what
| we've done is unique, but the implementation/use is. The
| "trick" here, is that the end user doesn't need to know about
| these great software engineering best practices, they happen
| naturally.
|
| Re: "god class" or "function" -- that's not accurate. It's more
| like a "god script". This script was written by different
| authors, with different styles, all modifying
| replacing/adjusting the code in this script.
|
| Now, yes, could that team have avoided some of the problems had
| they just thought about it, kind of -- but as the post points
| out, that would only get you so far. The paradigm created ticks
| all the boxes and scales very well with the team and code base
| size.
|
| Anyway, I would love for you to install hamilton (pip install
| sf-hamilton) and try it, and give a first hand perspective :)
| We're always after feedback. Cheers!
| mtVessel wrote:
| In my experience, many data engineers and data scientists know
| nothing about software engineering, unless they've had a
| previous life as a coder. Most data engineers who claim they
| know python really only know pandas, and many couldn't
| implement fizzbuzz without a library.
|
| I'm a data guy, but I started out as a developer, and
| eventually choose to specialize in data systems. I'm
| continually amazed to find that this path is the exception.
| noobhacker wrote:
| Could you elaborate on how this Hamilton framework is a case of
| Dependency Inversion (which in my understanding is about
| removing dependency on low level class and using dependency on
| high level class instead)?
|
| What's the low and high level classes in this application?
| dash2 wrote:
| Nice. I've really felt the pain point here. In R I use Drake to
| run a big pipeline on my laptop. In theory that provides the same
| "only run the computations you need" DAG, but because outputs are
| defined as whole data frames, changing or adding a single column
| often forces recomputation of everything. Column level isolation
| like this is the smart way to go.
___________________________________________________________________
(page generated 2021-12-02 23:02 UTC)