[HN Gopher] Hamilton: A Microframework for Dataframe Generation
       ___________________________________________________________________
        
       Hamilton: A Microframework for Dataframe Generation
        
       Author : gammarator
       Score  : 44 points
       Date   : 2021-12-02 16:11 UTC (6 hours ago)
        
 (HTM) web link (multithreaded.stitchfix.com)
 (TXT) w3m dump (multithreaded.stitchfix.com)
        
       | __mharrison__ wrote:
       | I would love to see more real-life examples. My take is that most
       | people don't know about the .assign method [0] or severely
       | underutilize it.
       | 
       | 0 - https://www.metasnake.com/blog/pydata-assign.html
        
         | pvitz wrote:
         | I don't understand the advantage of the assign-method in the
         | article you have linked to. If I would like to alter data in a
         | column of a large dataframe, why should I use the assign-method
         | (and copy data) instead of the mentioned loc-method and just
         | overwrite the data (without a copy)? I am assuming, of course,
         | that I will not need the original dataframe anymore.
        
           | lmeyerov wrote:
           | Yep, we regularly use assign & pipe to avoid the 80% case I
           | think the article is about:
           | 
           | ```
           | 
           | df2 = df.assign(new_col_1=f(df))
           | 
           | ```
           | 
           | or
           | 
           | ```
           | 
           | def add_new_col_1(df): return df.assign(new_col_1=...)
           | 
           | df2 = df.pipe(add_new_col_1)
           | 
           | ```
           | 
           | We do use reified compute DAGs in some places, but by that
           | point, we're using dask anyways, and that kind of code ends
           | up more annoying than if we could avoid. Speaking as someone
           | who has done years of FRP/streams/etc., if users can stick
           | with direct control flow or things that look like it
           | (async/await), help them do it :)
           | 
           | RE:Immutability, it's a convention that eliminates some
           | classes of bugs, so quite nice when a team does it. Code
           | written by senior / reviewed teams are nice b/c these things
           | add up to either a pleasant experience or perpetual paranoia
           | for the day-to-day:
           | 
           | * A lot of our code is in notebook envs, where the ability to
           | move cells up/down matters, so we try to only do single
           | assignments to avoid non-reproducibility bugs
           | 
           | * Similar in production, but more about when someone comes
           | back and wants to edit or log. having to undo the name reuse
           | is annoying, and may lead to bugs when you don't notice it
        
       | spratzt wrote:
       | Looks great and I shall definitely try it.
       | 
       | I've always thought that Stitchfix were the most interesting
       | ecommerce retail operation. It's a shame that none of their data
       | science roles are available in London.
        
       | wxnx wrote:
       | This is awesome to see!
       | 
       | I had a very similar idea recently (representing "big"
       | preprocessing pipelines in Python as DAGs), but for research-
       | scale projects. Funnily enough, I was also motivated by a time
       | series project that has resulted in several thousand lines of
       | gnarly preprocessing code, and growing - mostly in Pandas.
       | 
       | I'm not sure Hamilton would totally work for our use case, but
       | I'll be following closely either way. Thank you for open sourcing
       | this!
        
         | elijahbenizzy wrote:
         | Author here -- that's awesome! Great minds think alike. Always
         | interested in use-cases (that both fit and don't) -- feel free
         | to open up an issue with something representative/what you want
         | and I'm happy to contribute my opinion.
        
       | akdor1154 wrote:
       | Very interesting. In general i'm defaulting to "ahh not another
       | syntax/grammar to learn!", but i can absolutely appreciate that a
       | code base written with this would be far better than the messes
       | of opaque pandas wrangling it would otherwise involve.
       | 
       | I really appreciate API design that consciously encourages clean
       | code, and on that basis this looks fantastic.
        
       | jcmontx wrote:
       | Thanks I'll stick to Verstappen
        
       | imcoconut wrote:
       | This looks amazing, and very timely.
       | 
       | Any chance you have a mirror on gitlab (or github)?
        
         | zxexz wrote:
         | This[0] was linked in the article.
         | 
         | [0] https://github.com/stitchfix/hamilton
        
       | lr1970 wrote:
       | An earlier thread 23 days ago [0]
       | 
       | [0] https://news.ycombinator.com/item?id=29158021
        
       | feoren wrote:
       | It always amazes me how readily people assume that their problem
       | is unique and special. This article can be summarized as:
       | standard software engineering problem gets solved with standard
       | software engineering practices. The problem they describe of a
       | giant function that gets touched every time something changes is
       | called a "God Function" or a "God Class" and it's one of the most
       | common problems in software engineering. Their solution is called
       | "Dependency Inversion" and it was designed to solve literally
       | that exact problem in the exact way they're doing it.
       | 
       | That doesn't mean it's easy! Anyone who applies Software
       | Engineering principles to make their code better deserves
       | congratulations. So congratulations, Stitchfix! But you could
       | have found this solution a lot quicker if you didn't fall into
       | that age old trap of thinking your data-science code was
       | _different_ , and _excluded_ from standard software engineering.
       | Everyone does this, and I don 't know why. They read all these
       | blogs and books on software engineering best practices, but then
       | they don't apply them to anything. You can't apply it to your SQL
       | because it's just _different_. Your database organization is
       | _different_. Your I /O code is _different_. Your API surface is
       | _different_. Your UI is _different_. Your feature engineering is
       | _different_.
       | 
       | No! It's software. Stop thinking that you need a unique solution
       | to your unique snowflake problem. Apply well-known and good
       | software engineering practices to everything you ever do.
       | 
       | They even say in the article "It hasn't grown convoluted out of
       | ... bad software engineering practices". _Of course it has!_ What
       | do you think software engineering practices are supposed to apply
       | to!?
        
         | [deleted]
        
         | krawczstef wrote:
         | Author here. Yep, definitely don't think conceptually what
         | we've done is unique, but the implementation/use is. The
         | "trick" here, is that the end user doesn't need to know about
         | these great software engineering best practices, they happen
         | naturally.
         | 
         | Re: "god class" or "function" -- that's not accurate. It's more
         | like a "god script". This script was written by different
         | authors, with different styles, all modifying
         | replacing/adjusting the code in this script.
         | 
         | Now, yes, could that team have avoided some of the problems had
         | they just thought about it, kind of -- but as the post points
         | out, that would only get you so far. The paradigm created ticks
         | all the boxes and scales very well with the team and code base
         | size.
         | 
         | Anyway, I would love for you to install hamilton (pip install
         | sf-hamilton) and try it, and give a first hand perspective :)
         | We're always after feedback. Cheers!
        
         | mtVessel wrote:
         | In my experience, many data engineers and data scientists know
         | nothing about software engineering, unless they've had a
         | previous life as a coder. Most data engineers who claim they
         | know python really only know pandas, and many couldn't
         | implement fizzbuzz without a library.
         | 
         | I'm a data guy, but I started out as a developer, and
         | eventually choose to specialize in data systems. I'm
         | continually amazed to find that this path is the exception.
        
         | noobhacker wrote:
         | Could you elaborate on how this Hamilton framework is a case of
         | Dependency Inversion (which in my understanding is about
         | removing dependency on low level class and using dependency on
         | high level class instead)?
         | 
         | What's the low and high level classes in this application?
        
       | dash2 wrote:
       | Nice. I've really felt the pain point here. In R I use Drake to
       | run a big pipeline on my laptop. In theory that provides the same
       | "only run the computations you need" DAG, but because outputs are
       | defined as whole data frames, changing or adding a single column
       | often forces recomputation of everything. Column level isolation
       | like this is the smart way to go.
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:02 UTC)