[HN Gopher] Koheesio: Nike's Python-based framework to build adv...
       ___________________________________________________________________
        
       Koheesio: Nike's Python-based framework to build advanced data-
       pipelines
        
       Author : betacar
       Score  : 195 points
       Date   : 2024-06-04 05:07 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jiggunjer wrote:
       | Another snakemake?
        
       | esafak wrote:
       | > Koheesio is not in competition with other libraries.
       | 
       | Yes, it is, because nobody wants to run multiple orchestrators,
       | and the "What sets Koheesio apart from other libraries?" section
       | does little to help users decide why they should pick yours.
       | 
       | Workflow orchestration is a mature category, as evidenced by the
       | length of this list: https://github.com/meirwah/awesome-workflow-
       | engines
       | 
       | I would expect someone who's seriously writing a new orchestrator
       | in 2024 to cite the alternatives, their shortcomings, and how you
       | intend to address them. Bonus points if you make a neat little
       | table.
       | 
       | The fact that you're leading with Python does not inspire
       | confidence. Pretty much all workflow orchestrators use Python for
       | their glue, and that's hardly the interesting part.
       | 
       | What were they using at Nike before this?
        
         | zaptheimpaler wrote:
         | This is the kind of attitude that makes people, companies,
         | researchers hesitant to publish code online. Its free code for
         | everyone to see, they don't owe you anything. Its not
         | necessarily a "product" for your consumption, its just a repo.
        
           | esafak wrote:
           | I don't think so; this is a corporate project, not an
           | enthusiast's. They probably had to internally justify
           | building it over using an existing solution, so they could
           | simply share their rationale. I am not trying to rain on
           | their parade.
        
             | fforflo wrote:
             | It's a library written by some devs who thought it might be
             | useful to others, too, and/or are proud enough to share
             | their work. It's not that they'll publish it in an SEC
             | filing.
             | 
             | For every n-th solution in the market, n-1 existing ones
             | could have been used, but they weren't for (many times)
             | good reasons.
             | 
             | And looking at their Makefile and pyproject.toml I can see
             | that they knew what they wanted.
        
               | klohto wrote:
               | yes, Nike is a corp where everyone sings kumbaya and the
               | engineers have unlimited time to work on and publish
               | libraries they are proud of, they don't have to deliver
               | specific values at all.
        
               | blitzar wrote:
               | the kumbaya and coding is only allowed in the 5 minute
               | lunchbreak at the nike sweatshops
        
           | muspimerol wrote:
           | I think this is a fair criticism. You chose to share this on
           | hackernews, which invites feedback (including constructive
           | criticism). I don't see the problem -\\_(tsu)_/- Do you
           | expect people to only voice praise for open source projects?
        
             | mb5 wrote:
             | Why do you think the poster is connected with the repo? I
             | couldn't see any link.
        
               | muspimerol wrote:
               | I don't. Regardless who posted it, discussion and
               | constructive criticism is warranted in the comments.
        
               | yunohn wrote:
               | /s
               | 
               | Like it or hate it, it's how social media works. Everyone
               | posts other people's things, and then more people get
               | together to rip it to shreds.
        
           | koliber wrote:
           | It's a fair set of questions, posed fairly directly, without
           | any sugarcoating. If you read it as a ruthless critique, it
           | stings. If you read it as constructive feedback, it can bring
           | the author a lot of value.
           | 
           | If the author can clearly show the value proposition of their
           | library, it will get more adoption, and the community and the
           | author will get value.
           | 
           | If the author realizes they coded something that is already a
           | well-solved problem, or a poorly-constructed alternative, the
           | author and Nike could gain by throwing this away and going
           | with a better alternative.
           | 
           | Personally, I wasted too many hours of my life creating a
           | solution that did not solve any problems. Had I done some
           | thinking and/or market research ahead of time, it would have
           | saved me ton's of time to work on more worthwhile endeavors.
           | 
           | Good feedback is worth its weight in gold, even if it hurts a
           | bit to hear it.
        
         | bt1a wrote:
         | I'll agree that the pages about Koheesio hosted on nike's
         | website are strangely empty of persuasive specifics (seems like
         | generative AI used by someone who was in charge of writing it
         | up lol..), but maybe the authors believe that someone who's
         | seriously considering a data pipeline library would want to
         | look at the actual code (which by the way, is structured quite
         | well). Python doesn't "inspire confidence"? Think of the target
         | audience. The frameworks claims to take typing and tests quite
         | seriously, so perhaps your confidence is gained elsewhere. I
         | didn't go through the codebase because I'm not interested, but
         | your dismissive attitude doesn't radiate good faith
        
         | vincnetas wrote:
         | Regarding comments saying that parent comment lacks good faith,
         | and code is free so take it as is, should also consider that
         | parent has all the rights to express his concerns and opinions.
         | If you don't like criticism or inconvenient questions, just
         | skip it or ignore. For example parent insights were interesting
         | to me. Good questions that i can learn from how critically
         | evaluate things.
        
         | boffinAudio wrote:
         | >nobody wants to run multiple orchestrators
         | 
         | Straw man argument.
         | 
         | >I would expect someone who's seriously writing a new
         | orchestrator in 2024 to cite the alternatives, their
         | shortcomings, and how you intend to address them. Bonus points
         | if you make a neat little table.
         | 
         | Did you even try to read the docs before you launched this
         | critical diatribe?
         | 
         | From the docs (https://engineering.nike.com/koheesio/latest/tut
         | orials/onboa...):                   Advantages of Koheesio
         | Using Koheesio instead of raw Spark has several advantages:
         | Modularity: Each step in the pipeline (reading, transformation,
         | writing) is encapsulated in its own class, making the code
         | easier to understand and maintain.         Reusability: Steps
         | can be reused across different tasks, reducing code
         | duplication.         Testability: Each step can be tested
         | independently, making it easier to write unit tests.
         | Flexibility: The behavior of a task can be customized using a
         | Context class.         Consistency: Koheesio enforces a
         | consistent structure for data processing tasks, making it
         | easier for new developers to understand the codebase.
         | Error Handling: Koheesio provides a consistent way to handle
         | errors and exceptions in data processing tasks.
         | Logging: Koheesio provides a consistent way to log information
         | and errors in data processing tasks.              In contrast,
         | using the plain PySpark API for transformations can lead to
         | more verbose and less structured code, which can be harder to
         | understand, maintain, and test. It also doesn't provide the
         | same level of error handling, logging, and flexibility as the
         | Koheesio Transform class.
         | 
         | It took me less than 15 seconds to find the solution to the
         | problem you propose. How long did it take you to formulate your
         | critique? Do you perhaps just have a prejudice against Nike
         | (corporate haze), or is it an investment in a 'competing
         | orchestrator' that is clouding your judgement?
        
           | esafak wrote:
           | Every modern workflow orchestrator does those things you
           | quote, and more. You make it sound like they're innovations
           | when they're table stakes. Why wouldn't you just use Flyte or
           | Kubeflow?
           | 
           | Also, the fact that they say that the alternative is "raw
           | Spark" tells me either that they're confused, or not very
           | good at explaining. Spark is used to execute tasks in a
           | pipeline, not to orchestrate it.
        
             | thenaturalist wrote:
             | While I generally tend to agree with your basic criticism,
             | I think you need to keep in mind our perspectives might be
             | biased due to limited data.
             | 
             | Flyte went OSS what, 4 years ago? I'm not super familiar
             | with it, but a) could have been that it was too unpolished
             | at the time or b) requiring K8s to be a non-starter for
             | some teams/ orgs. Same for Kubeflow.
             | 
             | We also don't know for how long Koheesio existed within
             | Nike.
             | 
             | In short, there's a lot we don't know and there's a good
             | chance the internal reasoning for investing in this made
             | sense under certain circumstances in the past.
        
               | esafak wrote:
               | This project is two weeks old!
        
         | serial_dev wrote:
         | > Workflow orchestration is a mature category, as evidenced by
         | the length of this list (...)
         | 
         | Or could it be an evidence that the existing tools all have
         | their flaws and any reasonably sized organization will hit
         | these flaws pretty early, so many orgs come to the conclusion
         | that the best approach is to just roll their own that fits
         | their case relatively well?
        
           | thenaturalist wrote:
           | Are you at all familiar with the data integration / ETL
           | framework space?
           | 
           | If so, I think you would recognize how unreasonable the
           | presumption of your comment is. It certainly strikes me as
           | such.
           | 
           | Most if not all the big OSS frameworks and commercial
           | offerings originated in one big corp and subsequently moved
           | into the Apache direction (Airflow) or spun out into their
           | own companies.
        
           | whimsicalism wrote:
           | no offense, but you clearly don't know much about
           | orchestrators
        
         | gonzo41 wrote:
         | Every task orchestration tool kinda has a crappy security
         | model. Sure they're a ton of them but when you start putting it
         | all over the place it's just a hectic to get right. That is a
         | feel the space someone could make gains in a big way.
        
         | whimsicalism wrote:
         | yeah they need to explain why this is better than flyte
        
         | rockostrich wrote:
         | > Workflow orchestration is a mature category
         | 
         | As far as I can tell from the docs, this is not an orchestrator
         | at all. It is a Python-wrapper for certain runtimes like
         | PySpark. I don't see anything in the docs that mentions DAGs,
         | dependency definitions, scheduling, or deployment.
        
       | hipadev23 wrote:
       | Had Nike as a client for a period of time, interacted with quite
       | a few people across their data org. There is absolutely no
       | software you want authored by them.
        
         | annexrichmond wrote:
         | This is just flat out rude.
        
           | sam_lowry_ wrote:
           | It's a big organization, but I can understand the feeling,
           | because I had the same attitude towards Microsoft, Oracle,
           | Salesforce and many others.
        
             | alex_lav wrote:
             | It's a big organization, which means there will be areas
             | full of great people and areas full of less-than great
             | people. To discount all work from a group that large is
             | just silly.
        
         | teekert wrote:
         | I worked at a large Healthtech company before. I get the
         | sentiment. But, in the forest of software based on poor
         | decisions, there were definitely a couple of valuable gems made
         | by very knowledgeable people. Generally those were the people
         | with an "opensource mindset" (as opposed to the "my ultra-
         | crappy code that is just some FOSS glued together is super
         | valuable IP and you need to read 100s of QMS docs before you
         | can lay your eyes on it"-people). Don't measure the whole org
         | by the same yardstick.
        
       | benterix wrote:
       | A probably better explanation of what it is and why you might
       | want to use it (or not) can be found here:
       | 
       | https://engineering.nike.com/koheesio/latest/tutorials/onboa...
        
       | serial_dev wrote:
       | I used to work a little with ETLs, Spark, Storm, etc and I
       | honestly don't understand the value proposition of this library.
       | I'm no data engineer expert by any means (it was like 2 years
       | working on data eng stuff about 30% of the time 5+ years ago),
       | but I expected that at least I'd get what this is useful for.
        
         | ubercore wrote:
         | This, at a glance, seems pretty simplistic. A neat project, but
         | not something I would have expected on HN front page.
        
           | whoiscroberts wrote:
           | After decades of working on overly abstracted clever
           | applications the only place I see elegance is in simplicity.
           | I'd like to see more libraries like this on the front page.
        
         | tiew9Vii wrote:
         | From their docs: > Koheesio is a Python library that simplifies
         | the development of data engineering pipelines. It provides a
         | structured way
         | 
         | I think this pretty much sums it up, "a structured way".
         | 
         | It's looks to be a thin wrapper around spark to provide a
         | consistent way to structure ETL jobs. They've implemented a
         | mini dsl defining jobs as a datastructure on top of Spark.
         | 
         | I've seen several companies build stuff similar to this
         | internally, defining jobs as a data structure. It all amounts
         | to each company having their own internal conventions, their
         | own view of what is easier for their devs and creating a
         | framework for it. Nike have just decided to make theirs public.
         | 
         | You can do all this simply with simple spark scripts.
         | Personally I'd use simple spark scripts. Big companies with
         | lots of people love making these tools as companies love
         | conventions, their conventions, their style guides, deal with
         | staff churn/on-boarding frequently so believe these kind of
         | things make that easier.
         | 
         | Probably makes sense in Nike as a way of organizing their ETL
         | jobs but that looks to be all it brings. A way to
         | structure/define your simple spark jobs the way Nike devs
         | believe it should be done.
        
         | anentropic wrote:
         | it looks like a layer of sugar over PySpark
         | 
         | seems to be Spark-only AFAICT?
        
       | adrianbr wrote:
       | That's really cool, did you already saw the dlt library? That
       | one's done for very easy to use EL in python. It's similarly
       | modular and built by senior data engineers for the data team, and
       | the sources are generators which you could probably use too.
       | 
       | How is koheesio different to dlt? Where could they complement
       | each other?
        
       | steveBK123 wrote:
       | If I had to guess, a tool like this might be useful in a shop
       | with a lot of inexperienced devs. It's a thin wrapper to make
       | sure everyone walks the same well worn path the same way. You
       | have 2-3 devs work on the tooling, and a much larger team doing
       | rote ETL.
       | 
       | I worked at a shop that did this and the trade-off is TTM, as
       | your 2 person tools team is constantly needing to unblock ETL
       | team with new features as they encounter new requirements in the
       | wild.
       | 
       | If your ETL team is 20+ people and the tools team doesn't have a
       | head start, tools team will quickly fall behind an insurmountable
       | backlog as your ETL team spins its wheels. But you might save
       | some money if you choose the right KPI..
        
         | haddr wrote:
         | I think this is the case: when you run your pipelines at scale
         | you want to standardize and simplify some repeatable aspects to
         | lower the cost of managing them. You may also want to be
         | orthogonal to orchestrator engines (or triggering engines) and
         | avoid getting too opinionated and inflexible in the future. So
         | this framework is exploring some sweet spot between raw spark
         | pipelines and low code etl engines.
        
           | steveBK123 wrote:
           | yeah though a lot of these fall for a variant of the
           | "universal standard" conceit joked about in xkcd. All these
           | low-code solutions suck, so we'll build our own in-house that
           | surely won't have the same pitfalls..
        
         | datavirtue wrote:
         | I built a data processing framework at GE that let junior devs
         | write whatever code they needed to transform a particular
         | input. It provided an interface that they had to satisfy (for
         | data lineage metrics) but otherwise scaled their code without
         | them having to understand the distributed architecture or
         | anything about the platform. Exceptions flowed up to the
         | platform and became part of the data lineage metrics.
         | 
         | I walked into 20 years of adhoc code that had zero data
         | lineage, recoverability, or scalability that was breaking
         | daily. There were contractors with over a decade of tenure
         | whose job it was to troubleshoot and fix their own brittle
         | processes (and make new ones) daily.
         | 
         | I got laid off (747Max plus pandemic) as I was rolling it out
         | and they went back to the old way.
         | 
         | Subsequently, a new startup (Pantomath) emerged with former GE
         | engineers (and other former colleagues of mine) from my former
         | department to address that problem domain.
         | 
         | Based on my experience trying to socialize this type of
         | solution, sales are going to be a bitch.
        
           | steveBK123 wrote:
           | A framework with composable building blocks, allowing devs to
           | unblock themselves by adding the functionality they need is a
           | good solution.
        
       | alessmar wrote:
       | A few weeks ago, I chose to write my data pipelines using Apache
       | Beam. It seems that Koheesio shares some features with this
       | project, but I believe Apache Beam is superior due to its ability
       | to run on various runners, support multiple programming
       | languages, and integrate with numerous data sources and
       | destinations.
        
         | esafak wrote:
         | If I wanted an abstraction layer for writing pipelines today
         | I'd look into ZenML: https://www.zenml.io/vs/zenml-vs-
         | orchestrators
        
       | tpoacher wrote:
       | Oh so like luigi? Great!
        
       | newfocogi wrote:
       | I'd like to understand what data engineering inside Nike is
       | actually like. I'm curious because I have relevant experience on
       | my LinkedIn profile, and I get reached out to almost weekly from
       | third party recruiters trying to fill really low paying contract
       | data engineering and ML jobs with Nike. These roles seem to be
       | targeting people with professional experience in the US but pay
       | roughly a 3rd of what I would consider the going rate. There's
       | another top level comment here that this tool might make sense
       | "in a shop with a lot of inexperienced devs", which would confirm
       | my anecdata. Maybe the roles are actually scams, who knows
       | :shrug:
        
         | steveBK123 wrote:
         | My experience with these kinds of tools (and I've built some
         | myself) tells me that you're better off hiring people who know
         | what they are doing or having enough experienced people PLUS
         | culture to train up juniors.
         | 
         | The idea that you'll just build a tool that makes hiring 10x as
         | many inexperienced devs work is dubious. Just one more new DSL
         | bro. Certainly we have cracked the code that no one else has.
         | 
         | The problem with these types of orgs/tools is that by its
         | nature your DSL constrains the juniors/inexperienced devs to
         | what is currently possible. There is not a lot of learning
         | unless you rotate them through the tools team periodically,
         | which no one does. It's also awful for the devs who are
         | building in experience in something they can't use anywhere
         | else.
         | 
         | I was in one shop where the "tools team" guys epiphany was he
         | would meta-recruit by poaching the best data engineers out of
         | the ETL team, lol. Very explicit "good team / bad team" vibes.
        
         | hipadev23 wrote:
         | Nike's data engineering is very bad. It's hundreds of temporary
         | contractors, mostly offshore, all with 6-18 month tenures, and
         | everyone reinvents their own square wheel. Thousand upon
         | thousands of abandoned confluence pages of documentation. The
         | most convoluted SQL and data architecture you'll ever find.
         | Getting answers to simple questions like "How many shoes did we
         | sell in-store vs ecommerce last week?" is a nearly impossible
         | task.
        
           | notyourwork wrote:
           | > Getting answers to simple questions like "How many shoes
           | did we sell in-store vs ecommerce last week?" is a nearly
           | impossible task.
           | 
           | I find this type of thing scary as an outsider looking in.
           | How a company so large has such immature engineering
           | continues to astonish me.
        
             | shepardrtc wrote:
             | > I find this type of thing scary as an outsider looking
             | in. How a company so large has such immature engineering
             | continues to astonish me.
             | 
             | It's management that doesn't want to risk their positions
             | by doing the very difficult business of either starting
             | over or properly simplifying their stack. It's not easy,
             | it's not quick, but if they can't even answer that basic
             | question then they need to do the work.
        
               | crowcroft wrote:
               | To defend management a _little_ bit, these massive
               | companies have existed through many eras of technology
               | with many different managers. They work with many
               | external companies in many different ways. They have an
               | exceptionally complex, but functioning tech stack, that
               | allows all of these many dependencies to function
               | together. Lastly, they are successful as they are!
               | 
               | It's not usually an issue of immaturity, it's just really
               | hard. To make things worse, often people don't really
               | want to do the work because literally any other data
               | engineering job would probably be more enjoyable.
               | 
               | Simplifying the tech stack would probably require
               | simplifying their business operations, which probably
               | means less revenue.
               | 
               | Starting over is often literally not possible because
               | there are so many interconnected systems that aren't all
               | necessarily owned by the company trying to make the
               | decision...
        
           | rootusrootus wrote:
           | > How many shoes did we sell in-store vs ecommerce last week?
           | 
           | That is perhaps not a great example. My brother is a business
           | analyst at Nike (has been for 15 years or more). I just asked
           | him how hard it would be to answer that question and he said
           | it would be pretty easy. Granted, this is the kind of data he
           | works with routinely, so it may be more difficult for other
           | teams that do not.
        
             | MisterTea wrote:
             | Is he the person running the query or is he reading a blood
             | spattered report pulled from the dead hands of some data
             | engineer who perished battling the system to retrieve said
             | data?
        
         | alex_lav wrote:
         | Speaking only my own experiences, my contract was ~2x what
         | competitors were paying in the US. This was similar amongst the
         | contractors I worked with, depending on seniority.
        
       | djaouen wrote:
       | So this is like Broadway (Elixir), but for Python?
        
       | waffletower wrote:
       | Many data engineering problems are impeded by strong typing,
       | particularly type transduction applications (translating between
       | a database type system and a transport such as Avro, for
       | example). While in many cases that is somebody else's problem --
       | it is solved in a library -- when it isn't the strengths and
       | facility of a dynamic language can save you considerable code
       | complexity and maintenance. Type control is often central to
       | reporting as well, and it is, again, more awkward in a strong
       | typing context. I would tend to argue that insistence upon type
       | frameworks such as pydantic in a data engineering framework is
       | naive and imposed by academic rather than industry experience.
       | There is a reason that python is chosen for data processing
       | applications, and it certainly isn't typing.
        
         | waffletower wrote:
         | Giving this some more thought: I do know that Nike has a
         | revolving door for developers. It seems that a framework like
         | Koheesio allows Nike to essentially hire for scala from a pool
         | of candidates that only have python experience. Once hired, as
         | they use pyspark and koheesio daily, they don't even know they
         | are scala developers. Much easier to hire/fire python
         | developers these days.
        
         | IshKebab wrote:
         | Python is chosen for data processing because it's one of the
         | most popular languages in the world and it has a passable REPL
         | (which is more than most languages) so you can use it for
         | experimentation.
         | 
         | From the readme they describe it as "robust" and having a high
         | level of type safety, so I'm guessing they're just leaning
         | towards the "learn about bugs up before they hit production"
         | end of the spectrum than you.
         | 
         | Then again I don't do any of this data engineering stuff so
         | maybe it doesn't matter too much if it doesn't work reliably?
        
         | sixdimensional wrote:
         | You're right, but this problem doesn't go away when you use
         | Python - because you still have to interact with the type
         | systems of the platforms you are integrating. It's a much more
         | difficult and nuanced problem than I think most people realize.
        
       | yevpats wrote:
       | Check out CloudQuery - Arrow powered ELT framework (Author here
       | :) )
        
         | Jgrubb wrote:
         | Neato, and I personally appreciate your finops.sql example
         | query :)
        
       ___________________________________________________________________
       (page generated 2024-06-04 23:01 UTC)