[HN Gopher] Koheesio: Nike's Python-based framework to build adv...
___________________________________________________________________
Koheesio: Nike's Python-based framework to build advanced data-
pipelines
Author : betacar
Score : 195 points
Date : 2024-06-04 05:07 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jiggunjer wrote:
| Another snakemake?
| esafak wrote:
| > Koheesio is not in competition with other libraries.
|
| Yes, it is, because nobody wants to run multiple orchestrators,
| and the "What sets Koheesio apart from other libraries?" section
| does little to help users decide why they should pick yours.
|
| Workflow orchestration is a mature category, as evidenced by the
| length of this list: https://github.com/meirwah/awesome-workflow-
| engines
|
| I would expect someone who's seriously writing a new orchestrator
| in 2024 to cite the alternatives, their shortcomings, and how you
| intend to address them. Bonus points if you make a neat little
| table.
|
| The fact that you're leading with Python does not inspire
| confidence. Pretty much all workflow orchestrators use Python for
| their glue, and that's hardly the interesting part.
|
| What were they using at Nike before this?
| zaptheimpaler wrote:
| This is the kind of attitude that makes people, companies,
| researchers hesitant to publish code online. Its free code for
| everyone to see, they don't owe you anything. Its not
| necessarily a "product" for your consumption, its just a repo.
| esafak wrote:
| I don't think so; this is a corporate project, not an
| enthusiast's. They probably had to internally justify
| building it over using an existing solution, so they could
| simply share their rationale. I am not trying to rain on
| their parade.
| fforflo wrote:
| It's a library written by some devs who thought it might be
| useful to others, too, and/or are proud enough to share
| their work. It's not that they'll publish it in an SEC
| filing.
|
| For every n-th solution in the market, n-1 existing ones
| could have been used, but they weren't for (many times)
| good reasons.
|
| And looking at their Makefile and pyproject.toml I can see
| that they knew what they wanted.
| klohto wrote:
| yes, Nike is a corp where everyone sings kumbaya and the
| engineers have unlimited time to work on and publish
| libraries they are proud of, they don't have to deliver
| specific values at all.
| blitzar wrote:
| the kumbaya and coding is only allowed in the 5 minute
| lunchbreak at the nike sweatshops
| muspimerol wrote:
| I think this is a fair criticism. You chose to share this on
| hackernews, which invites feedback (including constructive
| criticism). I don't see the problem -\\_(tsu)_/- Do you
| expect people to only voice praise for open source projects?
| mb5 wrote:
| Why do you think the poster is connected with the repo? I
| couldn't see any link.
| muspimerol wrote:
| I don't. Regardless who posted it, discussion and
| constructive criticism is warranted in the comments.
| yunohn wrote:
| /s
|
| Like it or hate it, it's how social media works. Everyone
| posts other people's things, and then more people get
| together to rip it to shreds.
| koliber wrote:
| It's a fair set of questions, posed fairly directly, without
| any sugarcoating. If you read it as a ruthless critique, it
| stings. If you read it as constructive feedback, it can bring
| the author a lot of value.
|
| If the author can clearly show the value proposition of their
| library, it will get more adoption, and the community and the
| author will get value.
|
| If the author realizes they coded something that is already a
| well-solved problem, or a poorly-constructed alternative, the
| author and Nike could gain by throwing this away and going
| with a better alternative.
|
| Personally, I wasted too many hours of my life creating a
| solution that did not solve any problems. Had I done some
| thinking and/or market research ahead of time, it would have
| saved me ton's of time to work on more worthwhile endeavors.
|
| Good feedback is worth its weight in gold, even if it hurts a
| bit to hear it.
| bt1a wrote:
| I'll agree that the pages about Koheesio hosted on nike's
| website are strangely empty of persuasive specifics (seems like
| generative AI used by someone who was in charge of writing it
| up lol..), but maybe the authors believe that someone who's
| seriously considering a data pipeline library would want to
| look at the actual code (which by the way, is structured quite
| well). Python doesn't "inspire confidence"? Think of the target
| audience. The frameworks claims to take typing and tests quite
| seriously, so perhaps your confidence is gained elsewhere. I
| didn't go through the codebase because I'm not interested, but
| your dismissive attitude doesn't radiate good faith
| vincnetas wrote:
| Regarding comments saying that parent comment lacks good faith,
| and code is free so take it as is, should also consider that
| parent has all the rights to express his concerns and opinions.
| If you don't like criticism or inconvenient questions, just
| skip it or ignore. For example parent insights were interesting
| to me. Good questions that i can learn from how critically
| evaluate things.
| boffinAudio wrote:
| >nobody wants to run multiple orchestrators
|
| Straw man argument.
|
| >I would expect someone who's seriously writing a new
| orchestrator in 2024 to cite the alternatives, their
| shortcomings, and how you intend to address them. Bonus points
| if you make a neat little table.
|
| Did you even try to read the docs before you launched this
| critical diatribe?
|
| From the docs (https://engineering.nike.com/koheesio/latest/tut
| orials/onboa...): Advantages of Koheesio
| Using Koheesio instead of raw Spark has several advantages:
| Modularity: Each step in the pipeline (reading, transformation,
| writing) is encapsulated in its own class, making the code
| easier to understand and maintain. Reusability: Steps
| can be reused across different tasks, reducing code
| duplication. Testability: Each step can be tested
| independently, making it easier to write unit tests.
| Flexibility: The behavior of a task can be customized using a
| Context class. Consistency: Koheesio enforces a
| consistent structure for data processing tasks, making it
| easier for new developers to understand the codebase.
| Error Handling: Koheesio provides a consistent way to handle
| errors and exceptions in data processing tasks.
| Logging: Koheesio provides a consistent way to log information
| and errors in data processing tasks. In contrast,
| using the plain PySpark API for transformations can lead to
| more verbose and less structured code, which can be harder to
| understand, maintain, and test. It also doesn't provide the
| same level of error handling, logging, and flexibility as the
| Koheesio Transform class.
|
| It took me less than 15 seconds to find the solution to the
| problem you propose. How long did it take you to formulate your
| critique? Do you perhaps just have a prejudice against Nike
| (corporate haze), or is it an investment in a 'competing
| orchestrator' that is clouding your judgement?
| esafak wrote:
| Every modern workflow orchestrator does those things you
| quote, and more. You make it sound like they're innovations
| when they're table stakes. Why wouldn't you just use Flyte or
| Kubeflow?
|
| Also, the fact that they say that the alternative is "raw
| Spark" tells me either that they're confused, or not very
| good at explaining. Spark is used to execute tasks in a
| pipeline, not to orchestrate it.
| thenaturalist wrote:
| While I generally tend to agree with your basic criticism,
| I think you need to keep in mind our perspectives might be
| biased due to limited data.
|
| Flyte went OSS what, 4 years ago? I'm not super familiar
| with it, but a) could have been that it was too unpolished
| at the time or b) requiring K8s to be a non-starter for
| some teams/ orgs. Same for Kubeflow.
|
| We also don't know for how long Koheesio existed within
| Nike.
|
| In short, there's a lot we don't know and there's a good
| chance the internal reasoning for investing in this made
| sense under certain circumstances in the past.
| esafak wrote:
| This project is two weeks old!
| serial_dev wrote:
| > Workflow orchestration is a mature category, as evidenced by
| the length of this list (...)
|
| Or could it be an evidence that the existing tools all have
| their flaws and any reasonably sized organization will hit
| these flaws pretty early, so many orgs come to the conclusion
| that the best approach is to just roll their own that fits
| their case relatively well?
| thenaturalist wrote:
| Are you at all familiar with the data integration / ETL
| framework space?
|
| If so, I think you would recognize how unreasonable the
| presumption of your comment is. It certainly strikes me as
| such.
|
| Most if not all the big OSS frameworks and commercial
| offerings originated in one big corp and subsequently moved
| into the Apache direction (Airflow) or spun out into their
| own companies.
| whimsicalism wrote:
| no offense, but you clearly don't know much about
| orchestrators
| gonzo41 wrote:
| Every task orchestration tool kinda has a crappy security
| model. Sure they're a ton of them but when you start putting it
| all over the place it's just a hectic to get right. That is a
| feel the space someone could make gains in a big way.
| whimsicalism wrote:
| yeah they need to explain why this is better than flyte
| rockostrich wrote:
| > Workflow orchestration is a mature category
|
| As far as I can tell from the docs, this is not an orchestrator
| at all. It is a Python-wrapper for certain runtimes like
| PySpark. I don't see anything in the docs that mentions DAGs,
| dependency definitions, scheduling, or deployment.
| hipadev23 wrote:
| Had Nike as a client for a period of time, interacted with quite
| a few people across their data org. There is absolutely no
| software you want authored by them.
| annexrichmond wrote:
| This is just flat out rude.
| sam_lowry_ wrote:
| It's a big organization, but I can understand the feeling,
| because I had the same attitude towards Microsoft, Oracle,
| Salesforce and many others.
| alex_lav wrote:
| It's a big organization, which means there will be areas
| full of great people and areas full of less-than great
| people. To discount all work from a group that large is
| just silly.
| teekert wrote:
| I worked at a large Healthtech company before. I get the
| sentiment. But, in the forest of software based on poor
| decisions, there were definitely a couple of valuable gems made
| by very knowledgeable people. Generally those were the people
| with an "opensource mindset" (as opposed to the "my ultra-
| crappy code that is just some FOSS glued together is super
| valuable IP and you need to read 100s of QMS docs before you
| can lay your eyes on it"-people). Don't measure the whole org
| by the same yardstick.
| benterix wrote:
| A probably better explanation of what it is and why you might
| want to use it (or not) can be found here:
|
| https://engineering.nike.com/koheesio/latest/tutorials/onboa...
| serial_dev wrote:
| I used to work a little with ETLs, Spark, Storm, etc and I
| honestly don't understand the value proposition of this library.
| I'm no data engineer expert by any means (it was like 2 years
| working on data eng stuff about 30% of the time 5+ years ago),
| but I expected that at least I'd get what this is useful for.
| ubercore wrote:
| This, at a glance, seems pretty simplistic. A neat project, but
| not something I would have expected on HN front page.
| whoiscroberts wrote:
| After decades of working on overly abstracted clever
| applications the only place I see elegance is in simplicity.
| I'd like to see more libraries like this on the front page.
| tiew9Vii wrote:
| From their docs: > Koheesio is a Python library that simplifies
| the development of data engineering pipelines. It provides a
| structured way
|
| I think this pretty much sums it up, "a structured way".
|
| It's looks to be a thin wrapper around spark to provide a
| consistent way to structure ETL jobs. They've implemented a
| mini dsl defining jobs as a datastructure on top of Spark.
|
| I've seen several companies build stuff similar to this
| internally, defining jobs as a data structure. It all amounts
| to each company having their own internal conventions, their
| own view of what is easier for their devs and creating a
| framework for it. Nike have just decided to make theirs public.
|
| You can do all this simply with simple spark scripts.
| Personally I'd use simple spark scripts. Big companies with
| lots of people love making these tools as companies love
| conventions, their conventions, their style guides, deal with
| staff churn/on-boarding frequently so believe these kind of
| things make that easier.
|
| Probably makes sense in Nike as a way of organizing their ETL
| jobs but that looks to be all it brings. A way to
| structure/define your simple spark jobs the way Nike devs
| believe it should be done.
| anentropic wrote:
| it looks like a layer of sugar over PySpark
|
| seems to be Spark-only AFAICT?
| adrianbr wrote:
| That's really cool, did you already saw the dlt library? That
| one's done for very easy to use EL in python. It's similarly
| modular and built by senior data engineers for the data team, and
| the sources are generators which you could probably use too.
|
| How is koheesio different to dlt? Where could they complement
| each other?
| steveBK123 wrote:
| If I had to guess, a tool like this might be useful in a shop
| with a lot of inexperienced devs. It's a thin wrapper to make
| sure everyone walks the same well worn path the same way. You
| have 2-3 devs work on the tooling, and a much larger team doing
| rote ETL.
|
| I worked at a shop that did this and the trade-off is TTM, as
| your 2 person tools team is constantly needing to unblock ETL
| team with new features as they encounter new requirements in the
| wild.
|
| If your ETL team is 20+ people and the tools team doesn't have a
| head start, tools team will quickly fall behind an insurmountable
| backlog as your ETL team spins its wheels. But you might save
| some money if you choose the right KPI..
| haddr wrote:
| I think this is the case: when you run your pipelines at scale
| you want to standardize and simplify some repeatable aspects to
| lower the cost of managing them. You may also want to be
| orthogonal to orchestrator engines (or triggering engines) and
| avoid getting too opinionated and inflexible in the future. So
| this framework is exploring some sweet spot between raw spark
| pipelines and low code etl engines.
| steveBK123 wrote:
| yeah though a lot of these fall for a variant of the
| "universal standard" conceit joked about in xkcd. All these
| low-code solutions suck, so we'll build our own in-house that
| surely won't have the same pitfalls..
| datavirtue wrote:
| I built a data processing framework at GE that let junior devs
| write whatever code they needed to transform a particular
| input. It provided an interface that they had to satisfy (for
| data lineage metrics) but otherwise scaled their code without
| them having to understand the distributed architecture or
| anything about the platform. Exceptions flowed up to the
| platform and became part of the data lineage metrics.
|
| I walked into 20 years of adhoc code that had zero data
| lineage, recoverability, or scalability that was breaking
| daily. There were contractors with over a decade of tenure
| whose job it was to troubleshoot and fix their own brittle
| processes (and make new ones) daily.
|
| I got laid off (747Max plus pandemic) as I was rolling it out
| and they went back to the old way.
|
| Subsequently, a new startup (Pantomath) emerged with former GE
| engineers (and other former colleagues of mine) from my former
| department to address that problem domain.
|
| Based on my experience trying to socialize this type of
| solution, sales are going to be a bitch.
| steveBK123 wrote:
| A framework with composable building blocks, allowing devs to
| unblock themselves by adding the functionality they need is a
| good solution.
| alessmar wrote:
| A few weeks ago, I chose to write my data pipelines using Apache
| Beam. It seems that Koheesio shares some features with this
| project, but I believe Apache Beam is superior due to its ability
| to run on various runners, support multiple programming
| languages, and integrate with numerous data sources and
| destinations.
| esafak wrote:
| If I wanted an abstraction layer for writing pipelines today
| I'd look into ZenML: https://www.zenml.io/vs/zenml-vs-
| orchestrators
| tpoacher wrote:
| Oh so like luigi? Great!
| newfocogi wrote:
| I'd like to understand what data engineering inside Nike is
| actually like. I'm curious because I have relevant experience on
| my LinkedIn profile, and I get reached out to almost weekly from
| third party recruiters trying to fill really low paying contract
| data engineering and ML jobs with Nike. These roles seem to be
| targeting people with professional experience in the US but pay
| roughly a 3rd of what I would consider the going rate. There's
| another top level comment here that this tool might make sense
| "in a shop with a lot of inexperienced devs", which would confirm
| my anecdata. Maybe the roles are actually scams, who knows
| :shrug:
| steveBK123 wrote:
| My experience with these kinds of tools (and I've built some
| myself) tells me that you're better off hiring people who know
| what they are doing or having enough experienced people PLUS
| culture to train up juniors.
|
| The idea that you'll just build a tool that makes hiring 10x as
| many inexperienced devs work is dubious. Just one more new DSL
| bro. Certainly we have cracked the code that no one else has.
|
| The problem with these types of orgs/tools is that by its
| nature your DSL constrains the juniors/inexperienced devs to
| what is currently possible. There is not a lot of learning
| unless you rotate them through the tools team periodically,
| which no one does. It's also awful for the devs who are
| building in experience in something they can't use anywhere
| else.
|
| I was in one shop where the "tools team" guys epiphany was he
| would meta-recruit by poaching the best data engineers out of
| the ETL team, lol. Very explicit "good team / bad team" vibes.
| hipadev23 wrote:
| Nike's data engineering is very bad. It's hundreds of temporary
| contractors, mostly offshore, all with 6-18 month tenures, and
| everyone reinvents their own square wheel. Thousand upon
| thousands of abandoned confluence pages of documentation. The
| most convoluted SQL and data architecture you'll ever find.
| Getting answers to simple questions like "How many shoes did we
| sell in-store vs ecommerce last week?" is a nearly impossible
| task.
| notyourwork wrote:
| > Getting answers to simple questions like "How many shoes
| did we sell in-store vs ecommerce last week?" is a nearly
| impossible task.
|
| I find this type of thing scary as an outsider looking in.
| How a company so large has such immature engineering
| continues to astonish me.
| shepardrtc wrote:
| > I find this type of thing scary as an outsider looking
| in. How a company so large has such immature engineering
| continues to astonish me.
|
| It's management that doesn't want to risk their positions
| by doing the very difficult business of either starting
| over or properly simplifying their stack. It's not easy,
| it's not quick, but if they can't even answer that basic
| question then they need to do the work.
| crowcroft wrote:
| To defend management a _little_ bit, these massive
| companies have existed through many eras of technology
| with many different managers. They work with many
| external companies in many different ways. They have an
| exceptionally complex, but functioning tech stack, that
| allows all of these many dependencies to function
| together. Lastly, they are successful as they are!
|
| It's not usually an issue of immaturity, it's just really
| hard. To make things worse, often people don't really
| want to do the work because literally any other data
| engineering job would probably be more enjoyable.
|
| Simplifying the tech stack would probably require
| simplifying their business operations, which probably
| means less revenue.
|
| Starting over is often literally not possible because
| there are so many interconnected systems that aren't all
| necessarily owned by the company trying to make the
| decision...
| rootusrootus wrote:
| > How many shoes did we sell in-store vs ecommerce last week?
|
| That is perhaps not a great example. My brother is a business
| analyst at Nike (has been for 15 years or more). I just asked
| him how hard it would be to answer that question and he said
| it would be pretty easy. Granted, this is the kind of data he
| works with routinely, so it may be more difficult for other
| teams that do not.
| MisterTea wrote:
| Is he the person running the query or is he reading a blood
| spattered report pulled from the dead hands of some data
| engineer who perished battling the system to retrieve said
| data?
| alex_lav wrote:
| Speaking only my own experiences, my contract was ~2x what
| competitors were paying in the US. This was similar amongst the
| contractors I worked with, depending on seniority.
| djaouen wrote:
| So this is like Broadway (Elixir), but for Python?
| waffletower wrote:
| Many data engineering problems are impeded by strong typing,
| particularly type transduction applications (translating between
| a database type system and a transport such as Avro, for
| example). While in many cases that is somebody else's problem --
| it is solved in a library -- when it isn't the strengths and
| facility of a dynamic language can save you considerable code
| complexity and maintenance. Type control is often central to
| reporting as well, and it is, again, more awkward in a strong
| typing context. I would tend to argue that insistence upon type
| frameworks such as pydantic in a data engineering framework is
| naive and imposed by academic rather than industry experience.
| There is a reason that python is chosen for data processing
| applications, and it certainly isn't typing.
| waffletower wrote:
| Giving this some more thought: I do know that Nike has a
| revolving door for developers. It seems that a framework like
| Koheesio allows Nike to essentially hire for scala from a pool
| of candidates that only have python experience. Once hired, as
| they use pyspark and koheesio daily, they don't even know they
| are scala developers. Much easier to hire/fire python
| developers these days.
| IshKebab wrote:
| Python is chosen for data processing because it's one of the
| most popular languages in the world and it has a passable REPL
| (which is more than most languages) so you can use it for
| experimentation.
|
| From the readme they describe it as "robust" and having a high
| level of type safety, so I'm guessing they're just leaning
| towards the "learn about bugs up before they hit production"
| end of the spectrum than you.
|
| Then again I don't do any of this data engineering stuff so
| maybe it doesn't matter too much if it doesn't work reliably?
| sixdimensional wrote:
| You're right, but this problem doesn't go away when you use
| Python - because you still have to interact with the type
| systems of the platforms you are integrating. It's a much more
| difficult and nuanced problem than I think most people realize.
| yevpats wrote:
| Check out CloudQuery - Arrow powered ELT framework (Author here
| :) )
| Jgrubb wrote:
| Neato, and I personally appreciate your finops.sql example
| query :)
___________________________________________________________________
(page generated 2024-06-04 23:01 UTC)