hngopher.com

       [HN Gopher] Show HN: I built an open-source data pipeline tool i...
       ___________________________________________________________________
        
       Show HN: I built an open-source data pipeline tool in Go
        
       Every data pipeline job I had to tackle required quite a few
       components to set up:  - One tool to ingest data  - Another one to
       transform it  - If you wanted to run Python, set up an orchestrator
       - If you need to check the data, a data quality tool  Let alone
       this being hard to set up and taking time, it is also pretty high-
       maintenance. I had to do a lot of infra work, and while this being
       billable hours for me I didn't enjoy the work at all. For some
       parts of it, there were nice solutions like dbt, but in the end for
       an end-to-end workflow, it didn't work. That's why I decided to
       build an end-to-end solution that could take care of data
       ingestion, transformation, and Python stuff. Initially, it was just
       for our own usage, but in the end, we thought this could be a
       useful tool for everyone.  In its core, Bruin is a data framework
       that consists of a CLI application written in Golang, and a VS Code
       extension that supports it with a local UI.  Bruin supports quite a
       few stuff:  - Data ingestion using ingestr
       (https://github.com/bruin-data/ingestr)  - Data transformation in
       SQL & Python, similar to dbt  - Python env management using uv  -
       Built-in data quality checks  - Secrets management  - Query
       validation & SQL parsing  - Built-in templates for common
       scenarios, e.g. Shopify, Notion, Gorgias, BigQuery, etc  This means
       that you can write end-to-end pipelines within the same framework
       and get it running with a single command. You can run it on your
       own computer, on GitHub Actions, or in an EC2 instance somewhere.
       Using the templates, you can also have ready-to-go pipelines with
       modeled data for your data warehouse in seconds.  It includes an
       open-source VS Code extension as well, which allows working with
       the data pipelines locally, in a more visual way. The resulting
       changes are all in code, which means everything is version-
       controlled regardless, it just adds a nice layer.  Bruin can run
       SQL, Python, and data ingestion workflows, as well as quality
       checks. For Python stuff, we use the awesome (and it really is
       awesome!) uv under the hood, install dependencies in an isolated
       environment, and install and manage the Python versions locally,
       all in a cross-platform way. Then in order to manage data uploads
       to the data warehouse, it uses dlt under the hood to upload the
       data to the destination. It also uses Arrow's memory-mapped files
       to easily access the data between the processes before uploading
       them to the destination.  We went with Golang because of its speed
       and strong concurrency primitives, but more importantly, I knew Go
       better than the other languages available to me and I enjoy writing
       Go, so there's also that.  We had a small pool of beta testers for
       quite some time and I am really excited to launch Bruin CLI to the
       rest of the world and get feedback from you all. I know it is not
       often to build data tooling in Go but I believe we found ourselves
       in a nice spot in terms of features, speed, and stability.
       https://github.com/bruin-data/bruin  I'd love to hear your feedback
       and learn more about how we can make data pipelines easier and
       better to work with, looking forward to your thoughts!  Best, Burak
        
       Author : karakanb
       Score  : 103 points
       Date   : 2024-12-17 16:40 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | producthunter90 wrote:
       | How does it handle scheduling or orchestrating pipeline runs? Do
       | you integrate with tools like Airflow, or is there a built-in
       | solution for that?
        
         | karakanb wrote:
         | Bruin orchestrates individual runs for single pipelines, which
         | means you can use any tool to schedule the runs outside and the
         | assets will be orchestrated by Bruin. You can use GitHub
         | Actions, Airflow, a regular cronjob, or any other form of
         | scheduling for that.
        
       | thruflo wrote:
       | It's pretty remarkable what Bruin brings together into a single
       | tool / workflow.
       | 
       | If you're doing data analytics in Python it's well worth a look.
        
         | karakanb wrote:
         | thanks a lot for the kind words, James!
        
       | kyt wrote:
       | Why use this over Meltano?
        
         | ellisv wrote:
         | The README would benefit from a comparison to other tools.
         | 
         | I'm not (necessarily) motivated to switch tooling because of
         | the language it is written in. I'm motivated to switch tooling
         | if it has better ergonomics, performance, or features.
        
           | karakanb wrote:
           | good point, thanks. I'll definitely add some more details
           | about the comparison between different tools.
           | 
           | I agree with you 100% on the language part, I think it is an
           | interesting detail for a data tool to be built in Go, but we
           | have a lot more than that, a couple of things we do there is:
           | 
           | - everything is local-first: native Python support, local VS
           | Code extension, isolated local environments, etc
           | 
           | - very quick iteration speed: rendered queries, backfills,
           | all running locally
           | 
           | - support for data ingestion, transformation, and quality,
           | without leaving the framework, while also having the ability
           | to extend it with Python
           | 
           | these are some of the improvements we focused on bringing
           | into the workflows, I hope this explains our thinking a bit
           | more.
        
             | ellisv wrote:
             | My #1 feedback would be to expand on the documentation.
             | 
             | I really want to know how this is going to benefit me
             | before I start putting in a lot of effort to switch to
             | using it. That means I need to see why it is better than
             | ${EXISTING_TOOL}.
             | 
             | I also need to know that it is actually compatible with my
             | existing data pipeline. For example, we have many single
             | tenant databases that are replicated to a central
             | warehouse. During replication, we have to attach source
             | information to the records to distinguish them and for
             | RBAC. It looks like I can do this with Bruin but the
             | documentation doesn't explicitly talk about single tenant
             | vs multi-tenant design.
        
               | karakanb wrote:
               | I would love to add a dedicated section on this, and
               | would love to learn a bit more from you in this. Do you
               | have any particular example tools that you would compare
               | Bruin in your mind that you would like to understand the
               | difference better?
        
         | karakanb wrote:
         | great question! Meltano, if I am not wrong, only does data
         | ingestion (Extract & Load), whereas we go further into the
         | pipeline such as transformation with SQL and Python, ML
         | pipelines, data quality, and more.
         | 
         | I guess a more comparable alternative would be Meltano + dbt +
         | Great Expectations + Airflow (for Python stuff), whereas Bruin
         | does all of them at once. In that sense, Bruin's alternative
         | would be a stack rather than a single product.
         | 
         | Does that make sense?
        
       | ellisv wrote:
       | Direct link to the documentation:
       | 
       | https://bruin-data.github.io/bruin/
        
       | jmccarthy wrote:
       | Burak - one wish I've had recently is for a "py data ecosystem
       | compiler", specifically one which allows me to express structures
       | and transformations in dbt and Ibis, but not rely on Python at
       | runtime. [Go|Rust]+[DuckDB|chDB|DataFusion] for the runtime.
       | Bruin seems very close to the mark! Following.
        
         | karakanb wrote:
         | hey, thanks for the shoutout!
         | 
         | I love the idea, effectively allowing going towards a direction
         | where the right platform for the right job is used, and it is
         | very much in line with where we are taking things towards.
         | Another interesting project in that spirit is sqlframe:
         | https://github.com/eakmanrq/sqlframe
        
       | JeffMcCune wrote:
       | Congrats on the launch! Since this is Go have you considered
       | using CUE or looked at their flow package? Curious how you see it
       | relating or helping with data pipelines.
        
         | karakanb wrote:
         | thanks!
         | 
         | I did look into CUE in the very early days of Bruin but ended
         | up going with a more YAML-based configuration due to its
         | support. I am not familiar with their flow package
         | specifically, but I'll definitely take a deeper look. From a
         | quick look, it seems like it could have replaced some of the
         | orchestration code in Bruin to a certain extent.
         | 
         | One of the challenges, maybe specific to the data world, is
         | that the userbase is familiar with a certain set of tools and
         | patterns, such as SQL and Python, therefore introducing even a
         | small variance into the mix is often adding friction, this was
         | one of the reasons we didn't go with CUE at the time. I should
         | definitely take another look though. thanks!
        
       | tony_francis wrote:
       | How does this compare to ray data?
        
         | karakanb wrote:
         | I didn't know about Ray Data before, but just gave a quick look
         | and it seems like a framework for ML workloads specifically?
         | 
         | Bruin is effectively going a layer above individual assets, and
         | instead takes a declarative approach to the full pipeline,
         | which could contain assets that are using Ray internally. In
         | the end, think of Bruin as a full pipeline/orchestrator, which
         | would contain one or more assets using various other
         | technologies.
         | 
         | I hope this makes sense.
        
       | NortySpock wrote:
       | Interesting, I've been looking for a system / tool that
       | acknowledges that a dbt transformation pipeline tends to be
       | joined-at-the-hip with the data ingestion mode....
       | 
       | As I read through the documentation, Do you have a mode in ingstr
       | that lets you specify the maximum lateness of a file? (For late-
       | arriving rows or files or backfills) I didn't see it in my brief
       | read through.
       | 
       | https://bruin-data.github.io/bruin/assets/ingestr.html
       | 
       | Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good
       | way)
       | 
       | Interested to kick the tires on this (compared to, say, Python
       | dlt)
        
         | karakanb wrote:
         | great point about the transformation pipeline, that's a very
         | strong part of our motivation: it's never "just
         | transformation", "just ingestion" or "just python", the value
         | lies in being able to mix and match technologies.
         | 
         | as per the lateness: ingestr itself does the fetching itself,
         | which means the moment you run it it will ingest the data right
         | away, which means there's no latency there. in terms of loading
         | files from S3 as an example, you can already define your own
         | blob pattern, which would allow you to ingest only certain
         | files that fit into your lateness criteria, would this fit?
         | 
         | in addition, we will implement the concept of a "sensor", which
         | will allow you to wait until a certain condition is met, e.g. a
         | table/file exists, or a certain query returns true, and
         | continue the pipeline from there, which could also help your
         | usecase.
         | 
         | feel free to join our slack community, happy to dig deeper into
         | this and see what we can implement there.
        
       | kakoni wrote:
       | Is dlt part of bruin-stack?
        
         | karakanb wrote:
         | depends on what you mean by that, but we do use dlt through
         | ingestr (https://github.com/bruin-data/ingestr), which is used
         | inside Bruin CLI.
        
       ___________________________________________________________________
       (page generated 2024-12-17 23:00 UTC)