[HN Gopher] Show HN: I built an open-source data pipeline tool i...
___________________________________________________________________
Show HN: I built an open-source data pipeline tool in Go
Every data pipeline job I had to tackle required quite a few
components to set up: - One tool to ingest data - Another one to
transform it - If you wanted to run Python, set up an orchestrator
- If you need to check the data, a data quality tool Let alone
this being hard to set up and taking time, it is also pretty high-
maintenance. I had to do a lot of infra work, and while this being
billable hours for me I didn't enjoy the work at all. For some
parts of it, there were nice solutions like dbt, but in the end for
an end-to-end workflow, it didn't work. That's why I decided to
build an end-to-end solution that could take care of data
ingestion, transformation, and Python stuff. Initially, it was just
for our own usage, but in the end, we thought this could be a
useful tool for everyone. In its core, Bruin is a data framework
that consists of a CLI application written in Golang, and a VS Code
extension that supports it with a local UI. Bruin supports quite a
few stuff: - Data ingestion using ingestr
(https://github.com/bruin-data/ingestr) - Data transformation in
SQL & Python, similar to dbt - Python env management using uv -
Built-in data quality checks - Secrets management - Query
validation & SQL parsing - Built-in templates for common
scenarios, e.g. Shopify, Notion, Gorgias, BigQuery, etc This means
that you can write end-to-end pipelines within the same framework
and get it running with a single command. You can run it on your
own computer, on GitHub Actions, or in an EC2 instance somewhere.
Using the templates, you can also have ready-to-go pipelines with
modeled data for your data warehouse in seconds. It includes an
open-source VS Code extension as well, which allows working with
the data pipelines locally, in a more visual way. The resulting
changes are all in code, which means everything is version-
controlled regardless, it just adds a nice layer. Bruin can run
SQL, Python, and data ingestion workflows, as well as quality
checks. For Python stuff, we use the awesome (and it really is
awesome!) uv under the hood, install dependencies in an isolated
environment, and install and manage the Python versions locally,
all in a cross-platform way. Then in order to manage data uploads
to the data warehouse, it uses dlt under the hood to upload the
data to the destination. It also uses Arrow's memory-mapped files
to easily access the data between the processes before uploading
them to the destination. We went with Golang because of its speed
and strong concurrency primitives, but more importantly, I knew Go
better than the other languages available to me and I enjoy writing
Go, so there's also that. We had a small pool of beta testers for
quite some time and I am really excited to launch Bruin CLI to the
rest of the world and get feedback from you all. I know it is not
often to build data tooling in Go but I believe we found ourselves
in a nice spot in terms of features, speed, and stability.
https://github.com/bruin-data/bruin I'd love to hear your feedback
and learn more about how we can make data pipelines easier and
better to work with, looking forward to your thoughts! Best, Burak
Author : karakanb
Score : 103 points
Date : 2024-12-17 16:40 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| producthunter90 wrote:
| How does it handle scheduling or orchestrating pipeline runs? Do
| you integrate with tools like Airflow, or is there a built-in
| solution for that?
| karakanb wrote:
| Bruin orchestrates individual runs for single pipelines, which
| means you can use any tool to schedule the runs outside and the
| assets will be orchestrated by Bruin. You can use GitHub
| Actions, Airflow, a regular cronjob, or any other form of
| scheduling for that.
| thruflo wrote:
| It's pretty remarkable what Bruin brings together into a single
| tool / workflow.
|
| If you're doing data analytics in Python it's well worth a look.
| karakanb wrote:
| thanks a lot for the kind words, James!
| kyt wrote:
| Why use this over Meltano?
| ellisv wrote:
| The README would benefit from a comparison to other tools.
|
| I'm not (necessarily) motivated to switch tooling because of
| the language it is written in. I'm motivated to switch tooling
| if it has better ergonomics, performance, or features.
| karakanb wrote:
| good point, thanks. I'll definitely add some more details
| about the comparison between different tools.
|
| I agree with you 100% on the language part, I think it is an
| interesting detail for a data tool to be built in Go, but we
| have a lot more than that, a couple of things we do there is:
|
| - everything is local-first: native Python support, local VS
| Code extension, isolated local environments, etc
|
| - very quick iteration speed: rendered queries, backfills,
| all running locally
|
| - support for data ingestion, transformation, and quality,
| without leaving the framework, while also having the ability
| to extend it with Python
|
| these are some of the improvements we focused on bringing
| into the workflows, I hope this explains our thinking a bit
| more.
| ellisv wrote:
| My #1 feedback would be to expand on the documentation.
|
| I really want to know how this is going to benefit me
| before I start putting in a lot of effort to switch to
| using it. That means I need to see why it is better than
| ${EXISTING_TOOL}.
|
| I also need to know that it is actually compatible with my
| existing data pipeline. For example, we have many single
| tenant databases that are replicated to a central
| warehouse. During replication, we have to attach source
| information to the records to distinguish them and for
| RBAC. It looks like I can do this with Bruin but the
| documentation doesn't explicitly talk about single tenant
| vs multi-tenant design.
| karakanb wrote:
| I would love to add a dedicated section on this, and
| would love to learn a bit more from you in this. Do you
| have any particular example tools that you would compare
| Bruin in your mind that you would like to understand the
| difference better?
| karakanb wrote:
| great question! Meltano, if I am not wrong, only does data
| ingestion (Extract & Load), whereas we go further into the
| pipeline such as transformation with SQL and Python, ML
| pipelines, data quality, and more.
|
| I guess a more comparable alternative would be Meltano + dbt +
| Great Expectations + Airflow (for Python stuff), whereas Bruin
| does all of them at once. In that sense, Bruin's alternative
| would be a stack rather than a single product.
|
| Does that make sense?
| ellisv wrote:
| Direct link to the documentation:
|
| https://bruin-data.github.io/bruin/
| jmccarthy wrote:
| Burak - one wish I've had recently is for a "py data ecosystem
| compiler", specifically one which allows me to express structures
| and transformations in dbt and Ibis, but not rely on Python at
| runtime. [Go|Rust]+[DuckDB|chDB|DataFusion] for the runtime.
| Bruin seems very close to the mark! Following.
| karakanb wrote:
| hey, thanks for the shoutout!
|
| I love the idea, effectively allowing going towards a direction
| where the right platform for the right job is used, and it is
| very much in line with where we are taking things towards.
| Another interesting project in that spirit is sqlframe:
| https://github.com/eakmanrq/sqlframe
| JeffMcCune wrote:
| Congrats on the launch! Since this is Go have you considered
| using CUE or looked at their flow package? Curious how you see it
| relating or helping with data pipelines.
| karakanb wrote:
| thanks!
|
| I did look into CUE in the very early days of Bruin but ended
| up going with a more YAML-based configuration due to its
| support. I am not familiar with their flow package
| specifically, but I'll definitely take a deeper look. From a
| quick look, it seems like it could have replaced some of the
| orchestration code in Bruin to a certain extent.
|
| One of the challenges, maybe specific to the data world, is
| that the userbase is familiar with a certain set of tools and
| patterns, such as SQL and Python, therefore introducing even a
| small variance into the mix is often adding friction, this was
| one of the reasons we didn't go with CUE at the time. I should
| definitely take another look though. thanks!
| tony_francis wrote:
| How does this compare to ray data?
| karakanb wrote:
| I didn't know about Ray Data before, but just gave a quick look
| and it seems like a framework for ML workloads specifically?
|
| Bruin is effectively going a layer above individual assets, and
| instead takes a declarative approach to the full pipeline,
| which could contain assets that are using Ray internally. In
| the end, think of Bruin as a full pipeline/orchestrator, which
| would contain one or more assets using various other
| technologies.
|
| I hope this makes sense.
| NortySpock wrote:
| Interesting, I've been looking for a system / tool that
| acknowledges that a dbt transformation pipeline tends to be
| joined-at-the-hip with the data ingestion mode....
|
| As I read through the documentation, Do you have a mode in ingstr
| that lets you specify the maximum lateness of a file? (For late-
| arriving rows or files or backfills) I didn't see it in my brief
| read through.
|
| https://bruin-data.github.io/bruin/assets/ingestr.html
|
| Reminds me a bit of Benthos / Bento / RedPanda Connect (in a good
| way)
|
| Interested to kick the tires on this (compared to, say, Python
| dlt)
| karakanb wrote:
| great point about the transformation pipeline, that's a very
| strong part of our motivation: it's never "just
| transformation", "just ingestion" or "just python", the value
| lies in being able to mix and match technologies.
|
| as per the lateness: ingestr itself does the fetching itself,
| which means the moment you run it it will ingest the data right
| away, which means there's no latency there. in terms of loading
| files from S3 as an example, you can already define your own
| blob pattern, which would allow you to ingest only certain
| files that fit into your lateness criteria, would this fit?
|
| in addition, we will implement the concept of a "sensor", which
| will allow you to wait until a certain condition is met, e.g. a
| table/file exists, or a certain query returns true, and
| continue the pipeline from there, which could also help your
| usecase.
|
| feel free to join our slack community, happy to dig deeper into
| this and see what we can implement there.
| kakoni wrote:
| Is dlt part of bruin-stack?
| karakanb wrote:
| depends on what you mean by that, but we do use dlt through
| ingestr (https://github.com/bruin-data/ingestr), which is used
| inside Bruin CLI.
___________________________________________________________________
(page generated 2024-12-17 23:00 UTC)