[HN Gopher] Bauplan - Git-for-data pipelines on object storage
___________________________________________________________________
Bauplan - Git-for-data pipelines on object storage
Author : barabbababoon
Score : 34 points
Date : 2025-04-16 14:25 UTC (2 hours ago)
(HTM) web link (docs.bauplanlabs.com)
(TXT) w3m dump (docs.bauplanlabs.com)
| jtagliabuetooso wrote:
| Looking to get feedback for a code-first platform for data:
| instead of custom frameworks, GUIs, notebooks on a chron, bauplan
| runs SQL / Python functions from your IDE, in the cloud, backed
| by your object storage. Everything is versioned and composable:
| time-travel, git-like branches, scriptable meta-logic.
|
| Perhaps surprisingly, we decided to co-design the abstractions
| and the runtime, which allowed novel optimizations at the
| intersection of FaaS and data - e.g. rebuilding functions can be
| 15x faster than the corresponding AWS stack
| (https://arxiv.org/pdf/2410.17465). All capabilities are
| available to humans (CLI) and machines (SDK) through simple APIs.
|
| Would love to hear the community's thoughts on moving data
| engineering workflows closer to software abstractions: tables,
| functions, branches, CI/CD etc.
| dijksterhuis wrote:
| the big question i have is -- where is the code executed? "the
| cloud"? who's cloud? my cloud? your environment on AWS?
|
| the paper briefly mentions "bring your own cloud" in 4.5 but
| the docs page doesn't seem to have any information on doing
| that (or at least none that i can find).
| zenlikethat wrote:
| The code you execute on your data currently runs in a per-
| customer AWS account managed by us. We leave the door open
| for BYOC based on the architecture we've designed, but due to
| lean startup life, that's not an option yet. We'd definitely
| be down to chat about it
| korijn wrote:
| How does this compare to dbt? Seems like it can do the same?
| laminarflow027 wrote:
| To me they seem like the pythonic version of dbt! Instead of
| yaml, you write Python code. That, and a lot of on-the-fly
| computations to generate an optimized workflow plan.
| barabbababoon wrote:
| Plenty of stuff in common with dbt's philosophy. One big
| thing though, dbt does not run your compute or manage your
| lake. It orchestrate your code and pushes it down to a
| runtime (e.g. 90% of the time Snowflake).
|
| This IS a runtime.
|
| You import bauplan, write your functions and run them in
| straight into the cloud - you don't need anything more.
| When you want to make a pipeline you chain the functions
| together, and the system manages the dependencies, the
| containerization, the runtime, and gives you a git-like
| abstractions over runs, tables and pipelines.
| zenlikethat wrote:
| Some similarities, but Bauplan offers:
|
| 1. Great Python support. Piping something from a structured
| data catalog into Python is trivial, and so is persisting
| results. With materialization, you never need to recompute
| something in Python twice if you don't want to -- you can
| store it in your data catalog forever.
|
| Also, you can request anything Python package you want, and
| even have different Python versions and packages in different
| workflow steps.
|
| 2. Catalog integration. Safely make changes and run
| experiments in branches.
|
| 3. Efficient caching and data re-use. We do a ton of tricks
| behind to scenes to avoid recomputing or rescanning things
| that have already been done, and pass data between steps with
| Arrow zero copy tables. This means your DAGs run a lot faster
| because the amount of time spent shuffling bytes around is
| minimal.
| anentropic wrote:
| I am very interested in this but have some questions after a
| quick look
|
| It mentions "Serverless pipelines. Run fast, stateless Python
| functions in the cloud." on the home page... but it took me a
| while of clicking around looking for exactly what the
| deployment model is
|
| e.g. is it the cloud provider's own "serverless functions"? or
| is this a platform that maybe runs on k8s and provides its own
| serverless compute resources?
|
| Under examples I found
| https://docs.bauplanlabs.com/en/latest/examples/data_product...
| which shows running a cli command `serverless deploy` to deploy
| an AWS Lambda
|
| for me deploying to regular Lambda func is a plus, but this
| example raises more questions...
|
| https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h...
| doesn't show any 'serverless' or 'deploy' command... presumably
| the example is using an external tool i.e. the Serverless
| framework?
|
| which is fine, great even - I can presumably use my existing
| code deployment methodology like CDK or Terraform instead
|
| Just suggesting that the underlying details could be spelled
| out a bit more up front.
|
| In the end I kind of understand it as similar to sqlmesh, but
| with a "BYO compute" approach? So where sqlmesh wants to run on
| a Data Warehouse platform that provides compute, and only
| really supports Iceberg via Trino, bauplan is focused solely on
| Iceberg and defining/providing your own compute resources?
|
| I like it
|
| Last question is re here
| https://docs.bauplanlabs.com/en/latest/tutorial/index.html
|
| > "Need credentials? Fill out this form to get started"
|
| Should I understand therefore that this is only usable with an
| account from bauplanlabs.com ?
|
| What does that provide? There's no pricing mentioned so far -
| what is the model?
| zenlikethat wrote:
| > or is this a platform that maybe runs on k8s and provides
| its own serverless compute resources?
|
| This one, although it's a custom orchestration system, not
| Kubernetes. (there are some similarities but our system is
| really optimized for data workloads)
|
| We manage Iceberg for easy data versioning, take care of data
| caching and Python modules, etc., and you just write some
| Python and SQL and exec it over your data catalog without
| having to worry about Docker and all infra stuff.
|
| I wrote a bit on what the efficient SQL half takes care of
| for you here: https://www.bauplanlabs.com/blog/blending-
| duckdb-and-iceberg...
|
| > In the end I kind of understand it as similar to sqlmesh,
| but with a "BYO compute" approach? So where sqlmesh wants to
| run on a Data Warehouse platform that provides compute, and
| only really supports Iceberg via Trino, bauplan is focused
| solely on Iceberg and defining/providing your own compute
| resources?
|
| Philosophically, yes. In practice so far we manage the
| machines in separate AWS accounts _for_ the customers, in a
| sort of hybrid approach, but the idea is not dissimilar.
|
| > Should I understand therefore that this is only usable with
| an account from bauplanlabs.com ?
|
| Yep. We'd help you get started and use our demo team. Send
| jacopo.tagliabue@bauplanlabs.com an email
|
| RE: pricing. Good question. Early startup stage bespoke at
| the moment. Contact your friendly neighborhood Bauplan
| founder to learn more :)
| esafak wrote:
| It is a service, not an open source tool, as far as I can tell.
| Do you intend to stay that way? What is the business model and
| pricing?
|
| I am a bit concerned that you want users to swap out both their
| storage and workflow orchestrator. It's hard enough to convince
| users to drop one.
|
| How does it compare to DuckDB or Polars for medium data?
| zenlikethat wrote:
| Yep, staying service.
|
| RE: workflow orchestrators. You can use the Bauplan SDK to
| query, launch jobs and get results from within your existing
| platform, we don't want to replace entirely if it's doesn't
| fit for you, just to augment.
|
| RE: DuckDB and Polars. It literally uses DuckDB under the
| hood but with two huge upgrades: one, we plug into your data
| catalog for really efficient scanning even on massive data
| lake houses, before it hits the DuckDB step. Two, we do
| efficient data caching. Query results and intermediate scans
| and stuff can be reused across runs.
|
| More details here: https://www.bauplanlabs.com/blog/blending-
| duckdb-and-iceberg...
|
| As for Polars, you can use Polars itself within your Python
| models easily by specifying it in a pip decorator. We install
| all requested packages within Python modules.
| barabbababoon wrote:
| - Yes. it is a service and at least the runner will stay like
| that for the time being.
|
| - We are not quite live yet, but the pricing model is based
| on compute capacity and it is divided in tiers (e.g.
| small=50GB for concurrent scans=$1500/month, large can get up
| to a TB). infinite queries, infinte jobs, infinite users. The
| idea is to have a very clear pricing with no sudden increases
| due to volume.
|
| - You do not have to swap your storage - our runner comes to
| your S3 bucket and your data never ever have to be anywhere
| else that is not your S3.
|
| - You do not have to swap your orchestrator either. Most of
| our clients are actually using it with their existing
| orchestrator. You call the platform's APIs, including run
| from your Airflow/Prefect/Temporal tasks
| https://www.prefect.io/blog/prefect-on-the-lakehouse-
| write-a...
|
| Does it help?
| rustyconover wrote:
| I'd love to see a 10 minute YouTube video of the capabilities of
| this product.
| mehdmldj wrote:
| Not really 10 minutes, be here is what you're looking for:
| https://www.youtube.com/watch?v=Di2AkSmitTc
___________________________________________________________________
(page generated 2025-04-16 17:00 UTC)