hngopher.com

       [HN Gopher] Bauplan - Git-for-data pipelines on object storage
       ___________________________________________________________________
        
       Bauplan - Git-for-data pipelines on object storage
        
       Author : barabbababoon
       Score  : 34 points
       Date   : 2025-04-16 14:25 UTC (2 hours ago)
        
 (HTM) web link (docs.bauplanlabs.com)
 (TXT) w3m dump (docs.bauplanlabs.com)
        
       | jtagliabuetooso wrote:
       | Looking to get feedback for a code-first platform for data:
       | instead of custom frameworks, GUIs, notebooks on a chron, bauplan
       | runs SQL / Python functions from your IDE, in the cloud, backed
       | by your object storage. Everything is versioned and composable:
       | time-travel, git-like branches, scriptable meta-logic.
       | 
       | Perhaps surprisingly, we decided to co-design the abstractions
       | and the runtime, which allowed novel optimizations at the
       | intersection of FaaS and data - e.g. rebuilding functions can be
       | 15x faster than the corresponding AWS stack
       | (https://arxiv.org/pdf/2410.17465). All capabilities are
       | available to humans (CLI) and machines (SDK) through simple APIs.
       | 
       | Would love to hear the community's thoughts on moving data
       | engineering workflows closer to software abstractions: tables,
       | functions, branches, CI/CD etc.
        
         | dijksterhuis wrote:
         | the big question i have is -- where is the code executed? "the
         | cloud"? who's cloud? my cloud? your environment on AWS?
         | 
         | the paper briefly mentions "bring your own cloud" in 4.5 but
         | the docs page doesn't seem to have any information on doing
         | that (or at least none that i can find).
        
           | zenlikethat wrote:
           | The code you execute on your data currently runs in a per-
           | customer AWS account managed by us. We leave the door open
           | for BYOC based on the architecture we've designed, but due to
           | lean startup life, that's not an option yet. We'd definitely
           | be down to chat about it
        
         | korijn wrote:
         | How does this compare to dbt? Seems like it can do the same?
        
           | laminarflow027 wrote:
           | To me they seem like the pythonic version of dbt! Instead of
           | yaml, you write Python code. That, and a lot of on-the-fly
           | computations to generate an optimized workflow plan.
        
             | barabbababoon wrote:
             | Plenty of stuff in common with dbt's philosophy. One big
             | thing though, dbt does not run your compute or manage your
             | lake. It orchestrate your code and pushes it down to a
             | runtime (e.g. 90% of the time Snowflake).
             | 
             | This IS a runtime.
             | 
             | You import bauplan, write your functions and run them in
             | straight into the cloud - you don't need anything more.
             | When you want to make a pipeline you chain the functions
             | together, and the system manages the dependencies, the
             | containerization, the runtime, and gives you a git-like
             | abstractions over runs, tables and pipelines.
        
           | zenlikethat wrote:
           | Some similarities, but Bauplan offers:
           | 
           | 1. Great Python support. Piping something from a structured
           | data catalog into Python is trivial, and so is persisting
           | results. With materialization, you never need to recompute
           | something in Python twice if you don't want to -- you can
           | store it in your data catalog forever.
           | 
           | Also, you can request anything Python package you want, and
           | even have different Python versions and packages in different
           | workflow steps.
           | 
           | 2. Catalog integration. Safely make changes and run
           | experiments in branches.
           | 
           | 3. Efficient caching and data re-use. We do a ton of tricks
           | behind to scenes to avoid recomputing or rescanning things
           | that have already been done, and pass data between steps with
           | Arrow zero copy tables. This means your DAGs run a lot faster
           | because the amount of time spent shuffling bytes around is
           | minimal.
        
         | anentropic wrote:
         | I am very interested in this but have some questions after a
         | quick look
         | 
         | It mentions "Serverless pipelines. Run fast, stateless Python
         | functions in the cloud." on the home page... but it took me a
         | while of clicking around looking for exactly what the
         | deployment model is
         | 
         | e.g. is it the cloud provider's own "serverless functions"? or
         | is this a platform that maybe runs on k8s and provides its own
         | serverless compute resources?
         | 
         | Under examples I found
         | https://docs.bauplanlabs.com/en/latest/examples/data_product...
         | which shows running a cli command `serverless deploy` to deploy
         | an AWS Lambda
         | 
         | for me deploying to regular Lambda func is a plus, but this
         | example raises more questions...
         | 
         | https://docs.bauplanlabs.com/en/latest/commands_cheatsheet.h...
         | doesn't show any 'serverless' or 'deploy' command... presumably
         | the example is using an external tool i.e. the Serverless
         | framework?
         | 
         | which is fine, great even - I can presumably use my existing
         | code deployment methodology like CDK or Terraform instead
         | 
         | Just suggesting that the underlying details could be spelled
         | out a bit more up front.
         | 
         | In the end I kind of understand it as similar to sqlmesh, but
         | with a "BYO compute" approach? So where sqlmesh wants to run on
         | a Data Warehouse platform that provides compute, and only
         | really supports Iceberg via Trino, bauplan is focused solely on
         | Iceberg and defining/providing your own compute resources?
         | 
         | I like it
         | 
         | Last question is re here
         | https://docs.bauplanlabs.com/en/latest/tutorial/index.html
         | 
         | > "Need credentials? Fill out this form to get started"
         | 
         | Should I understand therefore that this is only usable with an
         | account from bauplanlabs.com ?
         | 
         | What does that provide? There's no pricing mentioned so far -
         | what is the model?
        
           | zenlikethat wrote:
           | > or is this a platform that maybe runs on k8s and provides
           | its own serverless compute resources?
           | 
           | This one, although it's a custom orchestration system, not
           | Kubernetes. (there are some similarities but our system is
           | really optimized for data workloads)
           | 
           | We manage Iceberg for easy data versioning, take care of data
           | caching and Python modules, etc., and you just write some
           | Python and SQL and exec it over your data catalog without
           | having to worry about Docker and all infra stuff.
           | 
           | I wrote a bit on what the efficient SQL half takes care of
           | for you here: https://www.bauplanlabs.com/blog/blending-
           | duckdb-and-iceberg...
           | 
           | > In the end I kind of understand it as similar to sqlmesh,
           | but with a "BYO compute" approach? So where sqlmesh wants to
           | run on a Data Warehouse platform that provides compute, and
           | only really supports Iceberg via Trino, bauplan is focused
           | solely on Iceberg and defining/providing your own compute
           | resources?
           | 
           | Philosophically, yes. In practice so far we manage the
           | machines in separate AWS accounts _for_ the customers, in a
           | sort of hybrid approach, but the idea is not dissimilar.
           | 
           | > Should I understand therefore that this is only usable with
           | an account from bauplanlabs.com ?
           | 
           | Yep. We'd help you get started and use our demo team. Send
           | jacopo.tagliabue@bauplanlabs.com an email
           | 
           | RE: pricing. Good question. Early startup stage bespoke at
           | the moment. Contact your friendly neighborhood Bauplan
           | founder to learn more :)
        
         | esafak wrote:
         | It is a service, not an open source tool, as far as I can tell.
         | Do you intend to stay that way? What is the business model and
         | pricing?
         | 
         | I am a bit concerned that you want users to swap out both their
         | storage and workflow orchestrator. It's hard enough to convince
         | users to drop one.
         | 
         | How does it compare to DuckDB or Polars for medium data?
        
           | zenlikethat wrote:
           | Yep, staying service.
           | 
           | RE: workflow orchestrators. You can use the Bauplan SDK to
           | query, launch jobs and get results from within your existing
           | platform, we don't want to replace entirely if it's doesn't
           | fit for you, just to augment.
           | 
           | RE: DuckDB and Polars. It literally uses DuckDB under the
           | hood but with two huge upgrades: one, we plug into your data
           | catalog for really efficient scanning even on massive data
           | lake houses, before it hits the DuckDB step. Two, we do
           | efficient data caching. Query results and intermediate scans
           | and stuff can be reused across runs.
           | 
           | More details here: https://www.bauplanlabs.com/blog/blending-
           | duckdb-and-iceberg...
           | 
           | As for Polars, you can use Polars itself within your Python
           | models easily by specifying it in a pip decorator. We install
           | all requested packages within Python modules.
        
           | barabbababoon wrote:
           | - Yes. it is a service and at least the runner will stay like
           | that for the time being.
           | 
           | - We are not quite live yet, but the pricing model is based
           | on compute capacity and it is divided in tiers (e.g.
           | small=50GB for concurrent scans=$1500/month, large can get up
           | to a TB). infinite queries, infinte jobs, infinite users. The
           | idea is to have a very clear pricing with no sudden increases
           | due to volume.
           | 
           | - You do not have to swap your storage - our runner comes to
           | your S3 bucket and your data never ever have to be anywhere
           | else that is not your S3.
           | 
           | - You do not have to swap your orchestrator either. Most of
           | our clients are actually using it with their existing
           | orchestrator. You call the platform's APIs, including run
           | from your Airflow/Prefect/Temporal tasks
           | https://www.prefect.io/blog/prefect-on-the-lakehouse-
           | write-a...
           | 
           | Does it help?
        
       | rustyconover wrote:
       | I'd love to see a 10 minute YouTube video of the capabilities of
       | this product.
        
         | mehdmldj wrote:
         | Not really 10 minutes, be here is what you're looking for:
         | https://www.youtube.com/watch?v=Di2AkSmitTc
        
       ___________________________________________________________________
       (page generated 2025-04-16 17:00 UTC)