[HN Gopher] Launch HN: BuildFlow (YC W23) - The FastAPI of data ...
___________________________________________________________________
Launch HN: BuildFlow (YC W23) - The FastAPI of data pipelines
Hey HN! We're Caleb and Josh, the founders of BuildFlow
(https://www.buildflow.dev). We provide an open source framework
for building your entire data pipeline quickly using Python. You
can think of us as an easy alternative to Apache Beam or Google
Cloud Dataflow. The problem we're trying to solve is simple:
building data pipelines can be a real pain. You often need to deal
with complex frameworks, manage external cloud resources, and wire
everything together into a single deployment (you're probably
drowning in Yaml by this point in the dev cycle). This can be a
burden on both data scientists and engineering teams. Data
pipelines is a broad term, but we generally mean any kind of
processing that happens outside of the user facing path. This can
be things like: processing file uploads, syncing data to a data
warehouse, or ingesting data from IoT devices. BuildFlow, our
open-source framework, lets you build a data pipeline by simply
attaching a decorator to a Python function. All you need to do is
describe where your input is coming from and where your output
should be written, and BuildFlow handles the rest. No configuration
outside of the code is required. See our docs for some examples:
https://www.buildflow.dev/docs/intro. When you attach the
decorator to your function, the BuildFlow runtime creates your
referenced cloud resources, spins up replicas of your processor,
and wires up everything needed to efficiently scale out the reads
from your source and then writes to your sink. This lets you focus
on writing logic as opposed to interacting with your external
dependencies. BuildFlow aims to hide as much complexity as
possible in the sources / sinks so that your processing logic can
remain simple. The framework provides generic I/O connectors for
popular cloud services and storage systems, in addition to "use
case driven" I/O connectors that chain together multiple I/O steps
required by common use cases. An example "use case driven" source
that chains together GCS pubsub notifications & fetching GCS blobs
can be seen here: https://www.buildflow.dev/docs/io-
connectors/gcs_notificatio... BuildFlow was inspired by our time
at Verily (Google Life Sciences) where we designed an internal
platform to help data scientists build and deploy ML infra / data
pipelines using Apache Beam. Using a complex framework was a burden
on our data science team because they had to learn a whole new
paradigm to write their Python code in, and our engineering team
was left with the operational load of helping folks learn Apache
Beam while also managing / deploying production pipelines. From
this pain, BuildFlow was born. Our design is based around two
observations we made from that experience: (1) The hardest thing
to get right is I/O. Efficiently fanning out I/O to workers,
concurrently reading / processing input data, catching schema
mismatches before runtime, and configuring cloud resources is where
most of the pain is. BuildFlow attempts to abstract away all of
these bits. (2) Most use cases are large scale but not (overly)
complex. Existing frameworks give you scalability and a complicated
programming model that supports every use case under the sun.
BuildFlow provides the same scalability but focuses on common use
cases so that the API can remain lightweight & easy to use.
BuildFlow is open source, but we offer a managed cloud offering
that allows you to easily deploy your pipelines to the cloud. We
provide a CLI that deploys your pipeline to a managed kubernetes
cluster, and you can optionally opt in to letting us manage your
resources / terraform as well. Ultimately this will feed into our
VS Code Extension which will allow users to visually build their
data pipelines directly from VS Code (see https://launchflow.com
for a preview). The extension will be free to use and will come
packaged with a bunch of nice-to-haves (code generation, fuzzing,
tracing, and arcade games (yep!) just to name a few in the works).
Our managed offering is still in private beta but we're hoping to
release our CLI in the next couple weeks. Pricing for this service
is still being ironed out but we expect it to be based on usage.
We'd love for you to try BuildFlow and would love any feedback. You
can get started right away by installing the python package: pip
install buildflow. Check out our docs
(https://buildflow.dev/docs/intro) and GitHub
(https://github.com/launchflow/buildflow) to see examples on how to
use the API. This project is very new, so we'd love to gather some
specific feedback from you, the community. How do you feel about a
framework managing your cloud resources? We're considering adding a
module that would let BuildFlow create / manage your terraform for
you (terraform state would be dumped to disk). What are some common
I/O operations you find yourself rewriting? What are some
operational tasks that require you to leave your code editor? We'd
like to bring as many tasks into BuildFlow and our VSCode extension
so you can avoid context switches.
Author : calebtv
Score : 67 points
Date : 2023-03-15 14:55 UTC (8 hours ago)
| lysecret wrote:
| Congrats! I had something quite similar in mind (also working a
| lot in python based streaming ETL). I am unfamiliar with Ray. A
| few questions: 1. Ray seems to be focussed on ML usecases, are
| you as well or are you a more generic streaming ETL framework.
| Can you explain the reasoning behind choosing Ray. 2. I see you
| deeply integrate Infrastructure (like BQ and Pub/Sub) what is
| your story on evolution of this infra? What happens if i have
| deployed infra through your code and I want to edit it? How do
| you deal with Dev/Prod/Qa stage divide. 3. What is your story on
| deployment of the "glue" code that runs your pipeline? Do you
| also handle multi stage pipelines?
| calebtv wrote:
| Thanks! These are all great questions, apologies for the wall
| of text
|
| 1. We're definitely more of a generic streaming framework. But
| I could see ML being one of those use cases as well.
|
| Why Ray? One of our main drivers was how "pythonic" ray feels,
| and that was a core principal we wanted in our framework. Most
| of my prior experience has been working with Beam, and Beam is
| great but it is kind of a whole new paradigm you have to learn.
| Another thing I really like about ray is how easy it is to run
| locally on your machine and get some real processing power. You
| can easily have ray use all of your cores and actually see how
| things scale without having to deploy to a cluster. I could
| probably go on and on haha, but those are the first two that
| come to mind.
|
| 2. We really want to support a bunch of frameworks / resources.
| We mainly choose BQ and Pub/Sub because of our prior
| experience. We have some github issues to support other
| resources across multiple clouds, and feel free to file some
| issues if you would like to see support for other things! With
| BuildFlow we deploy the resources to a project you own so you
| are free to edit them as you see fit. BuildFlow won't touch
| already created resource beyond making sure it can access them.
| In BuildFlow we don't really want to bake in environment
| specific logic, I think this is probably best handled with
| command line arguments to a BuildFlow pipeline. But happy to
| hear other thoughts here!
|
| 3. I'm not sure I understand what you mean by "glue", so
| apologies if this doesn't answer your question. The BuildFlow
| code gets deployed with your pipeline so it doesn't need to run
| remotely at all. So if you were deploying this to a single VM,
| you can just execute the python file on the VM and things will
| be running. We don't have great support for multi-stage
| pipelines at the moment. What you can do is chain together
| processors with a Pub/Sub feed. But we do really want to
| support chaining together processors themselves.
| amath wrote:
| Cool, nice idea. Can you sub in different backend like bytewax
| (https://github.com/bytewax/bytewax) for stateful processing?
| calebtv wrote:
| Thanks! Currently you can't, right now your only option is to
| use our ray runner. But we have talked about supporting
| different runner options similar to how Beam can be run on
| Spark, Dataflow, etc. And ultimately it would be nice if folks
| could implement their own runners, but I think we're still a
| ways out on that.
| calebtv wrote:
| I should also mention BuildFlow does support stateful
| processing with the Processor class API: https://www.buildflo
| w.dev/docs/processors/overview#processor...
| vosper wrote:
| Would you see Buildflow as a competitor to Dagster, Flink, or
| Spark Streaming?
|
| I'm about to build a pipeline that needs to pass thousands of
| docs a minute through a variety of enrichments (ML models, third-
| party APIs, etc) and then dump the final enriched doc in ES.
|
| There are so many pipeline products and workflow engines and
| MLOps solutions that I'm very confused about what technologies I
| should be looking at. I think something looks good (Temporal) but
| then read it's not really for large-volumes of streaming data. Or
| I look at Flink that can handle massive volumes but it doesn't
| seem like it's as easy to wire up as other options. I think
| Dagster looks nice but can't find any answer (even in their
| Slack) about what kind of volumes it can handle...
| TankeJosh wrote:
| You can think of BuildFlow as a lightweight alternative to
| Flink / Spark Streaming. These streaming frameworks are great
| when you want to react to events in realtime (i.e. you want to
| trigger some processing logic every time a file is uploaded to
| cloud storage). Dagster is more focused on scheduling jobs, and
| might be a good fit if you have some batch jobs you want to
| trigger occasionally.
|
| BuildFlow can run a simple PubSub -> light processing ->
| BigQuery pipeline at about 5-7k messages / second on a 4core VM
| (tested on GCP's n1-standard-4 machines). For your case, you
| might be able to get away with running on a single machine with
| 4-8 cores.
|
| I'd be happy to connect outside of HN if you'd like me to dig
| into your use case more! You can reach me at
| josh@launchflow.com
|
| edit: You can also reach out on our discord:
| https://discordapp.com/invite/wz7fjHyrCA
| 0xDEF wrote:
| How does this compare with Airflow and Dagster?
| [deleted]
| calebtv wrote:
| Good question, I would say we're more focused on being a data
| pipeline engine as opposed to workflow orchestration. So you
| could use something like Airflow or Dagster to trigger your
| BuildFlow pipeline.
| Kalanos wrote:
| by not addressing them and prefect in your initial post, it's
| a bit hit to credibility
| TankeJosh wrote:
| We chose not to reference them because we are mainly
| focused on streaming use cases, which don't fit well in the
| prefect & dagster models.
| brap wrote:
| Congrats!
|
| Just out of curiosity, it seems like the process function which
| you define has to run remotely on workers. How does it get
| serialized? Are there limitations to the process function due to
| serialization?
| calebtv wrote:
| Thanks! The process function runs as a Ray Actor
| (https://docs.ray.io/en/latest/ray-core/actors.html). So we
| have the same serialization requirements as Ray
| (https://docs.ray.io/en/latest/ray-
| core/objects/serialization...)
|
| I think the most common limitation will be ensure that your
| output is serializable. Typically returning python dictionaries
| or dataclasses is fine.
|
| But if you had a specific limitation in mind let me know happy
| to dive into it!
| calebtv wrote:
| One other thing I should mention that's relevent, we do also
| have a class abstraction instead of a decorator: https://gith
| ub.com/launchflow/buildflow/blob/main/buildflow/...
|
| This can help with things like setting up RPC clients. But it
| all boils down to the same runner whether you're using the
| class or decorator.
| fhenrywells wrote:
| Do you see this as a direct competitor to Ray's built-in
| workflow abstraction
| https://docs.ray.io/en/latest/workflows/management.html
|
| Exciting to see more libraries built on Ray in any case!
| calebtv wrote:
| Great question! We actually looked at using the workflow
| abstraction for batch processing in our runner, but
| ultimately didn't because it was still in alpha (we use the
| dataset API for batch flows).
|
| I think one area where we differ is our focus on streaming
| processing which I don't think is well supported with the
| workflow abstraction, and also having more resource
| management / use case driven IO.
| Kalanos wrote:
| will you support notebook execution and docker containers?
| TankeJosh wrote:
| Notebook execution should already work! The flow.run(...) call
| returns the output collection with this use case in mind. We're
| currently working on a docker manager module which will let
| users easily dockerize / run their pipeline locally. The system
| (repl) debugger tool in our VSCode extension will manage all of
| the docker bits for the user.
| Orangeair wrote:
| I think your site could use some copy editing. I was confused by
| the nested schema example [1], I don't even see NestedScema
| referenced after it's defined, maybe the float field should have
| used that type? Also noticed an instance of "BigQuer" on that
| page (not a particularly egregious typo, but I was on your site
| for all of thirty seconds).
|
| [1] https://www.buildflow.dev/docs/schema-validation#examples
| TankeJosh wrote:
| Thanks for the catch! We just pushed a fix.
| faizshah wrote:
| Should we think of BuildFlow as an alternative to workflow
| managers like Prefect or kubeflow or is it a higher level library
| for stream processing like Beam?
| calebtv wrote:
| More of a higher level library like Beam, and I could see it
| being plugged into a Prefect workflow.
| jcnnghm wrote:
| Is there an underlying stream processor (e.g. Flink)? How
| many messages per second can it process?
| calebtv wrote:
| All of our processing is done via Ray
| (https://www.ray.io/). Our early benchmarks are about 5k
| mesesages per second on a single 4 core VM, but we believe
| we can increase the with some more optimizations.
|
| This bench mark was consuming a Google Cloud Pub/Sub stream
| and outputting to BigQuery.
| faizshah wrote:
| I see, what fault tolerance mechanisms does it provide?
|
| I don't see anything on snapshotting or checkpointing like
| Flink. Is this just for stateless jobs?
| calebtv wrote:
| We don't support any snapshotting or checkpointing directly
| in BuildFlow at the moment, but these are great features we
| should support.
|
| But we do have some fault tolerance baked into our I/O
| operations. Specifically for Google Cloud Pub/Sub the acks
| don't happen until the data has been successfully processed
| and written to the sink, so if there is a bug or some
| transient failure the message will be resent later
| depending on your subscriber configuration.
| calebtv wrote:
| I should also mention BuildFlow does support stateful
| processing with the Processor class API: https://www.buil
| dflow.dev/docs/processors/overview#processor...
___________________________________________________________________
(page generated 2023-03-15 23:01 UTC)