[HN Gopher] Launch HN: Serra (YC S23) - Open-core, Python-based ...
___________________________________________________________________
Launch HN: Serra (YC S23) - Open-core, Python-based dbt alternative
Hey HN! Alan and Albert here, cofounders of Serra. Serra is end-to-
end dbt--we make building reliable, scalable ELT/ETL easy by
replacing brittle SQL scripts with object-oriented Python. It's
open core: https://github.com/Serra-Technologies/serra, and our
docs are here: https://docs.serra.io/documentation/. I stumbled
into this idea as a data engineer for Disney+'s subscriptions team.
We were "firefighters for data," ready to debug huge pipelines that
always crashed and burned. The worst part of my job at Disney+ was
the graveyard on-call rotations, where pagers from 12am to 5am were
guaranteed, and you'd have to dig through thousands of lines of
someone else's SQL. SQL is long-winded--1000 lines of SQL can often
be summarized with 10 key transforms. We take this SQL and
summarize those transforms with reusable, testable, scalable Spark
objects. Serra is written in PySpark and modularizes every
component of ETL through Spark objects. Similar to dbt, we apply
software engineering best practices to data, but we aim to do it
not just with transformations, but with data connectors as well. We
accomplish this with a configuration YAML file--the idea is if we
have a pipeline with said 1000 line SQL script that is using third-
party connectors, we can summarize all of this into a 12-block
config file that gives easy high-level overhead and debugging
capabilities--10 blocks for the transforms and 2 for the in-house
connectors. Then, we can add tests and custom alerts to each of
these objects/blocks so that we know where exactly the pipeline
breaks and why. We are open-source to make it easy to customize
Serra to whatever flavor you like with custom
transformers/connectors. The connectors we support OOB are
Snowflake, AWS, BigQuery, and Databricks and are adding more based
on feedback. The transforms we support include mapping, pivoting,
joining, truncating, imputing, and more. We're doing our best to
make Serra as easy to use as possible. If you have docker
installed, you can run this docker command to instantly get setup
with a Serra environment to create modular pipelines. We wrap up
our functionality with a command line tool that lets you: - create
your ETL pipelines, test them locally with a subset of your data,
and deploy them to the cloud (currently we only support Databricks,
but will soon support others and plan to host our own clusters
too). It also has an experimental "translate" feature which is
still a bit finicky, but the idea is to take your existing SQL
script and get suggestions on how you can chunk up and modularize
your job with our config. It's still just a super early suggestion
feature that is definitely not fleshed out, but we think it's a
cool approach. Here's a quick demo going through retooling a long-
winded SQL script to an easily maintainable, scalable ETL job:
https://www.loom.com/share/acc633c0ec03455e9e8837f5c3db3165?....
(docker command: docker run --mount
type=bind,source="$(pwd)",target=/app -it serraio/serra /bin/bash)
We don't see or store any of your data--we're a transit layer that
helps you write ETL jobs that you can send to your warehouse of
choice with your actual data. Right now we are helping customers
retool their messy data pipelines and plan to monetize by hosting
Serra on the cloud, charging if you run the job on our own
clusters, and per API call on our translate feature (once it's
mature). We're super excited to launch this to Hacker News. We'd
love to hear what you think. Thanks in advance!
Author : Alanhlwang
Score : 89 points
Date : 2023-08-14 10:13 UTC (12 hours ago)
| sails wrote:
| I like that you improve on the underlying database error
| messages, as they are really unhelpful, and I think this is a
| great place to add value.
|
| I've been keeping track of a few dbt alternatives. Dbt have
| opened up the market to this use case, while only partially
| solving the business model and maturity side. Here are the more
| interesting ones:
|
| sdf.com (ex Meta team)
|
| sqlmesh.com (relatively new)
|
| paradime.io (more an IDE)
|
| cloud.google.com/dataform (GCP only)
| Alanhlwang wrote:
| These are great links, we'll take a look
| boxed wrote:
| If the selling point is "replacing brittle SQL scripts with
| object-oriented Python" you should have at least one example of
| this code in the README!
| Alanhlwang wrote:
| Definitely a great point, adding one today
| geonnave wrote:
| For those also wondering what is ETL and dbt:
|
| > ETL (Extract, Transform, Load) is a process that involves
| extracting data from various sources, transforming it to fit
| operational needs, and loading it into a database for analysis.
| dbt (data build tool) is an open-source software tool that
| enables data analysts and engineers to transform and model data
| in the data warehouse, streamlining the transformation part of
| the ETL process.
| zxwrt wrote:
| [flagged]
| dang wrote:
| That was my mistake, not the founders', and I've changed the
| term to source available (edit: now open core) to try to avoid
| misunderstanding. I assure you there was no attempt at fraud!
| zxwrt wrote:
| Thanks! Do not make this mistake again!
| dang wrote:
| I will try! but it is not so easy, because there's no
| consensus on what these terms mean.
| adeelk93 wrote:
| If you're interested in doing the reverse of this (replacing
| pyspark with sql) - Sqlglot can do this:
| https://sqlglot.com/sqlglot/dataframe.html
| sidcool wrote:
| Congrats on launching.
| whoknowswhat11 wrote:
| A quick note that the "open source" license they use requires
| activation and license keys that block feature activations in the
| "open source" software be preserved.
|
| This license was popular w folks like apollo who arguably
| hijacked nearly 700 contributors efforts w a license like this.
| Because they are using it from start at least that won't be as
| bad
| folkrav wrote:
| > This license was popular w folks like apollo who arguably
| hijacked nearly 700 contributors efforts w a license like this.
| Because they are using it from start at least that won't be as
| bad
|
| Which Apollo are you talking about? Only one I know is Apollo
| GraphQL, and their main server package seems to be MIT, so I
| must be looking at the wrong thing. What's the story?
| SlickStef11 wrote:
| Apollo GraphQL is not MIT. Their Gateway, federation
| libraries, and all versions of router are under a Elastic
| License v2.
|
| https://www.apollographql.com/docs/resources/elastic-
| license...
| folkrav wrote:
| As I mentioned, their main GraphQL server package is[1], so
| that's where the confusion came from. Thanks.
|
| [1] https://github.com/apollographql/apollo-
| server/blob/9817bc47...
| iamjk wrote:
| Finally a competitor to dbt. the world needs this!
| mufty wrote:
| Exciting project, definitely taking a look at this.
| Alanhlwang wrote:
| Thank you we appreciate the support!
| sitkack wrote:
| > easy by replacing brittle SQL scripts with object-oriented
| Python
|
| There is a lot to unpack here. Can you explain this in more
| detail?
| albertstanley wrote:
| Sure, our approach is to define Python classes to handle
| reusable steps for reading, transforming or loading data. For
| example, we have a MapTransformer, CastColumnsTransformer,
| GeoDistanceTransformer.
|
| Each class specifies some configuration needed for the "step"
| and can then be used in the config file to construct a full ETL
| job. You can write unit tests for custom transformers you
| create as we have shown in the tests/ directory.
|
| I have also updated the README in our repo to hopefully provide
| a better explanation of how our config file connects to
| specific Python objects.
| mrwnmonm wrote:
| > Serra is a low-code, object-oriented ETL framework that allows
| developers to write PySpark jobs easily--think end-to-end dbt
| with the benefits of object-oriented Spark.
|
| Could you please explain this as if I am three years old? (also,
| I don't know dbt)
| albertstanley wrote:
| Sure, I'll clarify some of the terms used in that one-liner in
| case it's helpful for anyone else as well.
|
| ETL is the process of extracting transforming and loading data
| from a source to a destination in a data pipeline. Spark, an
| engine for large scale data processing, allows us to write code
| that can work with large amounts of data. dbt is a tool you can
| use to break up your SQL scripts into smaller "models" - other
| SQL scripts that can be reused and tested.
|
| We described us as an end to end because we also have
| extractors and loaders, whereas dbt focuses on the T (
| transformation step of ETL ). Each of our steps involved in
| extraction, transformation and loading correspond to a specific
| Python object defined in our Python framework. I have also
| updated the README in our repo to hopefully better explain how
| the config file links to user defined readers, writers, and
| transformers.
| esafak wrote:
| If it is really a dbt clone it is an ELT tool not ETL:
|
| https://en.wikipedia.org/wiki/Extract,_load,_transform
|
| https://en.wikipedia.org/wiki/Data_build_tool
|
| It's about (big) data munging.
| Alanhlwang wrote:
| Thanks for these links! We consider ourselves an ELT and ETL
| tool--if you run a Serra job in your own warehouse (ie
| Databricks), you can easily specify extracting from AWS,
| loading the parquets into your warehouse, then transforming
| them with our config block approach (ELT).
|
| The same is true for ETL. If you have a spark cluster
| separate from your warehouse, you can define your config file
| to run in the order E T L: you can extract from your data
| source, run the transformations on a separate cluster, then
| load it to your warehouse.
| khaledh wrote:
| The pattern of reading from data sources to a Pandas DataFrame
| first defeats the whole point of using Spark[1]. Maybe it's ok
| for small tables, but you'll probably run out of memory on large
| tables.
|
| [1] https://github.com/Serra-
| Technologies/serra/blob/a7a80c77af5...
| kermatt wrote:
| Moving between Spark and Pandas can cause type casting as well.
| For example the range of allowable dates in Pandas is much
| smaller than in Spark. We completely abandoned Pandas in favor
| of PySpark for this reason.
|
| It seems unnecessary to use multiple dataframe implementations
| when Spark is already in play.
| albertstanley wrote:
| This is a completely valid point, we'll be changing the readers
| to directly read into Spark. Thank for the comment!
| vladsanchez wrote:
| That's a smell. I thought they basically packaged the ETL
| portion of DBT as open source, not the data connector
| implementation. I'd like it be connector agnostic so that you
| can choose the most suitable for your needs.
|
| Good intentions, but perhaps wrong execution. We'll see!
| addisonj wrote:
| Congrats on the launch!
|
| Interesting project in a space that I am pretty certain is going
| to change a lot in the coming years. Here is a bit of random
| feedback and questions.
|
| * Some of your messaging related to python vs yaml is a bit
| confusing, which results in me not being immediately clear on the
| value prop. After digging through docs and code I now understand
| that the yaml is a declarative pipeline calling the underlying
| python code that can include user defined transformations. Nifty!
| As someone who has led data platform teams, I understand that
| this would be a big win for any data platform team to better
| support data eng/scientists. But you don't tell me any of that. I
| would look at trying to give more context to what this is and
| adding more of these use cases and values in your marketing (even
| if they are pretty nascent at this stage)
|
| * From the loom, the play you are doing is clear and makes a lot
| of sense to build a cloud service to easily run these jobs... but
| that makes me wonder if your licensing choice is maybe a bit too
| restrictive? IMHO, the most important thing to do when building
| dev tools is to be very deliberate in your end-to-end user ->
| customer journey and designing your open source and commercial
| strategies to nicely dovetail. For a product like this, I would
| think the faster and bigger I can build a community, the better,
| and that may mean "giving away" a lot of the initial core
| innovation, but with a clear plan on the innovation I can drive
| through integrated services, which would imply as open as a
| license as possible. As is, I think you might find it much harder
| to get people to take it serious, as, unlike other source
| available companies (Elastic, Cockroach, etc) you aren't yet
| proven to be worth the effort to get this approved vs a full open
| source alternative
|
| * On a similar note, what is in the repo right now seems to be a
| relatively thin wrapper around spark. That isn't a criticism.
| Many technologies and communities have started based on a "remix"
| of a lower level tool that offers simplified UX/DX or big
| workflow improvements. What sets those apart though, imho, is to
| drastically lower the barrier to entry to using the underlying
| technology and to be seen as leaders and experts in the space you
| operate. I am guessing you probably have lots of features
| planned, but I would also give a soft suggestion to look as much
| into thinking of learnability as a feature (via features,
| interactive docs, etc) as I would almost anything else, as that
| is really where a lot of the value of a higher level interface
| like this comes in
|
| * My past experience with really large and complex ETL jobs that
| essentially required dropping into spark to represent them has me
| wonder how much actual complexity can be represented by the
| transformers? I would be curious to know what your most complex
| pipeline is? It doesn't seem there is an API limitation why these
| pipelines couldn't get quite a bit larger and represent many sql
| statements, other than big long spark pipelines getting kind of
| ugly, and in some cases, could even remove the need for quite a
| few airflow jobs. I am curious to know if and how you see Serra
| addressing those sorts of problems like those types of ETL jobs.
|
| Once again, congrats on launching! Happy to give more
| context/thoughts in a thread or reach out to me via in profile
| Alanhlwang wrote:
| This is super insightful thanks a ton for this gold mine.
|
| On the python vs yaml part--definitely could've made that way
| more clear in the demo. Right now are framework lets you call
| these python objects in your yaml file, but we are working on
| just a python-centric implementation as well for those that do
| not want to interact with yamls.
|
| On the loom and licensing choice--that's a great point. One of
| the main issues we ran into is getting adoption as we
| originally just tried licensing out the framework (mega fail
| ofc)--found out the hard way that no dev wants to buy something
| to try it out. We're definitely flexible on our license and
| will take all this feedback into account.
|
| On the barrier of entry--also super insightful. We're working
| on a local UI offering that will be a 'config' block builder
| that will be free for all installs. We're implementing a DAG
| view similar to Airflow on the transform level. We also want to
| make it super easy to see your code and preview how it changes
| with this local UI (and have a list of all the params you need
| for your spark objects without having to go through the docs).
| We also want to flesh out more features especially on the
| translate side, as well as host on the cloud.
|
| With the complexity issue that's something I ran into Disney as
| well! As the product grows we definitely want to flesh out our
| transformers based on the scripts we see. For now, the
| developer can make one-off transformers--we actually have a
| catch all "SQL transformer" for cases where you want to just
| pass in your sql (similar to a dbt model) and run it that way.
| That way it's a fail safe for if you have one specific
| transform that you feel is super hard to break down, you can
| fall back on dbt's way of just modularizing the SQL into a
| transform, and reference it however many times you want as an
| input block later on.
|
| Thanks so much for the congrats, will definitely reach out and
| would love to have further discussions in the thread as will.
| mbellm wrote:
| The selling point was replacing brittle SQL with Python. Sounds
| great, but I'm not seeing where Python comes into play from that
| demo? Was it not shown in the video?
| Alanhlwang wrote:
| Yep we only showed our configuration file which instantiates
| the said Python objects. Definitely would've made the demo
| clearer, and if you want to look at the actual objects that our
| config is referencing, definitely check out our repo!
| mbellm wrote:
| Thanks! I'll check it out, cheers
| morkalork wrote:
| The data engineering space feels like it's earning the same
| reputation front end had/has with the endless stream of new and
| shiny frameworks.
| riwsky wrote:
| Were your Disney+ fires using dbt? The comparison in your demo
| doesn't resemble normal dbt usage: it forces the SQL to inline
| the state abbreviation instead of using a dbt seed file, while
| the initial serra version uses one; it initially shows the serra
| code-folded, to make the SQL seem more verbose; and the SQL makes
| no use of CTEs or dbt models, either of which would make the
| transform steps clear.
| Alanhlwang wrote:
| These are all great points, and no, Disney didn't use dbt. We
| wanted the demo to show how you can modularize SQL into
| reusable objects that you can fully customize error logs for,
| while also adding the value of handling all steps of ETL in-
| house. You could definitely write this script modularly using
| dbt, but we feel like that value add of having connectors that
| easily integrate with your transforms (e2e), as well as taking
| the software engineering best practices that dbt applies a step
| further by turning each transform and connect into objects that
| you can test, modularize, and customize error logs for is our
| main differentiator.
| ssddanbrown wrote:
| Another YC launch advertising as open source while using a non-
| osd-adhering license (ELv2 in this case). I respect your right to
| choose a license that protects your efforts, but calling this
| open source will be misleading to many.
| dang wrote:
| I put the word "open source" in the title, and am happy to
| change it if someone has a better term. (I'm not up on license
| subtleties.)
| ssddanbrown wrote:
| Thanks dang, "source available" is pretty common for licenses
| like the ELv2 used here.
| dang wrote:
| As more and more startups are going open source, source
| available, open core, etc., I need to figure out how to do
| Launch HNs without triggering off-topic controversies
| around the term "open source". My problem is, there's no
| consensus among HN readers about what the term means.
|
| If anyone has a suggestion about how to solve this problem
| in an accurate and neutral way, I'd like to hear it.
| version_five wrote:
| OSI maintains a list of open source licenses which is as
| close to an industry consensus as you'll find. If a
| license is on that list I don't think many would say it's
| not open source.
|
| https://opensource.org/licenses/
|
| That's for software only, if it's an AI model all bets
| are off.
| ayewo wrote:
| I understand your frustration.
|
| IME, HN tends to use the term open source in two senses.
| It can either refer to:
|
| - the license or;
|
| - the business model.
|
| And we know that licenses exist on a spectrum of
| permissive to restrictive.
|
| So when the community is presented with a for-profit
| entity in a Launch/Show HN, they tend to dwell on the 2nd
| sense.
|
| If it's a side project that's on display, then the 1st
| sense kicks in.
|
| Based on this, I'd like to offer the following colloquial
| interpretations for the terms you mentioned.
|
| 1. Open source: _permissive_ (or more correctly, well-
| known) licenses like MIT, Apache, BSD, GPL, LGPL etc that
| do not prohibit commercial derivatives (or prevent cloud
| hyperscalers like AWS from using it).
|
| 2. Open core: our code is split into 2 parts: the open
| source bit (often under a _permissive_ open source
| license in #1) to attract fellow devs and the closed
| source bit. The closed source bit is how we plan to make
| money.
|
| 3. Source available: we plan to make money however we see
| best so as insurance, our code can only be available
| under an obscure license that was designed to be
| _restrictive_.
|
| So, I think what's really happening is that labelling
| something "open source" will cause the community to
| quickly to point out that said license is _restrictive_.
| dang wrote:
| Thanks! that's helpful. I've changed the wording to "open
| core" above.
| wildermuthn wrote:
| This comment would be more helpful if you could summarize the
| pitfalls of people relying upon ELv2. My impression of these
| variations is that they are generally used to protect
| themselves from a giant corp from using it to create a cloud
| service of some kind?
| illiliti wrote:
| There is no problem with people relying on ELv2 license. Just
| don't call your project open-source because ELv2 is not an
| open-source license.
| wolframhempel wrote:
| It is open source in the sense that the source is open, you can
| go and look at it. It's even free open source in the sense that
| you can take it and use in your own, commercial project without
| the need to compensate its authors.
|
| The only limit is, that the project you're building with it
| can't be a hosted service version of the software itself -
| which is, what I assume, Serra's business model will be.
|
| I don't think that "Open Source" just means Apache 2 and MIT
| licensed stuff - and infact feel, that the license Serra chose
| is one of the most generous OSS licenses that still retain just
| enough rights for the authors to make a living.
| rovr138 wrote:
| https://opensource.org/osd/
|
| > Introduction
|
| > Open source doesn't just mean access to the source code.
| The distribution terms of open-source software must comply
| with the following criteria
|
| There's a definition. This isn't open source.
| wolframhempel wrote:
| And how does the OSI derive it's legitimacy as the steward
| for all things open source? As far as I am concerned, it is
| just one body with its own private viewpoint, not a
| universal lawmaker for all open source devs.
|
| In general, while I appreciate the work of the OSI, I
| believe that they are too idealistic in their viewpoint,
| derived from the world of Linux and early OS.
|
| In my view, if we want to maintain a healthy and growing
| open source ecosystem, we must allow the makers of great
| OSS to be sustainable and monetize their creation. I don't
| believe that that's an inherent conflict with the spirit of
| OSS.
| rovr138 wrote:
| > And how does the OSI derive it's legitimacy as the
| steward for all things open source?
|
| We give it to them.
|
| OSI isn't an entity that's existed since the beginning
| and the original definition doesn't come from them.
|
| I agree that it's not a space that doesn't change.
|
| Having said that, change must come from the community. It
| can't be just a couple corporate entities defining their
| own license, calling it open source, and going against
| the established definition.
|
| > In my view, if we want to maintain a healthy and
| growing open source ecosystem, we must allow the makers
| of great OSS to be sustainable and monetize their
| creation. I don't believe that that's an inherent
| conflict with the spirit of OSS.
|
| I don't mind monetization don't get me wrong.
|
| Should it change? I'm not the authority on it. I'm just
| saying it's not the current definition.
| Karunamon wrote:
| Usage determines definition, not the OSI.
|
| At the end of the day, Pure Open Source(tm) modulo one very
| narrowly defined prohibited use is good enough for everyone
| except product managers at large public cloud companies and
| people who want to argue about ideological purity. It
| provides all of the same benefits.
| jauco wrote:
| I get what you're saying. The term for that is "source
| available" though.
|
| Open Source has a specific meaning to the people who frequent
| this site. And I get how YC companies a scrutinized more for
| vague and misleading promises.
|
| What if someone said their app was free to use. And then
| somewhere far in the sign up flow, it turns out you are
| required to pay. And then the app developer claims "well,
| you're free to use it, but you do have to pay". It's not that
| that sentence can't mean what the developer says it does. But
| they should take into account what people will think it
| means.
___________________________________________________________________
(page generated 2023-08-14 23:00 UTC)