[HN Gopher] Launch HN: Castled Data (YC W22) - Open-Source Rever...
___________________________________________________________________
Launch HN: Castled Data (YC W22) - Open-Source Reverse ETL
Hi HN, We're Arun, Manish, Abhilash and Franklin from Castled Data
(https://castled.io). Castled is an open source reverse ETL
solution. It helps you to periodically sync the data in your
database/warehouse (Snowflake, BigQuery, Redshift, etc.) into
sales, marketing, or support apps (Salesforce, Hubspot, Intercom
etc.), or custom software, without needing an engineering team.
Here's a demo video:
https://www.loom.com/share/71bf33acbb4a41cab7c96a3460a84e5f. On an
average, mid-scale organizations use around 40 SaaS apps. These are
powerful in functionality, but limited by the quality of the
product/customer data which is fed into them. The data getting
synced into these tools is often incomplete, suffers from quality
issues, and requires unreliable and manual imports (e.g. from CSV).
Manish and I were founding engineers at Hevodata, an ETL company,
when it went from 5 customers to around 300 customers. We started
seeing the trend of more and more customers wanting to move the
data out of their cloud data warehouse to feed their business
tools. We built a prototype to solve this for our users, but when
we went deep into their use cases, we found that there were a lot
of unsolved problems in this space. We also realized that
activating warehouse data reliably for operational purposes was
emerging as the next big trend for data-driven companies. We did
some research and came across Census/Hightouch, which were early-
stage Reverse ETL cloud solutions at the time. But from our
previous experience working in the ETL space, we believed that any
data pipeline solution needs to be open source to cover the long
list of connectors that needs to be built. So we set out to build
our open source Reverse ETL solution. With Castled, companies can
create automated data pipelines to periodically sync the output of
a warehouse transformation query or dbt models(on the works) to
their sales, marketing, support and notification tools. We fetch
only the incremental results by default on every pipeline run,
which makes sure that rate limits and other constraints of the
destination APIs are not breached. Our users can also set a time
schedule to define the frequency of the pipeline run. The
technical challenges in building such a tool include: doing CDC
(Change Data Capture) from data warehouses which do not provide a
typical write ahead log; handling rate limits on destination APIs;
handling deduplication of records on destination objects; failure
handling and automatic retries. But the biggest challenge is the
sheer number of destination app integrations that need to be
supported--we are talking about tens of thousands of connectors.
Our major differentiator from Census/Hightouch is that we are open
source. Our users can host Castled in their own private cloud and
start operationalizing their data for free. We've observed that
initially customers are inclined towards buying a cloud solution
for their data integration needs. But once they scale up, they
realize that their cloud vendor is unable to cope with the
increasing number of apps getting used in the organization. They
soon start building in-house data pipeline solutions or look for an
open-source solution to solve their problems. Being open source, we
provide the flexibility for our customers to build their own
connectors rather than waiting for cloud vendors to fulfill their
connector requests. Compared with open-source alternatives (e.g.
Grouparoo), we have built Castled in such a way that our community
can build new connectors in a few hours. One example of this is our
Castled Form Language (CFL), which helps our users auto generate
extremely complex forms on the UI by writing a few Java annotations
on the backend. This removes the need for a UI developer to build a
new connector. We have our Github repo here :
https://github.com/castledio/castled. For most users, you can spin
up the application on your desktop in a few minutes. In case you
want a hosted solution, we also have our cloud platform hosted at
https://castled.io. We have a subscription based hosted cloud
solution, which provides more security features like single sign
on, authentication, user management, notification, alerts, etc. you
can sign up for and try out the product for free, no credit card
required. This is the first time we are trying to build an open-
source community around a project and we're excited to hear any
thoughts, insights, questions, encouragement and concerns in the
comments below! Also we will be monitoring the thread over the
course of today to answer any questions. Also feel free to reach
out to me by email at arun@castled.io
Author : aruntdharan
Score : 64 points
Date : 2022-01-25 14:40 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| soumyadeb wrote:
| Congrats on your launch. Great to see more innovation in this
| space.
|
| How are you thinking of monetizing?
| aruntdharan wrote:
| Thanks! We also have a subscription based hosted solution
| hosted at https://castled.io
| manishks wrote:
| Hi everyone, Manish here (one of the founders of Castled). Thanks
| for pouring in so many ideas and suggestions in the comments. We
| have a Discord community, it would be nice if you join and help
| us build a great open source product -
| https://discord.gg/ERAjcSNerD
| mritchie712 wrote:
| It looks like you're asking users to write connectors in Java.
| Have you given some thought to who your user is? I'd imagine the
| type of person that'd consider using this would be unlikely to
| prefer or even know Java.
| aruntdharan wrote:
| Thanks for the tip. We believe it's still the data engineers
| who will have to write these reverse connectors as well. We
| understand that python is probably the preferred choice of
| language for them. We have also seen that a majority of them
| understand java as well. But we are willing to support multi-
| language functionality in future, if the community demands it.
| Do you think majority of the data engineers do not understand
| Java?
| lmeyerov wrote:
| "Know" vs "willing to write", esp. OSS for making someone
| else rich
|
| Nowadays, probably something like python then rust/go, just
| for community, and especially aligned on apache arrow. OSS
| async python / HTTP, with arrow dataplane support
| (fast,typed,standardized), is part of our bar for whether we
| consider a data proj as a core dep nowadays. A surprising
| amount of ETL startups are YOLO json for the dataplane, so
| we've intentionally stayed away due to reliability+perf heart
| pains. But maybe you can fake it till you make it that way
| too, and then hire staff to clean it up 2 years later :)
| mritchie712 wrote:
| It's probably 10 to 1 preference of Python over Java for data
| engineers. At least 5 to 1.
| llaolleh wrote:
| I would disagree with this. Most Python developers I know
| can do Java. Not a big deal.
| aruntdharan wrote:
| Noted. We will definitely consider adding support for
| python. Thanks for the suggestion.
| galdosdi wrote:
| You might consider checking Jython out as an option. It's
| JSR-223 compliant and trivial to drop into a Java app and
| just expose Java objects to the Python scripts to be used
| as is. I've had a pretty great experience with it.
|
| The only downside is it's stuck on supporting Python2.x
| so you may end up wanting to properly integrate CPython
| eventually. Since you're targeting running Python code
| that doesn't exist yet and the language differences
| aren't huge though, I doubt most users would mind (I
| wouldn't). Just an idea to consider, esp for an MVP
|
| (One /upside/ is Jython is a Python2 interpreter fully
| written in Java, so the concurrency and performance may
| be better than CPython2 with its GIL)
| aruntdharan wrote:
| Just checked it out. Looks interesting. Thanks
| coderintherye wrote:
| Since there seems to be differing opinions here, I'll just
| add my experience that having worked with 3 data teams,
| everyone knew Python and used Python and no knew or used
| Java.
|
| Great product and very excited for this, wish I could invest
| in you and wish this had been around years ago when I was
| trying to convince Fivetran they should create reverse ETL
| functionality.
| cstanley wrote:
| Cool product guys! One question >
|
| "Being open source, we provide the flexibility for our customers
| to build their own connectors rather than waiting for cloud
| vendors to fulfill their connector requests."
|
| Why does it need to be OS? Can't a product just have a devkit
| that enables you to build your own connectors?
| aruntdharan wrote:
| Thanks for the suggestion. Yes, a devkit would work if its just
| about building new connectors. But we believe that the
| community would want control over the entire project rather
| than just the connectors module. We also wanted to provide a
| usable version of the product for free to the community, which
| you can self-host and maintain yourselves as well.
| awwx wrote:
| fyi small typo in https://oss-docs.castled.io/deploying-
| castled/deploy-on-aws-...: "Login to you AWS web console"
| aruntdharan wrote:
| Fixed :)
| amcaskill wrote:
| Fantastic! Congratulations on the launch.
|
| Is there a way to version control the sync configurations? Any
| thoughts on putting that in the roadmap?
|
| I'd love to be able to put my 'Castled config' in the same repo
| as my dbt project, for example.
| aruntdharan wrote:
| You mean the warehouse/app credentials, when you say sync
| configurations? If so, yes, that seems like a great idea.
| Infact I think your warehouse credentials are already there in
| your dbt repo in a specific format. Castled can directly read
| those credentials from there.
| amcaskill wrote:
| That is not what I meant, but also pretty interesting.
|
| I actually mean the 'definition' of the syncs themselves.
|
| I am picturing JSON or YAML that describes the source fields,
| their mapping to the destination fields, and any other meta
| about the sync: frequency, number of retries, whatever else
| that you could configure in the UI
|
| So when I go and update my dbt model to modify one of the
| tables that I am syncing from, I can make the corresponding
| changes to my Castled settings file, and release it all as
| one atomic update to my data infrastructure.
|
| It might be a small number of people who would want something
| like that, but it's definitely something I would have been
| excited about when I was running a data team.
| aruntdharan wrote:
| We can definitely consider that. But I feel its a lot of
| config and can be error prone. For instance, source-
| destination field mapping configs might be complex and have
| various issues like data type mismatches, typos in field
| names etc and a user interface is better suited to guide
| you through the entire process.
|
| But I see value in exporting the config to a github repo
| after the pipeline is created and thereafter future edits
| can be done via the github repo. Does that make sense?
| amcaskill wrote:
| Yes 100% -- you could also imagine just syncing from the
| UI to a repo, rather than trying to make the config
| human-editable. Toggle into a branch in the UI, make
| edits, and have those committed to the repo by the tool.
|
| Looks awesome, I am rooting for you guys!
| aruntdharan wrote:
| Thanks for the input!
| mason55 wrote:
| > _It might be a small number of people who would want
| something like that, but it 's definitely something I would
| have been excited about when I was running a data team._
|
| Yeah this one certainly depends on the target customer. For
| me, any tool that didn't have source control integration
| for configuration would be a non-starter. But it's quite
| possible that the target audience for this tool doesn't
| even understand the term "source control".
| iRomain wrote:
| Please consider supporting OSS destinations as well! They share
| the same values as you and could make for some interesting
| partnerships. I understand you have to have the big SaaS names,
| but that's an opportunity to differentiate from your competitors
| !
|
| Ps: experience is poor for https://oss-docs.castled.io on mobile,
| I cannot see a menu to switch pages
| frankcastled wrote:
| Thanks for the suggestion. We started out supporting popular
| saas tools as it will increases the chances of people trying
| out. We currently support kafka as a destination. Also could
| you give example of some of the OSS destinations you have in
| mind?
|
| Sorry about the docs. Haven't done much testing on smaller
| screens yet.
| hbarka wrote:
| Why is the term "Reverse ETL" becoming a thing? Committing data
| to an OLTP system has been around since the beginning of SQL. I
| believe one vendor coined this term to a marketing success but
| this meme needs to stop. Besides, ETL has been going out in favor
| of ELT.
| aruntdharan wrote:
| A year back when we started to built Castled, this technology
| which syncs data from cloud warehouses to your operational
| tools did not have a name. The term "Reverse ETL" became
| popular somewhere in the beginning of 2021. We used this term,
| since we know that the data community knows this technology by
| this name now.
|
| But my personal take is that "Reverse ETL" is still a new
| technology in the sense that it completes the modern data
| stack, which is built around cloud data warehouses.
| davidkell wrote:
| Thanks folks, much needed! Where is the list of destinations
| today?
| frankcastled wrote:
| Currently we have 14 destinations available for use.
| Salesforce, Hubspot, Intercom, Google Ads, Mailchimp, Google
| Sheet, Sendgrid, Marketo, ActiveCampaign, Kafka, Customer.io,
| Google pub/sub, Mixpanel, Rest API.
| awwx wrote:
| fyi Your purple "Deploy on AWS" link
| (https://docs.castled.io/deploying-castled/deploy-on-aws-ec2) at
| the top of your README yields a 404.
| frankcastled wrote:
| just fixed :) Thanks!
| gregw2 wrote:
| How do you handle CDC from a DW like Redshift? If I have 5
| billion row fact table with an insert or update datetime audit
| column (but no soft delete tracking!) how do you deliver deltas?
| Are you keeping our of band hashes of pk values or tuple values?
|
| Do you need to know the primary key of the source table to sync?
| aruntdharan wrote:
| Thats a great question! We dont use updated timestamps to
| compute deltas, as thats unreliable and can cause data loss
| depending on your transaction window.
|
| We keep snapshot tables on your data warehouse(in our own
| custom schema, so that you dont have to provide Castled write
| access to any of your production schemas). The snapshot tables
| are then used as the baseline to compare the query results
| everytime the pipeline runs. Frankly, we have not really seen a
| use case of transferring 5 billion rows in a Reverse ETL
| pipeline. This is mostly because of the fact that our
| destination apps are mostly transactional systems and cannot
| really store so much of data. For example, salesforce
| destination can store max 10GB of data. Because of this, we are
| storing the actual tuple values in the snapshot. We have easily
| scaled our pipelines to compute deltas from queries which
| returns up to 100 million records. To optimize this further, we
| are also considering to keep the hashes of the tuple values
| instead of the actual values.
|
| Yes, we need to know the primary key of your query results.
| This is required to handle failures and to remove the failed
| records from the snapshot table, so that those can be retried
| on the next pipeline run.
| mjirv wrote:
| Looks great, congrats on launching!
|
| I'm curious how you differentiate yourselves from Airbyte, which
| isn't really designed for reverse ETL but can be used for it. And
| do you ever see Castled supporting regular ETL?
|
| Right now there is a lot of separation in the market between ETL
| and reverse ETL, but it seems like a pain to maintain separate
| tools when you could just do both in one.
| aruntdharan wrote:
| The founding team of Castled comprises mostly founding
| engineers from HevoData, which is an ETL solution. Having built
| both ETL and Reverse ETL solutions from scratch, we have
| realized that the architecture required to support both ETL and
| Reverse ETL needs to be drastically different. Your cloud data
| warehouse is powerful enough to change the entire architecture
| of the product depending on which side of the pipeline the
| warehouse is on. So we believe you need different products to
| support both ETL and Reverse ETL. But agreed the same tool can
| provide both products.
|
| I will have to check how airbyte supports both. Regarding
| Castled, Regular ETL is there in our mid term roadmap.
| gorkemyurt wrote:
| I think its still "coming soon"
|
| https://airbyte.com/blog/airbyte-strategy-to-commoditize-
| all...
| ploomber wrote:
| Go Castled! Congrats on the launch!
| aruntdharan wrote:
| Thanks!
___________________________________________________________________
(page generated 2022-01-25 23:01 UTC)