[HN Gopher] Show HN: ScratchDB - Open-Source Snowflake on ClickH...
___________________________________________________________________
Show HN: ScratchDB - Open-Source Snowflake on ClickHouse
Hello! For the past year I've been working on a fully-managed data
warehouse built on Clickhouse. I built this because I was
frustrated with how much work was required to run an OLAP database
in prod: re-writing my app to do batch inserts, managing clusters
and needing to look up special CREATE TABLE syntax every time I
made a change. I found pricing for other warehouses confusing (what
is a "credit" exactly?) and worried about getting capacity-planning
wrong. I was previously building accounting software for firms
with millions of transactions. I desperately needed to move from
Postgres to an OLAP database but didn't know where to start. I
eventually built abstractions around Clickhouse: My application
code called an insert() function but in the background I had to
stand up Kafka for streaming, bulk loading, DB drivers, Clickhouse
configs, and manage schema changes. This was all a big distraction
when all I wanted was to save data and get it back. So I decided to
build a better developer experience around it. The software is
open-source: https://github.com/scratchdata/ScratchDB and and the
paid offering is a hosted version: https://www.scratchdb.com/.
It's called "ScratchDB" because the idea is to make it easy to get
started from scratch. It's a massively simpler abstraction on top
of Clickhouse. ScratchDB provides two endpoints [1]: one to insert
data and another to query. When you send any JSON, it automatically
creates tables and columns based on the structure [2]. Because
table creation is automated, you can just start sending data and
the system will just work [3]. It also means you can use Scratch as
any webhook destination without prior setup [4,5]. When you query,
just pass SQL as a query param and it returns JSON. It handles
streaming and bulk loading data. When data is inserted, I append it
to a file on disk, which is then bulk loaded into Clickhouse. The
overall goal is for the platform to automatically handle managing
shards and replicas. The whole thing runs on regular servers.
Hetzner has become our cloud of choice, along with Backblaze B2 and
SQS. It is written in Go. From an architecture perspective I try to
keep things simple - want folks to make economical use of their
servers. So far ScratchDB has ingested about 2 TB of data and
4,000 requests/second on about $100 worth of monthly server costs.
Feel free to download it and play around - if you're interested in
this stuff then I'd love to chat! Really looking for feedback on
what is hard about analytical databases and what would make the
developer experience easier! [1] https://scratchdb.com/docs [2]
https://scratchdb.com/blog/flatten-json/ [3]
https://scratchdb.com/blog/scratchdb-email-signups/ [4]
https://scratchdb.com/blog/stripe-data-ingest/ [5]
https://scratchdb.com/blog/shopify-data-ingest/
Author : memset
Score : 160 points
Date : 2023-10-27 13:34 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| throwaway295729 wrote:
| Congrats on the release! Can this be used for log data? How long
| is ingested data kept?
| memset wrote:
| You can use it for logs. I have a basic example here:
| https://scratchdb.com/blog/fly-logs-to-clickhouse/
|
| It's just storing data in a database so it persists as long as
| you want. For the hosted version you just pay per GB of
| compressed data, so if you have tons of logs then you can keep
| them, or you can clear out old data to save on disk space.
| esafak wrote:
| TiDB is an HTAP whose OLAP component (TiFlash) was based on
| Clickhouse: https://news.ycombinator.com/item?id=23584022
|
| If you have analyzed the competition, what are your selling
| points? Benchmarks welcome. Thank you!
| memset wrote:
| I haven't seen TiDB before - thanks for sharing. I honestly
| haven't spent tons of much time analyzing other databases
| (there are so many!) but I haven't seen much out there which
| tries to be "Firebase for OLAP" - that is, make it so easy that
| a developer who maybe doesn't have time to learn Clickhouse can
| be effective with it. That's the most important goal of this
| project: lower the barrier for people who need to use a
| columnar data structure to be able to do so [without thinking
| too much about it] and get back to solving their engineering
| problems.
| didip wrote:
| You should submit your benchmarks to ClickBench.
| tbragin wrote:
| Disclaimer: I work at ClickHouse.
|
| Thank you! Looks really interesting!
|
| I personally agree that real-time OLAP databases have potential
| to better serve workloads currently in Postgres or cloud data
| warehouses that need real-time ingest and analytical queries. And
| simplifying developer experience on top of that, so you don't
| have to learn about all the details of a powerful database,
| really speeds up developer velocity.
|
| I'm curious, how you see your project differs from GraphJSON
| (https://www.graphjson.com/) and Tinybird
| (https://www.tinybird.co/)?
|
| Congratulations again on the launch!
| memset wrote:
| Good to meet you! I hadn't seen graphjson before, I'll check
| that out. I'm also a fan of tinybird, and I think we have
| similar goals of wanting to make it easier for people to adopt
| OLAP.
|
| On the technical side, I've made different design decisions
| when it comes to ingesting and processing data. For example,
| after a table is created, you can post new JSON, with different
| columns, and we ingest it without needing a manual migration
| step. I also have a different treatment for JSON arrays, where
| we can split them into multiple Clickhouse rows (rather than
| using a clickhouse array.)
|
| Philosophically, I think there is a lot of room for open-source
| software _that also has an amazing UI and developer
| experience._ I 've been writing OSS for years and basically
| think this is the best way to build successful developer tools.
| tbragin wrote:
| > Philosophically, I think there is a lot of room for open-
| source software that also has an amazing UI and developer
| experience.
|
| +1 to that! Welcome to the ClickHouse community!
| hawk_ wrote:
| I would like to be an OSS dev but how do you pay the bills?
| mike_d wrote:
| > how you see your project differs from GraphJSON
| (https://www.graphjson.com/) and Tinybird
| (https://www.tinybird.co/)?
|
| Not OP, but neither of these appear to be open source?
| ddorian43 wrote:
| Why is your storage 10X that of bigquery? How does your compute
| price compare to bigtable?
|
| Edit: bigtable->bigquery
| memset wrote:
| I don't bill separately for the compute vs storage the way
| bigtable does. The pricing per GB of data is inclusive of
| compute. The goal is for pricing to be modeled similarly to
| DynamoDB - just pay for what you use. The other way I charge is
| by query wall time - so a 30s query will cost you more than a
| 500ms one.
|
| I haven't used bigtable but it seems like the minimum charge is
| on the order of $300 before you have any data. With ScratchDB,
| the minimum charge is $10 for 30 GB.
|
| Additionally, on average, data has a 25% compression ratio. So
| if your 1 TB of data only takes up 250GB, you only pay for
| that.
|
| Bigtable isn't OLAP, so you would not use them for the same
| data. This competes more directly with GCP's BigQuery.
|
| Finally, I'm interested in pricing feedback! The goal is to be
| able to sustain the development of this, so I want to do what
| makes sense.
| ddorian43 wrote:
| I wrote wrong, I meant bigquery, sorry. It's 9x of bigquery.
| memset wrote:
| I haven't calculated the break-even point for bigquery vs
| scratchdb. It would be impossible to do so, as I'd need to
| know the number of rows bigquery will scan in order to do
| an estimation in advance. Also what is a "slot-hour"?
|
| That is why I chose pricing uses units of "GB" and "hours"
| for storage and compute - those are things you can more
| easily observe.
|
| It is a good question, though, and perhaps I can do an
| experiment and write a blog post using example data showing
| the differences. I might be surprised at how efficient
| bigquery is!
| wkoszek wrote:
| ScratchDB has save my business and it's awesome. I think if you
| need a columnar store, you should really try these guys
| memset wrote:
| Thank you - really excited about what you're building too!
| tiffanyh wrote:
| AGPL-3.0 license, for those wondering.
| CharlesW wrote:
| Can you explain what "open-source Snowflake" means, since you
| don't explain it in this description, in the repo, or on the
| site?
|
| Is your goal explicitly to replicate all Snowflake capabilities?
| https://docs.snowflake.com/en/user-guide/intro-supported-fea...
| memset wrote:
| This is good feedback on writing a clear message - I appreciate
| it.
|
| The goal of this project is to build a superb developer
| experience around analytical databases. I think that is one of
| Snowflake's (many) value propositions. It is also a goal for
| users to be able to have full control of their data, how it is
| processed, and be able to make economical use of their compute.
|
| This project does not have a goal of matching anyone else's
| features. But of course, as we grow, we'll end up building
| things which are important to enterprises.
| skeeter2020 wrote:
| I too was confused by the title. If it doesn't support
| feature-parity, nor approach the problem space from the same
| perspective as Snowflake, I don't think it's the open source
| version.
| willseth wrote:
| Calling it "Open-Source Snowflake" implies you are trying to
| be a drop-in replacement for Snowflake for at least a subset
| of features, SQL syntax, etc.
| giovannibonetti wrote:
| Great product! Thanks for sharing it!
|
| Question: I thought Clickhouse already has native support for
| flattening JSON [1], although it was released recently (version
| 22.3.1). Did you start working on it [2] before that? Or is it a
| different take? I'm curious about the pros and cons of each one.
|
| [1] https://clickhouse.com/docs/en/integrations/data-
| formats/jso... [2] https://scratchdb.com/blog/flatten-json/
| memset wrote:
| This is a really good question. My snarky answer is "the way we
| ingest JSON doesn't require 50 pages and n configuration
| settings to explain."
|
| Here's a more factual one:
|
| - We don't use tuples for nested JSON. We just use underscores
| to show parent_child relationships in keys.
|
| - We don't use arrays. Instead, we make it easy to split into
| multiple rows so you can use regular SQL.
|
| I haven't directly compared Clickhouse's all of different JSON
| treatments compared to what I've implemented, but my goal was
| to build something that you could run and would "just work the
| way it should."
| antoniojtorres wrote:
| ClickHouse JSON certainly has some sharp edges because they
| do a lot of schema inference to transparently column store
| the component parts, which makes it tricky with anything that
| could be irregular.
|
| Big fan of CH on my end. Will follow your project closely.
| zepolen wrote:
| What happens with this? { "some_key":
| "1" "some": { "key": "2" } }
| memset wrote:
| Probably an error! Should create a ticket for that. What
| would we want the behavior to be?
|
| (This hasn't come up with existing customers because
| theoretically it could and we should handle it.)
| mosen wrote:
| I tested it with the "explode" flattening mode, and it
| resulted in { "some_key": "2" }
| datatrashfire wrote:
| as a huge clickhouse fan, i have to agree json is simply not
| a great experience even with all the changes in the last year
| jed_sanders12 wrote:
| This looks great. I have one question. When you are automatically
| creating tables, how do you choose primary keys order for
| clickhouse table?
| memset wrote:
| Great question. Each new table is given a __row_id column which
| ScratchDB populates and that is the default primary key.
|
| Then, as data is added, I look at the cardinality of each
| column and set an ORDER BY appropriately. The goal
| (implementation in progress, not launched yet) is for ScratchDB
| to do this automatically - we should be able to automatically
| look at both the data in your table and queries and
| analytically come up with an appropriate sorting key.
| jed_sanders12 wrote:
| This sounds like a very interesting problem. How costly is it
| to modify order by after data has been loaded into the table?
| It sounds expensive because it should reorder the data after
| changing order by.
| memset wrote:
| It is not a cheap operation, I basically create a new table
| and then "INSERT INTO SELECT..." Then again, it isn't
| something you typically need to do a lot.
| jed_sanders12 wrote:
| Thank you for the answers, and best of luck with this
| product.
| gbrits wrote:
| Congrats with the launch. This looks great. Inferring schemas on
| the fly is awesome to get started quickly, but are there ways to
| explicitly define a schema if I wanted to? For example, thinking
| of setting column specific compression
___________________________________________________________________
(page generated 2023-10-27 23:00 UTC)