[HN Gopher] Show HN: ClickHouse-local - a small tool for serverl...
___________________________________________________________________
Show HN: ClickHouse-local - a small tool for serverless data
analytics
Me with my friend Vitaly Ludvichenko made an experiment to combine
ClickHouse server and client to make a self-contained program
running a database engine and processing data without a server:
https://github.com/ClickHouse/ClickHouse/pull/150 Development
continued in the past 6 years, and now clickhouse-local becomes a
swiss-army knife for data processing. Say "ffmpeg" for datasets and
more. It can resemble textql, octosql, dsq, duckdb, trdsql, q,
datafusion-cli, spyql, but has better capabilities and performance.
Here is a tutorial: https://clickhouse.com/blog/extracting-
converting-querying-l... "serverless" in the same sense as here:
https://www.sqlite.org/serverless.html and also in the more common
sense - clickhouse-local can be packaged into AWS Lambda and serve
queries on a per-request basis, as here:
https://github.com/ClickHouse/ClickHouse/issues/43589
Author : zX41ZdbW
Score : 108 points
Date : 2023-01-05 19:30 UTC (3 hours ago)
(HTM) web link (clickhouse.com)
(TXT) w3m dump (clickhouse.com)
| notamy wrote:
| Tangential, but I'm curious about the installation command:
| curl https://clickhouse.com/ | sh
|
| It seems like it just detects curl('s user agent?) and serves up
| the script, serving HTML otherwise (ex. `wget -O- ...` would
| print HTML). I wonder what the design decisions here were.
| [deleted]
| goodpoint wrote:
| That's really sketchy.
|
| https://www.idontplaydarts.com/2016/04/detecting-curl-pipe-b...
| afandian wrote:
| curl -vvv https://clickhouse.com/ ... > user-
| agent: curl/7.79.1
| notamy wrote:
| Yes, that's obvious. The question is more WHY -- why serve
| curl detection from the root domain? Why only detect curl?
| ...
| slt2021 wrote:
| easier to remember download link?
| hk__2 wrote:
| There's nothing in that link that says "I'm a download
| link" and I don't think you would remember that the site
| you have open in your browser also serves as the download
| link if you curl it with the right user-agent.
| zX41ZdbW wrote:
| Yes, it is detecting curl by the user agent - by using the
| "page rules" in Cloudflare. Cloudflare is good for traffic
| management rules.
|
| I've created this distribution channel as an experiment and for
| my own need - I use it to download and install ClickHouse
| without the need of package manager. So I don't need to run
| "apt update" to install ClickHouse. It works if the package
| manager is broken, and it works well on very old servers, and
| rare Linux distributions.
|
| It is using the fresh build from the master branch. I have also
| created a website for my team to monitor the status of the
| continuous integration system: https://aretestsgreenyet.com/
|
| We also have deb, rpm, tgz and Docker as wrappers for the
| release builds.
| notamy wrote:
| > I've created this distribution channel as an experiment and
| for my own need - I use it to download and install ClickHouse
| without the need of package manager. So I don't need to run
| "apt update" to install ClickHouse. It works if the package
| manager is broken, and it works well on very old servers, and
| rare Linux distributions.
|
| Makes perfect sense! Thanks for the explanation :D
| CoolCold wrote:
| Interesting - for me broken package manager is indicator of
| that server to be fixed or reprovisioned, I'd be wondered to
| see some DB running on such kind of server - highly likely
| it's missing updates as well
| swyx wrote:
| hmmmm interesting. clickhouse becoming duckdb while duckdb
| becoming clickhouse.
| [deleted]
| maxbond wrote:
| I believe that's Newton's First Law of DBMSs
| RyanHamilton wrote:
| Next they become an email system.
| winrid wrote:
| For those that don't know, this is referring to MotherDuck
| which was recently announced: https://motherduck.com/
| ignoramous wrote:
| nb: Engs and academics leading the development of DuckDB
| don't work for MotherDuck (yet).
| lazzlazzlazz wrote:
| What's the evidence for superior performance? The post submitter,
| zX41ZdbW, has made uncharitable comments about DuckDB in previous
| threads.
| [deleted]
| zX41ZdbW wrote:
| Here is one example:
| https://colab.research.google.com/github/dcmoura/spyql/blob/...
| Note: DuckDB is not present in this comparison, because it does
| not support JSON processing (only CSV and Parquet). ClickHouse
| supports everything.
|
| Here is another example:
| https://datastudio.google.com/u/0/reporting/6a2c38d4-3a22-41...
| - startup times for serverless engines. Note: DuckDB is not
| present in this comparison, I'm not sure why.
|
| Both comparisons are independent.
|
| I know a few cases when clickhouse-local is worse than DuckDB:
|
| - it does not use the embedded metadata to filter while
| processing Parquet files;
|
| - the syntax for accessing the files in s3 is clunkier in
| clickhouse-local;
|
| - finally, there is no Python module and integration with
| dataframes.
|
| I know many cases when clickhouse-local is better than DuckDB.
| Performance is mostly better, because ClickHouse is more
| advanced in the query engine. DuckDB mostly keeping up, but
| sometimes is already ahead, in some scenarios. Query language,
| data types support, feature completeness, stability and testing
| - much better in ClickHouse.
|
| I did not make uncharitable comments about DuckDB. We have
| recently met with Hannes Muhleisen and the team from DuckDB
| labs, and I have very good impression about the technology and
| the team. I see them as our friends. I'm also enthusiastic
| about every data processing technologies.
| vgt wrote:
| I'd love to see evidence for some of these claims, like
| stability and testing. Would you also be able to comment on
| correctness?
|
| (co-founder at MotherDuck)
| zX41ZdbW wrote:
| This summer I was preparing the ClickBench:
| https://benchmark.clickhouse.com/
|
| When I tried to use DuckDB on the same dataset as
| ClickHouse, it simply did not work due to OOM:
| https://github.com/duckdb/duckdb/issues/3969
|
| I also told them about our experience of using various
| memory allocators, and why you should never use the GLibC's
| malloc.
|
| This issue was fixed.
| vgt wrote:
| Thanks, sounds like relative immaturity.
|
| I did not catch you address correctness?
|
| FWIW I have some experience with Clickhouse as I ran
| product at Firebolt and played a critical role in being
| more transparent about their foundations and giving
| credit where it is due. However, I do have some first
| hand experience with Clickhouse, which was jarring,
| considering my previous experience was with BigQuery.
| zX41ZdbW wrote:
| Recently I tried to load another dataset to DuckDB:
| https://github.com/ClickHouse/ClickHouse/issues/42363 for
| my own entertainment.
|
| But it did not work (the server became unresponsive after
| consuming all memory).
| vgt wrote:
| From my practical experience with ClickHouse, I saw OOMs
| as well, on load and on query. How confident are you that
| your chosen dataset is neutral?
| zX41ZdbW wrote:
| ClickHouse will throw an exception in case of not enough
| memory and continue to serve other queries. Under certain
| configurations it can OOM as well.
|
| > How confident are you that your chosen dataset is
| neutral?
|
| I have no idea if it is "neutral", I picked it randomly.
|
| I test ClickHouse on every interesting dataset, see here:
| https://github.com/ClickHouse/ClickHouse/issues?q=is%3Ais
| sue...
|
| The reason - I love working with data :) If I see a
| dataset, I load it into ClickHouse - this is the first
| thing I do. This is not a kind of marketing or promotion
| of ClickHouse - you know, if it were some directed task,
| it would be uninteresting for me.
| glogla wrote:
| It is interesting to see how ClickHouse webpage changed, lately.
| If I didn't know there is open source version, I would not find
| it.
|
| You have to go to _Product_ menu to see _Open source ClickHouse_
| or scroll several screens to find low contrast _Open source
| quickstart_ button, right next to high contrast _Try Cloud now_
| button. The button will take you to a tutorial section of the
| documentation, where you have to scroll over a into video of
| getting started on ClickHouse Cloud, and then you have to switch
| from _ClickHouse Cloud_ tab to _Self-Managed_ tab, to get to how
| to actually use it.
|
| It's like the non-hosted version is a secret!
|
| I understand everybody wants to make money, and people took money
| from VCs and they want results, etc. But I remember time when the
| webpage was just docs and that was it, and it leaves bit of a sad
| taste in my mouth.
| zX41ZdbW wrote:
| This is not what we want. I admit this problem. Changes will be
| made.
|
| Just for amusement, here is the 2016 website version:
| https://presentations.clickhouse.com/original_website/
| winrid wrote:
| Will this be available with APIs to use this embedded like
| DuckDB?
|
| All the examples I see are using the CLI.
| zX41ZdbW wrote:
| It is not available as a library, only CLI.
| winrid wrote:
| Ah, okay. Won't work for my use case. Still cool though!
| [deleted]
| vgt wrote:
| This is, admittedly, the most unusual use of "serverless" I've
| ever heard of
| varunkmohan wrote:
| Any plans to be able to directly run this in the browser similar
| to DuckDB. Nice to see more options in the space.
| cube2222 wrote:
| Congrats on the Show HN!
|
| It's great to see more tools in this area (querying data from
| various sources in-place) and the Lambda use case is a really
| cool idea!
|
| I've recently done a bunch of benchmarking including ClickHouse
| Local, and the usage was straightforward, with everything working
| as it's supposed to.
|
| Just to comment on the performance avenue though, one area I
| think ClickHouse could still possibly improve on - vs OctoSQL[0]
| at least - is that it seems like the JSON datasource is slower,
| especially if only a small part of the JSON objects is used. If
| only a single field of many is used, OctoSQL lazily parses only
| that field, and skips the others, which yields non-trivial
| performance gains on big JSON files with small queries.
|
| Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM
| books.json` with the Amazon Review Dataset (10GB), OctoSQL is
| twice as fast (3s vs 6s). That's a minor thing though (OctoSQL
| will slow down for more complicated queries, while for ClickHouse
| decoding the input is and remains the bottleneck, with the
| processing itself being ridiculously fast).
|
| Godspeed with the future evolution of ClickHouse!
|
| [0]: https://github.com/cube2222/octosql
| justinsaccount wrote:
| There's newer json support in clickhouse where you can load
| arbitrary json objects into a JSON column type:
|
| https://clickhouse.com/blog/getting-data-into-clickhouse-par...
|
| Behind the scenes it flattens things into real columns with
| auto detected types.
|
| I suppose if you planned on running many queries that would be
| faster, but I bet octosql could also convert it to something
| like parquet for similar gains.
|
| There's also some tricks you can do with json filtering if
| strings are involved, if you have a query like "select path
| from http_logs where hostname = 'news.ycom...'" you can read a
| whole chunk of data, do a `memmem` for the string and if no
| matches, not even bother decoding ANY of the json.
___________________________________________________________________
(page generated 2023-01-05 23:00 UTC)