[HN Gopher] Show HN: ClickHouse-local - a small tool for serverl...
       ___________________________________________________________________
        
       Show HN: ClickHouse-local - a small tool for serverless data
       analytics
        
       Me with my friend Vitaly Ludvichenko made an experiment to combine
       ClickHouse server and client to make a self-contained program
       running a database engine and processing data without a server:
       https://github.com/ClickHouse/ClickHouse/pull/150  Development
       continued in the past 6 years, and now clickhouse-local becomes a
       swiss-army knife for data processing. Say "ffmpeg" for datasets and
       more.  It can resemble textql, octosql, dsq, duckdb, trdsql, q,
       datafusion-cli, spyql, but has better capabilities and performance.
       Here is a tutorial: https://clickhouse.com/blog/extracting-
       converting-querying-l...  "serverless" in the same sense as here:
       https://www.sqlite.org/serverless.html and also in the more common
       sense - clickhouse-local can be packaged into AWS Lambda and serve
       queries on a per-request basis, as here:
       https://github.com/ClickHouse/ClickHouse/issues/43589
        
       Author : zX41ZdbW
       Score  : 108 points
       Date   : 2023-01-05 19:30 UTC (3 hours ago)
        
 (HTM) web link (clickhouse.com)
 (TXT) w3m dump (clickhouse.com)
        
       | notamy wrote:
       | Tangential, but I'm curious about the installation command:
       | curl https://clickhouse.com/ | sh
       | 
       | It seems like it just detects curl('s user agent?) and serves up
       | the script, serving HTML otherwise (ex. `wget -O- ...` would
       | print HTML). I wonder what the design decisions here were.
        
         | [deleted]
        
         | goodpoint wrote:
         | That's really sketchy.
         | 
         | https://www.idontplaydarts.com/2016/04/detecting-curl-pipe-b...
        
         | afandian wrote:
         | curl -vvv  https://clickhouse.com/         ...         > user-
         | agent: curl/7.79.1
        
           | notamy wrote:
           | Yes, that's obvious. The question is more WHY -- why serve
           | curl detection from the root domain? Why only detect curl?
           | ...
        
             | slt2021 wrote:
             | easier to remember download link?
        
               | hk__2 wrote:
               | There's nothing in that link that says "I'm a download
               | link" and I don't think you would remember that the site
               | you have open in your browser also serves as the download
               | link if you curl it with the right user-agent.
        
         | zX41ZdbW wrote:
         | Yes, it is detecting curl by the user agent - by using the
         | "page rules" in Cloudflare. Cloudflare is good for traffic
         | management rules.
         | 
         | I've created this distribution channel as an experiment and for
         | my own need - I use it to download and install ClickHouse
         | without the need of package manager. So I don't need to run
         | "apt update" to install ClickHouse. It works if the package
         | manager is broken, and it works well on very old servers, and
         | rare Linux distributions.
         | 
         | It is using the fresh build from the master branch. I have also
         | created a website for my team to monitor the status of the
         | continuous integration system: https://aretestsgreenyet.com/
         | 
         | We also have deb, rpm, tgz and Docker as wrappers for the
         | release builds.
        
           | notamy wrote:
           | > I've created this distribution channel as an experiment and
           | for my own need - I use it to download and install ClickHouse
           | without the need of package manager. So I don't need to run
           | "apt update" to install ClickHouse. It works if the package
           | manager is broken, and it works well on very old servers, and
           | rare Linux distributions.
           | 
           | Makes perfect sense! Thanks for the explanation :D
        
           | CoolCold wrote:
           | Interesting - for me broken package manager is indicator of
           | that server to be fixed or reprovisioned, I'd be wondered to
           | see some DB running on such kind of server - highly likely
           | it's missing updates as well
        
       | swyx wrote:
       | hmmmm interesting. clickhouse becoming duckdb while duckdb
       | becoming clickhouse.
        
         | [deleted]
        
         | maxbond wrote:
         | I believe that's Newton's First Law of DBMSs
        
           | RyanHamilton wrote:
           | Next they become an email system.
        
         | winrid wrote:
         | For those that don't know, this is referring to MotherDuck
         | which was recently announced: https://motherduck.com/
        
           | ignoramous wrote:
           | nb: Engs and academics leading the development of DuckDB
           | don't work for MotherDuck (yet).
        
       | lazzlazzlazz wrote:
       | What's the evidence for superior performance? The post submitter,
       | zX41ZdbW, has made uncharitable comments about DuckDB in previous
       | threads.
        
         | [deleted]
        
         | zX41ZdbW wrote:
         | Here is one example:
         | https://colab.research.google.com/github/dcmoura/spyql/blob/...
         | Note: DuckDB is not present in this comparison, because it does
         | not support JSON processing (only CSV and Parquet). ClickHouse
         | supports everything.
         | 
         | Here is another example:
         | https://datastudio.google.com/u/0/reporting/6a2c38d4-3a22-41...
         | - startup times for serverless engines. Note: DuckDB is not
         | present in this comparison, I'm not sure why.
         | 
         | Both comparisons are independent.
         | 
         | I know a few cases when clickhouse-local is worse than DuckDB:
         | 
         | - it does not use the embedded metadata to filter while
         | processing Parquet files;
         | 
         | - the syntax for accessing the files in s3 is clunkier in
         | clickhouse-local;
         | 
         | - finally, there is no Python module and integration with
         | dataframes.
         | 
         | I know many cases when clickhouse-local is better than DuckDB.
         | Performance is mostly better, because ClickHouse is more
         | advanced in the query engine. DuckDB mostly keeping up, but
         | sometimes is already ahead, in some scenarios. Query language,
         | data types support, feature completeness, stability and testing
         | - much better in ClickHouse.
         | 
         | I did not make uncharitable comments about DuckDB. We have
         | recently met with Hannes Muhleisen and the team from DuckDB
         | labs, and I have very good impression about the technology and
         | the team. I see them as our friends. I'm also enthusiastic
         | about every data processing technologies.
        
           | vgt wrote:
           | I'd love to see evidence for some of these claims, like
           | stability and testing. Would you also be able to comment on
           | correctness?
           | 
           | (co-founder at MotherDuck)
        
             | zX41ZdbW wrote:
             | This summer I was preparing the ClickBench:
             | https://benchmark.clickhouse.com/
             | 
             | When I tried to use DuckDB on the same dataset as
             | ClickHouse, it simply did not work due to OOM:
             | https://github.com/duckdb/duckdb/issues/3969
             | 
             | I also told them about our experience of using various
             | memory allocators, and why you should never use the GLibC's
             | malloc.
             | 
             | This issue was fixed.
        
               | vgt wrote:
               | Thanks, sounds like relative immaturity.
               | 
               | I did not catch you address correctness?
               | 
               | FWIW I have some experience with Clickhouse as I ran
               | product at Firebolt and played a critical role in being
               | more transparent about their foundations and giving
               | credit where it is due. However, I do have some first
               | hand experience with Clickhouse, which was jarring,
               | considering my previous experience was with BigQuery.
        
             | zX41ZdbW wrote:
             | Recently I tried to load another dataset to DuckDB:
             | https://github.com/ClickHouse/ClickHouse/issues/42363 for
             | my own entertainment.
             | 
             | But it did not work (the server became unresponsive after
             | consuming all memory).
        
               | vgt wrote:
               | From my practical experience with ClickHouse, I saw OOMs
               | as well, on load and on query. How confident are you that
               | your chosen dataset is neutral?
        
               | zX41ZdbW wrote:
               | ClickHouse will throw an exception in case of not enough
               | memory and continue to serve other queries. Under certain
               | configurations it can OOM as well.
               | 
               | > How confident are you that your chosen dataset is
               | neutral?
               | 
               | I have no idea if it is "neutral", I picked it randomly.
               | 
               | I test ClickHouse on every interesting dataset, see here:
               | https://github.com/ClickHouse/ClickHouse/issues?q=is%3Ais
               | sue...
               | 
               | The reason - I love working with data :) If I see a
               | dataset, I load it into ClickHouse - this is the first
               | thing I do. This is not a kind of marketing or promotion
               | of ClickHouse - you know, if it were some directed task,
               | it would be uninteresting for me.
        
       | glogla wrote:
       | It is interesting to see how ClickHouse webpage changed, lately.
       | If I didn't know there is open source version, I would not find
       | it.
       | 
       | You have to go to _Product_ menu to see _Open source ClickHouse_
       | or scroll several screens to find low contrast _Open source
       | quickstart_ button, right next to high contrast _Try Cloud now_
       | button. The button will take you to a tutorial section of the
       | documentation, where you have to scroll over a into video of
       | getting started on ClickHouse Cloud, and then you have to switch
       | from _ClickHouse Cloud_ tab to _Self-Managed_ tab, to get to how
       | to actually use it.
       | 
       | It's like the non-hosted version is a secret!
       | 
       | I understand everybody wants to make money, and people took money
       | from VCs and they want results, etc. But I remember time when the
       | webpage was just docs and that was it, and it leaves bit of a sad
       | taste in my mouth.
        
         | zX41ZdbW wrote:
         | This is not what we want. I admit this problem. Changes will be
         | made.
         | 
         | Just for amusement, here is the 2016 website version:
         | https://presentations.clickhouse.com/original_website/
        
       | winrid wrote:
       | Will this be available with APIs to use this embedded like
       | DuckDB?
       | 
       | All the examples I see are using the CLI.
        
         | zX41ZdbW wrote:
         | It is not available as a library, only CLI.
        
           | winrid wrote:
           | Ah, okay. Won't work for my use case. Still cool though!
        
       | [deleted]
        
       | vgt wrote:
       | This is, admittedly, the most unusual use of "serverless" I've
       | ever heard of
        
       | varunkmohan wrote:
       | Any plans to be able to directly run this in the browser similar
       | to DuckDB. Nice to see more options in the space.
        
       | cube2222 wrote:
       | Congrats on the Show HN!
       | 
       | It's great to see more tools in this area (querying data from
       | various sources in-place) and the Lambda use case is a really
       | cool idea!
       | 
       | I've recently done a bunch of benchmarking including ClickHouse
       | Local, and the usage was straightforward, with everything working
       | as it's supposed to.
       | 
       | Just to comment on the performance avenue though, one area I
       | think ClickHouse could still possibly improve on - vs OctoSQL[0]
       | at least - is that it seems like the JSON datasource is slower,
       | especially if only a small part of the JSON objects is used. If
       | only a single field of many is used, OctoSQL lazily parses only
       | that field, and skips the others, which yields non-trivial
       | performance gains on big JSON files with small queries.
       | 
       | Basically, for a query like `SELECT COUNT(*), AVG(overall) FROM
       | books.json` with the Amazon Review Dataset (10GB), OctoSQL is
       | twice as fast (3s vs 6s). That's a minor thing though (OctoSQL
       | will slow down for more complicated queries, while for ClickHouse
       | decoding the input is and remains the bottleneck, with the
       | processing itself being ridiculously fast).
       | 
       | Godspeed with the future evolution of ClickHouse!
       | 
       | [0]: https://github.com/cube2222/octosql
        
         | justinsaccount wrote:
         | There's newer json support in clickhouse where you can load
         | arbitrary json objects into a JSON column type:
         | 
         | https://clickhouse.com/blog/getting-data-into-clickhouse-par...
         | 
         | Behind the scenes it flattens things into real columns with
         | auto detected types.
         | 
         | I suppose if you planned on running many queries that would be
         | faster, but I bet octosql could also convert it to something
         | like parquet for similar gains.
         | 
         | There's also some tricks you can do with json filtering if
         | strings are involved, if you have a query like "select path
         | from http_logs where hostname = 'news.ycom...'" you can read a
         | whole chunk of data, do a `memmem` for the string and if no
         | matches, not even bother decoding ANY of the json.
        
       ___________________________________________________________________
       (page generated 2023-01-05 23:00 UTC)