[HN Gopher] ClickHouse gets lazier and faster: Introducing lazy ...
___________________________________________________________________
ClickHouse gets lazier and faster: Introducing lazy materialization
Author : tbragin
Score : 195 points
Date : 2025-04-22 16:03 UTC (6 hours ago)
(HTM) web link (clickhouse.com)
(TXT) w3m dump (clickhouse.com)
| simonw wrote:
| Unrelated to the new materialization option, this caught my eye:
|
| "this query sorts all 150 million values in the helpful_votes
| column (which isn't part of the table's sort key) and returns the
| top 3, in just 70 milliseconds cold (with the OS filesystem cache
| cleared beforehand) and a processing throughput of 2.15 billion
| rows/s"
|
| I clearly need to update my mental model of what might be a slow
| query against modern hardware and software. Looks like that's so
| fast because in a columnar database it only has to load that 150
| million value column. I guess sorting 150 million integers in
| 70ms shouldn't be surprising.
|
| (Also "Peak memory usage: 3.59 MiB" for that? Nice.)
|
| This is a really great article - very clearly explained, good
| diagrams, I learned a bunch from it.
| amluto wrote:
| > I guess sorting 150 million integers in 70ms shouldn't be
| surprising.
|
| I find sorting 150M integers at all to be surprising. The query
| asks for finding the top 3 elements and returning those
| elements, sorted. This can be done trivially by keeping the
| best three found so far and scanning the list. This should
| operate at nearly the speed of memory and use effectively zero
| additional storage. I don't know whether Clickhouse does this
| optimization, but I didn't see it mentioned.
|
| Generically, one can find the kth best of n elements in time
| O(n):
|
| https://en.m.wikipedia.org/wiki/Selection_algorithm
|
| And one can scan again to find the top k, plus some extra if
| the kth best wasn't unique, but that issue is manageable and, I
| think, adds at most a factor of 2 overhead if one is careful
| (collect up to k elements that compare equal to the kth best
| and collect up to k that are better than it). Total complexity
| is O(n) if you don't need the result sorted or O(n + k log k)
| if you do.
|
| If you're not allowed to mutate the input (which probably
| applies to Clickhouse-style massive streaming reads), you can
| collect the top k in a separate data structure, and
| straightforward implementations are O(n log k). I wouldn't be
| surprised if using a fancy heap or taking advantage of the data
| being integers with smallish numbers of bits does better, but I
| haven't tried to find a solution or disprove the existence of
| one.
| Akronymus wrote:
| > This can be done trivially by keeping the best three found
| so far and scanning the list.
|
| That doesnt seem to guarantee correctness. If you dont track
| all of the unique values, at least, you could be throwing
| away one of the most common values.
|
| The wiki entry seems to be specifically about the smallest,
| rather than largest values.
| recursive wrote:
| What? The algorithm is completely symmetrical with respect
| to smallest or largest, and fully correct and general. I
| don't understand the problem with unique values. Could you
| provide a minimal input demonstrating the issue?
| Akronymus wrote:
| I cant because I completely misread the wiki article
| before commenting and have now read it more carefully and
| realized I was wrong. Specifically I went in thinking
| about top 3 most common value.
| datadrivenangel wrote:
| With an equality that returns true/false, this guarantees
| correctness. If there can be 3 best/biggest/smallest
| values, this technique works.
| senderista wrote:
| The max-heap algorithm alluded to above is correct. You
| fill it with the first k values scanned, then peek at the
| max element for each subsequent value. If the current value
| is smaller than the max element, you evict the max element
| and insert the new element. This streaming top-k algorithm
| is ubiquitous in both leetcode interviews and applications.
| (The standard quickselect top-k algorithm is not useful in
| the streaming context because it requires random access and
| in-place mutation.)
| Akronymus wrote:
| My failure was misreading it as most common k rather than
| max k.
| senderista wrote:
| Most common k is super-interesting because it can't be
| solved in one pass in constant space!
|
| https://en.wikipedia.org/wiki/Streaming_algorithm#Frequen
| t_e...
| 3np wrote:
| Why is that interesting? Intuitively a worst-case could
| be a stream of n-1 unique elements out of n with the
| duplicate at the end, so there is no way around O(n)
| space. Any element could be the most common so you must
| keep them all.
| senderista wrote:
| Sure, a similar trivial argument applies to the linear-
| space lower bound for set membership. But these linear
| lower bounds motivate the search for approximate
| techniques with sublinear lower bounds (although bloom
| filters or fingerprint tables are not actually
| sublinear).
| amluto wrote:
| To be fair to quickselect, I can imagine a lazy data
| processing framework having a concept of a lazily sorted
| data column where the actual data has been materialized
| but it's not in sorted order yet. Then someone does
| "LIMIT k" to it, and the framework can go to town with
| quickselect.
|
| As noted a couple times in this thread, there are all
| kinds of tradeoffs here, and I can't imagine quickselect
| being even close to competitive for k that is small
| enough to fit in cache. Quickselect will, in general,
| scan a large input approximately twice. For k = 3, the
| answer fits in general-purpose registers or even in a
| single SIMD register, and a single scan with brute force
| accumulation of the answer will beat quickselect handily
| and will also beat any sort of log-time heap.
|
| (In general, more advanced and asymptotically better
| algorithms often lose to simpler brute force algorithms
| when the parameters in question are smallish.)
| senderista wrote:
| Yeah, obviously I wouldn't bother with a heap for k=3. A
| heap has good compactness but poor locality, so I guess
| it wouldn't perform well out of (some level of) cache.
| eru wrote:
| So quickselect needs multiple passes, and the heap needs
| O(n log k) time to find the top k elements of n elements
| total.
|
| However, you can find the top k elements in O(n) time and
| O(k) space in a single pass.
|
| One simple way: you keep a buffer of up to 2*k elements.
| You scan your stream of n items one by one. Whenever your
| buffer gets full, you pare it back down to k elements
| with your favourite selection algorithm (like
| quickselect).
|
| As a minor optimisation, you can only add items to your
| buffer, if they improve on the worst element in your
| buffer (or when you haven't hit k elements in your
| buffer, yet).
|
| As an empirical question, you can also experiment with
| the size of the buffer. Theoretically any multiple of k
| will do (even 1.1*k or so), but in practice they give you
| different constant factors for space and time.
| senderista wrote:
| How do you efficiently track the "worst element" without
| something like a max-heap? But yeah, this is a fun
| algorithm. I think I've seen it before but can't place
| it, do you remember where you came across it?
| simonw wrote:
| Maybe they do have that optimization and that explains the
| 3.59 MiB peak memory usage for ~600MB of integers.
| danlark1 wrote:
| I am the author of the optimization of partial sorting and
| selection in Clickhouse. It uses Floyd-Rivest algorithm and
| we tried a lot of different things back at the time, read [1]
|
| Overall clickhouse reads blocks of fixed sizes (64k) and
| finds top elements and then does top of the top until it
| converges.
|
| [1] https://danlark.org/2020/11/11/miniselect-practical-and-
| gene...
| baq wrote:
| Slow VMs on overprovisioned cloud hosts which cost as much per
| month as a dedicated box per year have broken a generation of
| engineers.
|
| You could host _so much_ from your macbook. The average HN
| startup could be hosted on a $200 minipc from a closet for the
| first couple of years if not more - and I 'm talking expensive
| here for the extra RAM you want to not restart every hour when
| you have a memory leak.
| sofixa wrote:
| Raw compute wise, you're almost right (almost because real
| cloud hosts aren't overprovisioned, you get the full
| CPU/memory/disk reserved for you).
|
| But you actually need more than compute. You might need a
| database, cache, message broker, scheduler, to send emails,
| and a million other things you can always DIY with FOSS
| software, but take time. If you have more money than time,
| get off the shelf services that provide those with guarantees
| and maintenance; if not, the DIY route is also great for
| learning.
| baq wrote:
| My point is all of this can be hosted on a single bare
| metal box, a small one at that! We used to do just that
| back in mid naughts and computers only got faster. Half of
| those cloud services are preconfigured FOSS derivatives
| behind the scenes anyway (probably...)
| rfoo wrote:
| > so much from your macbook
|
| At least on cloud I can actually have hundreds of GiBs of
| RAM. If I want this on my Macbook it's even more expensive
| than my cloud bill.
| baq wrote:
| You can, but if you need it you're not searching for a
| product market fit anymore.
| federiconafria wrote:
| Not only that, you have a pile of layers that could be
| advantageous in some situations but are an overkill in most.
|
| I've seen Spark clusters being replaced by a single container
| using less than 1 CPU core and few 100s MB of RAM.
| ramraj07 wrote:
| I don't see how that's the root cause. ClickHouse and
| snowflake run on your so-called slow vms on overprovisioned
| cloud hosts and they're efficient as hell. It's all about
| your optimizations.
|
| The real problem is the lack of understanding by most
| engineers the degree of overprovisioning they do for code
| that's simple and doing stupid things using an inefficient
| 4th order language on top of 5 different useless (imo)
| abstractions.
| tmoertel wrote:
| This optimization _should_ provide dramatic speed-ups when taking
| random samples from massive data sets, especially when the wanted
| columns can contain large values. That 's because the basic SQL
| recipe relies on a LIMIT clause to determine which rows are in
| the sample (see query below), and this new optimization promises
| to defer reading the big columns until the LIMIT clause has
| filtered the data set down to a tiny number of lucky rows.
| SELECT * FROM Population WHERE weight > 0
| ORDER BY -LN(1.0 - RANDOM()) / weight LIMIT 100 --
| Sample size.
|
| Can anyone from ClickHouse verify that the lazy-materialization
| optimization speeds up queries like this one? (I want to make
| sure the randomization in the ORDER BY clause doesn't prevent the
| optimization.)
| tschreiber wrote:
| Verified: EXPLAIN plan actions = 1 SELECT
| * FROM amazon.amazon_reviews WHERE helpful_votes >
| 0 ORDER BY -log(1 - (rand() / 4294967296.0)) /
| helpful_votes LIMIT 3
|
| Lazily read columns: review_body, review_headline,
| verified_purchase, vine, total_votes, marketplace, star_rating,
| product_category, customer_id, product_title, product_id,
| product_parent, review_date, review_id
|
| Note that there is a setting
| query_plan_max_limit_for_lazy_materialization (default value
| 10) that controls the max n for which lm kicks in for LIMIT n.
| tmoertel wrote:
| Awesome! Thanks for checking :-)
| zX41ZdbW wrote:
| I checked, and yes - it works:
| https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...
| tmoertel wrote:
| Thanks! That's a nice 5x improvement. Pretty good for a query
| that offers only modest opportunity, given that the few
| columns it asks for are fairly small (`title` being the
| largest, which isn't that large).
| ethan_smith wrote:
| The optimization should work well for your sampling query since
| the ORDER BY and LIMIT operations would happen before
| materializing the large columns, but the randomization function
| might force early evaluation - worth benchmarking both
| approaches.
| simianwords wrote:
| Maybe I'm too inexperienced in this field but reading the
| mechanism I think this would be an obvious optimisation. Is it
| not?
|
| But credit where it is due, obviously clickhouse is an industry
| leader.
| ahofmann wrote:
| Obvious solutions are often hard to do right. I bet the code
| that was needed to pull this off is either very complex or took
| a long time to write (and test). Or both.
| ryanworl wrote:
| This is a well-known class of optimization and the literature
| term is "late materialization". It is a large set of strategies
| including this one. Late materialization is about as old as
| column stores themselves.
| jurgenkesker wrote:
| I really like Clickhouse. Discovered it recently, and man, it's
| such a breath of fresh air compared to suboptimal solutions I
| used for analytics. It's so fast and the CLI is also a joy to
| work with.
| EvanAnderson wrote:
| Same here. I come from a strong Postgres and Microsoft SQL
| Server background and I was able to get up to speed with it,
| ingesting real data from text files, in an afternoon. I was
| really impressed with the docs as well as the performance of
| the software.
| theLiminator wrote:
| How does it compare to duckdb and/or polars?
| thenaturalist wrote:
| This is very much an active space, so the half-life of in
| depth analyses is limited, but one of the best write ups from
| about 1.5 years ago is this one: https://bicortex.com/duckdb-
| vs-clickhouse-performance-compar...
| ohnoesjmr wrote:
| Wonder how well this propagates down to subqueries/CTE's
| meta_ai_x wrote:
| can we take the "packing your luggage" analogy and only pack the
| things we actually use in the trip and apply that to clickhouse?
| Onavo wrote:
| Reminder clickhouse can be optionally embedded, you don't need to
| reach for Duck just because of hype (it's buggy as hell everytime
| I tried it).
|
| https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
| sirfz wrote:
| Chdb is awesome but so is duckdb
| vjerancrnjak wrote:
| It's quite amazing how a db like this shows that all of those
| row-based dbs are doing something wrong, they can't even approach
| these speeds with btree index structures. I know they like
| transactions more than Clickhouse, but it's just amazing to see
| how fast modern machines are, billions of rows per second.
|
| I'm pretty sure they did not even bother to properly compress the
| dataset, with some tweaking, could have probably been much
| smaller than 30GBs. The speed shows that reading the data is
| slower than decompressing it.
|
| Reminds me of that Cloudflare article where they had a similar
| idea about encryption being free (slower to read than to decrypt)
| and finding a bug, that when fixed, materialized this behavior.
|
| The compute engine (chdb) is a wonder to use.
| apavlo wrote:
| > It's quite amazing how a db like this shows that all of those
| row-based dbs are doing something wrong
|
| They're not "doing something wrong". They are designed
| differently for different target workloads.
|
| Row-based -> OLTP -> "Fetch the entire records from order table
| where user_id = XYZ"
|
| Column-based -> OLAP -> "Compute the total amount of orders
| from the order table grouped by month/year"
| mmsimanga wrote:
| IMHO if ClickHouse had Windows native release that does not need
| WSL or a Linux virtual machine it would be more popular than
| DuckDB. I remember for years MySQL being way more popular than
| PostgreSQL. One of the reasons being MySQL had a Windows
| installer.
| dangoodmanUT wrote:
| God clickhouse is such great software, if it only it was as
| ergonomic as duckdb, and management wasn't doing some
| questionable things (deleting references to competitors in GH
| issues, weird legal letters, etc.)
|
| The CH contributors are really stellar, from multiple companies
| (Altinity, Tinybird, Cloudflare, ClickHouse)
| justmarc wrote:
| Clickhouse is a masterpiece of modern engineering with absolute
| attention to performance.
___________________________________________________________________
(page generated 2025-04-22 23:00 UTC)