[HN Gopher] ClickHouse gets lazier and faster: Introducing lazy ...
       ___________________________________________________________________
        
       ClickHouse gets lazier and faster: Introducing lazy materialization
        
       Author : tbragin
       Score  : 195 points
       Date   : 2025-04-22 16:03 UTC (6 hours ago)
        
 (HTM) web link (clickhouse.com)
 (TXT) w3m dump (clickhouse.com)
        
       | simonw wrote:
       | Unrelated to the new materialization option, this caught my eye:
       | 
       | "this query sorts all 150 million values in the helpful_votes
       | column (which isn't part of the table's sort key) and returns the
       | top 3, in just 70 milliseconds cold (with the OS filesystem cache
       | cleared beforehand) and a processing throughput of 2.15 billion
       | rows/s"
       | 
       | I clearly need to update my mental model of what might be a slow
       | query against modern hardware and software. Looks like that's so
       | fast because in a columnar database it only has to load that 150
       | million value column. I guess sorting 150 million integers in
       | 70ms shouldn't be surprising.
       | 
       | (Also "Peak memory usage: 3.59 MiB" for that? Nice.)
       | 
       | This is a really great article - very clearly explained, good
       | diagrams, I learned a bunch from it.
        
         | amluto wrote:
         | > I guess sorting 150 million integers in 70ms shouldn't be
         | surprising.
         | 
         | I find sorting 150M integers at all to be surprising. The query
         | asks for finding the top 3 elements and returning those
         | elements, sorted. This can be done trivially by keeping the
         | best three found so far and scanning the list. This should
         | operate at nearly the speed of memory and use effectively zero
         | additional storage. I don't know whether Clickhouse does this
         | optimization, but I didn't see it mentioned.
         | 
         | Generically, one can find the kth best of n elements in time
         | O(n):
         | 
         | https://en.m.wikipedia.org/wiki/Selection_algorithm
         | 
         | And one can scan again to find the top k, plus some extra if
         | the kth best wasn't unique, but that issue is manageable and, I
         | think, adds at most a factor of 2 overhead if one is careful
         | (collect up to k elements that compare equal to the kth best
         | and collect up to k that are better than it). Total complexity
         | is O(n) if you don't need the result sorted or O(n + k log k)
         | if you do.
         | 
         | If you're not allowed to mutate the input (which probably
         | applies to Clickhouse-style massive streaming reads), you can
         | collect the top k in a separate data structure, and
         | straightforward implementations are O(n log k). I wouldn't be
         | surprised if using a fancy heap or taking advantage of the data
         | being integers with smallish numbers of bits does better, but I
         | haven't tried to find a solution or disprove the existence of
         | one.
        
           | Akronymus wrote:
           | > This can be done trivially by keeping the best three found
           | so far and scanning the list.
           | 
           | That doesnt seem to guarantee correctness. If you dont track
           | all of the unique values, at least, you could be throwing
           | away one of the most common values.
           | 
           | The wiki entry seems to be specifically about the smallest,
           | rather than largest values.
        
             | recursive wrote:
             | What? The algorithm is completely symmetrical with respect
             | to smallest or largest, and fully correct and general. I
             | don't understand the problem with unique values. Could you
             | provide a minimal input demonstrating the issue?
        
               | Akronymus wrote:
               | I cant because I completely misread the wiki article
               | before commenting and have now read it more carefully and
               | realized I was wrong. Specifically I went in thinking
               | about top 3 most common value.
        
             | datadrivenangel wrote:
             | With an equality that returns true/false, this guarantees
             | correctness. If there can be 3 best/biggest/smallest
             | values, this technique works.
        
             | senderista wrote:
             | The max-heap algorithm alluded to above is correct. You
             | fill it with the first k values scanned, then peek at the
             | max element for each subsequent value. If the current value
             | is smaller than the max element, you evict the max element
             | and insert the new element. This streaming top-k algorithm
             | is ubiquitous in both leetcode interviews and applications.
             | (The standard quickselect top-k algorithm is not useful in
             | the streaming context because it requires random access and
             | in-place mutation.)
        
               | Akronymus wrote:
               | My failure was misreading it as most common k rather than
               | max k.
        
               | senderista wrote:
               | Most common k is super-interesting because it can't be
               | solved in one pass in constant space!
               | 
               | https://en.wikipedia.org/wiki/Streaming_algorithm#Frequen
               | t_e...
        
               | 3np wrote:
               | Why is that interesting? Intuitively a worst-case could
               | be a stream of n-1 unique elements out of n with the
               | duplicate at the end, so there is no way around O(n)
               | space. Any element could be the most common so you must
               | keep them all.
        
               | senderista wrote:
               | Sure, a similar trivial argument applies to the linear-
               | space lower bound for set membership. But these linear
               | lower bounds motivate the search for approximate
               | techniques with sublinear lower bounds (although bloom
               | filters or fingerprint tables are not actually
               | sublinear).
        
               | amluto wrote:
               | To be fair to quickselect, I can imagine a lazy data
               | processing framework having a concept of a lazily sorted
               | data column where the actual data has been materialized
               | but it's not in sorted order yet. Then someone does
               | "LIMIT k" to it, and the framework can go to town with
               | quickselect.
               | 
               | As noted a couple times in this thread, there are all
               | kinds of tradeoffs here, and I can't imagine quickselect
               | being even close to competitive for k that is small
               | enough to fit in cache. Quickselect will, in general,
               | scan a large input approximately twice. For k = 3, the
               | answer fits in general-purpose registers or even in a
               | single SIMD register, and a single scan with brute force
               | accumulation of the answer will beat quickselect handily
               | and will also beat any sort of log-time heap.
               | 
               | (In general, more advanced and asymptotically better
               | algorithms often lose to simpler brute force algorithms
               | when the parameters in question are smallish.)
        
               | senderista wrote:
               | Yeah, obviously I wouldn't bother with a heap for k=3. A
               | heap has good compactness but poor locality, so I guess
               | it wouldn't perform well out of (some level of) cache.
        
               | eru wrote:
               | So quickselect needs multiple passes, and the heap needs
               | O(n log k) time to find the top k elements of n elements
               | total.
               | 
               | However, you can find the top k elements in O(n) time and
               | O(k) space in a single pass.
               | 
               | One simple way: you keep a buffer of up to 2*k elements.
               | You scan your stream of n items one by one. Whenever your
               | buffer gets full, you pare it back down to k elements
               | with your favourite selection algorithm (like
               | quickselect).
               | 
               | As a minor optimisation, you can only add items to your
               | buffer, if they improve on the worst element in your
               | buffer (or when you haven't hit k elements in your
               | buffer, yet).
               | 
               | As an empirical question, you can also experiment with
               | the size of the buffer. Theoretically any multiple of k
               | will do (even 1.1*k or so), but in practice they give you
               | different constant factors for space and time.
        
               | senderista wrote:
               | How do you efficiently track the "worst element" without
               | something like a max-heap? But yeah, this is a fun
               | algorithm. I think I've seen it before but can't place
               | it, do you remember where you came across it?
        
           | simonw wrote:
           | Maybe they do have that optimization and that explains the
           | 3.59 MiB peak memory usage for ~600MB of integers.
        
           | danlark1 wrote:
           | I am the author of the optimization of partial sorting and
           | selection in Clickhouse. It uses Floyd-Rivest algorithm and
           | we tried a lot of different things back at the time, read [1]
           | 
           | Overall clickhouse reads blocks of fixed sizes (64k) and
           | finds top elements and then does top of the top until it
           | converges.
           | 
           | [1] https://danlark.org/2020/11/11/miniselect-practical-and-
           | gene...
        
         | baq wrote:
         | Slow VMs on overprovisioned cloud hosts which cost as much per
         | month as a dedicated box per year have broken a generation of
         | engineers.
         | 
         | You could host _so much_ from your macbook. The average HN
         | startup could be hosted on a $200 minipc from a closet for the
         | first couple of years if not more - and I 'm talking expensive
         | here for the extra RAM you want to not restart every hour when
         | you have a memory leak.
        
           | sofixa wrote:
           | Raw compute wise, you're almost right (almost because real
           | cloud hosts aren't overprovisioned, you get the full
           | CPU/memory/disk reserved for you).
           | 
           | But you actually need more than compute. You might need a
           | database, cache, message broker, scheduler, to send emails,
           | and a million other things you can always DIY with FOSS
           | software, but take time. If you have more money than time,
           | get off the shelf services that provide those with guarantees
           | and maintenance; if not, the DIY route is also great for
           | learning.
        
             | baq wrote:
             | My point is all of this can be hosted on a single bare
             | metal box, a small one at that! We used to do just that
             | back in mid naughts and computers only got faster. Half of
             | those cloud services are preconfigured FOSS derivatives
             | behind the scenes anyway (probably...)
        
           | rfoo wrote:
           | > so much from your macbook
           | 
           | At least on cloud I can actually have hundreds of GiBs of
           | RAM. If I want this on my Macbook it's even more expensive
           | than my cloud bill.
        
             | baq wrote:
             | You can, but if you need it you're not searching for a
             | product market fit anymore.
        
           | federiconafria wrote:
           | Not only that, you have a pile of layers that could be
           | advantageous in some situations but are an overkill in most.
           | 
           | I've seen Spark clusters being replaced by a single container
           | using less than 1 CPU core and few 100s MB of RAM.
        
           | ramraj07 wrote:
           | I don't see how that's the root cause. ClickHouse and
           | snowflake run on your so-called slow vms on overprovisioned
           | cloud hosts and they're efficient as hell. It's all about
           | your optimizations.
           | 
           | The real problem is the lack of understanding by most
           | engineers the degree of overprovisioning they do for code
           | that's simple and doing stupid things using an inefficient
           | 4th order language on top of 5 different useless (imo)
           | abstractions.
        
       | tmoertel wrote:
       | This optimization _should_ provide dramatic speed-ups when taking
       | random samples from massive data sets, especially when the wanted
       | columns can contain large values. That 's because the basic SQL
       | recipe relies on a LIMIT clause to determine which rows are in
       | the sample (see query below), and this new optimization promises
       | to defer reading the big columns until the LIMIT clause has
       | filtered the data set down to a tiny number of lucky rows.
       | SELECT *         FROM Population         WHERE weight > 0
       | ORDER BY -LN(1.0 - RANDOM()) / weight         LIMIT 100  --
       | Sample size.
       | 
       | Can anyone from ClickHouse verify that the lazy-materialization
       | optimization speeds up queries like this one? (I want to make
       | sure the randomization in the ORDER BY clause doesn't prevent the
       | optimization.)
        
         | tschreiber wrote:
         | Verified:                 EXPLAIN plan actions = 1       SELECT
         | *       FROM amazon.amazon_reviews       WHERE helpful_votes >
         | 0       ORDER BY -log(1 - (rand() / 4294967296.0)) /
         | helpful_votes       LIMIT 3
         | 
         | Lazily read columns: review_body, review_headline,
         | verified_purchase, vine, total_votes, marketplace, star_rating,
         | product_category, customer_id, product_title, product_id,
         | product_parent, review_date, review_id
         | 
         | Note that there is a setting
         | query_plan_max_limit_for_lazy_materialization (default value
         | 10) that controls the max n for which lm kicks in for LIMIT n.
        
           | tmoertel wrote:
           | Awesome! Thanks for checking :-)
        
         | zX41ZdbW wrote:
         | I checked, and yes - it works:
         | https://pastila.nl/?002a2e01/31807bae7e114ca343577d263be7845...
        
           | tmoertel wrote:
           | Thanks! That's a nice 5x improvement. Pretty good for a query
           | that offers only modest opportunity, given that the few
           | columns it asks for are fairly small (`title` being the
           | largest, which isn't that large).
        
         | ethan_smith wrote:
         | The optimization should work well for your sampling query since
         | the ORDER BY and LIMIT operations would happen before
         | materializing the large columns, but the randomization function
         | might force early evaluation - worth benchmarking both
         | approaches.
        
       | simianwords wrote:
       | Maybe I'm too inexperienced in this field but reading the
       | mechanism I think this would be an obvious optimisation. Is it
       | not?
       | 
       | But credit where it is due, obviously clickhouse is an industry
       | leader.
        
         | ahofmann wrote:
         | Obvious solutions are often hard to do right. I bet the code
         | that was needed to pull this off is either very complex or took
         | a long time to write (and test). Or both.
        
         | ryanworl wrote:
         | This is a well-known class of optimization and the literature
         | term is "late materialization". It is a large set of strategies
         | including this one. Late materialization is about as old as
         | column stores themselves.
        
       | jurgenkesker wrote:
       | I really like Clickhouse. Discovered it recently, and man, it's
       | such a breath of fresh air compared to suboptimal solutions I
       | used for analytics. It's so fast and the CLI is also a joy to
       | work with.
        
         | EvanAnderson wrote:
         | Same here. I come from a strong Postgres and Microsoft SQL
         | Server background and I was able to get up to speed with it,
         | ingesting real data from text files, in an afternoon. I was
         | really impressed with the docs as well as the performance of
         | the software.
        
         | theLiminator wrote:
         | How does it compare to duckdb and/or polars?
        
           | thenaturalist wrote:
           | This is very much an active space, so the half-life of in
           | depth analyses is limited, but one of the best write ups from
           | about 1.5 years ago is this one: https://bicortex.com/duckdb-
           | vs-clickhouse-performance-compar...
        
       | ohnoesjmr wrote:
       | Wonder how well this propagates down to subqueries/CTE's
        
       | meta_ai_x wrote:
       | can we take the "packing your luggage" analogy and only pack the
       | things we actually use in the trip and apply that to clickhouse?
        
       | Onavo wrote:
       | Reminder clickhouse can be optionally embedded, you don't need to
       | reach for Duck just because of hype (it's buggy as hell everytime
       | I tried it).
       | 
       | https://clickhouse.com/blog/chdb-embedded-clickhouse-rocket-...
        
         | sirfz wrote:
         | Chdb is awesome but so is duckdb
        
       | vjerancrnjak wrote:
       | It's quite amazing how a db like this shows that all of those
       | row-based dbs are doing something wrong, they can't even approach
       | these speeds with btree index structures. I know they like
       | transactions more than Clickhouse, but it's just amazing to see
       | how fast modern machines are, billions of rows per second.
       | 
       | I'm pretty sure they did not even bother to properly compress the
       | dataset, with some tweaking, could have probably been much
       | smaller than 30GBs. The speed shows that reading the data is
       | slower than decompressing it.
       | 
       | Reminds me of that Cloudflare article where they had a similar
       | idea about encryption being free (slower to read than to decrypt)
       | and finding a bug, that when fixed, materialized this behavior.
       | 
       | The compute engine (chdb) is a wonder to use.
        
         | apavlo wrote:
         | > It's quite amazing how a db like this shows that all of those
         | row-based dbs are doing something wrong
         | 
         | They're not "doing something wrong". They are designed
         | differently for different target workloads.
         | 
         | Row-based -> OLTP -> "Fetch the entire records from order table
         | where user_id = XYZ"
         | 
         | Column-based -> OLAP -> "Compute the total amount of orders
         | from the order table grouped by month/year"
        
       | mmsimanga wrote:
       | IMHO if ClickHouse had Windows native release that does not need
       | WSL or a Linux virtual machine it would be more popular than
       | DuckDB. I remember for years MySQL being way more popular than
       | PostgreSQL. One of the reasons being MySQL had a Windows
       | installer.
        
       | dangoodmanUT wrote:
       | God clickhouse is such great software, if it only it was as
       | ergonomic as duckdb, and management wasn't doing some
       | questionable things (deleting references to competitors in GH
       | issues, weird legal letters, etc.)
       | 
       | The CH contributors are really stellar, from multiple companies
       | (Altinity, Tinybird, Cloudflare, ClickHouse)
        
       | justmarc wrote:
       | Clickhouse is a masterpiece of modern engineering with absolute
       | attention to performance.
        
       ___________________________________________________________________
       (page generated 2025-04-22 23:00 UTC)