hngopher.com

       [HN Gopher] What is ClickHouse how it compares to PostgreSQL and...
       ___________________________________________________________________
        
       What is ClickHouse how it compares to PostgreSQL and TimescaleDB
       for time series
        
       Author : LoriP
       Score  : 202 points
       Date   : 2021-10-21 15:31 UTC (7 hours ago)
        
 (HTM) web link (blog.timescale.com)
 (TXT) w3m dump (blog.timescale.com)
        
       | qaq wrote:
       | ClickHouse shines at scales that timescale has no hope of ever
       | supporting. Hence the choice of workloads in the test. Cloudflare
       | was ingesting 6,000,000 rows per second into 36 node (dual
       | E5-2630) ClickHouse cluster (2018) was something like 20PB of
       | data per year.
        
         | akulkarni wrote:
         | (TimescaleDB co-founder)
         | 
         | 6,000,000 rows inserted per second is great! And if you need
         | that for your workload, then you probably should choose
         | ClickHouse over TimescaleDB (well at least, for now ;-)
         | 
         | The reason we don't include that in the benchmark is that most
         | developers _don't need_ 6,000,000 rows inserted per second.
         | 
         | And also - that performance doesn't come for free, but requires
         | giving up a lot of things that most developers may need: e.g.,
         | no transactions, immutable tables (can't easily update / delete
         | data), SQL-like but not quite SQL query language, inefficient
         | joins, inefficient for point queries retrieving single rows by
         | their keys, etc. (We go into much more detail in the blog
         | post.)
         | 
         | So it comes down to the fundamental analogy used in the post:
         | Do you need a car (versatility) or a bull dozer (narrow
         | specialization)?
         | 
         | If the answer is that you need to support 6,000,000 rows
         | inserted per second, then by all means, choose the bull dozer.
         | 
         | > ClickHouse shines at scales that timescale has no hope of
         | ever supporting.
         | 
         | I'm not sure if this was a throwaway line, or if it was the
         | result of a detailed analysis of TimescaleDB's architecture,
         | but if you don't mind, I'll share this: with TimescaleDB multi-
         | node [0] we are getting close to that performance, and the
         | product keeps getting better.
         | 
         | [0] https://blog.timescale.com/blog/building-a-distributed-
         | time-...
        
         | ryanbooz wrote:
         | (post author)
         | 
         | Those are great, impressive numbers. We certainly don't claim
         | to be all things to all people, but the benchmark was run using
         | single instances mostly because that is what most other
         | benchmarks published by others have done.
         | 
         | With a multi-node TimescaleDB cluster, ingest does literally
         | scale to millions of rows/second and we have a numerous users
         | achieving great numbers. One Fortune 100 company has a 25+
         | TimescaleDB cluster backing their network monitoring stack and
         | loves it for their use case.
         | 
         | At some point, when we can, I'm sure we'll start to do more
         | with multi-node benchmarking too to give some input to the
         | conversation.
        
           | akulkarni wrote:
           | Minor correction: It is actually a 40+ node cluster :-)
        
           | qaq wrote:
           | and then it will come down to spec of nodes actual fields and
           | so on etc. Also batch size obviously plays a big role here as
           | CH is optimized for very large batch sizes and benchmark is
           | not really using that kind of batch size. BTW. I am not
           | involved with CH but any kind of vendor benchmarking their
           | wears will always select params that will make their offering
           | look good
        
             | ryanbooz wrote:
             | Sure, these tests were not using really large batch sizes
             | because of the other benchmarks we were trying to replicate
             | (but with more detail). Honestly, for this single instance
             | setup, we saw improvement in CH when we went from (say) 5k,
             | 10k, or 20k batches. But it was a few percentage points at
             | a time, not a magnitude different. I'm sure things changes
             | with a cluster setup too, that just wasn't the focus of
             | this post.
             | 
             | Interestingly, we were just testing a multi-node
             | TimescaleDB cluster the other day and found that 75k
             | rows/batch was the optimal size as nodes increased.
             | 
             | So you're completely correct. I tried to be very clear that
             | we were not intentionally "cooking the books" and there's
             | surely other optimizations we could have made. Most of the
             | suggestions so far, however, require further setup of CH
             | features that haven't been used in other benchmarks, so we
             | tried to over communicate our strategy and process.
             | 
             | We also fully acknowledged in the post that an siloed
             | "insert", wait, then "query" test is not real world. But,
             | it's the current way TSBS has been used and other DB
             | engines have come along and used the methodology for now.
             | Maybe that process will change in time to come with other
             | contributions.
             | 
             | BTW, we'll discuss some of this next week during the live-
             | stream and the video will be available after.
        
       | mrwnmonm wrote:
       | Any plans to do this kind of detailed comparison with Druid too?
        
         | ryanbooz wrote:
         | There's a long list of DBs users would like to see. Druid is on
         | the list but probably not happening in the near-term without
         | some community help.
         | 
         | Remember, TSBS is open-source and we've had some great
         | contributions from many teams/databases. :-)
        
       | pradeepchhetri wrote:
       | How can i replicate the results of the benchmarks ? I am
       | interested to look at the CH table schema you used.
        
         | carlotasoto wrote:
         | (Timescale team member here)
         | 
         | We used the Time Series Benchmark Suite for all these tests
         | https://github.com/timescale/tsbs. Also, Ryan (post author)
         | will be giving all the config details in a Twitch stream
         | happening next Wednesday. We'll be uploading the video to
         | Youtube immediately afterwards too >>
         | 
         | twitch.tv/timescaledb youtube.com/timescaledb
        
         | ryanbooz wrote:
         | (Post author)
         | 
         | Howdy! All of the details about our TSBS settings in the
         | performance section of the docs. Also, we'll be streaming a
         | sample benchmark of the two databases next Wednesday at 10AM
         | ET/4PM CET.
         | 
         | https://blog.timescale.com/blog/what-is-clickhouse-how-does-...
         | 
         | twitch.tv/timescaledb
        
           | pradeepchhetri wrote:
           | Few comments:
           | 
           | - The CH table schema generated by TSBS isn't optimized for
           | the queries. First of all, it doesn't uses CODEC
           | (https://altinity.com/blog/2019/7/new-encodings-to-improve-
           | cl...) and many other optimizations CH have.
           | 
           | > We tried multiple batch sizes and found that in most cases
           | there was little difference in overall insert efficiency
           | 
           | This is wrong in CH world where batch size matters a lot. I
           | would recommend keep this even more higher around (10x of
           | current value).
           | 
           | Humble Suggestion: There are many things not quite properly
           | interpreted about CH and reading through the blog it seems
           | like you're focusing more on areas which CH is
           | lacking/missing. Please don't do these things.
        
             | PeterZaitsev wrote:
             | This is what tend to make all vendor benchmarks
             | "benchmarketing" - while many of us fully intend to give a
             | fair shot to other technologies we tend to know best
             | practices for our own software better than "competition"
        
             | ryanbooz wrote:
             | Two quick responses:
             | 
             | - The code that TSBS uses was contributed by Altinity[1].
             | If there is a better setup, please feel free to submit a
             | PR. As stated elsewhere, we did have a former CH engineer
             | review and even updated ClickHouse to the newest version
             | __yesterday__ based on his suggestion to ensure we had the
             | best numbers. (and some queries did improve after
             | upgrading, which are the numbers we presented)
             | 
             | - It seems like you read the article (great job - it was
             | long!!), so I'm sure you understand that we were trying to
             | answer performance and feature questions at a deeper level
             | than almost any benchmark we've seen to date. Many just
             | show a few graphs and walk away. We fully acknowledged that
             | smaller batches are not recommended by CH, but something
             | many (normally OLTP) users would probably have. It matters
             | and nobody (that we know of) has shown those numbers
             | before. And in our test, larger batch sizes do work well,
             | but not to some great magnitude in this one server setup.
             | Did 10k or 20k rows maybe go a little faster for CH?
             | Sometimes yes, sometimes negligible. The illustration was
             | that we literally spent months and hundreds of benchmark
             | cycles trying to understand the nuances.
             | 
             | I think we're pretty clear in the post that CH is a great
             | database for the intended cases, but it has shortcomings
             | just like TimescaleDB does and we tried to faithfully
             | explore each side.
             | 
             | [1]: https://github.com/timescale/tsbs/pull/26
        
         | jonatasdp wrote:
         | You can run tsbs by yourself. Just check the options here:
         | https://github.com/timescale/tsbs/blob/master/docs/clickhous...
         | 
         | The blog post benchmarks used `--use-case=cpu-only` case for
         | data ingestion. You can see the table definition here:
         | https://github.com/timescale/tsbs/blob/1eb7705ff921fd31784c0...
         | coming from here:
         | https://github.com/timescale/tsbs/blob/master/pkg/targets/cl...
        
       | mrwnmonm wrote:
       | If you have thousands of clients writing to the database
       | individual rows, one per request, and thousands of clients making
       | queries (some of them are complex, some are not). Does ClickHouse
       | even get used in this scenario?
        
       | PeterZaitsev wrote:
       | I think it is worth noting while Clickhouse is often used for
       | time series store it is not particularly designed for this use
       | case, but more for storing logs, events and similar data.
       | VictoriaMetrics would be interesting comparable which is inspired
       | by Clickhouse design but Optimized for time series store in
       | particular https://victoriametrics.com/
        
       | dmw_ng wrote:
       | More war stories: found Timescale easier to setup (maybe just
       | because more familiar), but raw query perf is not something you
       | just magically get for free. Timescale requires a lot of
       | investment in planning. In one project we had simple time range
       | scan queries against a less-than-RAM-sized table taking tens of
       | seconds to complete.
       | 
       | ClickHouse has a bit more ops overhead, but requires very little
       | in the way of pre-planning. Just throw whatever you want at it
       | and it seems to sing by default.
       | 
       | Regarding ops overhead, ClickHouse also has a "local" mode where
       | you can query a huge range of compressed formats without ever
       | performing any kind of ETL step. That means queries can run e.g.
       | directly against S3 storage. For batch logs analysis, IMHO local
       | is a gamechanger. Most bulk logging systems produce massive
       | amounts of S3 objects, and ClickHouse lets you tear through these
       | infrequently (but at high speed) when desired without any
       | bulky/opsey ETL step, and no cluster running idle just waiting
       | for a handful of daily jobs to arrive.
       | 
       | (I love this style of system in general, but the clear
       | engineering work behind ClickHouse completely won me over, I'd
       | apply it anywhere I could)
        
         | akulkarni wrote:
         | (TimescaleDB co-founder)
         | 
         | Thank you for the feedback - it is conversations like this that
         | enable us to understand how we can continue to make TimescaleDB
         | better.
         | 
         | And some of the ideas you are discussion are on our roadmap -
         | if anyone wants to help, we are hiring :-)
         | 
         | https://www.timescale.com/careers
        
       | CodesInChaos wrote:
       | Can somebody recommend a database suitable for an event sourced
       | application:
       | 
       | * One series of events per user
       | 
       | * Each series grows at about 10 events/minute while the user is
       | active
       | 
       | * Fancy queries are not required, typically a user's event series
       | is consumed in order to update aggregate state for that user
       | 
       | * Either used online, adding events one at a time and needing to
       | immediately update the aggregate state
       | 
       | * Used offline syncing a batch of hours or days at once. When a
       | large time interval, eventually consistent state updates are
       | acceptable
       | 
       | * It must be possible to delete a user's data, regardless how old
       | it is (a nightly batch job deleting multiple users at once is
       | fine, if it helps performance)
       | 
       | * Migrating old data should be possible with reasonable
       | performance and without consuming excessive temporary memory
       | 
       | * Compact storage is important (simple zstd compression should
       | suffice, though columnar compression might be slightly better)
       | 
       | * Being able to use a cheaper object store like S3 for old data
       | would be nice
       | 
       | At a glance timescale community appears to meet most
       | requirements. The userid can be used as `segmentby` key, and the
       | data compressed via columnar compression. But it seems to have
       | limitations with migration (sounds like it requires me to
       | manually decompress and recompress chunks, instead of simply
       | transforming one (chunk, segment) piece at a time) and deletion
       | (I need to delete everything with a specific `segmentby` key).
       | 
       | Alternatively there is the DIY approach, of serializing each
       | entry in a compact format, one file per user, and then once data
       | is old enough compress it (e.g. with zstd) and upload it to S3.
        
         | ryanbooz wrote:
         | Looks like you edited this with some more detail, so I'll
         | answer higher.
         | 
         | Compression in TimescaleDB used to mean all compressed data was
         | immutable and the table schema couldn't be altered. Since
         | TimescaleDB 2.1, and 2.3 that has changed in a few ways.
         | 
         | - Schema can now have columns added or renamed - Compressed
         | chunks can now have rows inserted into them (partially
         | compressed, the background job will fully compress when it
         | runs)
         | 
         | Row deletion is not possible yet, but I've personally been
         | having some internal conversations around ways to do exactly as
         | you're suggesting in the near-term; deleting rows based on a
         | "segmentby" column. I have some testing to do... but my focus
         | has been taken up by a certain, 33-minute long read, blog post.
         | 
         | Feel free to join our Slack and DM me if you want to talk about
         | it further.
         | 
         | slack.timescale.com
        
         | ryanbooz wrote:
         | (post author and Timescaler)
         | 
         | What do you mean by "migrating old data"? Don't want to make
         | assumptions before answering further.
        
           | CodesInChaos wrote:
           | Updating many/all rows as a rare maintenance task, typically
           | as part of deploying a new version of the application.
           | 
           | I know timescale has native support for the most common cases
           | (adding nullable columns/renaming columns). But sometimes the
           | transformation is more complex. Sometimes an sql update
           | statement would suffice, sometimes streaming each segment in
           | chronological order to the application returning the updated
           | row might be required.
        
         | smarx007 wrote:
         | 10 events/minute - I would guess SQLite/DuckDB can fit the bill
         | for many years?
        
           | [deleted]
        
           | CodesInChaos wrote:
           | I don't think performance is the issue, but I'd like to keep
           | the storage small/cheap via compression, since this is a
           | hobby project. (though admittedly it's probably premature
           | optimization)
        
       | andrejserafim wrote:
       | Our anecdata: we store telemetry per thing. After loading a month
       | worth of data - timescaldb as hosted by their cloud ran a
       | difference aggregation in seconds. Clickhouse routinely did it in
       | 20 millis.
       | 
       | Simple avg, etc were better, but always clickhouse was an order
       | of magnitude faster than timescale. We didn't invest a whole
       | bunch into optimization other than trying some indexing
       | strategies in timescaledb.
       | 
       | So for our use case the choice is clear.
        
         | claytonjy wrote:
         | Was this for your primary source-of-truth, or more of a
         | downstream data warehouse, or something else?
         | 
         | I'm struggling to imagine a case where these are the two things
         | being considered; Timescale is the obvious choice for a primary
         | database, Clickhouse the obvious choice for a warehouse. I
         | wouldn't let my user-facing app write to Clickhouse, and while
         | I could potentially get away with a read-only Timescale replica
         | for internal-facing reports I would expect to eventually
         | outgrow that and reach for Clickhouse/Snowflake/Redshift.
        
           | dominotw wrote:
           | > Clickhouse the obvious choice for a warehouse >
           | Clickhouse/Snowflake/Redshift.
           | 
           | but clickhouse is very unlike the other two. when i think of
           | a warehouse i think star schema, data modeling ect not
           | something that hates joins.
        
             | claytonjy wrote:
             | Agreed, I wouldn't use Clickhouse for usual warehouse stuff
             | either, mostly because I can't imagine it plays well with
             | dbt which is a non-starter these days.
             | 
             | I'd still argue Clickhouse is closer to Snowflake/Redshift
             | than anything OLTP, and their name is intentionally chosen
             | to evoke warehouse-like scenarios.
        
               | FridgeSeal wrote:
               | What makes you think CH doesn't like joins?
               | 
               | Having used Redshift, Snowflake and CH for similar
               | workloads, I'd much prefer ClickHouse to the other 2.
               | 
               | Snowflake is hideously expensive for the subpar perf it
               | offers in my experience and Redshift is mediocre at best
               | in general.
        
               | hodgesrm wrote:
               | Is your comment on ClickHouse and DBT based on using the
               | DBT ClickHouse plugin? [0] If so I would be very
               | interested in understanding what you or others see as
               | deficiencies.
               | 
               | [0] https://github.com/silentsokolov/dbt-clickhouse
        
           | encoderer wrote:
           | > I wouldn't let my user-facing app write to Clickhouse
           | 
           | I've been thinking of doing exactly that. What are your
           | concerns?
        
             | claytonjy wrote:
             | I suppose it depends what you're going to let your user do,
             | but OLAPs in general and Clickhouse in particular don't do
             | well under row-oriented workloads, as described in the post
             | here. I'm imagining users primarily operating on small
             | numbers of rows and sometimes making updates to or deleting
             | them, a worst-case scenario for Clickhouse but best-case
             | for an OLTP like Postgres.
        
               | encoderer wrote:
               | Ah totally. Thanks for sharing your thoughts! In my case
               | I'm evaluating clickhouse as a source of truth for
               | customer telemetry data. Totally agree about the OLTP
               | limitations.
        
               | willvarfar wrote:
               | (Remember that clickhouse is not reliable. It doesn't
               | pretend to be.
               | 
               | Clickhouse is great for lots of common query workloads,
               | but if losing your data would be a big deal then it makes
               | a lot of sense to have your data in a reliable and backed
               | up place (eg timescale or just s3 files or whatever) too.
               | 
               | Of course lots of times people chuck stuff into
               | clickhouse and it's fine if they lose a bit sometimes.
               | YMMV.)
        
             | dreyfan wrote:
             | https://blog.cloudflare.com/http-analytics-
             | for-6m-requests-p...
             | 
             | has some good thoughts. The main thing you'll likely need
             | is some sort of a buffer layer so you can do bulk inserts.
             | Do not write a high-volume of single-row inserts into
             | Clickhouse.
        
               | encoderer wrote:
               | Thanks for sharing the link! I've heard the bulk insert
               | thing before and to be honest I've always thought that
               | RDBMSs don't love single row inserts either. Seems
               | clickhouse takes that to a new level.
               | 
               | In our case we are using sqs and usually insert 20-100
               | rows into the db at a time so I'm going to benchmark how
               | that does in clickhouse.
        
               | [deleted]
        
               | zepearl wrote:
               | With Clickhouse you can use a "buffer table", which uses
               | just RAM and sits on top of a normal table:
               | https://clickhouse.com/docs/en/engines/table-
               | engines/special...
               | 
               | Rows inserted into the buffer table are then flushed to
               | the normal/base table when one of the limits (defined
               | when the buffer table is created) is reached (limits are
               | max rows, max bytes, max time since the last flush), or
               | when you drop the buffer table.
               | 
               | I'm using it and it works (performance difference can be
               | huge compared to perform single inserts directly into a
               | real/normal table), but be careful - the flushed rows
               | don't give a guarantee of which row is flushed in which
               | sequence, so using a buffer table is a very bad idea if
               | your base table is something which relies on the correct
               | sequences of rows that it receives.
        
         | ants_a wrote:
         | From my experience of benchmarking these databases on
         | scientific data (highly regular timeseries) and looking at the
         | internals of both, these kinds types of number happen when
         | answering the query needs crunching through many rows, but the
         | output has few. i.e. the queries are filtering and/or
         | aggregating a ton of input rows, that can't be excluded by
         | indexes or queried from preaggregations.
         | 
         | From what I can tell it comes down to execution engine
         | differences. TimeScale, even with compressed tables, uses a row
         | by row execution engine architecturally resembling IE6 era JS
         | engines. ClickHouse uses a batched and vectorized execution
         | engine utilizing SIMD. Difference is one to two orders of
         | magnitude of throughput in terms raw number of rows per core
         | pushed through the execution engine.
         | 
         | Postgres/TimeScale could certainly also implement a similar
         | model of execution, but to call it an undertaking would be an
         | understatement considering the breadth and extensibility of
         | features that the execution engine would need to support. To my
         | knowledge no one is seriously working on this outside of
         | limited capability hacks like vops or PG-Strom extensions.
        
           | ryanbooz wrote:
           | (post author)
           | 
           | You do a great job summarizing some of the benefits of
           | ClickHouse we mentioned in the post, including the vectorized
           | engine!
           | 
           | That said, I'm not sure I'd refer to PostgreSQL/TimescaleDB
           | engine architecture as resembling IE6 JS support. Obviously
           | YMMV, but every release of PG and TimescaleDB bring new
           | advancements to query optimizations for the architecture they
           | are designed for, which was the focus of the post.
           | 
           | I'm personally still impressed, after 20+ years of working
           | with SQL, relational databases, when any optimization engine
           | can use statistics to find the "best" plan among
           | (potentially) thousands in a few ms. Maybe I'm too easily
           | impressed. :-D
        
             | ants_a wrote:
             | The optimization engine is of course great (despite
             | occasionally missing hard), but I am not referring to it. I
             | am referring to the way that PostgreSQL executes query
             | plans, the way rows are pulled up the execution tree, is
             | very similar to first iterations JavaScript engines - a
             | tree based interpreter. Picking out columns from rows and
             | evaluating expressions used to work the same until PG11,
             | where we got a bytecode based interpreter and a JIT for
             | those. But so far rows are still working the same way, and
             | it hurts pretty bad when row lookup is cheap and the rows
             | end up either thrown away or aggregated together with basic
             | math.
        
               | mfreed wrote:
               | With TimescaleDB compression, 1000 rows of uncompressed
               | data are compressed into column segments, moved to
               | external TOAST pages, and then pointers to these column
               | segments are stored in the table's "row" (along with
               | other statistics, including some common aggregates).
               | 
               | So while the query processor might still be "row-by-row",
               | each "row" it processes actually corresponds to a column
               | segment for which parallelization/vectorization is
               | possible. And because these column segments are TOASTed,
               | the row itself are just pointers, and you only need to
               | read in those compressed column segments that you are
               | actually SELECTing.
               | 
               | Anyway, might have known this, just wanted to clarify.
               | Thanks for discussion!
        
         | GordonS wrote:
         | How many data points were those aggregations being computed
         | over? How much memory does your Postgres server have, and are
         | you using SSD storage (with associated postgres config tweaks)?
        
           | ryanbooz wrote:
           | (Post author)
           | 
           | Howdy! We provided all of those details in the post and
           | you're welcome to join us next week when we live-stream our
           | setup and test!
           | 
           | https://blog.timescale.com/blog/what-is-clickhouse-how-
           | does-...
        
             | GordonS wrote:
             | I was responding to @andrejserafim, asking about their
             | scenario, not the article.
        
               | ryanbooz wrote:
               | Gotcha! My apologies for not seeing the thread nature. HN
               | threads get me sometimes. :-)
        
         | ryanbooz wrote:
         | (N.B. post author)
         | 
         | Thanks for the feedback. Without knowing your situation, one of
         | the things we show in the blog post is that TimescaleDB
         | compression often changes the game on those kinds of queries
         | (data is transformed to columnar storage when you compress).
         | You don't mention if you did that or not, but it's something
         | we've seen/noticed in every other benchmark at this point -
         | that folks don't enable it for the benchmark.
         | 
         | And second point of the article is that you have lots of
         | options for whatever works in your specific situation. But,
         | make sure you're using the chosen database features before
         | counting it out. :-)
        
           | maxmcd wrote:
           | I wonder if it's worth taking a page out of the MongoDB book
           | and enabling these kinds of benchmark altering settings by
           | default. We certainly selected clickhouse over tailscale
           | internally because of major performance differences in our
           | internal testing that might have gone the other way had we
           | "known better".
        
             | ryanbooz wrote:
             | Indeed. Lots of discussion over this in the last few
             | months. There are nuances, but I think you'll see some
             | progress in this area over the next year.
        
       | brightball wrote:
       | That's a really thorough comparison. Much more detailed than I
       | expected.
       | 
       | From what I see, the trade off in disk space usage would point me
       | toward Timescale for most of my workloads. The insert performance
       | tradeoff just wouldn't justify the difference for me.
        
         | ericb wrote:
         | A licensing comparison would be a good addition.
        
           | PeterZaitsev wrote:
           | Good question. Was it Open Source version of TimeScaleDB
           | compared or Source Available ?
        
             | ryanbooz wrote:
             | Sure. As we shared in the blog post it was tested (like
             | other benchmarks) on dedicated EC2 instances using the
             | freely available Community version.
        
               | PeterZaitsev wrote:
               | This does not answer the question - is it Open Source
               | License or Source Available (TSL)
               | https://www.timescale.com/legal/licenses
        
               | CodesInChaos wrote:
               | the original article and the parent poster say that the
               | community edition was used:
               | 
               | > Versions: TimescaleDB version 2.4.0, community edition,
               | with PostgreSQL 13
               | 
               | and the link you posted explains that it's the non OSI
               | license version:
               | 
               | > TimescaleDB Community is made available under the
               | Timescale License ("TSL")
        
         | darthShadow wrote:
         | Does the disk usage go down later once the numerous parts are
         | merged or not?
         | 
         | I would assume it does but reading the article implies that it
         | does not.
        
           | ryanbooz wrote:
           | Great question. Yes, eventually it does, but (at least for
           | now) it wasn't something we could reliably force as part of
           | the query cycle and know everything was in it's "best" state
           | with ClickHouse. To be honest, we didn't provide the final
           | compressed size of either database because of the need to
           | wait.
           | 
           | The code that's currently used by TSBS was submitted by
           | Altinity, a heavy supporter of ClickHouse in the U.S., but
           | TSBS is open source and anyone is welcome to contribute and
           | make the process/test better!
        
             | darthShadow wrote:
             | Thanks for the answer.
             | 
             | May be worth pointing that out in the article since the
             | increased disk usage has been mentioned multiple times in
             | the article without any indication that it's only temporary
             | until ClickHouse merges the parts.
        
           | zepearl wrote:
           | Concerning Clickhouse, yes it does - exactly the same thing
           | e.g. as when you have 2 compressed files containing each 100
           | sorted rows, when you merge those 200 rows into a single
           | file, sort them and compress them, the result will be smaller
           | than the sum of the 2 separate files.
           | 
           | How much you save is again exactly the same as when dealing
           | directly with files: it depends on the data and on the
           | compression algo.
        
         | ryanbooz wrote:
         | (Post author)
         | 
         | Thanks for the compliment! It's becoming a habit with us and
         | benchmarks. We just really want to dig in and understand what's
         | going on and why things work the way they do. ;-)
         | 
         | There really are so many nuances and as we tried to say a
         | number of times, ClickHouse is really great at what it does
         | well. But it's still OLAP at heart (which precludes OLTP
         | features many apps take for granted) and after enabling
         | TimescaleDB compression, the query story isn't as cut and dry.
         | We don't claim that TimescaleDB is the fastest in all
         | circumstances, or that it absolutely has to be for every
         | workload. Features and versatility play a major part in the
         | decision.
        
       | chalcolithic wrote:
       | Could putting RedPanda/Kafka in front of ClickHouse make it
       | insert benchmark winner? Of course it means operational expenses
       | but I wonder if this route is worth exploring?
        
       | grzff wrote:
       | Fauci funded COVID-19: https://www.zerohedge.com/covid-19/nih-
       | admits-funding-gain-f...
       | 
       | Everyone involved should face the firing squad.
        
       | cyber1 wrote:
       | Very interesting to see comparison TimescaleDB vs
       | VictoriaMetrics. Car vs car :)
        
       | eatonphil wrote:
       | I was surprised to see that ClickHouse and ElasticSearch have the
       | same number of contributors. That's pretty astounding given how
       | much older and more prominent ElasticSearch has been.
       | 
       | https://github.com/ClickHouse/ClickHouse/graphs/contributors
       | 
       | https://github.com/elastic/elasticsearch/graphs/contributors
       | 
       | Edit: I was very off. The Github contributor graph does not show
       | all actual contributors. ElasticSearch has somewhere around 2-3
       | times as many contributors as ClickHouse.
        
         | hodgesrm wrote:
         | ClickHouse now has more unique contributors with merged PRs on
         | an annual basis. The lines crossed early this year, or even
         | late last year.
        
           | eatonphil wrote:
           | Thanks! Could you point me at something concrete. :)
        
             | hodgesrm wrote:
             | Go to one of the several public ClickHouse endpoints and
             | run this query:                 -- Elastic vs CH in a
             | single table.        SELECT toYear(created_at) Year,
             | uniqIf(creator_user_login, repo_name in
             | ('elastic/elasticsearch')) "Elastic",
             | uniqIf(creator_user_login, repo_name in
             | ('yandex/ClickHouse', 'ClickHouse/ClickHouse'))
             | "CLickHouse"        FROM github_events        WHERE
             | event_type = 'PullRequestEvent'       AND merged = 1
             | AND repo_name in ('yandex/ClickHouse',
             | 'ClickHouse/ClickHouse', 'elastic/elasticsearch')
             | GROUP BY Year ORDER BY Year
             | 
             | You can access the ClickHouse web UI for this dataset here:
             | https://github.demo.trial.altinity.cloud:8443/play?user=dem
             | o. The password is "demo" (type it in the left side.) This
             | is the Altinity.Cloud copy of Alexey Milovidov's excellent
             | github_events dataset.
             | 
             | When I run this query I get the following numbers.
             | Year|Elastic|CLickHouse|       ----|-------|----------|
             | 2015|    191|         0|       2016|    299|        40|
             | 2017|    296|        85|       2018|    284|       142|
             | 2019|    341|       232|       2020|    339|       300|
             | 2021|    243|       294|
             | 
             | Just speculation on my part but the drop Elastic
             | contributors may be a side effect of the licensing change.
        
         | rohitnair wrote:
         | As per the landing pages of the projects, ES has 1.6k
         | contributors whereas ClickHouse has 803. The contributors page
         | likely only lists the top contributors to keep the page load
         | time manageable.
        
           | eatonphil wrote:
           | That makes much more sense. Thanks for pointing that out.
        
         | AdamProut wrote:
         | Clickhouse is by far the leading open source columnar SQL data
         | warehouse at this point. We have had strong open source
         | operational SQL DBs for many years (MySQL, Postgres), but no
         | open source systems that mirrored closed source MPP columnstore
         | until clickhouse. Its interesting that it took "this long" for
         | a strong open source SQL DW to emerge.
        
         | LoriP wrote:
         | ClickHouse has gained a huge following and honestly that's been
         | pretty well earned. For the kinds of apps that they target it's
         | a great choice, it's great technology with a very able team
         | behind it.
        
       | rkwasny wrote:
       | There is some creative engineering going here :) have a look:
       | 
       | https://github.com/timescale/tsbs/blob/master/scripts/load/l...
       | 
       | vs
       | 
       | https://github.com/timescale/tsbs/blob/master/scripts/load/l...
        
         | csdvrx wrote:
         | > There is some creative engineering going here
         | 
         | Agreed. At a previous work, clickhouse outperformed timescale
         | by several orders of magnitude, under about every condition.
         | 
         | The timescale team seems to recognize that (look for the
         | comment about clickhouse being a bulldozer) but they seem to
         | say timescale can be better suited.
         | 
         | In my experience, in about 1% of the cases, yes, timescale will
         | be a better choice (ex: if you do very small batches of
         | insertions, if you need to remove some datapoints) but in 99%
         | of the usecases for a time series database, clickhouse is the
         | right answer.
         | 
         | There seems to have been several improvements to timescale
         | since 2018, with columnar storage, compression, etc. and that's
         | good because more competition is always better.
         | 
         | But in 2021, clickhouse vs timescale for a timeseries is like
         | postgres vs mongo for a regular database: unless you have
         | special constraints [*], the "cool" solution (timescale or
         | mongo) is the wrong one.
         | 
         | [*]: you may think you have a unique problem and you need
         | unique features, but odds are, YAGNI
         | 
         | https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it
        
         | rkwasny wrote:
         | Also this queries are different? order by "time" vs order by
         | "created_at"
         | 
         | https://github.com/timescale/tsbs/blob/a045665d9c94426bbc405...
         | 
         | https://github.com/timescale/tsbs/blob/a045665d9c94426bbc405...
        
           | ryanbooz wrote:
           | We were using tags, so that "else" block isn't the one being
           | used for ClickHouse. Regardless, the table that is created
           | (by the community and verified by former CH engineers) orders
           | by created_at, not time and so that query should be the
           | "fastest" the distinct possible.
        
         | ryanbooz wrote:
         | (Post author)
         | 
         | I'm not sure why you think that's creative engineering. What
         | you're pointing to is the depth of available configuration that
         | the contributors to TSBS have exposed for each database. It's
         | totally open source and anyone is welcome to add more
         | configuration and options! I believe (although not totally
         | sure) that Altinity and ClickHouse folks added their code a few
         | years ago - at least it wasn't anyone on the Timescale team.
         | 
         | That said, we didn't actually use those scripts to run our
         | tests. Please join us next Wednesday (10AM ET/4PM CET) to see
         | how we set the databases up and ran the benchmarks. We'd be
         | delighted to have you try it on your own too!
        
           | rkwasny wrote:
           | Ah so the tests you have used are not the ones in
           | https://github.com/timescale/tsbs ?
        
             | ryanbooz wrote:
             | All the same tests. You simply pointed to a shell script
             | that's configurable to run tests for each database. We
             | provided details in the blog post of exactly what settings
             | we used for each database (cardinality, batch size, time
             | range, TimescaleDB chunk size, etc.) so you can use those
             | script to configure and run the tests too.
        
       | darksaints wrote:
       | It would be awesome to combine the following things:
       | 
       | * PostGIS
       | 
       | * Timescale
       | 
       | * Citus
       | 
       | * Zedstore
       | 
       | This truly would be the relational DB to end all relational DBs.
       | Unfortunately, we run into a couple problems:
       | 
       | * Managing multiple extensions is a burdensome task, which should
       | be in the wheelhouse of cloud providers, but...
       | 
       | * Timescale and Citus are are open core, holding back features
       | for customers. Their primary revenue channels are their cloud
       | offerings. Unfortunately you can't get Citus and Timescale in the
       | same cloud offering, cause you're dealing with two separate
       | companies.
       | 
       | * PostGIS has multiple cloud providers, but none of them have
       | Timescale or Citus available.
       | 
       | * Citus only has cloud offerings on Azure, excluding the other
       | two major players that often have exclusive relationships with
       | companies.
       | 
       | * Zedstore is really cool and together with Citus could be a
       | massive gamechanger by having columnstore and rowstores in the
       | same distributed database. However, development has stalled, and
       | nobody seems to be able to explain what happened.
       | 
       | Sigh...maybe 5 years from now.
        
         | mfreed wrote:
         | Timescale Cloud indeed comes with PostGIS installed by default.
         | 
         | Regarding distributed (Citus) and columnar (Zedstore):
         | 
         | - TimescaleDB's compression actually takes a columnar approach
         | (including that it only reads the individual compressed columns
         | that you SELECT), and so combines both row- and column-oriented
         | data. [0]
         | 
         | - TimescaleDB 2.0 also supports distributed deployment, and
         | Timescale Cloud will (very soon) offer one-click deployment of
         | fully-managed multi-node TimescaleDB. [1]
         | 
         | [0] https://blog.timescale.com/blog/building-columnar-
         | compressio...
         | 
         | [1] https://blog.timescale.com/blog/building-a-distributed-
         | time-...
        
         | avthar wrote:
         | > Timescale and Citus are are open core, holding back features
         | for customers.
         | 
         | One clarification. While TimescaleDB is open-core, our
         | community version is source-available and 100% free to use. We
         | do not "hold back features for customers". You do not need to
         | pay to use any of TimescaleDB's best features, it's all free
         | via the Timescale Community license.
         | 
         | You only pay if you'd like to use our hosted offerings (and
         | save the hassle of self-managing your DB): Timescale Cloud or
         | Managed Service for TimescaleDB.
         | 
         | For more see: https://www.timescale.com/products
         | 
         | (Disclaimer: I work at Timescale)
        
         | contrahax wrote:
         | If you use Aiven for a cloud PG instance you can do both
         | Timescale + PostGIS installed.
         | 
         | I also really wish ClickHouse would prioritize PostGIS support
         | - IIRC it has been on their roadmap for a while but keeps
         | getting kicked around every year or so. Same thing with
         | CockroachDB - PostGIS support kicked down the road every year.
        
           | otan_ wrote:
           | Hi there, CockroachDB dev here. We've supported spatial
           | features from PostGIS since 20.2 - you can get started here
           | https://www.cockroachlabs.com/docs/stable/spatial-data.html!
           | Of course, there's bits and pieces we've missed but if
           | there's something you're missing in particular you can let me
           | know here or through a GitHub issue.
        
       | sam0x17 wrote:
       | Can someone give me a real-world example of a scenario where they
       | actually need a time series database, like an example query with
       | the business use case / justification? Just super curious.
        
         | joshxyz wrote:
         | - https://clickhouse.com/docs/en/faq/use-cases/time-series/
         | 
         | - https://clickhouse.com/docs/en/faq/general/olap/
        
         | jedberg wrote:
         | At Netflix all of our monitoring was in a time series database
         | so we could get real time insights into pretty much anything we
         | were monitoring (which was most everything).
        
         | ryanbooz wrote:
         | (Post author)
         | 
         | This is a great post to give you some talking points:
         | 
         | https://blog.timescale.com/blog/what-the-heck-is-time-series...
         | 
         | I also love this recent one we did with some non-standard time-
         | series data that the NFL provided! Really fun working on that
         | data set.
         | 
         | https://blog.timescale.com/blog/hacking-nfl-data-with-postgr...
        
         | LoriP wrote:
         | Here's some posts from users on the Timescale blog
         | https://blog.timescale.com/tag/dev-q-a/
        
         | GordonS wrote:
         | We're using TimescaleDB to store log events - so each row has a
         | timestamp, some properties, and a log message. And a lot of
         | that is actually in a JSONB column.
         | 
         | Not the archetypal time series use case, but TimescaleDB is
         | still really useful.
         | 
         | TSDB's compression means we can store a huge volume of data in
         | a fraction of the space of a standard Postgres table. You can
         | achieve even better compression ratios and performance if you
         | spend time designing your schema carefully, but honestly we
         | didn't see the need, as just throwing data in gets us something
         | like 10:1 compression and great performance.
         | 
         | TSDB's chunked storage engine means that queries along chunking
         | dimensions (e.g. timestamp) are super-fast, as it knows exactly
         | which files to read.
         | 
         | Chunking also means that data retention policies execute nearly
         | instantaneously, as it's literally just deleting files from
         | disk, rather than deleting rows one-by-one - millions of rows
         | are gone in an instant!
         | 
         | And best of all, this all works in Postgres, and we can query
         | TSDB data just the same as regular data.
         | 
         | All that combined easily justified the decision to use TSDB -
         | and if you're familiar with Postgres, it's actually really
         | simple to get started with. Really, we'd of needed a business
         | justification _not_ to use it!
         | 
         | Much love for the TimescaleDB team!
        
           | sam0x17 wrote:
           | So services like Sentry / Honeybadger / etc probably use this
           | architecture?
        
             | GordonS wrote:
             | I think most online logging SaaS services actually use
             | ElasticSearch.
        
         | tnolet wrote:
         | We just migrated to Clickhouse. We collect monitoring data. So
         | response times from 20+ different location. We are not super
         | duper big but at least 100M+ individual metrics per month. We
         | want to give our users a snappy, interactive dashboard that
         | lets them explore aggregates of that data over time: averages,
         | p99 etc..
         | 
         | That is where a time series DB is very handy
        
         | bradstewart wrote:
         | We capture and store energy readings at ~5second intervals,
         | then display total energy at various time granularity by
         | aggregating the values over minutes, hours, days, months, etc.
        
         | ur-whale wrote:
         | > Can someone give me a real-world example of a scenario where
         | they actually need a time series database
         | 
         | Large scale infrastructure monitoring?
         | 
         | If you run a data center with 10K machines in it, whitebox
         | monitoring of these machines and what runs on it generates tons
         | of timestamped data.
         | 
         | These time series can be used to inform an automated alerting
         | system (eg using trends to forecast bad things before they
         | happen).
         | 
         | They can also be analyzed in batch mode to figure out how to
         | optimize many things (power / cooling / workload assignment /
         | etc ...)
        
         | claytonjy wrote:
         | I introduced Timescale at an IIOT company, we had thousands of
         | sensors regularly sending data up and wanted to efficiently
         | display such metrics to users on a per-sensor basis, and one
         | "tick" of data had a lot of metrics. Timescale let us go from
         | Postgres for metadata and OpenTSDB (awful, stay far away) for
         | time series to just one Timescale instance for everything. Huge
         | win for us. We had enough data that doing the same with vanilla
         | Postgres would have performed much worse (billions of rows).
         | 
         | We wrote more about this for an earlier Timescale blog post:
         | https://blog.timescale.com/blog/how-everactive-powers-a-dens...
        
       | nojvek wrote:
       | I'm surprised Timescale hasn't given a comparison with
       | SingleStoreDB.
       | 
       | I've found SingleStore column scans at parity with ClickHouse in
       | speed. At same time SingleStore uses a hybrid skip-list,
       | columnstore data structure in their universal storage (which is
       | default table format).
       | 
       | So you have high throughput transactions, as well as insanely
       | fast aggregate scans.
       | 
       | Usually in column stores, they are great at append, not so much
       | with updates and deletes.
        
       | avinassh wrote:
       | Related to TimescaleDB, there was a blog post which explained
       | their internals and also compared with another similar time
       | series DB. I can't seem to find the link, anyone remembers?
        
         | avthar wrote:
         | Timescaler here. I think you're referring to this comparison of
         | InfluxDB vs TimescaleDB [0]?
         | 
         | There's also comparisons of TimescaleDB vs MongoDB[1] and AWS
         | Timestream [2].
         | 
         | [0]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-
         | for-...
         | 
         | [1]: https://blog.timescale.com/blog/how-to-store-time-series-
         | dat...
         | 
         | [2]:https://blog.timescale.com/blog/timescaledb-vs-amazon-
         | timest...
        
       | cercatrova wrote:
       | Has anyone else been seeing an influx of timescale.com articles?
       | I count around 10 in the last month.
        
         | carlotasoto wrote:
         | (Timescale Team member here)
         | 
         | We've been working really hard on our launches / releases this
         | month! We called it "Always Be Launching" - we've been aiming
         | for releasing multiple things per week during October :)
        
           | yumraj wrote:
           | That sounds great.
           | 
           | However, as a DB where users may store critical data, should
           | you really be "Always be launching"? That sounds a little
           | like FB's "move fast and break things". There's a reason why
           | some of the mission critical open source technologies move
           | slowly.
        
             | mfreed wrote:
             | We actually are only having one database software release
             | this month (TimescaleDB v2.5), which is aligned with our
             | normal database release cadence.
             | 
             | Timescale (the company) also provides a managed cloud
             | offering, as well as Promscale (an observability product
             | built on top of TimescaleDB).
             | 
             | So #AlwaysBeLaunching is a company-wide effort across
             | different product & engineering teams, as well as folks in
             | Developer Advocacy and others (e.g., who worked on this
             | comparison benchmarks).
             | 
             | What might be also interesting is our introduction of
             | Experimental Schema features in TimescaleDB - explicitly so
             | that we can "Move fast, but don't break things" (which is
             | also key to getting good community feedback):
             | 
             | https://blog.timescale.com/blog/move-fast-but-dont-break-
             | thi...
             | 
             | (Timescale co-founder)
        
             | avthar wrote:
             | Timescale team member here. We take our responsibility to
             | build a rock-solid platform very seriously. We have
             | multiple "levels" of product within Timescale. At our core,
             | we have the open-source database, TimescaleDB. This product
             | releases on a more deliberate and careful cadence, always
             | making sure that we are optimizing for reliability,
             | security, and performance. This has been our approach since
             | our initial launch [0], where we embraced the mantra
             | "boring is awesome", recognizing that for our users
             | stability and reliability is of paramount importance.
             | 
             | Within the core database, we offer features that are
             | carefully marked as "experimental", which we discuss at
             | length in this blog post [1].
             | 
             | Beyond TimescaleDB, we also offer other products that are
             | more SaaS-y in nature. While they're all based on the rock-
             | solid foundation of TimescaleDB, we are also able to ship
             | new features more quickly because they are UI components
             | that make using the database even easier.
             | 
             | Finally, some of our "launches" are more textual in nature,
             | such as this benchmark, which we have spent months
             | researching and compiling.
             | 
             | [0]: https://blog.timescale.com/blog/when-boring-is-
             | awesome-build...
             | 
             | [1]: https://blog.timescale.com/blog/move-fast-but-dont-
             | break-thi...
        
       | pradeepchhetri wrote:
       | One suggestion, if you really want to benchmark systems:
       | 
       | * Create a setup which is production grade i.e. run a multi-node
       | HA setup of those systems.
       | 
       | * Understand the best practices of those systems otherwise result
       | gets biased.
       | 
       | * Validate the results with experts of those systems before
       | publishing.
        
       | didip wrote:
       | Question, which helm chart is the best to install ClickHouse
       | these days?
        
         | hodgesrm wrote:
         | Don't use helm. The ClickHouse Kubernetes Operator is the way
         | to go. Here's the project:
         | https://github.com/Altinity/clickhouse-operator
         | 
         | This is generally true for most databases these days. Use an
         | operator if it's available. Helm can't handle the dynamic
         | management required to run databases properly.
        
           | didip wrote:
           | Thank you!
        
         | [deleted]
        
       | Upitor wrote:
       | Would either of these database systems be proper for a case where
       | you have a mix of large measurement data and small
       | reference/master data that you need to join, filter, etc.
       | ?Example:
       | 
       | SELECT r.country, m.time, SUM(m.measurement) FROM
       | measurement_table AS m INNER JOIN refence_table AS r ON
       | m.device_id = r.device_id
        
         | ryanbooz wrote:
         | In it's current form/state, ClickHouse is not optimized for
         | typical JOIN-type queries, a point we make in the post. You
         | would have to re-write your statement to get better
         | performance. The other main point is that all data is
         | "immutable", so if your reference data needs to be updated, it
         | would still need to go through some kind of asynchronous
         | transform process to ensure you're getting the correct values
         | at query time.
         | 
         | TimescaleDB is PostgreSQL, so it can easily handle this kind of
         | join aggregate like you would expect. If "m.measurement" was
         | compressed, historical queries with a time predicate would
         | likely be faster than uncompressed state.
        
       | PeterZaitsev wrote:
       | Altinity folks have suggested number of Clickhouse optimizations
       | for time series benchmark did you enable any of those ?
       | https://altinity.com/blog/clickhouse-continues-to-crush-time...
        
         | ryanbooz wrote:
         | Hello @PeterZaitsev!
         | 
         | Actually Altinity is the one that contributed the bits to TSBS
         | for benchmarking ClickHouse[1], so we are using the work that
         | they contributed (and anyone is welcome to make a PR for
         | updates or changes). We also had a former ClickHouse engineer
         | look at the setup to verify it matched best practices with how
         | CH is currently designed, given the TSBS dataset.
         | 
         | As for the optimizations in the article you pointed to from
         | 2019 (specifically how to query "last point" data more
         | efficiently in ClickHouse), it uses a different table type
         | (AggregatedMergeTree) and a materialized view to get better
         | query response times for this query type.
         | 
         | We (or someone in the community) could certainly add that
         | optimization to the benchmark, but it wouldn't be using raw
         | data - which we didn't think was appropriate for the benchmark
         | analysis. But if one wanted to use that optimization, then one
         | should also use Continuous Aggregates for TimescaleDB - ie for
         | an apples to apples comparison - which I think would also lead
         | to similar results to what we show today.
         | 
         | It's actually something we've talked about adding to TSBS for
         | TimescaleDB (as an option to turn on/off) and maybe other DBs
         | could do the same.
         | 
         | [1]: https://github.com/timescale/tsbs/pull/26
        
           | PeterZaitsev wrote:
           | Thank you for your prompt response!
           | 
           | I think the most important thing is Clickhouse is NOT
           | designed for small batch insertion, if you need to do 1000s
           | of Inserts/sec you do queue in front of clickhouse. And query
           | speed can be impacted by batch side a lot. So have you looked
           | at query performance with optimal batch size ?
        
             | mfreed wrote:
             | Yep! The blog post includes data and graphs from both large
             | (5000-15,000 rows / batch) and small (100-500 rows / batch)
             | sizes. Please see the section "Insert Performance". Thanks!
             | 
             | https://blog.timescale.com/blog/what-is-clickhouse-how-
             | does-...
        
               | PeterZaitsev wrote:
               | 1) This is also small batch size. If you're inserting
               | 500.000 rows/sec 5000 rows is not particularly large
               | batch size
               | 
               | 2) I see different graphs for ingest but not for queries.
               | The data layout will depend on the batch size, unless of
               | course you did OPTIMIZE before running queries
        
               | ryanbooz wrote:
               | 1) you're absolutely right. 5k rows isn't "large". We
               | also mentioned that we did hundreds of tests often going
               | between 5k and 15k rows/batch. The overall ingest/query
               | cycle didn't change dramatically in any of these. That
               | is, 5k rows was within a few percentage of 10k rows.
               | Interestingly, the benchmarks that Altinity has, only
               | used 10k rows/sec (which we also did, it just didn't have
               | any major impact in the grand scheme of things).
               | 
               | 2) We did not specifically call OPTIMIZE before running
               | queries. Again, learning from the leaders at Altinity and
               | their published benchmarks, I don't see any references
               | that they did either, and neither does the TSBS code
               | appear to call it after ingest.
               | 
               | Happy to try both of these during our live stream next
               | week to demonstrate and learn!
               | 
               | Altinity benchmark (10k rows/batch mention):
               | https://altinity.com/blog/clickhouse-for-time-series
        
       | benwilson-512 wrote:
       | We've got a few billion rows in TSDB, pretty happy with it so
       | far. Our workload fits the OLTP workflow more than OLAP though,
       | we're processing / analyzing individual data points from IoT
       | devices as they come in, and then providing various
       | visualizations. This tends to mean that we're doing lots of
       | fetches to relatively small subsets of the data at a time, vs
       | trying to compute summaries of large subsets.
       | 
       | Compression is seriously impressive, we see ~90% compression rate
       | on our real world datasets. Having that data right next to our
       | regular postgres tables and being able to operate on it all
       | transactionally definitely simplifies our application logic.
       | 
       | Where I see a lot of folks run into issues with TimescaleDB is
       | that it does require that your related data models hold on to
       | relevant timestamps. If you want to query a hypertable
       | efficiently, you always want to be able to specify the relevant
       | time range so that it can ignore irrelevant chunks. This may mean
       | that you need to put data_starts_at, data_ends_at columns on
       | various other tables in your database to make sure you always
       | know where you find your data. This is actually just fine though,
       | because it also means you have an easy record of those min / max
       | values on hand and don't need to hit the hypertable at all just
       | to go "When did I last get data for this device".
        
         | qorrect wrote:
         | > Compression is seriously impressive
         | 
         | Does this effect your query performance ?
        
           | benwilson-512 wrote:
           | In practice we've seen it actually improve performance,
           | because when fetching a data range for a device fewer actual
           | rows have to be fetched from the disk. You pick certain
           | columns (like device ID) that remain uncompressed and indexed
           | for rapid querying, and then the actual value columns are
           | compressed for a range of time.
        
             | qorrect wrote:
             | Very cool thanks for sharing
             | 
             | > This may mean that you need to put data_starts_at,
             | data_ends_at columns on various other tables in your
             | database to make sure you always know where you find your
             | data.
             | 
             | Do you have a link to docs for this ? Does this mean
             | literally put a first column named (xstartx) and an end
             | column (xendx) as the last column ? How do you then utilize
             | it ?
             | 
             | Thanks so much!
        
       | nojito wrote:
       | None of these clickhouse queries are optimized.
       | 
       | It is very very hard to beat clickhouse in terms of performance
       | if set up properly.
        
       | fnord77 wrote:
       | Apache Druid and Apache Pinot are two others to consider for time
       | series. We're using druid at scale and it works pretty well.
       | Pinot appears to be faster for retrieval but it is less mature.
        
       | zekrioca wrote:
       | I know it is not related, but "ClickHouse" ("_Click_stream" and
       | "Data ware_House_") doesn't sound like a database name.
        
         | LoriP wrote:
         | I think the name originally came from company/project origin
         | and how the whole thing started... Recently set up as their own
         | entity, ClickHouse came from Yandex. There are probably others
         | better able to give that history, but that's the gist of it.
        
       | arunmu wrote:
       | What is the difference w.r.t the comparison done by Altinity of
       | clickhouse with timescale ? Clickhouse performed better there for
       | the same test. What gives ?
        
         | ryanbooz wrote:
         | (Post author)
         | 
         | The two big things, which we discuss at length in the post,
         | are:
         | 
         | - Altinity (and others) did not enable compression in
         | TimescaleDB (which converts data into columnar storage) and
         | provides improvement in querying historical data because it can
         | retrieve individual columns in compressed format similar to CH
         | 
         | - They didn't explore different batch sizes to help understand
         | how each database is impacted at various batch sizes.
        
           | PeterZaitsev wrote:
           | Have you from your side followed all Clickhouse best
           | practices?
           | 
           | Clickhouse design in particular suggests doing ingest request
           | approximately once per second and if you do much more than
           | that when you use it outside of intended usage and if you
           | need that you usually have some sort of queue between
           | whatever produces the data and Clickhouse.
           | 
           | Note the ingest in small batches also can significantly
           | affect query performance
        
             | ryanbooz wrote:
             | Yep - it's all detailed in the post! The question is how it
             | compares to TimescaleDB, which is an OLTP time-series
             | database that has a lot of other possible use cases (and
             | extensibility). I think it's very fair to explore how
             | smaller batches work since others haven't ever actually
             | shown that (as far as we can see) so that users that would
             | normally be coming from a database like PostgreSQL can
             | understand the impact something like small batches would
             | have.
             | 
             | As for ingest queueing, TSBS does not queue results. We
             | agree, and tell most users that they should queue and batch
             | insert in larger numbers. Not every app is designed that
             | way and so we wanted to understand what that would look
             | like.
             | 
             | But CH did amazingly well regardless of that with batches
             | above 1k-2k and lived up to it's name as a really fast
             | database for ingest!
        
         | akulkarni wrote:
         | If you are referring to this post:
         | https://altinity.com/blog/clickhouse-for-time-series
         | 
         | That post was written in November 2018 - 3 years ago - when
         | TimescaleDB was barely 1.0.
         | 
         | A lot has changed since then:
         | 
         | 1. TimescaleDB launched native columnar compression in 2019,
         | which completely changed its story around storage footprint and
         | query performance [0]
         | 
         | 2. TimescaleDB has gotten much better
         | 
         | 3. PostgreSQL has also gotten better (which in turn makes
         | TimescaleDB better)
         | 
         | In fact, IIRC Altinity used and contributed ClickHouse to the
         | TSBS [1], which is also what this newer benchmark uses as well
         | 
         | (Disclaimer: TimescaleDB co-founder)
         | 
         | [0] https://blog.timescale.com/blog/building-columnar-
         | compressio...
         | 
         | [1] https://github.com/timescale/tsbs
        
           | arunmu wrote:
           | Thank you. My only nit is the way the ratio (CH/TS) is shown.
           | What is the purpose of that ? It will show a bigger
           | percentage for cases in which TS is better, but lower
           | percentage for cases where CH is giving better results. From
           | the data representation perspective, I do not thinnk that is
           | fair.
        
       | jurajmasar wrote:
       | Disclaimer: I'm a co-founder of https://logtail.com, ClickHouse-
       | based hosted log management platform.
       | 
       | PostgreSQL, TimescaleDB, and ClickHouse are all impressive pieces
       | of software. We use both PostgreSQL and ClickHouse at Logtail.
       | 
       | ClickHouse shines for true OLAP use-cases and is _very_ hard to
       | beat performance-wise when configured properly.
       | 
       | Example:
       | 
       | > Poor inserts and much higher disk usage (e.g., 2.7x higher disk
       | usage than TimescaleDB) at small batch sizes (e.g., 100-300
       | rows/batch).
       | 
       | If your consistency requirements allow, you could use the Buffer
       | Table Engine to get blazing fast inserts:
       | https://clickhouse.com/docs/en/engines/table-engines/special...
       | 
       | Horizontal scalability and compression are also unbeatable from
       | what I've seen, to name a few.
       | 
       | There's a hefty price tag, however: ClickHouse is quite ops heavy
       | and its observability has a seriously steep learning curve. Only
       | go for ClickHouse in production if you really know what you're
       | doing :)
        
         | MichaelRazum wrote:
         | Could you explain what you mean by ops heavy? Just curious.
         | Actually have a production system where Timescale as well as
         | clickhouse are running in parallel. So far clickhouse didn't do
         | any trouble, but it is rarely used right now.
        
           | eternalban wrote:
           | >> Disclaimer: I'm a co-founder of https://logtail.com,
           | ClickHouse-based hosted log management platform.
        
       | PeterZaitsev wrote:
       | Benchmarks which were done a while back did not use compression
       | for TimescaleDB but also did not use new compression settings for
       | ClickHouse too.
       | 
       | https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...
       | 
       | In particularly low_cardinality() for strings and time series
       | specific compression many be very valuable
        
       | mt42or wrote:
       | Nobody talking about victoriametrics ?
        
       ___________________________________________________________________
       (page generated 2021-10-21 23:00 UTC)