hngopher.com

       [HN Gopher] ClickHouse as an alternative to Elasticsearch for lo...
       ___________________________________________________________________
        
       ClickHouse as an alternative to Elasticsearch for log storage and
       analysis
        
       Author : jetter
       Score  : 269 points
       Date   : 2021-03-02 15:40 UTC (7 hours ago)
        
 (HTM) web link (pixeljets.com)
 (TXT) w3m dump (pixeljets.com)
        
       | wakatime wrote:
       | A related database using ideas from Clickhouse:
       | 
       | https://github.com/VictoriaMetrics/VictoriaMetrics
        
         | wikibob wrote:
         | Are you familiar with VictoriaMetrics?
         | 
         | Can you elaborate on how it is similar and dissimilar to
         | Clickhouse?
         | 
         | What specific techniques are the same?
        
           | ekimekim wrote:
           | The core storage engine borrows heavily from it - I'll
           | attempt to summarize and apologies for any errors, it's been
           | a while since I worked with VictoriaMetrics or ClickHouse.
           | 
           | Basically data is stored in sorted "runs". Appending is cheap
           | because you just create a new run. You have a background
           | "merge" operation that coalesces runs into larger runs
           | periodically, amortizing write costs. Reads are very
           | efficient as long as you're doing range queries (very likely
           | on a time-series database) as you need only linearly scan the
           | portion of each run that contains your time range.
        
       | kaak3 wrote:
       | Uber recently blogged that they rebuilt the log analytics
       | platform based on ClickHouse, replacing the previous ELK based
       | one. The table schema choices made it easy to handle JSON
       | formatted logs with changing schemas.
       | https://eng.uber.com/logging/
        
         | jetter wrote:
         | Nice! Adding this to the post, thanks for the link!
        
       | sylvinus wrote:
       | ClickHouse is incredible. It has also replaced a large, expensive
       | and slow Elasticsearch cluster at Contentsquare. We are actually
       | starting an internal team to improve it and upstream patches,
       | email me if interested!
        
         | hodgesrm wrote:
         | Your 2019 talk on this was great. I cited it above. Here's a
         | link to slides to supplement others elsewhere in the thread:
         | https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...
        
         | jetter wrote:
         | Yep, do you guys have a writeup on this? Altinity actually
         | mention Contentsquare case in their video, here:
         | https://www.youtube.com/watch?t=2479&v=pZkKsfr8n3M&feature=y...
        
           | mgachka wrote:
           | Hi
           | 
           | I'm of the guy who did the 2 presentations of Clickhouse at
           | ContentSquare. There are no blog posts on the migration from
           | ES to CH. But you can find the slides of the 2018
           | presentation here
           | https://www.slideshare.net/VianneyFOUCAULT/clickhouse-
           | meetup... And the slides of the 2019 presentation here
           | https://www.slideshare.net/VianneyFOUCAULT/meetup-a-
           | successf...
           | 
           | There is also a video recording of the 2019 presentation
           | available here. https://www.youtube.com/watch?v=lwYSYMwpJOU
           | nb: The video is not great because the camera is often losing
           | focus but it's still understandable.
        
           | polote wrote:
           | I'm not sure there is a public writeup. I know that the
           | incredibly talented guy who created the first CH setup at CS
           | planned to write a more global post about data analytics at
           | scale, but after 2 years I still wait for it
        
             | mgachka wrote:
             | I stopped answering to people about the release date of my
             | next blog post because I'm always postponing it ;-).
             | 
             | But don't worry Paul, the day I'll release it you'll be one
             | of the first to be informed.
        
             | sylvinus wrote:
             | I'll remind him about the post ;)
        
         | wikibob wrote:
         | Can you share some more details?
         | 
         | How many nodes on both? How much data ingested and stored?
         | What's the query load?
        
           | AurimasJLT wrote:
           | https://github.com/ClickHouse/clickhouse-
           | presentations/blob/... and the presentation itself
           | https://www.youtube.com/watch?v=lwYSYMwpJOU 300Elastic nodes
           | vs 12ClickHouse nodes / 260TB/ lots of querries
        
       | harporoeder wrote:
       | I don't have any production experience running Clickhouse, but I
       | have used it on a side project for an OLAP workload. Compared to
       | Postgres Clickhouse was a couple orders of magnitude faster (for
       | the query pattern), and it was pretty easy to setup a single node
       | configuration compared to lots of the "big data" stuff.
       | Clickhouse is really a game changer.
        
         | dominotw wrote:
         | what happens when your data doesn't fit in a single node
         | anymore?
        
           | aseipp wrote:
           | There's replication in ClickHouse and you can just shove
           | reads off to one of them if you'd like. From a backup/safety
           | standpoint that's important, but I think there are other
           | options besides just replicas, of course.
           | 
           | From an operations standpoint, however, ClickHouse is
           | ridiculously efficient at what it does. You can store tens of
           | billions, probably trillions of records on a single node
           | machine. You can query at tens of billions of rows a second,
           | etc, all with SQL. (The only competitor I know of in the same
           | class is MemSQL.) So another thing to keep in mind is you'll
           | be able to go much further with a single node using
           | ClickHouse than the alternatives. For OLAP style workloads,
           | it's well worth investigating.
        
           | [deleted]
        
           | qaq wrote:
           | It scales to crazy numbers as you add nodes in 2018
           | CloudFlare was ingesting 11M rows per second into their CH
           | cluster
        
       | tgtweak wrote:
       | I think it's an unfair comparison, notably because:
       | 
       | 1) Clickhouse is rigid-schema + append-only - you can't simply
       | dump semi-structured data (csv/json/documents) into it and worry
       | about schema (index definition) + querying later. The only
       | clickhouse integration I've seen up close had a lot of "json"
       | blobs in it as a workaround, which cannot be queried with the
       | same ease as in ES.
       | 
       | 2) Clickhouse scalability is not as simple/documented as
       | elasticsearch. You can set up a 200-node ES cluster with a
       | relatively simple helm config or readily-available cloudformation
       | recipe.
       | 
       | 3) Elastic is more than elasticsearch - kibana and the "on top of
       | elasticsearch" featureset is pretty substantial.
       | 
       | 4) Every language/platform under the sun (except powerbi... god
       | damnit) has native + mature client drivers for elasticsearch, and
       | you can fall back to bog-standard http calls for querying if you
       | need/want. ClickHouse supports some very elementary SQL
       | primitives ("ANSI") and even those have some gotchas and are far
       | from drop-in.
       | 
       | In this manner, I think that clickhouse is better compared as a
       | self-hosted alternative to Aurora and other cloud-native scalable
       | SQL databases, and less a replacement for elasticsearch. If
       | you're using Elasticsearch for OLAP, you're probably better to
       | ETL the semi-structured/raw data out of ES that you specifically
       | wan to a more suitable database which is meant for that.
        
         | hodgesrm wrote:
         | I'm the author of at least one of the ClickHouse video
         | presentations referenced in the article as well as here on HN.
         | ElasticSearch is a great product, but three of your points
         | undersell ClickHouse capabilities considerably.
         | 
         | 1.) ClickHouse JSON blobs are queryable and can be turned into
         | columns as needed. The Uber engineering team posted a great
         | write-up on their new log management platform, which uses these
         | capabilities at large scale. One of the enabling ClickHouse
         | features is ALTER TABLE commands that just change metadata, so
         | you can extend schema very efficiently. [1]
         | 
         | 2.) With reference to scalability, the question is not what it
         | takes to get 200 nodes up and running but what you get from
         | them. ClickHouse typically gets better query results on log
         | management using far fewer resources than ElasticSearch.
         | ContentSquare did a great talk on the performance gains
         | including 10x speed-up in queries and 11x reduction in cost.
         | [2]
         | 
         | 3.) Kibana is excellent and well-liked by users. Elastic has
         | done a great job on it. This is an area where the ClickHouse
         | ecosystem needs to grow.
         | 
         | 4.) This is just flat-out wrong. ClickHouse has a very powerful
         | SQL implementation that is particular strong at helping to
         | reduce I/O, compute aggregations efficiently and solve specific
         | use cases like funnel analysis. It has the best implementation
         | of arrays of any DBMS I know of. [3] Drivers are maturing
         | rapidly but to be honest it's so easy to submit queries via
         | HTTP that you don't need a driver for many use cases. My own
         | team does that for PHP.
         | 
         | I don't want to take away anything from Elastic's work--
         | ElasticSearch and the ecosystem products are great, as shown by
         | their wide adoption. At the same time ClickHouse is advancing
         | very quickly and has much better capabilities than many people
         | know.
         | 
         | p.s., As far as ANSI capability, we're working on TPC-DS and
         | have ClickHouse running at full steam on over 60% of the cases.
         | That's up from 15% a year ago. We'll have more to say on that
         | publicly later this year.
         | 
         | [1] https://eng.uber.com/logging/
         | 
         | [2] https://www.slideshare.net/VianneyFOUCAULT/meetup-a-
         | successf...
         | 
         | [3] https://altinity.com/blog/harnessing-the-power-of-
         | clickhouse...
         | 
         | p.s., I'm CEO of Altinity and work on ClickHouse, so usual
         | disclaimers.
        
           | jetter wrote:
           | Thank you for what you guys do. Altinity blog and videos are
           | an outstanding source of practical in-depth knowledge on the
           | subject, so much needed for Clickhouse recognition.
        
             | hodgesrm wrote:
             | You are most welcome. The webinars and blog articles are
             | incredibly fun to work on.
        
         | lovedswain wrote:
         | > you can't simply dump semi-structured data
         | (csv/json/documents) into it and worry about schema (index
         | definition) + querying later
         | 
         | Unless you love rewrites, you can't simply dump semi-structured
         | data into ElasticSearch either. Seen multiple apps with 5x or
         | worse ES storage usage tied to 'data model' or lack thereof,
         | and fixing it inevitably means revisiting every piece of code
         | pushing stuff into and out of ES.
         | 
         | I love ES but this notion of schema free is dumb, in practice
         | it's a nightmare.
        
         | ignoramous wrote:
         | > _Elastic is more than elasticsearch..._
         | 
         | Grafana Labs sponsored FOSS projects are probably adequate
         | replacement for the Elasticsearch? https://grafana.com/oss/
         | 
         | > _...clickhouse is better compared as a self-hosted
         | alternative to Aurora and other cloud-native scalable SQL
         | databases_
         | 
         | Aurora would be likely be less better at this than RedShift or
         | Snowflake.
        
         | jetter wrote:
         | I address your concern from #1 in "2. Flexible schema - but
         | strict when you need it" section - take a look at
         | https://www.youtube.com/watch?v=pZkKsfr8n3M&feature=emb_titl...
         | 
         | Regarding #2: Clickhouse scalability is not simple, but I think
         | Elasticsearch scalability is not that simple, too, they just
         | have it out of the box, while in Clickhouse you have to use
         | Zookeeper for it. I agree that for 200 nodes ES may be a better
         | choice, especially for full text search. For 5 nodes of 10 TB
         | logs data I would choose Clickhouse.
         | 
         | #3 is totally true. I mention it in "Cons" section - Kibana and
         | ecosystem may be a deal breaker for a lot of people.
         | 
         | #4. Clickhouse in 2021 has a pretty good support in all major
         | languages. And it can talk HTTP, too.
        
           | hodgesrm wrote:
           | Hi! Assuming you are author of the PixelJets article would
           | you consider submitting a talk to the Percona Live Online
           | 2021 Conference? It's all about open source databases. We're
           | doing an analytics track and welcome submissions on any and
           | all solutions based on open source analytic databases. CFP
           | runs through 14 March.
           | 
           | p.s., Everyone is welcome! If you see this and have a story
           | please consider submitting. No marketing please. We are DBMS
           | geeks.
           | 
           | https://altinity.com/blog/call-for-papers-on-analytics-at-
           | pe...
        
             | jetter wrote:
             | Thank your for the invitation! Will definitely consider
             | submitting my story.
        
         | anaphor wrote:
         | You might be able to just put whatever you want into an
         | Elasticsearch index, but I wouldn't recommend doing that. It
         | could severely limit how you can query your data later, see:
         | https://www.elastic.co/guide/en/elasticsearch/reference/curr...
         | 
         | Also it can cause performance problems if you have really
         | heterogeneous data with lots of different fields
         | https://www.elastic.co/guide/en/elasticsearch/reference/curr...
        
           | BoorishBears wrote:
           | Yup, reading that comment all I thought was exactly what I
           | said in another comment here, it'll work great until it
           | doesn't, and by then you'll suffer a lot to work around it
           | 
           | Same with scaling, scaling ES is super easy until you realize
           | your index sizes aren't playing nicely with sharding or
           | something and have to start working around that.
           | 
           | Clickhole feels like it's targeting what most people end up
           | using ES for. Comparing it to ES and talking about what's
           | missing is kind of missing the point imo.
        
             | free652 wrote:
             | I manage a fairly small ES cluster of 20 i3en.2xlarge
             | instances that ingest data from 300+ apps. Yes, the only
             | problem I see is the field type collision and it happens
             | occasionally.
             | 
             | Otherwise elastic doesn't require much operational time,
             | may be an hour a week.
             | 
             | You pretty much want to keep your indices around 50gb and
             | the ILM works well to manage that.
        
           | LASR wrote:
           | Yeah we learned this the hard way. We had a field, status:
           | 200. After a while, we introduced some new logging that
           | emitted status: success.
           | 
           | Since the auto-index used a number type instead of string, we
           | hit some issues trying to reindex the already ingested data.
        
         | outworlder wrote:
         | > In this manner, I think that clickhouse is better compared as
         | a self-hosted alternative to Aurora and other cloud-native
         | scalable SQL databases, and less a replacement for
         | elasticsearch.
         | 
         | Neither of which is normally used for logging.
         | 
         | I am glad there are some alternatives to ELK. Elasticsearch is
         | great, but it's not as great when you have to ingest terabytes
         | of logs daily. You can do it, but at a very large resource cost
         | (both computing and human). Managing shards is a headache with
         | the logging use-case.
         | 
         | Most logs don't have that much structure. A few fields, sure.
         | For this, Elasticsearch is not only overkill, but also not very
         | well suited. This is the reason why placing Kafka in front of
         | Elasticsearch for ingestion is rather popular.
        
         | LastMuel wrote:
         | I think the author addresses your point one in the article:
         | 
         | > SQL is a perfect language for analytics. I love SQL query
         | language and SQL schema is a perfect example of boring tech
         | that I recommend to use as a source of truth for all the data
         | in 99% of projects: if the project code is not perfect, you can
         | improve it relatively easily if your database state is strongly
         | structured. If your database state is a huge JSON blob (NoSQL)
         | and no-one can fully grasp the structure of this data, this
         | refactoring usually gets much more problematic.
         | 
         | > I saw this happening, especially in older projects with
         | MongoDB, where every new analytics report and every new
         | refactoring involving data migration is a big pain.
         | 
         | They're arguing that using non-structured, or variable
         | structured data is actually a developmental burden and the
         | flexibility it provides actually makes log analysis harder.
         | 
         | It seems that the "json" blobs are a symptom of the problem,
         | not the cause of it.
        
           | marcinzm wrote:
           | That argument would apply to production backend databases but
           | I don't see how it really applies to logs. It's like they
           | just copy and pasted a generic argument regarding structure
           | data without taking account of the context.
           | 
           | Logs tend to be rarely read but often written. They also age
           | very quickly and old logs are very rarely read. So putting
           | effort to unify the schemas on write seems very wasteful
           | versus doing so on read. Most of the queries are also text
           | search rather than structured requests so the chance of
           | missing something on read due to bad unification is very low.
        
           | leokennis wrote:
           | I disagree with the author on that.
           | 
           | Yes, SQL is nicer for structured queries, sure ("KQL" in
           | Kibana is sort of a baby step into querying data stored in
           | Elastic).
           | 
           | But in Kibana, I can just type in (for example) a filename,
           | and it will return any result row where that filename is part
           | of any column of data.
           | 
           | Also, if I need more structured results (for example, HTTP
           | responses by an API grouped per hour per count), I can pretty
           | easily do a visualization in Kibana.
           | 
           | So yes, for 5% of use cases regarding exposing logging data,
           | an SQL database of structured log events is preferred or
           | necessary. For the other 95%, the convenience of just dumping
           | files into Elastic makes it totally worth it.
        
       | eeZah7Ux wrote:
       | > ElasticSearch repo has jaw-dropping 1076 PRs merged for the
       | same month
       | 
       | Code change frequency is not a measure of quality or development
       | speed.
       | 
       | One organization can encourage bigger PRs while another encourage
       | tiny, frequent changes.
       | 
       | One can care about quality and stability while another can care
       | very little about bugs.
        
       | Jennifer9918910 wrote:
       | I'm available for sex, Please just follow my link
       | https://rebrand.ly/sexygirls007
        
       | pachico wrote:
       | I've been using it successfully in production for year and a
       | half. I can think of no other database that would give me real
       | time aggregation over hundreds of millions of rows inserter every
       | day for virtually zero cost. It's just a marvelous work.
        
       | e12e wrote:
       | I just there was a foss loki-like solution built on ch - that was
       | stable and used in production.
       | 
       | I know there's a few projects (see below) - but I'm not aware of
       | anything mature..
       | 
       | https://github.com/QXIP/cloki-go
       | 
       | https://github.com/lmangani/cloki
        
       | guardiangod wrote:
       | I am using Clickhouse at my workplace as a side project. I wrote
       | a Rust app that dumps the daily traffic data collected from my
       | company's products into a ClickHouse database.
       | 
       | That's 1-5 billion rows, per day, with 60 days of data, onto a
       | single i5 3500 desktop I have laying around. It returns a complex
       | query in less than 5 minutes.
       | 
       | I was gonna get a beef-ier server, but 5 minutes is fine for my
       | task. I was flabbergasted.
        
         | akudha wrote:
         | 5 billion rows per day? What does your product do?
        
           | guardiangod wrote:
           | Security products
        
       | crb002 wrote:
       | I wish they had a data store shoot-out like Techempower has for
       | Web stacks.
        
       | moralestapia wrote:
       | Sorry to hijack the thread but can anyone suggest alternatives to
       | the 'search' side of Elasticsearch?
       | 
       | I haven't been following the topic and there's probably new and
       | interesting developments like ClickHouse is for logging.
        
         | ericcholis wrote:
         | https://typesense.org/ comes to mind. Has support for Algolia's
         | instantsearch.js as well.
        
         | jetter wrote:
         | https://github.com/meilisearch/MeiliSearch gets a lot of
         | traction recently. There are also Sphinx and its fork
         | https://manticoresearch.com/ - very lightweight and fast.
        
         | KaoruAoiShiho wrote:
         | I just use elasticsearch, nothing wrong with it so far.
        
         | abrookins wrote:
         | You can use Redis for full-text search and some more SQL-like
         | queries, including aggregations, with RediSearch:
         | https://oss.redislabs.com/redisearch/
        
         | drewda wrote:
         | Not sure if it's "new" but it's always interesting: Postgres
         | offers a fine set of full-text search functionality, with the
         | advantage of being able to also use the other ways in which
         | Postgres shines: document-like data (storing JSON in columns),
         | horizontal replication, PostGIS, and so on.
        
         | stanmancan wrote:
         | Someone recommended Meilisearch the other day. I've been
         | playing around with it and it's pretty great so far. Still
         | early in the project right now.
        
       | valiant-comma wrote:
       | Regarding #1 in the article, Elastic does have SQL query
       | support[1]. I can't speak to performance or other comparative
       | metrics, but it's worked well for my purposes.
       | 
       | [1]
       | https://www.elastic.co/guide/en/elasticsearch/reference/curr...
        
       | proddata wrote:
       | If you are looking an OSS ES replacement, CrateDB might also be
       | worth a look :)
       | 
       | Basically a best of both worlds combination of ES and PostgreSQL,
       | perfect for time-series and log analytics.
        
       | dominotw wrote:
       | How does clickhouse compare to druid, pinot, rockset
       | (commercial), memsql (commercial). I know clickhouse is easier to
       | deploy.
       | 
       | But from user's perspective is clickhouse superior to the others?
        
         | mgachka wrote:
         | FYI, we're using clickhouse since 2018 at ContentSquare.
         | 
         | I did a few POCs to compare clickhouse vs other databases on
         | ContentSquare's use case. One of them was memSQL. Although
         | memSQL was very good, since we don't need to JOIN big datasets
         | or need killer features like fulltext search, clickhouse gave a
         | better perf/cost ratio for us (I don't remember exactly but it
         | was at least twice cheaper).
        
         | caust1c wrote:
         | When Cloudflare was considering clickhouse, we did estimates on
         | just the hardware cost and it was well over 10x what clickhouse
         | was based on druids given numbers on events processed per
         | compute unit.
        
           | advaita wrote:
           | Are you saying druid hardware costs were coming out to be 10x
           | of clickhouse hardware costs?
           | 
           | Caveat: English is not my first language so might have missed
           | your point in translation. :)
        
             | caust1c wrote:
             | That's right, yeah. We would have had to buy 10x the
             | servers in order to support the same workload that
             | clickhouse could.
        
         | jetter wrote:
         | I've mentioned Pinot and Druid briefly in 2018 writeup:
         | https://pixeljets.com/blog/clickhouse-as-a-replacement-for-e...
         | (see "Compete with Pinot and Druid" )
        
         | AurimasJLT wrote:
         | Yeah Druid got blown away by ClickHouse at eBay Druid 700+
         | servers versus 2 region fully replicated ClickHouse system of
         | 40 nodes. https://tech.ebayinc.com/engineering/ou-online-
         | analytical-pr... and a webinar they did with us at Altinity
         | https://www.youtube.com/watch?v=KI0AqpmcSOk&t=20s
        
       | BoorishBears wrote:
       | My biggest problem with Elasticsearch is how easy it is to get
       | data in there and think everything is just fine... until it falls
       | flat on its face the moment you hit some random use case that,
       | according to Murphy's law, will also be a very important one.
       | 
       | I wish Elasticsearch were maybe a little more opinionated in its
       | defaults. In some ways Clickhouse feels like they filled the gap
       | _not_ having opinionated defaults created. My usage is from a few
       | years back so maybe things have improved
        
         | djxfade wrote:
         | Would you care to elaborate on what happened in your case. My
         | company is using ElasticSearch extensively, and it is mission
         | critical for us. I fear something might happen one day
        
           | BoorishBears wrote:
           | It's been a few years so the details are fuzzy, but iirc it
           | was just simple things like index sizing, managing shard
           | count, some JVM tuning, certain aggregated fields blowing up
           | once we had more data in our instances...
           | 
           | We also got our data very out of order. We had embedded
           | devices logging analytics that would phone home very
           | infrequently, think months between check-ins
           | 
           | I forget why but that became a big issue at some point,
           | bringing the instance to its knees when a few devices started
           | to phone-in covering large periods of time.
           | 
           | ES just has a ton of knobs, I imagine if its been important
           | to you, you have people specializing in keeping it running,
           | which is great... but the amount of complexity there that is
           | specific to ES is really high.
           | 
           | It's not like there's no such thing as a Postgres expert for
           | example, but you don't need to hire a Postgres wizard until
           | you're pretty far in the weeds. But I feel like you should
           | have an ES wizard to use ES, which is a little unfortunate
        
       | the-alchemist wrote:
       | Also wanted to share my overall positive experience with
       | Clickhouse.
       | 
       | UPSIDES
       | 
       | * started a 3-node cluster using the official Docker images super
       | quickly
       | 
       | * ingested billions of rows super fast
       | 
       | * great compression (of course, depends on your data's
       | characteristics)
       | 
       | * features like https://clickhouse.tech/docs/en/engines/table-
       | engines/merget... are amazing to see
       | 
       | * ODBC support. I initially said "Who uses that??", but we used
       | it to connect PostgreSQL and so we can keep the non-timeseries
       | data in PostgreSQL but still access PostgreSQL tables in
       | Clickhouse (!)
       | 
       | * you can go the other way too: read Clickhouse from PostgreSQL
       | (see https://github.com/Percona-Lab/clickhousedb_fdw, although we
       | didn't try this)
       | 
       | * PRs welcome, and quickly reviewed. (We improved the ODBC UUID
       | support)
       | 
       | * code quality is pretty high.
       | 
       | DOWNSIDES
       | 
       | * limited JOIN capabilities, which is expected from a timeseries-
       | oriented database like Clickhouse. It's almost impossible to
       | implement JOINs at this kind of scale. The philosophy is "If it
       | won't be fast as scale, we don't support it"
       | 
       | * not-quite-standard SQL syntax, but they've been improving it
       | 
       | * limited DELETE support, which is also expected from this kind
       | of database, but rarely used in the kinds of environments that CH
       | usually runs in (how often do people delete data from
       | ElasticSearch?)
       | 
       | It's really an impressive piece of engineering. Hats off to the
       | Yandex crew.
        
         | FridgeSeal wrote:
         | Most minor of nitpicks:
         | 
         | > timeseries-oriented database
         | 
         | Technically it's a column oriented database that is good at
         | time series stuff. I only say that because I know there are
         | some databases that are even more specialised towards
         | timeseries and ClickHouse can do way more.
        
         | yamrzou wrote:
         | Could you share more details about the limited JOIN
         | capabilities? AFAIK, Clickhouse has multiple join algorithms
         | and supports on-disk joins to avoid out of memory:
         | 
         | https://github.com/ClickHouse/ClickHouse/issues/10830
         | 
         | https://github.com/ClickHouse/ClickHouse/issues/9702#issueco...
        
           | FridgeSeal wrote:
           | Maybe I'm not doing anything particularly challenging with
           | it, but I've not found anything lacking with the join
           | functionality.
        
         | hodgesrm wrote:
         | > It's really an impressive piece of engineering. Hats off to
         | the Yandex crew.
         | 
         | And thousands of contributors! Toward the end of 2020 over 680
         | unique users had submitted PRs and close to 2000 had opened
         | issues. It's becoming a very large community.
        
       | jcims wrote:
       | Does ClickHouse or anything else out there that even remotely
       | compete with Splunk for adhoc troubleshooting/forensics/threat
       | hunting type work?
       | 
       | I started off with Splunk and every time I try Elasticsearch I
       | feel like I'm stuck in a cage. Probably why they can charge so
       | much for it.
        
         | supergirl wrote:
         | why is splunk better than ES?
        
           | jcims wrote:
           | I really want to answer you but I'm struggling a bit b/c I
           | haven't worked with ES in a minute. Splunk just tends to be
           | able to eat just about any kind of structured or unstructed
           | content, operates on a pipeline concept similar to that of a
           | unix shell or (gasp) powershell, and has a rich set of data
           | manipulation of modification commands built in:
           | 
           | https://docs.splunk.com/Documentation/SplunkLight/7.3.6/Refe.
           | ..
           | 
           | I primarily use it for security-related analysis, which is
           | lowish on metrics and high on adhoc folding and mutilation of
           | a very diverse set of data structures and types.
        
       | phillc73 wrote:
       | > SQL is a perfect language for analytics.
       | 
       | Slightly off topic, but I strongly agree with this statement and
       | wonder why the languages used for a lot of data science work (R,
       | Python) don't have such a strong focus on SQL.
       | 
       | It might just be my brain, but SQL makes so much logical sense as
       | a query language and, with small variances, is used to directly
       | query so many databases.
       | 
       | In R, why learn the data.tables (OK, speed) or dplyr paradigms,
       | when SQL can be easily applied directly to dataframes? There are
       | libraries to support this like sqldf[1], tidyquery[2] and
       | duckdf[3] (author). And I'm sure the situation is similar in
       | Python.
       | 
       | This is not a post against great libraries like data.table and
       | dplyr, which I do use from time to time. It's more of a question
       | about why SQL is not more popular as the query language de jour
       | for data science.
       | 
       | [1] https://cran.r-project.org/web/packages/sqldf/index.html
       | 
       | [2] https://github.com/ianmcook/tidyquery
       | 
       | [3] https://github.com/phillc73/duckdf
        
         | jfim wrote:
         | SQL doesn't compose all that well.
         | 
         | For example, imagine that you have a complex query that handles
         | a report. If someone says "hey we need the same report but with
         | another filter on X," your options are to copy paste the SQL
         | query with the change, create a view that can optionally have
         | the filter (assuming the field that you'd want to filter on
         | actually is still visible at the view level), or parse the SQL
         | query into its tree form, mutate the tree, then turn it back
         | into SQL.
         | 
         | If you're using something like dplyr, then it's just an if
         | statement when building your pipeline. Dbplyr also will
         | generate SQL for you out of dplyr statements, it's pretty
         | amazing IMHO.
        
           | MattConfluence wrote:
           | > SQL doesn't compose all that well.
           | 
           | On that topic, I really enjoy working in Elixir because Ecto
           | [1] lets you write "SQL" with Elixir's composable functional
           | syntax. It sits somewhere between "the language is compiled
           | to SQL" and ORM. The Ruby-esque syntax took some getting used
           | to, but once I was past that hurdle my productivity
           | skyrocketed. It's not 100% feature complete compatibility
           | with all the different SQL dialects, but most of what you'll
           | need is there.
           | 
           | [1] https://github.com/elixir-ecto/ecto
        
             | phillc73 wrote:
             | In a similar vein, I really like the LINQ style queries in
             | the Julia package, Query.jl[1].
             | 
             | [1] https://www.queryverse.org/Query.jl/stable/linqquerycom
             | mands...
        
           | dominotw wrote:
           | > your options are to copy paste the SQL query with the
           | change
           | 
           | There are things like dbt, lookml that bring templating ,
           | macros on top of sql. Its much better than copy/pasting
           | strings but is worse than a full fledged programming
           | language.
        
           | phillc73 wrote:
           | I agree that dbplyr is a nice way to query databases, if
           | already familiar with dplyr (actually I think dtplyr is more
           | interesting for operating on data.tables). However, I'm not
           | sure I really understand your point about the "if" statement.
           | 
           | If the data is already in a dataframe, why not still use the
           | "if" statement, but one of the packages I mentioned earlier
           | to further modify the data?
           | 
           | E.g.
           | 
           | if (x = 10) {
           | 
           | duckdf("SELECT * WHERE y >= 20")
           | 
           | }
           | 
           | I guess I was thinking about one query language to rule them
           | all. If SQL was used everywhere, it's just about learning one
           | paradigm and largely being able to use that across platform
           | and language.
        
             | jfim wrote:
             | It's for more complex queries, eg. active users per
             | country:                 WITH active_users AS
             | (SELECT DISTINCT user_id, user_country FROM ...
             | WHERE last_login >= NOW() - 1 month)       SELECT
             | user_country, COUNT(user_id) AS user_count         FROM
             | active_users GROUP BY user_country         ORDER BY
             | user_count DESC
             | 
             | Now imagine someone says "what about users that have at
             | least 5 friends?" If you're using dplyr and want to reuse
             | most of your logic, it's just a matter of doing something
             | like                 active_users_with_friends =
             | active_users %>% filter(friend_count >= 5)
             | 
             | The SQL version is much hairier, since it's just code
             | that's within a string.
        
               | phillc73 wrote:
               | I think we're talking at cross purposes, and I don't want
               | to belabour the point too much, but there's no need to
               | modify the original SQL query (I'm imagining this pulls
               | data from a database). If the data is already in a
               | dataframe, use one of the libraries which can apply SQL
               | to in-memory dataframes.
               | 
               | e.g.
               | 
               | active_users_with_friends <- duckdf("SELECT * FROM
               | active_users WHERE friend_count >= 5")
        
               | jfim wrote:
               | There's an aggregation part that comes after the CTE, at
               | which point the "friend_count" or "last_login" fields are
               | not available anymore. "active_users" isn't a table, it's
               | the result of the CTE in the query that returns distinct
               | users that have a last_login in the last 30 days.
               | 
               | Also, keep in mind that this is a pretty simple example,
               | a more realistic one would probably have a half dozen to
               | a dozen CTEs, some nested, at which point correctly
               | mutating the SQL statement is not trivial.
        
               | phillc73 wrote:
               | > "active_users" isn't a table, it's the result of the
               | CTE in the query
               | 
               | Maybe we really are at cross purposes. In the duckdf
               | example above "active_users" is a dataframe. The duckdf
               | library applies SQL queries on dataframes. It's not
               | (necessarily) querying an on-disk database.
               | 
               | If you're querying data using dplyr then it's highly
               | likely that data is already in a dataframe. By the same
               | principle write the original SQL query to use SELECT with
               | wildcard, to return everything. Then use a relevant
               | library to apply SQL on dataframes and it's pretty much
               | the same result as dplyr.
               | 
               | That's not to say complex SQL queries shouldn't be used
               | to only return the minimally required information from
               | the database. Clearly, sometimes there's too much data to
               | fit in memory. My point is that if you're using dplyr on
               | dataframes (or tibbles), then you could just as easily
               | use an SQL based query on the same dataframe.
        
               | jfim wrote:
               | > Then your dplyr example above doesn't work, because
               | active_users doesn't exist.
               | 
               | You can assume that it would contain the same data that
               | would be in the CTE in the SQL query above. There's no
               | such table in the database, it's a subquery.
               | 
               | If your argument is that chaining multiple SQL queries on
               | data frames is needed, then we're both saying the same
               | thing. SQL doesn't compose well, hence the need for
               | multiple queries. At that point it's not just SQL though,
               | it's querying of dataframes using SQL _and_ composing the
               | dataframes together using non-SQL code.
               | 
               | Luckily, dplyr will actually compose properly, and you
               | can get the actual SQL statement that is equivalent to
               | the dplyr expression, which would be rather annoying to
               | obtain programmatically.
        
             | kristjansson wrote:
             | I think the point from GP (which aligns with my thinking)
             | is that more programmatic interfaces for querying data
             | allow for control flow in constructing the query, not just
             | in processing it. So you can conditionally add a predicate
             | or whatever, and then run the whole query at once.
             | 
             | Whereas with the SQL packages you mentioned, you either
             | have to conditionally construct your query string (ew) or
             | pull results back and express your conditional as
             | operations on the resulting data frame.
             | 
             | For pure dplyr-on-in-memory-data-frame there isn't much
             | difference between the two, to be sure. For dplyr on
             | dbplyr, or sparklyr, or for SparkR, or PySpark, or Dask, or
             | pick-your-favorite-framework, the difference between
             | operating on (effectively) query AST vs. result sets is
             | huge.
        
               | phillc73 wrote:
               | Good points. That makes the larger than memory use case
               | quite clear, and interesting.
        
             | jerf wrote:
             | I read down the thread as it stands now, and the basic
             | problem is this. Write the following function:
             | Compose("SELECT a, b.revenue, b.name              FROM
             | table              INNER JOIN  b ON whatever
             | WHERE name LIKE 'b%'",         "revenue > 10000")
             | 
             | to yield a new valid SQL statement that composes these two
             | fragments together in the way that it is obvious that I
             | want. You basically can't. If you can in this small
             | example, I can easily complexify the SQL until you can't
             | anymore. Note how "revenue" in my supplemental WHERE clause
             | may or may not be that "b.revenue", both in the sense that
             | I may be asking for some _other_ revenue entirely, and in
             | the sense that that particular fragment may have come from
             | somewhere that has no way to know about the  "b" part of
             | the name; one of the major issues is namespacing like this,
             | though it is far from the only one!
             | 
             | It is almost certainly impossible to sanely implement this
             | function literally in terms of strings. (If such a thing
             | was possible, it would almost certainly have trash
             | performance.) You need a richer set of data types and
             | operations to permit this.
             | 
             | It has been done. You can hold on to a symbolic
             | representation of the SQL in some internal library
             | representation. However, to "bind" to SQL in this manner
             | requires a binding to every single feature of the SQL you
             | want to use at a pretty deep level (more than just strings,
             | you need to understand the full syntax tree), and speaking
             | from experience, no matter how cleverly you try to start
             | writing it at first it gets more complicated than you
             | think. It is certainly a thing that has been done, but it
             | is a _huuuuuuge_ project. Seriously. It would be a lot
             | easier if we were using an underlying representation
             | designed to do this sort of thing from the beginning.
             | 
             | I like what you can do with SQL, but I kinda hate the way
             | it was _juuust_ good enough to get ensconced and apparently
             | permanently ensure that no improvement on it can ever get
             | off the ground because no conceivable improvement can
             | overcome the entrenched advantages SQL has.
        
         | kristjansson wrote:
         | I think the answer to your question is that, until relatively
         | recently it was not possible to run arbitrary SQL against data
         | frames in either language. sqldf and duckdf pass data off to
         | another program (albeit in-process), run the query there, and
         | pull the result set back to R, paying data movement and
         | transportation costs along the way. Tidyquery is neat and
         | avoids that issue, but is also only 18 months old or so
        
           | phillc73 wrote:
           | I agree that Tidyquery and duckdf are quite new, but sqldf
           | has been around for a long time. The first package upload to
           | CRAN was 2007.
           | 
           | Also, duckdf uses duckdb[1] to register a virtual in-memory
           | table, so there's very little overhead for data transport.
           | 
           | [1] https://duckdb.org/docs/api/r
        
             | kristjansson wrote:
             | While sqldf has been around for a while, it's not been a
             | performant choice, since it requires transforming and
             | moving data into a database, albeit one that's in process.
             | So with something like                   df <-
             | data.frame(a=rnorm(5e8), b=runif(5e8))
             | sqldf::sqldf("select count(*) from df where a > 3")
             | 
             | works, but takes 200 seconds and doubles the memory
             | footprint of the process whereas
             | nrow(df[df$a > 3, ])         sum(df$a > 3)
             | 
             | take ~1.5 seconds and ~1s respectively on my machine.
             | 
             | I appear to have been too pessimistic about duckdf/duckdb
             | though. It's docs[1] claim that registering an R data.frame
             | as a table
             | 
             | > does not actually transfer data into DuckDB yet.
             | 
             | Which implied to me that it just deferred transfer until
             | the user runs a query. However, in a basic test just now:
             | con = dbConnect(duckdb::duckdb(), ":memory:")
             | duckdb::duckdb_register(con, "iris_view", iris)
             | duckdb::duckdb_register(con, "df_view", df)
             | dbGetQuery(con, "select count(*) from df_view where a > 0")
             | 
             | it appears to execute the query directly against the data
             | frame. At least, it runs in ~2.2s, and doesn't allocate
             | more memory. Cool! As you've noted though, it's very new -
             | looks like they released 0.1.0 last May?
             | 
             | I think the point stands: until very recently, SQL-on-
             | dataframes was not a viable choice for anyone working at
             | even moderate scales in R or Python, so preference has been
             | for APIs exposed by libraries (pandas, data.table, dplyr,
             | ...) that offered users reasonable performance, even if SQL
             | would have been be a more ergonomic interface.
             | 
             | [1] https://duckdb.org/docs/api/r
        
               | phillc73 wrote:
               | Yeah, I agree sqldf is quite slow. Fair point.
               | 
               | As you've seen, duckdb registers an "R data frame as a
               | virtual table." I'm not sure what they mean by "yet"
               | either.
               | 
               | Of course it is possible to write an R dataframe to an
               | on-disk duckdb table, if that's what you want to do.
               | 
               | There are some simple benchmarks on the bottom of the
               | duckdf README[1]. Essentially I found for basic SQL
               | SELECT queries, dplyr is quicker, but for much more
               | complex queries, the duckdf/duckdb combination performs
               | better.
               | 
               | If you really want speed of course, just use data.table.
               | 
               | [1] https://github.com/phillc73/duckdf
        
               | kristjansson wrote:
               | Didn't realize duckdf was your package. Cool! Definitely
               | something I'll consider for future work, though I spend
               | more time on the Spark-and-Python side of the fence these
               | days.
               | 
               | If you wanted to add corresponding memory benchmarks the
               | value-prop of duckdf might be clearer to those of us that
               | have been scarred by sqldf :).
        
               | phillc73 wrote:
               | Sounds like it could be an interesting comparison. I'll
               | look into it.
        
         | alexisread wrote:
         | I'm late to the party here, but as stated in other comments,
         | dyplr / Linq etc. Compose better so you can extend a query
         | easier.
         | 
         | Another advantage of composable syntaxes is that you can lazily
         | evaluate parts of a query, and combine others, which allows the
         | equivalent of a CTE on parts of a query you need to optimise.
         | 
         | It's also true that SQL allows pushdown of queries to the
         | database much easier than composable syntaxes, as SQL is
         | usually the native language there. As such it can make for a
         | more exploratory syntax than say graphQL where you need to
         | define joins at design-time.
        
         | marcinzm wrote:
         | SQL tends to be non-composable which makes complicated scripts
         | really messy to refactor and modify (or read even). CTEs make
         | it more sensible but they're also a rather recent addition and
         | don't fully solve the problem. Data Science tends to involve a
         | lot of modifying of the same code rather than creating one off
         | or static scripts.
        
       | js2 wrote:
       | Sentry.io is using ClickHouse for search, with an API they built
       | on top of it to make it easier to transition if need be. They
       | blogged about it at the time they adopted it:
       | 
       | https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...
        
       | moralsupply wrote:
       | I'm happy that more people are "discovering" ClickHouse.
       | 
       | ClickHouse is an outstanding product, with great capabilities
       | that serve a wide array of big data use cases.
       | 
       | It's simple to deploy, simple to operate, simple to ingest large
       | amounts of data, simple to scale, and simple to query.
       | 
       | We've been using ClickHouse to handle 100's of TB of data for
       | workloads that require ranking on multi-dimensional timeseries
       | aggregations, and we can resolve most complex queries in less
       | than 500ms under load.
        
       | Bluestein wrote:
       | Read that as "Clubhouse" on first glance ...
       | 
       | ... and was quite intrigued :)
       | 
       | PS: Tough crowd tonite ...
       | 
       | It would appear local crowd does not take well to some tongue-in-
       | cheek. Besides, it actually happened as described.-
        
         | pantulis wrote:
         | Take my upvote, because the same happened to me.
        
           | Bluestein wrote:
           | Could not have been the only one :)
        
       | didip wrote:
       | Does ClickHouse have integration with Superset and Grafana?
        
         | daniel_levine wrote:
         | Yes to both, Altinity maintains the ClickHouse Grafana plugin
         | https://altinity.com/blog/2019/12/28/creating-beautiful-graf...
         | 
         | And Superset has a recommendation of a ClickHouse connector
         | https://superset.apache.org/docs/databases/clickhouse
        
       | dabeeeenster wrote:
       | I've been recording a podcast with Commercial Open Source company
       | founders (Plug! https://www.flagsmith.com/podcast) and have been
       | surprised how often Clickhouse has come up. It is _always_
       | referred to with glowing praise /couldn't have built our business
       | without it etc etc etc.
        
         | jabo wrote:
         | Thanks for sharing your podcast! Just subscribed.
         | 
         | To add yet another data point, we use Clickhouse as well for
         | centralized logging for the SaaS version of our open source
         | product, and can't imagine what we would have done without it.
        
       | wiradikusuma wrote:
       | Anyone know more lightweight alternative to (ELK) Elastic Stack?
       | I found https://vector.dev but it seems to be only the "L" part.
        
         | webo wrote:
         | Have you looked into Google Cloud Logging (Stackdriver)? It's
         | the most affordable and decent-enough solution we've found. The
         | only issue is querying can be slow on large volumes.
        
           | spyspy wrote:
           | Have you tried creating a sink from stackdriver to bigquery?
        
             | webo wrote:
             | Hi, not really because my understanding is bigquery
             | requires structured data but Stackdriver just expects a
             | arbitrary json payload which can be queried from. Do you
             | have any experience?
        
         | ehnto wrote:
         | I found Lucene's base library really easy to use without the
         | configuration/infrastructure overhead of Elasticsearch, but
         | haven't experienced it at scale:
         | 
         | https://lucene.apache.org/
        
           | rzzzt wrote:
           | Solr is the equivalent-ish of ES, if you are looking for a
           | search server instead of a library that can be embedded:
           | https://lucene.apache.org/solr/
        
         | johnx123-up wrote:
         | What about https://github.com/meilisearch/MeiliSearch ?
        
         | Xylakant wrote:
         | Promtail/Loki https://github.com/grafana/loki is an alternative
         | to elk, but while it seems more lightweight, it definitely is
         | less featureful. The integration with grafana/prometheus seems
         | nice, but I've only toyed with it, not used in production.
        
       ___________________________________________________________________
       (page generated 2021-03-02 23:00 UTC)