[HN Gopher] ClickHouse as an alternative to Elasticsearch for lo...
___________________________________________________________________
ClickHouse as an alternative to Elasticsearch for log storage and
analysis
Author : jetter
Score : 269 points
Date : 2021-03-02 15:40 UTC (7 hours ago)
(HTM) web link (pixeljets.com)
(TXT) w3m dump (pixeljets.com)
| wakatime wrote:
| A related database using ideas from Clickhouse:
|
| https://github.com/VictoriaMetrics/VictoriaMetrics
| wikibob wrote:
| Are you familiar with VictoriaMetrics?
|
| Can you elaborate on how it is similar and dissimilar to
| Clickhouse?
|
| What specific techniques are the same?
| ekimekim wrote:
| The core storage engine borrows heavily from it - I'll
| attempt to summarize and apologies for any errors, it's been
| a while since I worked with VictoriaMetrics or ClickHouse.
|
| Basically data is stored in sorted "runs". Appending is cheap
| because you just create a new run. You have a background
| "merge" operation that coalesces runs into larger runs
| periodically, amortizing write costs. Reads are very
| efficient as long as you're doing range queries (very likely
| on a time-series database) as you need only linearly scan the
| portion of each run that contains your time range.
| kaak3 wrote:
| Uber recently blogged that they rebuilt the log analytics
| platform based on ClickHouse, replacing the previous ELK based
| one. The table schema choices made it easy to handle JSON
| formatted logs with changing schemas.
| https://eng.uber.com/logging/
| jetter wrote:
| Nice! Adding this to the post, thanks for the link!
| sylvinus wrote:
| ClickHouse is incredible. It has also replaced a large, expensive
| and slow Elasticsearch cluster at Contentsquare. We are actually
| starting an internal team to improve it and upstream patches,
| email me if interested!
| hodgesrm wrote:
| Your 2019 talk on this was great. I cited it above. Here's a
| link to slides to supplement others elsewhere in the thread:
| https://www.slideshare.net/VianneyFOUCAULT/meetup-a-successf...
| jetter wrote:
| Yep, do you guys have a writeup on this? Altinity actually
| mention Contentsquare case in their video, here:
| https://www.youtube.com/watch?t=2479&v=pZkKsfr8n3M&feature=y...
| mgachka wrote:
| Hi
|
| I'm of the guy who did the 2 presentations of Clickhouse at
| ContentSquare. There are no blog posts on the migration from
| ES to CH. But you can find the slides of the 2018
| presentation here
| https://www.slideshare.net/VianneyFOUCAULT/clickhouse-
| meetup... And the slides of the 2019 presentation here
| https://www.slideshare.net/VianneyFOUCAULT/meetup-a-
| successf...
|
| There is also a video recording of the 2019 presentation
| available here. https://www.youtube.com/watch?v=lwYSYMwpJOU
| nb: The video is not great because the camera is often losing
| focus but it's still understandable.
| polote wrote:
| I'm not sure there is a public writeup. I know that the
| incredibly talented guy who created the first CH setup at CS
| planned to write a more global post about data analytics at
| scale, but after 2 years I still wait for it
| mgachka wrote:
| I stopped answering to people about the release date of my
| next blog post because I'm always postponing it ;-).
|
| But don't worry Paul, the day I'll release it you'll be one
| of the first to be informed.
| sylvinus wrote:
| I'll remind him about the post ;)
| wikibob wrote:
| Can you share some more details?
|
| How many nodes on both? How much data ingested and stored?
| What's the query load?
| AurimasJLT wrote:
| https://github.com/ClickHouse/clickhouse-
| presentations/blob/... and the presentation itself
| https://www.youtube.com/watch?v=lwYSYMwpJOU 300Elastic nodes
| vs 12ClickHouse nodes / 260TB/ lots of querries
| harporoeder wrote:
| I don't have any production experience running Clickhouse, but I
| have used it on a side project for an OLAP workload. Compared to
| Postgres Clickhouse was a couple orders of magnitude faster (for
| the query pattern), and it was pretty easy to setup a single node
| configuration compared to lots of the "big data" stuff.
| Clickhouse is really a game changer.
| dominotw wrote:
| what happens when your data doesn't fit in a single node
| anymore?
| aseipp wrote:
| There's replication in ClickHouse and you can just shove
| reads off to one of them if you'd like. From a backup/safety
| standpoint that's important, but I think there are other
| options besides just replicas, of course.
|
| From an operations standpoint, however, ClickHouse is
| ridiculously efficient at what it does. You can store tens of
| billions, probably trillions of records on a single node
| machine. You can query at tens of billions of rows a second,
| etc, all with SQL. (The only competitor I know of in the same
| class is MemSQL.) So another thing to keep in mind is you'll
| be able to go much further with a single node using
| ClickHouse than the alternatives. For OLAP style workloads,
| it's well worth investigating.
| [deleted]
| qaq wrote:
| It scales to crazy numbers as you add nodes in 2018
| CloudFlare was ingesting 11M rows per second into their CH
| cluster
| tgtweak wrote:
| I think it's an unfair comparison, notably because:
|
| 1) Clickhouse is rigid-schema + append-only - you can't simply
| dump semi-structured data (csv/json/documents) into it and worry
| about schema (index definition) + querying later. The only
| clickhouse integration I've seen up close had a lot of "json"
| blobs in it as a workaround, which cannot be queried with the
| same ease as in ES.
|
| 2) Clickhouse scalability is not as simple/documented as
| elasticsearch. You can set up a 200-node ES cluster with a
| relatively simple helm config or readily-available cloudformation
| recipe.
|
| 3) Elastic is more than elasticsearch - kibana and the "on top of
| elasticsearch" featureset is pretty substantial.
|
| 4) Every language/platform under the sun (except powerbi... god
| damnit) has native + mature client drivers for elasticsearch, and
| you can fall back to bog-standard http calls for querying if you
| need/want. ClickHouse supports some very elementary SQL
| primitives ("ANSI") and even those have some gotchas and are far
| from drop-in.
|
| In this manner, I think that clickhouse is better compared as a
| self-hosted alternative to Aurora and other cloud-native scalable
| SQL databases, and less a replacement for elasticsearch. If
| you're using Elasticsearch for OLAP, you're probably better to
| ETL the semi-structured/raw data out of ES that you specifically
| wan to a more suitable database which is meant for that.
| hodgesrm wrote:
| I'm the author of at least one of the ClickHouse video
| presentations referenced in the article as well as here on HN.
| ElasticSearch is a great product, but three of your points
| undersell ClickHouse capabilities considerably.
|
| 1.) ClickHouse JSON blobs are queryable and can be turned into
| columns as needed. The Uber engineering team posted a great
| write-up on their new log management platform, which uses these
| capabilities at large scale. One of the enabling ClickHouse
| features is ALTER TABLE commands that just change metadata, so
| you can extend schema very efficiently. [1]
|
| 2.) With reference to scalability, the question is not what it
| takes to get 200 nodes up and running but what you get from
| them. ClickHouse typically gets better query results on log
| management using far fewer resources than ElasticSearch.
| ContentSquare did a great talk on the performance gains
| including 10x speed-up in queries and 11x reduction in cost.
| [2]
|
| 3.) Kibana is excellent and well-liked by users. Elastic has
| done a great job on it. This is an area where the ClickHouse
| ecosystem needs to grow.
|
| 4.) This is just flat-out wrong. ClickHouse has a very powerful
| SQL implementation that is particular strong at helping to
| reduce I/O, compute aggregations efficiently and solve specific
| use cases like funnel analysis. It has the best implementation
| of arrays of any DBMS I know of. [3] Drivers are maturing
| rapidly but to be honest it's so easy to submit queries via
| HTTP that you don't need a driver for many use cases. My own
| team does that for PHP.
|
| I don't want to take away anything from Elastic's work--
| ElasticSearch and the ecosystem products are great, as shown by
| their wide adoption. At the same time ClickHouse is advancing
| very quickly and has much better capabilities than many people
| know.
|
| p.s., As far as ANSI capability, we're working on TPC-DS and
| have ClickHouse running at full steam on over 60% of the cases.
| That's up from 15% a year ago. We'll have more to say on that
| publicly later this year.
|
| [1] https://eng.uber.com/logging/
|
| [2] https://www.slideshare.net/VianneyFOUCAULT/meetup-a-
| successf...
|
| [3] https://altinity.com/blog/harnessing-the-power-of-
| clickhouse...
|
| p.s., I'm CEO of Altinity and work on ClickHouse, so usual
| disclaimers.
| jetter wrote:
| Thank you for what you guys do. Altinity blog and videos are
| an outstanding source of practical in-depth knowledge on the
| subject, so much needed for Clickhouse recognition.
| hodgesrm wrote:
| You are most welcome. The webinars and blog articles are
| incredibly fun to work on.
| lovedswain wrote:
| > you can't simply dump semi-structured data
| (csv/json/documents) into it and worry about schema (index
| definition) + querying later
|
| Unless you love rewrites, you can't simply dump semi-structured
| data into ElasticSearch either. Seen multiple apps with 5x or
| worse ES storage usage tied to 'data model' or lack thereof,
| and fixing it inevitably means revisiting every piece of code
| pushing stuff into and out of ES.
|
| I love ES but this notion of schema free is dumb, in practice
| it's a nightmare.
| ignoramous wrote:
| > _Elastic is more than elasticsearch..._
|
| Grafana Labs sponsored FOSS projects are probably adequate
| replacement for the Elasticsearch? https://grafana.com/oss/
|
| > _...clickhouse is better compared as a self-hosted
| alternative to Aurora and other cloud-native scalable SQL
| databases_
|
| Aurora would be likely be less better at this than RedShift or
| Snowflake.
| jetter wrote:
| I address your concern from #1 in "2. Flexible schema - but
| strict when you need it" section - take a look at
| https://www.youtube.com/watch?v=pZkKsfr8n3M&feature=emb_titl...
|
| Regarding #2: Clickhouse scalability is not simple, but I think
| Elasticsearch scalability is not that simple, too, they just
| have it out of the box, while in Clickhouse you have to use
| Zookeeper for it. I agree that for 200 nodes ES may be a better
| choice, especially for full text search. For 5 nodes of 10 TB
| logs data I would choose Clickhouse.
|
| #3 is totally true. I mention it in "Cons" section - Kibana and
| ecosystem may be a deal breaker for a lot of people.
|
| #4. Clickhouse in 2021 has a pretty good support in all major
| languages. And it can talk HTTP, too.
| hodgesrm wrote:
| Hi! Assuming you are author of the PixelJets article would
| you consider submitting a talk to the Percona Live Online
| 2021 Conference? It's all about open source databases. We're
| doing an analytics track and welcome submissions on any and
| all solutions based on open source analytic databases. CFP
| runs through 14 March.
|
| p.s., Everyone is welcome! If you see this and have a story
| please consider submitting. No marketing please. We are DBMS
| geeks.
|
| https://altinity.com/blog/call-for-papers-on-analytics-at-
| pe...
| jetter wrote:
| Thank your for the invitation! Will definitely consider
| submitting my story.
| anaphor wrote:
| You might be able to just put whatever you want into an
| Elasticsearch index, but I wouldn't recommend doing that. It
| could severely limit how you can query your data later, see:
| https://www.elastic.co/guide/en/elasticsearch/reference/curr...
|
| Also it can cause performance problems if you have really
| heterogeneous data with lots of different fields
| https://www.elastic.co/guide/en/elasticsearch/reference/curr...
| BoorishBears wrote:
| Yup, reading that comment all I thought was exactly what I
| said in another comment here, it'll work great until it
| doesn't, and by then you'll suffer a lot to work around it
|
| Same with scaling, scaling ES is super easy until you realize
| your index sizes aren't playing nicely with sharding or
| something and have to start working around that.
|
| Clickhole feels like it's targeting what most people end up
| using ES for. Comparing it to ES and talking about what's
| missing is kind of missing the point imo.
| free652 wrote:
| I manage a fairly small ES cluster of 20 i3en.2xlarge
| instances that ingest data from 300+ apps. Yes, the only
| problem I see is the field type collision and it happens
| occasionally.
|
| Otherwise elastic doesn't require much operational time,
| may be an hour a week.
|
| You pretty much want to keep your indices around 50gb and
| the ILM works well to manage that.
| LASR wrote:
| Yeah we learned this the hard way. We had a field, status:
| 200. After a while, we introduced some new logging that
| emitted status: success.
|
| Since the auto-index used a number type instead of string, we
| hit some issues trying to reindex the already ingested data.
| outworlder wrote:
| > In this manner, I think that clickhouse is better compared as
| a self-hosted alternative to Aurora and other cloud-native
| scalable SQL databases, and less a replacement for
| elasticsearch.
|
| Neither of which is normally used for logging.
|
| I am glad there are some alternatives to ELK. Elasticsearch is
| great, but it's not as great when you have to ingest terabytes
| of logs daily. You can do it, but at a very large resource cost
| (both computing and human). Managing shards is a headache with
| the logging use-case.
|
| Most logs don't have that much structure. A few fields, sure.
| For this, Elasticsearch is not only overkill, but also not very
| well suited. This is the reason why placing Kafka in front of
| Elasticsearch for ingestion is rather popular.
| LastMuel wrote:
| I think the author addresses your point one in the article:
|
| > SQL is a perfect language for analytics. I love SQL query
| language and SQL schema is a perfect example of boring tech
| that I recommend to use as a source of truth for all the data
| in 99% of projects: if the project code is not perfect, you can
| improve it relatively easily if your database state is strongly
| structured. If your database state is a huge JSON blob (NoSQL)
| and no-one can fully grasp the structure of this data, this
| refactoring usually gets much more problematic.
|
| > I saw this happening, especially in older projects with
| MongoDB, where every new analytics report and every new
| refactoring involving data migration is a big pain.
|
| They're arguing that using non-structured, or variable
| structured data is actually a developmental burden and the
| flexibility it provides actually makes log analysis harder.
|
| It seems that the "json" blobs are a symptom of the problem,
| not the cause of it.
| marcinzm wrote:
| That argument would apply to production backend databases but
| I don't see how it really applies to logs. It's like they
| just copy and pasted a generic argument regarding structure
| data without taking account of the context.
|
| Logs tend to be rarely read but often written. They also age
| very quickly and old logs are very rarely read. So putting
| effort to unify the schemas on write seems very wasteful
| versus doing so on read. Most of the queries are also text
| search rather than structured requests so the chance of
| missing something on read due to bad unification is very low.
| leokennis wrote:
| I disagree with the author on that.
|
| Yes, SQL is nicer for structured queries, sure ("KQL" in
| Kibana is sort of a baby step into querying data stored in
| Elastic).
|
| But in Kibana, I can just type in (for example) a filename,
| and it will return any result row where that filename is part
| of any column of data.
|
| Also, if I need more structured results (for example, HTTP
| responses by an API grouped per hour per count), I can pretty
| easily do a visualization in Kibana.
|
| So yes, for 5% of use cases regarding exposing logging data,
| an SQL database of structured log events is preferred or
| necessary. For the other 95%, the convenience of just dumping
| files into Elastic makes it totally worth it.
| eeZah7Ux wrote:
| > ElasticSearch repo has jaw-dropping 1076 PRs merged for the
| same month
|
| Code change frequency is not a measure of quality or development
| speed.
|
| One organization can encourage bigger PRs while another encourage
| tiny, frequent changes.
|
| One can care about quality and stability while another can care
| very little about bugs.
| Jennifer9918910 wrote:
| I'm available for sex, Please just follow my link
| https://rebrand.ly/sexygirls007
| pachico wrote:
| I've been using it successfully in production for year and a
| half. I can think of no other database that would give me real
| time aggregation over hundreds of millions of rows inserter every
| day for virtually zero cost. It's just a marvelous work.
| e12e wrote:
| I just there was a foss loki-like solution built on ch - that was
| stable and used in production.
|
| I know there's a few projects (see below) - but I'm not aware of
| anything mature..
|
| https://github.com/QXIP/cloki-go
|
| https://github.com/lmangani/cloki
| guardiangod wrote:
| I am using Clickhouse at my workplace as a side project. I wrote
| a Rust app that dumps the daily traffic data collected from my
| company's products into a ClickHouse database.
|
| That's 1-5 billion rows, per day, with 60 days of data, onto a
| single i5 3500 desktop I have laying around. It returns a complex
| query in less than 5 minutes.
|
| I was gonna get a beef-ier server, but 5 minutes is fine for my
| task. I was flabbergasted.
| akudha wrote:
| 5 billion rows per day? What does your product do?
| guardiangod wrote:
| Security products
| crb002 wrote:
| I wish they had a data store shoot-out like Techempower has for
| Web stacks.
| moralestapia wrote:
| Sorry to hijack the thread but can anyone suggest alternatives to
| the 'search' side of Elasticsearch?
|
| I haven't been following the topic and there's probably new and
| interesting developments like ClickHouse is for logging.
| ericcholis wrote:
| https://typesense.org/ comes to mind. Has support for Algolia's
| instantsearch.js as well.
| jetter wrote:
| https://github.com/meilisearch/MeiliSearch gets a lot of
| traction recently. There are also Sphinx and its fork
| https://manticoresearch.com/ - very lightweight and fast.
| KaoruAoiShiho wrote:
| I just use elasticsearch, nothing wrong with it so far.
| abrookins wrote:
| You can use Redis for full-text search and some more SQL-like
| queries, including aggregations, with RediSearch:
| https://oss.redislabs.com/redisearch/
| drewda wrote:
| Not sure if it's "new" but it's always interesting: Postgres
| offers a fine set of full-text search functionality, with the
| advantage of being able to also use the other ways in which
| Postgres shines: document-like data (storing JSON in columns),
| horizontal replication, PostGIS, and so on.
| stanmancan wrote:
| Someone recommended Meilisearch the other day. I've been
| playing around with it and it's pretty great so far. Still
| early in the project right now.
| valiant-comma wrote:
| Regarding #1 in the article, Elastic does have SQL query
| support[1]. I can't speak to performance or other comparative
| metrics, but it's worked well for my purposes.
|
| [1]
| https://www.elastic.co/guide/en/elasticsearch/reference/curr...
| proddata wrote:
| If you are looking an OSS ES replacement, CrateDB might also be
| worth a look :)
|
| Basically a best of both worlds combination of ES and PostgreSQL,
| perfect for time-series and log analytics.
| dominotw wrote:
| How does clickhouse compare to druid, pinot, rockset
| (commercial), memsql (commercial). I know clickhouse is easier to
| deploy.
|
| But from user's perspective is clickhouse superior to the others?
| mgachka wrote:
| FYI, we're using clickhouse since 2018 at ContentSquare.
|
| I did a few POCs to compare clickhouse vs other databases on
| ContentSquare's use case. One of them was memSQL. Although
| memSQL was very good, since we don't need to JOIN big datasets
| or need killer features like fulltext search, clickhouse gave a
| better perf/cost ratio for us (I don't remember exactly but it
| was at least twice cheaper).
| caust1c wrote:
| When Cloudflare was considering clickhouse, we did estimates on
| just the hardware cost and it was well over 10x what clickhouse
| was based on druids given numbers on events processed per
| compute unit.
| advaita wrote:
| Are you saying druid hardware costs were coming out to be 10x
| of clickhouse hardware costs?
|
| Caveat: English is not my first language so might have missed
| your point in translation. :)
| caust1c wrote:
| That's right, yeah. We would have had to buy 10x the
| servers in order to support the same workload that
| clickhouse could.
| jetter wrote:
| I've mentioned Pinot and Druid briefly in 2018 writeup:
| https://pixeljets.com/blog/clickhouse-as-a-replacement-for-e...
| (see "Compete with Pinot and Druid" )
| AurimasJLT wrote:
| Yeah Druid got blown away by ClickHouse at eBay Druid 700+
| servers versus 2 region fully replicated ClickHouse system of
| 40 nodes. https://tech.ebayinc.com/engineering/ou-online-
| analytical-pr... and a webinar they did with us at Altinity
| https://www.youtube.com/watch?v=KI0AqpmcSOk&t=20s
| BoorishBears wrote:
| My biggest problem with Elasticsearch is how easy it is to get
| data in there and think everything is just fine... until it falls
| flat on its face the moment you hit some random use case that,
| according to Murphy's law, will also be a very important one.
|
| I wish Elasticsearch were maybe a little more opinionated in its
| defaults. In some ways Clickhouse feels like they filled the gap
| _not_ having opinionated defaults created. My usage is from a few
| years back so maybe things have improved
| djxfade wrote:
| Would you care to elaborate on what happened in your case. My
| company is using ElasticSearch extensively, and it is mission
| critical for us. I fear something might happen one day
| BoorishBears wrote:
| It's been a few years so the details are fuzzy, but iirc it
| was just simple things like index sizing, managing shard
| count, some JVM tuning, certain aggregated fields blowing up
| once we had more data in our instances...
|
| We also got our data very out of order. We had embedded
| devices logging analytics that would phone home very
| infrequently, think months between check-ins
|
| I forget why but that became a big issue at some point,
| bringing the instance to its knees when a few devices started
| to phone-in covering large periods of time.
|
| ES just has a ton of knobs, I imagine if its been important
| to you, you have people specializing in keeping it running,
| which is great... but the amount of complexity there that is
| specific to ES is really high.
|
| It's not like there's no such thing as a Postgres expert for
| example, but you don't need to hire a Postgres wizard until
| you're pretty far in the weeds. But I feel like you should
| have an ES wizard to use ES, which is a little unfortunate
| the-alchemist wrote:
| Also wanted to share my overall positive experience with
| Clickhouse.
|
| UPSIDES
|
| * started a 3-node cluster using the official Docker images super
| quickly
|
| * ingested billions of rows super fast
|
| * great compression (of course, depends on your data's
| characteristics)
|
| * features like https://clickhouse.tech/docs/en/engines/table-
| engines/merget... are amazing to see
|
| * ODBC support. I initially said "Who uses that??", but we used
| it to connect PostgreSQL and so we can keep the non-timeseries
| data in PostgreSQL but still access PostgreSQL tables in
| Clickhouse (!)
|
| * you can go the other way too: read Clickhouse from PostgreSQL
| (see https://github.com/Percona-Lab/clickhousedb_fdw, although we
| didn't try this)
|
| * PRs welcome, and quickly reviewed. (We improved the ODBC UUID
| support)
|
| * code quality is pretty high.
|
| DOWNSIDES
|
| * limited JOIN capabilities, which is expected from a timeseries-
| oriented database like Clickhouse. It's almost impossible to
| implement JOINs at this kind of scale. The philosophy is "If it
| won't be fast as scale, we don't support it"
|
| * not-quite-standard SQL syntax, but they've been improving it
|
| * limited DELETE support, which is also expected from this kind
| of database, but rarely used in the kinds of environments that CH
| usually runs in (how often do people delete data from
| ElasticSearch?)
|
| It's really an impressive piece of engineering. Hats off to the
| Yandex crew.
| FridgeSeal wrote:
| Most minor of nitpicks:
|
| > timeseries-oriented database
|
| Technically it's a column oriented database that is good at
| time series stuff. I only say that because I know there are
| some databases that are even more specialised towards
| timeseries and ClickHouse can do way more.
| yamrzou wrote:
| Could you share more details about the limited JOIN
| capabilities? AFAIK, Clickhouse has multiple join algorithms
| and supports on-disk joins to avoid out of memory:
|
| https://github.com/ClickHouse/ClickHouse/issues/10830
|
| https://github.com/ClickHouse/ClickHouse/issues/9702#issueco...
| FridgeSeal wrote:
| Maybe I'm not doing anything particularly challenging with
| it, but I've not found anything lacking with the join
| functionality.
| hodgesrm wrote:
| > It's really an impressive piece of engineering. Hats off to
| the Yandex crew.
|
| And thousands of contributors! Toward the end of 2020 over 680
| unique users had submitted PRs and close to 2000 had opened
| issues. It's becoming a very large community.
| jcims wrote:
| Does ClickHouse or anything else out there that even remotely
| compete with Splunk for adhoc troubleshooting/forensics/threat
| hunting type work?
|
| I started off with Splunk and every time I try Elasticsearch I
| feel like I'm stuck in a cage. Probably why they can charge so
| much for it.
| supergirl wrote:
| why is splunk better than ES?
| jcims wrote:
| I really want to answer you but I'm struggling a bit b/c I
| haven't worked with ES in a minute. Splunk just tends to be
| able to eat just about any kind of structured or unstructed
| content, operates on a pipeline concept similar to that of a
| unix shell or (gasp) powershell, and has a rich set of data
| manipulation of modification commands built in:
|
| https://docs.splunk.com/Documentation/SplunkLight/7.3.6/Refe.
| ..
|
| I primarily use it for security-related analysis, which is
| lowish on metrics and high on adhoc folding and mutilation of
| a very diverse set of data structures and types.
| phillc73 wrote:
| > SQL is a perfect language for analytics.
|
| Slightly off topic, but I strongly agree with this statement and
| wonder why the languages used for a lot of data science work (R,
| Python) don't have such a strong focus on SQL.
|
| It might just be my brain, but SQL makes so much logical sense as
| a query language and, with small variances, is used to directly
| query so many databases.
|
| In R, why learn the data.tables (OK, speed) or dplyr paradigms,
| when SQL can be easily applied directly to dataframes? There are
| libraries to support this like sqldf[1], tidyquery[2] and
| duckdf[3] (author). And I'm sure the situation is similar in
| Python.
|
| This is not a post against great libraries like data.table and
| dplyr, which I do use from time to time. It's more of a question
| about why SQL is not more popular as the query language de jour
| for data science.
|
| [1] https://cran.r-project.org/web/packages/sqldf/index.html
|
| [2] https://github.com/ianmcook/tidyquery
|
| [3] https://github.com/phillc73/duckdf
| jfim wrote:
| SQL doesn't compose all that well.
|
| For example, imagine that you have a complex query that handles
| a report. If someone says "hey we need the same report but with
| another filter on X," your options are to copy paste the SQL
| query with the change, create a view that can optionally have
| the filter (assuming the field that you'd want to filter on
| actually is still visible at the view level), or parse the SQL
| query into its tree form, mutate the tree, then turn it back
| into SQL.
|
| If you're using something like dplyr, then it's just an if
| statement when building your pipeline. Dbplyr also will
| generate SQL for you out of dplyr statements, it's pretty
| amazing IMHO.
| MattConfluence wrote:
| > SQL doesn't compose all that well.
|
| On that topic, I really enjoy working in Elixir because Ecto
| [1] lets you write "SQL" with Elixir's composable functional
| syntax. It sits somewhere between "the language is compiled
| to SQL" and ORM. The Ruby-esque syntax took some getting used
| to, but once I was past that hurdle my productivity
| skyrocketed. It's not 100% feature complete compatibility
| with all the different SQL dialects, but most of what you'll
| need is there.
|
| [1] https://github.com/elixir-ecto/ecto
| phillc73 wrote:
| In a similar vein, I really like the LINQ style queries in
| the Julia package, Query.jl[1].
|
| [1] https://www.queryverse.org/Query.jl/stable/linqquerycom
| mands...
| dominotw wrote:
| > your options are to copy paste the SQL query with the
| change
|
| There are things like dbt, lookml that bring templating ,
| macros on top of sql. Its much better than copy/pasting
| strings but is worse than a full fledged programming
| language.
| phillc73 wrote:
| I agree that dbplyr is a nice way to query databases, if
| already familiar with dplyr (actually I think dtplyr is more
| interesting for operating on data.tables). However, I'm not
| sure I really understand your point about the "if" statement.
|
| If the data is already in a dataframe, why not still use the
| "if" statement, but one of the packages I mentioned earlier
| to further modify the data?
|
| E.g.
|
| if (x = 10) {
|
| duckdf("SELECT * WHERE y >= 20")
|
| }
|
| I guess I was thinking about one query language to rule them
| all. If SQL was used everywhere, it's just about learning one
| paradigm and largely being able to use that across platform
| and language.
| jfim wrote:
| It's for more complex queries, eg. active users per
| country: WITH active_users AS
| (SELECT DISTINCT user_id, user_country FROM ...
| WHERE last_login >= NOW() - 1 month) SELECT
| user_country, COUNT(user_id) AS user_count FROM
| active_users GROUP BY user_country ORDER BY
| user_count DESC
|
| Now imagine someone says "what about users that have at
| least 5 friends?" If you're using dplyr and want to reuse
| most of your logic, it's just a matter of doing something
| like active_users_with_friends =
| active_users %>% filter(friend_count >= 5)
|
| The SQL version is much hairier, since it's just code
| that's within a string.
| phillc73 wrote:
| I think we're talking at cross purposes, and I don't want
| to belabour the point too much, but there's no need to
| modify the original SQL query (I'm imagining this pulls
| data from a database). If the data is already in a
| dataframe, use one of the libraries which can apply SQL
| to in-memory dataframes.
|
| e.g.
|
| active_users_with_friends <- duckdf("SELECT * FROM
| active_users WHERE friend_count >= 5")
| jfim wrote:
| There's an aggregation part that comes after the CTE, at
| which point the "friend_count" or "last_login" fields are
| not available anymore. "active_users" isn't a table, it's
| the result of the CTE in the query that returns distinct
| users that have a last_login in the last 30 days.
|
| Also, keep in mind that this is a pretty simple example,
| a more realistic one would probably have a half dozen to
| a dozen CTEs, some nested, at which point correctly
| mutating the SQL statement is not trivial.
| phillc73 wrote:
| > "active_users" isn't a table, it's the result of the
| CTE in the query
|
| Maybe we really are at cross purposes. In the duckdf
| example above "active_users" is a dataframe. The duckdf
| library applies SQL queries on dataframes. It's not
| (necessarily) querying an on-disk database.
|
| If you're querying data using dplyr then it's highly
| likely that data is already in a dataframe. By the same
| principle write the original SQL query to use SELECT with
| wildcard, to return everything. Then use a relevant
| library to apply SQL on dataframes and it's pretty much
| the same result as dplyr.
|
| That's not to say complex SQL queries shouldn't be used
| to only return the minimally required information from
| the database. Clearly, sometimes there's too much data to
| fit in memory. My point is that if you're using dplyr on
| dataframes (or tibbles), then you could just as easily
| use an SQL based query on the same dataframe.
| jfim wrote:
| > Then your dplyr example above doesn't work, because
| active_users doesn't exist.
|
| You can assume that it would contain the same data that
| would be in the CTE in the SQL query above. There's no
| such table in the database, it's a subquery.
|
| If your argument is that chaining multiple SQL queries on
| data frames is needed, then we're both saying the same
| thing. SQL doesn't compose well, hence the need for
| multiple queries. At that point it's not just SQL though,
| it's querying of dataframes using SQL _and_ composing the
| dataframes together using non-SQL code.
|
| Luckily, dplyr will actually compose properly, and you
| can get the actual SQL statement that is equivalent to
| the dplyr expression, which would be rather annoying to
| obtain programmatically.
| kristjansson wrote:
| I think the point from GP (which aligns with my thinking)
| is that more programmatic interfaces for querying data
| allow for control flow in constructing the query, not just
| in processing it. So you can conditionally add a predicate
| or whatever, and then run the whole query at once.
|
| Whereas with the SQL packages you mentioned, you either
| have to conditionally construct your query string (ew) or
| pull results back and express your conditional as
| operations on the resulting data frame.
|
| For pure dplyr-on-in-memory-data-frame there isn't much
| difference between the two, to be sure. For dplyr on
| dbplyr, or sparklyr, or for SparkR, or PySpark, or Dask, or
| pick-your-favorite-framework, the difference between
| operating on (effectively) query AST vs. result sets is
| huge.
| phillc73 wrote:
| Good points. That makes the larger than memory use case
| quite clear, and interesting.
| jerf wrote:
| I read down the thread as it stands now, and the basic
| problem is this. Write the following function:
| Compose("SELECT a, b.revenue, b.name FROM
| table INNER JOIN b ON whatever
| WHERE name LIKE 'b%'", "revenue > 10000")
|
| to yield a new valid SQL statement that composes these two
| fragments together in the way that it is obvious that I
| want. You basically can't. If you can in this small
| example, I can easily complexify the SQL until you can't
| anymore. Note how "revenue" in my supplemental WHERE clause
| may or may not be that "b.revenue", both in the sense that
| I may be asking for some _other_ revenue entirely, and in
| the sense that that particular fragment may have come from
| somewhere that has no way to know about the "b" part of
| the name; one of the major issues is namespacing like this,
| though it is far from the only one!
|
| It is almost certainly impossible to sanely implement this
| function literally in terms of strings. (If such a thing
| was possible, it would almost certainly have trash
| performance.) You need a richer set of data types and
| operations to permit this.
|
| It has been done. You can hold on to a symbolic
| representation of the SQL in some internal library
| representation. However, to "bind" to SQL in this manner
| requires a binding to every single feature of the SQL you
| want to use at a pretty deep level (more than just strings,
| you need to understand the full syntax tree), and speaking
| from experience, no matter how cleverly you try to start
| writing it at first it gets more complicated than you
| think. It is certainly a thing that has been done, but it
| is a _huuuuuuge_ project. Seriously. It would be a lot
| easier if we were using an underlying representation
| designed to do this sort of thing from the beginning.
|
| I like what you can do with SQL, but I kinda hate the way
| it was _juuust_ good enough to get ensconced and apparently
| permanently ensure that no improvement on it can ever get
| off the ground because no conceivable improvement can
| overcome the entrenched advantages SQL has.
| kristjansson wrote:
| I think the answer to your question is that, until relatively
| recently it was not possible to run arbitrary SQL against data
| frames in either language. sqldf and duckdf pass data off to
| another program (albeit in-process), run the query there, and
| pull the result set back to R, paying data movement and
| transportation costs along the way. Tidyquery is neat and
| avoids that issue, but is also only 18 months old or so
| phillc73 wrote:
| I agree that Tidyquery and duckdf are quite new, but sqldf
| has been around for a long time. The first package upload to
| CRAN was 2007.
|
| Also, duckdf uses duckdb[1] to register a virtual in-memory
| table, so there's very little overhead for data transport.
|
| [1] https://duckdb.org/docs/api/r
| kristjansson wrote:
| While sqldf has been around for a while, it's not been a
| performant choice, since it requires transforming and
| moving data into a database, albeit one that's in process.
| So with something like df <-
| data.frame(a=rnorm(5e8), b=runif(5e8))
| sqldf::sqldf("select count(*) from df where a > 3")
|
| works, but takes 200 seconds and doubles the memory
| footprint of the process whereas
| nrow(df[df$a > 3, ]) sum(df$a > 3)
|
| take ~1.5 seconds and ~1s respectively on my machine.
|
| I appear to have been too pessimistic about duckdf/duckdb
| though. It's docs[1] claim that registering an R data.frame
| as a table
|
| > does not actually transfer data into DuckDB yet.
|
| Which implied to me that it just deferred transfer until
| the user runs a query. However, in a basic test just now:
| con = dbConnect(duckdb::duckdb(), ":memory:")
| duckdb::duckdb_register(con, "iris_view", iris)
| duckdb::duckdb_register(con, "df_view", df)
| dbGetQuery(con, "select count(*) from df_view where a > 0")
|
| it appears to execute the query directly against the data
| frame. At least, it runs in ~2.2s, and doesn't allocate
| more memory. Cool! As you've noted though, it's very new -
| looks like they released 0.1.0 last May?
|
| I think the point stands: until very recently, SQL-on-
| dataframes was not a viable choice for anyone working at
| even moderate scales in R or Python, so preference has been
| for APIs exposed by libraries (pandas, data.table, dplyr,
| ...) that offered users reasonable performance, even if SQL
| would have been be a more ergonomic interface.
|
| [1] https://duckdb.org/docs/api/r
| phillc73 wrote:
| Yeah, I agree sqldf is quite slow. Fair point.
|
| As you've seen, duckdb registers an "R data frame as a
| virtual table." I'm not sure what they mean by "yet"
| either.
|
| Of course it is possible to write an R dataframe to an
| on-disk duckdb table, if that's what you want to do.
|
| There are some simple benchmarks on the bottom of the
| duckdf README[1]. Essentially I found for basic SQL
| SELECT queries, dplyr is quicker, but for much more
| complex queries, the duckdf/duckdb combination performs
| better.
|
| If you really want speed of course, just use data.table.
|
| [1] https://github.com/phillc73/duckdf
| kristjansson wrote:
| Didn't realize duckdf was your package. Cool! Definitely
| something I'll consider for future work, though I spend
| more time on the Spark-and-Python side of the fence these
| days.
|
| If you wanted to add corresponding memory benchmarks the
| value-prop of duckdf might be clearer to those of us that
| have been scarred by sqldf :).
| phillc73 wrote:
| Sounds like it could be an interesting comparison. I'll
| look into it.
| alexisread wrote:
| I'm late to the party here, but as stated in other comments,
| dyplr / Linq etc. Compose better so you can extend a query
| easier.
|
| Another advantage of composable syntaxes is that you can lazily
| evaluate parts of a query, and combine others, which allows the
| equivalent of a CTE on parts of a query you need to optimise.
|
| It's also true that SQL allows pushdown of queries to the
| database much easier than composable syntaxes, as SQL is
| usually the native language there. As such it can make for a
| more exploratory syntax than say graphQL where you need to
| define joins at design-time.
| marcinzm wrote:
| SQL tends to be non-composable which makes complicated scripts
| really messy to refactor and modify (or read even). CTEs make
| it more sensible but they're also a rather recent addition and
| don't fully solve the problem. Data Science tends to involve a
| lot of modifying of the same code rather than creating one off
| or static scripts.
| js2 wrote:
| Sentry.io is using ClickHouse for search, with an API they built
| on top of it to make it easier to transition if need be. They
| blogged about it at the time they adopted it:
|
| https://blog.sentry.io/2019/05/16/introducing-snuba-sentrys-...
| moralsupply wrote:
| I'm happy that more people are "discovering" ClickHouse.
|
| ClickHouse is an outstanding product, with great capabilities
| that serve a wide array of big data use cases.
|
| It's simple to deploy, simple to operate, simple to ingest large
| amounts of data, simple to scale, and simple to query.
|
| We've been using ClickHouse to handle 100's of TB of data for
| workloads that require ranking on multi-dimensional timeseries
| aggregations, and we can resolve most complex queries in less
| than 500ms under load.
| Bluestein wrote:
| Read that as "Clubhouse" on first glance ...
|
| ... and was quite intrigued :)
|
| PS: Tough crowd tonite ...
|
| It would appear local crowd does not take well to some tongue-in-
| cheek. Besides, it actually happened as described.-
| pantulis wrote:
| Take my upvote, because the same happened to me.
| Bluestein wrote:
| Could not have been the only one :)
| didip wrote:
| Does ClickHouse have integration with Superset and Grafana?
| daniel_levine wrote:
| Yes to both, Altinity maintains the ClickHouse Grafana plugin
| https://altinity.com/blog/2019/12/28/creating-beautiful-graf...
|
| And Superset has a recommendation of a ClickHouse connector
| https://superset.apache.org/docs/databases/clickhouse
| dabeeeenster wrote:
| I've been recording a podcast with Commercial Open Source company
| founders (Plug! https://www.flagsmith.com/podcast) and have been
| surprised how often Clickhouse has come up. It is _always_
| referred to with glowing praise /couldn't have built our business
| without it etc etc etc.
| jabo wrote:
| Thanks for sharing your podcast! Just subscribed.
|
| To add yet another data point, we use Clickhouse as well for
| centralized logging for the SaaS version of our open source
| product, and can't imagine what we would have done without it.
| wiradikusuma wrote:
| Anyone know more lightweight alternative to (ELK) Elastic Stack?
| I found https://vector.dev but it seems to be only the "L" part.
| webo wrote:
| Have you looked into Google Cloud Logging (Stackdriver)? It's
| the most affordable and decent-enough solution we've found. The
| only issue is querying can be slow on large volumes.
| spyspy wrote:
| Have you tried creating a sink from stackdriver to bigquery?
| webo wrote:
| Hi, not really because my understanding is bigquery
| requires structured data but Stackdriver just expects a
| arbitrary json payload which can be queried from. Do you
| have any experience?
| ehnto wrote:
| I found Lucene's base library really easy to use without the
| configuration/infrastructure overhead of Elasticsearch, but
| haven't experienced it at scale:
|
| https://lucene.apache.org/
| rzzzt wrote:
| Solr is the equivalent-ish of ES, if you are looking for a
| search server instead of a library that can be embedded:
| https://lucene.apache.org/solr/
| johnx123-up wrote:
| What about https://github.com/meilisearch/MeiliSearch ?
| Xylakant wrote:
| Promtail/Loki https://github.com/grafana/loki is an alternative
| to elk, but while it seems more lightweight, it definitely is
| less featureful. The integration with grafana/prometheus seems
| nice, but I've only toyed with it, not used in production.
___________________________________________________________________
(page generated 2021-03-02 23:00 UTC)