[HN Gopher] Full text search over Postgres: Elasticsearch vs. al...
___________________________________________________________________
Full text search over Postgres: Elasticsearch vs. alternatives
Author : philippemnoel
Score : 120 points
Date : 2024-08-06 17:48 UTC (5 hours ago)
(HTM) web link (blog.paradedb.com)
(TXT) w3m dump (blog.paradedb.com)
| ashconnor wrote:
| Facets are possible in Postgres but it looks complex:
|
| https://web.archive.org/web/20200815141031/https://roamanaly...
|
| pg_search [0] not to be confused with the ruby gem of the same
| name [1]:
|
| [0] -
| https://github.com/paradedb/paradedb/tree/dev/pg_search#over...
|
| [1] - https://github.com/Casecommons/pg_search
| simonw wrote:
| I implemented faceted search on my own blog entirely on top of
| PostgreSQL. It works well, but that's because my blog has only
| 11,348 documents total.
|
| Notes on how I built that here:
| https://simonwillison.net/2017/Oct/5/django-postgresql-facet...
| bdzr wrote:
| Faceting over even small result sets and producing the relevant
| counts is really hard for postgres, and caching tends to be
| impossible due to the combinatorics of the problem. There's
| some discussion around supporting better indexing options here
| that is relevant: https://www.postgresql.org/message-
| id/CAJqPDh9o%3DoxRP%3DxZh.... Practically,
| https://github.com/cybertec-postgresql/pgfaceting is the best
| bet for people who must use postgres even though it still uses
| a manual maintenance step.
| radpanda wrote:
| > Because Elasticsearch is not a reliable data store,
| organizations that use Postgres typically extract, transform, and
| load (ETL) data from Postgres to Elasticsearch
|
| I'll admit haven't kept up with this but is it still the case
| that Elasticsearch is "not a reliable data store"?
|
| I remember there used to be a line in the Elasticsearch docs
| saying that Elasticseach shouldn't be your primary data store or
| something to that effect. At some point they removed that
| verbiage, seemingly indicating more confidence in their
| reliability but I still hear people sticking with the previous
| guidance.
| wordofx wrote:
| ES is just unreliable. Can be running smoothly for a year and
| boom it falls over and you're left scratching your head.
| cyberes wrote:
| That exact scenario just happened to me a few days ago.
| agumonkey wrote:
| What would you consider as replacement
| alexpeng wrote:
| SingleStore is pretty good at ES use cases. Low latency,
| scalability, real time ingest, full text search +
| millisecond query times.
| whartung wrote:
| How did you fix it? What happened?
| fizx wrote:
| It got a lot better in the ~7 series IIRC when they added
| checksums to the on-disk files. I don't know if you still have
| to recover corruptions by hand, or whether the correct file
| gets copied in from a replica.
|
| The replication protocols and leader election were IMO not
| battle-hardened or likely to pass Aphyr-style testing. It was
| pretty easy to get into a state where the source of truth was
| unclear.
|
| Source: Ran an Elasticsearch hosting company in the 2010's. A
| little out of the loop, but not sure much has changed.
| willio58 wrote:
| I'm not sure what the author was referring to, but in our stack
| ES is the only non-serverless tech we have to work with. I know
| there's a lot of hate in HN around serverless for many reasons
| but for us for several years, we've been able to scale without
| any worry of our systems being affected performance-wise (I
| know this won't last forever).
|
| ES is not this way, we have to manage our nodes ourselves and
| figure out "that one node is failing, why?" type questions. I
| hear they're working on a serverless version, but honestly, I
| think we will be leaving ES before that happens.
| maxxxxxx wrote:
| Elastic's serverless offering went GA recently:
| https://www.elastic.co/elasticsearch/serverless
| jrochkind1 wrote:
| What are you considering replacing it with for full-text
| search?
| amai wrote:
| Just read the article: ,,Elasticsearch's lack of ACID
| transactions and MVCC can lead to data inconsistencies and
| loss, while its lack of relational properties and real-time
| consistency makes many database queries challenging."
| simonw wrote:
| The BM25 thing is actually a really big deal.
|
| BM25 is similar to TF/IDF. In both cases, the key idea is to
| consider statistics of the overall corpus as part of relevance
| calculations. If the user searches for "charities in new orleans"
| in a corpus where "new orleans" is only represented in a few
| documents, those should clearly rank highly. If the corpus has
| "new orleans" in almost every document then the term "charity" is
| more important.
|
| PostgreSQL FTS cannot do this, because it doesn't maintain
| statistics for word frequencies across the entire corpus. This
| severely limits what it can implement in terms of relevance
| scoring - each result is scored based purely on if the search
| terms are present or not.
|
| For comparison, SQLite FTS (which a lot of people are unaware of)
| actually does implement full index statistics, and SQLite FTS5
| implements BM25 out of the box.
| beardedetim wrote:
| BM25 is definitely a big deal when you're doing FTS. It's one
| of the reasons I've switched to Arango as my DB for the project
| I'm working on that needs FTS. The fact it also comes with
| graphs means I don't need to bang my head against CTEs either.
|
| Not saying Arango can replace Postgres but for my needs, it's a
| much better fit AND it offers the FTS that I need out of the
| box.
| philippemnoel wrote:
| Arango is sweet! We've actually talked to many people who
| switched from Postgres to Arango because of its better FTS.
| This was one of the reasons for creating ParadeDB in the
| first place. Many of the users who made this switch wished
| they could have stayed in Postgres without compromising on
| FTS.
| packetlost wrote:
| I wish SQLite FTS had a nicer to use API. Last time I looked (a
| few months ago) it was very confusing and not very user
| friendly.
| simonw wrote:
| I've built a bunch of Python (and CLI) code to try and help
| make that work better: https://sqlite-
| utils.datasette.io/en/stable/python-api.html#...
| import sqlite_utils db =
| sqlite_utils.Database("news.db")
| db["articles"].enable_fts(["headline", "body"])
| results = list(db["articles"].search("softball"))
|
| Or with the CLI tool: https://sqlite-
| utils.datasette.io/en/stable/cli.html#configu...
| sqlite-utils enable-fts news.db articles headline body
| sqlite-utils search news.db articles softball
| packetlost wrote:
| That's cool, maybe I can use it as a reference. Thank you!
| sroussey wrote:
| The problem with BM25 in a database, is that is can have
| unexpected outcomes for some common use cases.
|
| Take multi-tenancy.
|
| What if user 1 has many more documents than user 2, and uses
| "new orleans" a lot. But user 2 does not. User 2 does the
| search.
|
| The db will first use FTS, and then filter. So user 1 will bias
| the results of user 2. Perhaps enough for user 2 to discover
| what words are in user 1 corpus.
| simonw wrote:
| Doesn't that affect BM25 with a solution like Elasticsearch
| as well? Or is that smart enough to apply filters to the
| corpus statistics before calculating relevance?
|
| You could solve that in SQLite by giving each user their own
| separate FTS table - not impossibly complex, but would grow
| increasingly messy if you have 10s of thousands of users.
| vosper wrote:
| One way to address this in Elasticsearch would be to put
| each customers documents in their own index. Other than
| that, as far as I can tell it's not smart enough to apply
| filters first.
|
| Also, shards can affect BM25 scoring:
| https://www.elastic.co/blog/practical-bm25-part-1-how-
| shards...
| giovannibonetti wrote:
| > Take multi-tenancy.
|
| Using a SQLite DB per tenant is a good alternative to handle
| that: https://turso.tech/
| Guillaume86 wrote:
| It could be solved with an index partitioning feature, no
| idea if it already exists somewhere...
| formerly_proven wrote:
| I've learned information retrieval in university before ever
| using any full text search engine and I was honestly kinda
| surprised that most engines are literal textbook
| implementations. IR was one of those courses where I definitely
| expected the SotA to be further out than the third slide deck.
| jhanschoo wrote:
| It kind of is now if you count deep-learning based solutions
| to an information need haha
| formerly_proven wrote:
| That is true
| samsk wrote:
| For internet user facing full-text search I would always prefer
| to use a separate tool and not a SQL database, because
|
| - the fulltext tool, can and should hold only 'active' data
|
| - as it has only active data, data size is usually much much
| smaller
|
| - as data size is smaller, it better fits in RAM
|
| - as data size is smaller, it can be probably run on poorer HW
| the full ACID db
|
| - as the indexed data are mostly read-only, the VM where it runs
| can be relatively easily cloned (never seen a corruption till
| now)
|
| - as FTS tools are usually schema-less, there is no outage during
| schema changes (compared to doing changes in ACID db)
|
| - as the indexed data are mostly read-only, the can be easily
| backup-ed
|
| - as the backups are smaller, restoring a backup can be very fast
|
| - and there is no such thing as database upgrade outage, you just
| spin a new version, feed it with new data and than change the
| backends
|
| - functionality and extensibility
|
| There is probably more, but if one doesn't needs to do a fulltext
| search on whole database (and you usually don't), than its IMHO
| better to use separate tool, that doesn't comes with all the ACID
| constraints. Probably only downside is that you need to format
| data for the FTS and index them, but if you want run a serious
| full-text search, you will have to take almost the same steps in
| the database.
|
| On a 15y old side project, I use SOLR for full-text search,
| serving 20-30k/request per day on a cheap VM, and PostgreSQL is
| used as primary data source. The PostgreSQL has had several
| longer outages - during major upgrades, because of disk
| corruption, because of failed schema migrations, because of
| 'problems' between the chair and keyboard etc... During that
| outages the full-text search always worked - it didn't had most
| recent data, but most users probably never noticed.
| whalesalad wrote:
| playing devils avocate
|
| > the fulltext tool, can and should hold only 'active' data
|
| very possible with postgres, too. instead of augmenting your
| primary table to support search, you would have a
| secondary/ephemeral table serving search duties
|
| > as data size is smaller, it better fits in RAM
|
| likewise, a standalone table for search helps here, containing
| only the relevant fields and attributes. this can be further
| optimized by using partial indexes.
|
| > as FTS tools are usually schema-less, there is no outage
| during schema changes (compared to doing changes in ACID db)
|
| postgresql can be used in this manner by using json/jsonb
| fields. instead of defining every field, just define one field
| and drop whatever you want in it.
|
| > as the indexed data are mostly read-only, the can be easily
| backup-ed
|
| same for postgres. the search table can be exported very easily
| as parquet, csv, etc.
|
| > as the backups are smaller, restoring a backup can be very
| fast
|
| tbh regardless of underlying mechanism, if your search index is
| based on upstream data it is likely easier to just rebuild it
| versus restoring a backup of throwaway data.
|
| > The PostgreSQL has had several longer outages - during major
| upgrades, because of disk corruption, because of failed schema
| migrations, because of 'problems' between the chair and
| keyboard etc...
|
| to be fair, these same issues can happen with elasticsearch or
| any other tool.
|
| how big was your data in solr?
| samsk wrote:
| PostgreSQL is over 90GB, SOLR is under 1GB.
|
| The PostgreSQL has to handle writes, reports, etc..., so I
| doubt it will cache as efficiently as full-text engine,
| you'll need to have full or partial replicas to distribute
| the load.
|
| And, yes, I agree, almost all of this can be done with
| separate search table(s), but this table(s) will still live
| in a 'crowded house', so again replicas will be probably
| necessary at some point.
|
| And using replicas brings new set of problems and costs ;-)
|
| One client used MySQL for fulltext search, it was a single
| beefy RDS server, costing well over $1k per month and the
| costs kept raising. It was replaced with a single ~$100 EC2
| machine running Meilisearch.
| sidmitra wrote:
| While agreeing somewhat with the post above, the answer isn't
| really so black and white but depends on your context, i.e.
| scale, app-complexity, search needs, data size etc.
|
| >the fulltext tool, can and should hold only 'active' data
|
| Same can be said about your DB. You can create separate tables,
| partitions to hold only active data. I assume materialized
| views are also there(but never used them for FTS). You can even
| choose to create a separate postgres instance but only use it
| for FTS data. The reason to do that might be to avoid coupling
| your business logic to another ORM/DSL and having your team t
| learn another query language and its gotchas.
|
| > as data size is smaller, it better fits in RAM
|
| > as data size is smaller, it better fits in RAM
|
| > as the indexed data are mostly read-only, the VM where it
| runs can be relatively easily cloned
|
| > as the indexed data are mostly read-only, the can be easily
| backup-ed
|
| > as the backups are smaller, restoring a backup can be very
| fast
|
| Once the pg tables are separate and relevant indexing, i assume
| PG can also keep most data in memory. There isn't anything
| stopping you from using a different instance of PG for FTS if
| needed.
|
| > as FTS tools are usually schema-less, there is no outage
| during schema changes
|
| True. But in practice for example ES does have schema(mappings,
| columns, indexes), and will have you re-index your rows/data in
| some cases rebuild your index entirely to be safe. There are
| field types and your querying will depend on the field types
| you choose. i remember even SOLR did, because i had to figure
| out Geospatial field types to do those queries, but haven't
| used it in a decade so can't say how things stand now.
|
| https://www.elastic.co/guide/en/elasticsearch/reference/curr...
|
| While the OPs point stands, in a sufficiently complex FTS
| search project you'll need all of the features and you'll have
| to deal with the following on search oriented DBs
|
| - Schema migrations or some async jobs to re-index data. Infact
| it was worse than postgres because atleast in RDBMS migrations
| are well understood. In ES devs would change field types and
| expect everything to work without realizing only the new data
| was getting it. So we had to re-index entire indexes sometimes
| to get around this for each change in schema.
|
| - At scale you'll have to tap into WAL logs via CDC/Debezium to
| ensure your data in your search index is up-to-date and no rows
| were missed. Which means dealing with robust queues/pub-sub.
|
| - A whole another ORM or DSL for elasticsearch. If you don't
| use these, your queries will soon start to become a mish-mash
| of string concats or f-strings which is even worse for
| maintainability.
|
| https://elasticsearch-py.readthedocs.io/en/v8.14.0/
| https://elasticsearch-dsl.readthedocs.io/en/latest/
|
| - Unless your search server is directly serving browser
| traffic, you'll add additional latency traversing hops. In some
| cases meilisearch, typesense might work here.
|
| I usually recommend engineers(starting out on a new search
| product feature) to start with FTS on postgres and jump to
| another search DB as and when needed. FTS support has improved
| greatly on python frameworks like Django. I've made the other
| choice of jumping too soon to a separate search DB and come to
| regret it because it needed me to either build abstractions on
| top or use DSL sdk, then ensure the data in both is "synced" up
| and maintain observability/telemetry on this new DB and so on.
| The time/effort investment was not linear is and the ROI wasn't
| in the same range for the use-case i was working on.
|
| I actually got more mileage out of search by just dumping small
| CSV datasets into S3 and downloading them in the browser and
| doing FTS client side via JS libs. This basically got me zero
| latency search, albeit for small enough per-user datasets.
| samsk wrote:
| Yes, it always depends on application and purpose.
|
| But once you will have to deal with a real FTS load, as you
| say, you have to use separate instances and replication, use
| materialized views etc.. and you find your self almost
| halfway to implementing ETL pipeline and because of replicas,
| with more complicated setup than having a FTS tool. And than
| somebody finds out what vector search is, and ask you if
| there is an PG extension for it (yes it is).
|
| So IMHO with FTS in database, you'll probably have to deal
| with the almost same problems as with external FTS
| (materialized views, triggers, reindexing, replication,
| migrations) but without all its features, and with constrains
| of ACID database (locks, transactions, writes)...
|
| Btw. I've SOLR right behind the OpenResty, so no hops. With
| database there would be one more hop and bunch of SQL
| queries, because it doesn't speaks HTTP (although I'm sure
| there is an PG extension for that ;-)
| corytheboyd wrote:
| > We've talked to many companies who have tried and regretted
| their decision to use Elasticsearch as their primary data store
|
| oh no...
| haolez wrote:
| Are there success cases of using ElasticSearch/OpenSearch as the
| primary data store in production? Just curious.
| amai wrote:
| Alternatives to both are
|
| https://www.meilisearch.com/
|
| https://typesense.org/
|
| and maybe
|
| https://github.com/Sygil-Dev/whoosh-reloaded
___________________________________________________________________
(page generated 2024-08-06 23:00 UTC)