[HN Gopher] Full text search over Postgres: Elasticsearch vs. al...
       ___________________________________________________________________
        
       Full text search over Postgres: Elasticsearch vs. alternatives
        
       Author : philippemnoel
       Score  : 120 points
       Date   : 2024-08-06 17:48 UTC (5 hours ago)
        
 (HTM) web link (blog.paradedb.com)
 (TXT) w3m dump (blog.paradedb.com)
        
       | ashconnor wrote:
       | Facets are possible in Postgres but it looks complex:
       | 
       | https://web.archive.org/web/20200815141031/https://roamanaly...
       | 
       | pg_search [0] not to be confused with the ruby gem of the same
       | name [1]:
       | 
       | [0] -
       | https://github.com/paradedb/paradedb/tree/dev/pg_search#over...
       | 
       | [1] - https://github.com/Casecommons/pg_search
        
         | simonw wrote:
         | I implemented faceted search on my own blog entirely on top of
         | PostgreSQL. It works well, but that's because my blog has only
         | 11,348 documents total.
         | 
         | Notes on how I built that here:
         | https://simonwillison.net/2017/Oct/5/django-postgresql-facet...
        
         | bdzr wrote:
         | Faceting over even small result sets and producing the relevant
         | counts is really hard for postgres, and caching tends to be
         | impossible due to the combinatorics of the problem. There's
         | some discussion around supporting better indexing options here
         | that is relevant: https://www.postgresql.org/message-
         | id/CAJqPDh9o%3DoxRP%3DxZh.... Practically,
         | https://github.com/cybertec-postgresql/pgfaceting is the best
         | bet for people who must use postgres even though it still uses
         | a manual maintenance step.
        
       | radpanda wrote:
       | > Because Elasticsearch is not a reliable data store,
       | organizations that use Postgres typically extract, transform, and
       | load (ETL) data from Postgres to Elasticsearch
       | 
       | I'll admit haven't kept up with this but is it still the case
       | that Elasticsearch is "not a reliable data store"?
       | 
       | I remember there used to be a line in the Elasticsearch docs
       | saying that Elasticseach shouldn't be your primary data store or
       | something to that effect. At some point they removed that
       | verbiage, seemingly indicating more confidence in their
       | reliability but I still hear people sticking with the previous
       | guidance.
        
         | wordofx wrote:
         | ES is just unreliable. Can be running smoothly for a year and
         | boom it falls over and you're left scratching your head.
        
           | cyberes wrote:
           | That exact scenario just happened to me a few days ago.
        
             | agumonkey wrote:
             | What would you consider as replacement
        
               | alexpeng wrote:
               | SingleStore is pretty good at ES use cases. Low latency,
               | scalability, real time ingest, full text search +
               | millisecond query times.
        
             | whartung wrote:
             | How did you fix it? What happened?
        
         | fizx wrote:
         | It got a lot better in the ~7 series IIRC when they added
         | checksums to the on-disk files. I don't know if you still have
         | to recover corruptions by hand, or whether the correct file
         | gets copied in from a replica.
         | 
         | The replication protocols and leader election were IMO not
         | battle-hardened or likely to pass Aphyr-style testing. It was
         | pretty easy to get into a state where the source of truth was
         | unclear.
         | 
         | Source: Ran an Elasticsearch hosting company in the 2010's. A
         | little out of the loop, but not sure much has changed.
        
         | willio58 wrote:
         | I'm not sure what the author was referring to, but in our stack
         | ES is the only non-serverless tech we have to work with. I know
         | there's a lot of hate in HN around serverless for many reasons
         | but for us for several years, we've been able to scale without
         | any worry of our systems being affected performance-wise (I
         | know this won't last forever).
         | 
         | ES is not this way, we have to manage our nodes ourselves and
         | figure out "that one node is failing, why?" type questions. I
         | hear they're working on a serverless version, but honestly, I
         | think we will be leaving ES before that happens.
        
           | maxxxxxx wrote:
           | Elastic's serverless offering went GA recently:
           | https://www.elastic.co/elasticsearch/serverless
        
           | jrochkind1 wrote:
           | What are you considering replacing it with for full-text
           | search?
        
         | amai wrote:
         | Just read the article: ,,Elasticsearch's lack of ACID
         | transactions and MVCC can lead to data inconsistencies and
         | loss, while its lack of relational properties and real-time
         | consistency makes many database queries challenging."
        
       | simonw wrote:
       | The BM25 thing is actually a really big deal.
       | 
       | BM25 is similar to TF/IDF. In both cases, the key idea is to
       | consider statistics of the overall corpus as part of relevance
       | calculations. If the user searches for "charities in new orleans"
       | in a corpus where "new orleans" is only represented in a few
       | documents, those should clearly rank highly. If the corpus has
       | "new orleans" in almost every document then the term "charity" is
       | more important.
       | 
       | PostgreSQL FTS cannot do this, because it doesn't maintain
       | statistics for word frequencies across the entire corpus. This
       | severely limits what it can implement in terms of relevance
       | scoring - each result is scored based purely on if the search
       | terms are present or not.
       | 
       | For comparison, SQLite FTS (which a lot of people are unaware of)
       | actually does implement full index statistics, and SQLite FTS5
       | implements BM25 out of the box.
        
         | beardedetim wrote:
         | BM25 is definitely a big deal when you're doing FTS. It's one
         | of the reasons I've switched to Arango as my DB for the project
         | I'm working on that needs FTS. The fact it also comes with
         | graphs means I don't need to bang my head against CTEs either.
         | 
         | Not saying Arango can replace Postgres but for my needs, it's a
         | much better fit AND it offers the FTS that I need out of the
         | box.
        
           | philippemnoel wrote:
           | Arango is sweet! We've actually talked to many people who
           | switched from Postgres to Arango because of its better FTS.
           | This was one of the reasons for creating ParadeDB in the
           | first place. Many of the users who made this switch wished
           | they could have stayed in Postgres without compromising on
           | FTS.
        
         | packetlost wrote:
         | I wish SQLite FTS had a nicer to use API. Last time I looked (a
         | few months ago) it was very confusing and not very user
         | friendly.
        
           | simonw wrote:
           | I've built a bunch of Python (and CLI) code to try and help
           | make that work better: https://sqlite-
           | utils.datasette.io/en/stable/python-api.html#...
           | import sqlite_utils                  db =
           | sqlite_utils.Database("news.db")
           | db["articles"].enable_fts(["headline", "body"])
           | results = list(db["articles"].search("softball"))
           | 
           | Or with the CLI tool: https://sqlite-
           | utils.datasette.io/en/stable/cli.html#configu...
           | sqlite-utils enable-fts news.db articles headline body
           | sqlite-utils search news.db articles softball
        
             | packetlost wrote:
             | That's cool, maybe I can use it as a reference. Thank you!
        
         | sroussey wrote:
         | The problem with BM25 in a database, is that is can have
         | unexpected outcomes for some common use cases.
         | 
         | Take multi-tenancy.
         | 
         | What if user 1 has many more documents than user 2, and uses
         | "new orleans" a lot. But user 2 does not. User 2 does the
         | search.
         | 
         | The db will first use FTS, and then filter. So user 1 will bias
         | the results of user 2. Perhaps enough for user 2 to discover
         | what words are in user 1 corpus.
        
           | simonw wrote:
           | Doesn't that affect BM25 with a solution like Elasticsearch
           | as well? Or is that smart enough to apply filters to the
           | corpus statistics before calculating relevance?
           | 
           | You could solve that in SQLite by giving each user their own
           | separate FTS table - not impossibly complex, but would grow
           | increasingly messy if you have 10s of thousands of users.
        
             | vosper wrote:
             | One way to address this in Elasticsearch would be to put
             | each customers documents in their own index. Other than
             | that, as far as I can tell it's not smart enough to apply
             | filters first.
             | 
             | Also, shards can affect BM25 scoring:
             | https://www.elastic.co/blog/practical-bm25-part-1-how-
             | shards...
        
           | giovannibonetti wrote:
           | > Take multi-tenancy.
           | 
           | Using a SQLite DB per tenant is a good alternative to handle
           | that: https://turso.tech/
        
           | Guillaume86 wrote:
           | It could be solved with an index partitioning feature, no
           | idea if it already exists somewhere...
        
         | formerly_proven wrote:
         | I've learned information retrieval in university before ever
         | using any full text search engine and I was honestly kinda
         | surprised that most engines are literal textbook
         | implementations. IR was one of those courses where I definitely
         | expected the SotA to be further out than the third slide deck.
        
           | jhanschoo wrote:
           | It kind of is now if you count deep-learning based solutions
           | to an information need haha
        
             | formerly_proven wrote:
             | That is true
        
       | samsk wrote:
       | For internet user facing full-text search I would always prefer
       | to use a separate tool and not a SQL database, because
       | 
       | - the fulltext tool, can and should hold only 'active' data
       | 
       | - as it has only active data, data size is usually much much
       | smaller
       | 
       | - as data size is smaller, it better fits in RAM
       | 
       | - as data size is smaller, it can be probably run on poorer HW
       | the full ACID db
       | 
       | - as the indexed data are mostly read-only, the VM where it runs
       | can be relatively easily cloned (never seen a corruption till
       | now)
       | 
       | - as FTS tools are usually schema-less, there is no outage during
       | schema changes (compared to doing changes in ACID db)
       | 
       | - as the indexed data are mostly read-only, the can be easily
       | backup-ed
       | 
       | - as the backups are smaller, restoring a backup can be very fast
       | 
       | - and there is no such thing as database upgrade outage, you just
       | spin a new version, feed it with new data and than change the
       | backends
       | 
       | - functionality and extensibility
       | 
       | There is probably more, but if one doesn't needs to do a fulltext
       | search on whole database (and you usually don't), than its IMHO
       | better to use separate tool, that doesn't comes with all the ACID
       | constraints. Probably only downside is that you need to format
       | data for the FTS and index them, but if you want run a serious
       | full-text search, you will have to take almost the same steps in
       | the database.
       | 
       | On a 15y old side project, I use SOLR for full-text search,
       | serving 20-30k/request per day on a cheap VM, and PostgreSQL is
       | used as primary data source. The PostgreSQL has had several
       | longer outages - during major upgrades, because of disk
       | corruption, because of failed schema migrations, because of
       | 'problems' between the chair and keyboard etc... During that
       | outages the full-text search always worked - it didn't had most
       | recent data, but most users probably never noticed.
        
         | whalesalad wrote:
         | playing devils avocate
         | 
         | > the fulltext tool, can and should hold only 'active' data
         | 
         | very possible with postgres, too. instead of augmenting your
         | primary table to support search, you would have a
         | secondary/ephemeral table serving search duties
         | 
         | > as data size is smaller, it better fits in RAM
         | 
         | likewise, a standalone table for search helps here, containing
         | only the relevant fields and attributes. this can be further
         | optimized by using partial indexes.
         | 
         | > as FTS tools are usually schema-less, there is no outage
         | during schema changes (compared to doing changes in ACID db)
         | 
         | postgresql can be used in this manner by using json/jsonb
         | fields. instead of defining every field, just define one field
         | and drop whatever you want in it.
         | 
         | > as the indexed data are mostly read-only, the can be easily
         | backup-ed
         | 
         | same for postgres. the search table can be exported very easily
         | as parquet, csv, etc.
         | 
         | > as the backups are smaller, restoring a backup can be very
         | fast
         | 
         | tbh regardless of underlying mechanism, if your search index is
         | based on upstream data it is likely easier to just rebuild it
         | versus restoring a backup of throwaway data.
         | 
         | > The PostgreSQL has had several longer outages - during major
         | upgrades, because of disk corruption, because of failed schema
         | migrations, because of 'problems' between the chair and
         | keyboard etc...
         | 
         | to be fair, these same issues can happen with elasticsearch or
         | any other tool.
         | 
         | how big was your data in solr?
        
           | samsk wrote:
           | PostgreSQL is over 90GB, SOLR is under 1GB.
           | 
           | The PostgreSQL has to handle writes, reports, etc..., so I
           | doubt it will cache as efficiently as full-text engine,
           | you'll need to have full or partial replicas to distribute
           | the load.
           | 
           | And, yes, I agree, almost all of this can be done with
           | separate search table(s), but this table(s) will still live
           | in a 'crowded house', so again replicas will be probably
           | necessary at some point.
           | 
           | And using replicas brings new set of problems and costs ;-)
           | 
           | One client used MySQL for fulltext search, it was a single
           | beefy RDS server, costing well over $1k per month and the
           | costs kept raising. It was replaced with a single ~$100 EC2
           | machine running Meilisearch.
        
         | sidmitra wrote:
         | While agreeing somewhat with the post above, the answer isn't
         | really so black and white but depends on your context, i.e.
         | scale, app-complexity, search needs, data size etc.
         | 
         | >the fulltext tool, can and should hold only 'active' data
         | 
         | Same can be said about your DB. You can create separate tables,
         | partitions to hold only active data. I assume materialized
         | views are also there(but never used them for FTS). You can even
         | choose to create a separate postgres instance but only use it
         | for FTS data. The reason to do that might be to avoid coupling
         | your business logic to another ORM/DSL and having your team t
         | learn another query language and its gotchas.
         | 
         | > as data size is smaller, it better fits in RAM
         | 
         | > as data size is smaller, it better fits in RAM
         | 
         | > as the indexed data are mostly read-only, the VM where it
         | runs can be relatively easily cloned
         | 
         | > as the indexed data are mostly read-only, the can be easily
         | backup-ed
         | 
         | > as the backups are smaller, restoring a backup can be very
         | fast
         | 
         | Once the pg tables are separate and relevant indexing, i assume
         | PG can also keep most data in memory. There isn't anything
         | stopping you from using a different instance of PG for FTS if
         | needed.
         | 
         | > as FTS tools are usually schema-less, there is no outage
         | during schema changes
         | 
         | True. But in practice for example ES does have schema(mappings,
         | columns, indexes), and will have you re-index your rows/data in
         | some cases rebuild your index entirely to be safe. There are
         | field types and your querying will depend on the field types
         | you choose. i remember even SOLR did, because i had to figure
         | out Geospatial field types to do those queries, but haven't
         | used it in a decade so can't say how things stand now.
         | 
         | https://www.elastic.co/guide/en/elasticsearch/reference/curr...
         | 
         | While the OPs point stands, in a sufficiently complex FTS
         | search project you'll need all of the features and you'll have
         | to deal with the following on search oriented DBs
         | 
         | - Schema migrations or some async jobs to re-index data. Infact
         | it was worse than postgres because atleast in RDBMS migrations
         | are well understood. In ES devs would change field types and
         | expect everything to work without realizing only the new data
         | was getting it. So we had to re-index entire indexes sometimes
         | to get around this for each change in schema.
         | 
         | - At scale you'll have to tap into WAL logs via CDC/Debezium to
         | ensure your data in your search index is up-to-date and no rows
         | were missed. Which means dealing with robust queues/pub-sub.
         | 
         | - A whole another ORM or DSL for elasticsearch. If you don't
         | use these, your queries will soon start to become a mish-mash
         | of string concats or f-strings which is even worse for
         | maintainability.
         | 
         | https://elasticsearch-py.readthedocs.io/en/v8.14.0/
         | https://elasticsearch-dsl.readthedocs.io/en/latest/
         | 
         | - Unless your search server is directly serving browser
         | traffic, you'll add additional latency traversing hops. In some
         | cases meilisearch, typesense might work here.
         | 
         | I usually recommend engineers(starting out on a new search
         | product feature) to start with FTS on postgres and jump to
         | another search DB as and when needed. FTS support has improved
         | greatly on python frameworks like Django. I've made the other
         | choice of jumping too soon to a separate search DB and come to
         | regret it because it needed me to either build abstractions on
         | top or use DSL sdk, then ensure the data in both is "synced" up
         | and maintain observability/telemetry on this new DB and so on.
         | The time/effort investment was not linear is and the ROI wasn't
         | in the same range for the use-case i was working on.
         | 
         | I actually got more mileage out of search by just dumping small
         | CSV datasets into S3 and downloading them in the browser and
         | doing FTS client side via JS libs. This basically got me zero
         | latency search, albeit for small enough per-user datasets.
        
           | samsk wrote:
           | Yes, it always depends on application and purpose.
           | 
           | But once you will have to deal with a real FTS load, as you
           | say, you have to use separate instances and replication, use
           | materialized views etc.. and you find your self almost
           | halfway to implementing ETL pipeline and because of replicas,
           | with more complicated setup than having a FTS tool. And than
           | somebody finds out what vector search is, and ask you if
           | there is an PG extension for it (yes it is).
           | 
           | So IMHO with FTS in database, you'll probably have to deal
           | with the almost same problems as with external FTS
           | (materialized views, triggers, reindexing, replication,
           | migrations) but without all its features, and with constrains
           | of ACID database (locks, transactions, writes)...
           | 
           | Btw. I've SOLR right behind the OpenResty, so no hops. With
           | database there would be one more hop and bunch of SQL
           | queries, because it doesn't speaks HTTP (although I'm sure
           | there is an PG extension for that ;-)
        
       | corytheboyd wrote:
       | > We've talked to many companies who have tried and regretted
       | their decision to use Elasticsearch as their primary data store
       | 
       | oh no...
        
       | haolez wrote:
       | Are there success cases of using ElasticSearch/OpenSearch as the
       | primary data store in production? Just curious.
        
       | amai wrote:
       | Alternatives to both are
       | 
       | https://www.meilisearch.com/
       | 
       | https://typesense.org/
       | 
       | and maybe
       | 
       | https://github.com/Sygil-Dev/whoosh-reloaded
        
       ___________________________________________________________________
       (page generated 2024-08-06 23:00 UTC)