[HN Gopher] Postgres Full Text Search vs. the Rest
       ___________________________________________________________________
        
       Postgres Full Text Search vs. the Rest
        
       Author : kiwicopple
       Score  : 214 points
       Date   : 2022-10-14 13:41 UTC (9 hours ago)
        
 (HTM) web link (supabase.com)
 (TXT) w3m dump (supabase.com)
        
       | endisneigh wrote:
       | I always thought it was interesting that full text search handles
       | misspellings, as opposed to just telling the client that they've
       | misspelled and to clarify their intent and retry.
       | 
       | From this it seems like the best solution would be to use
       | SQLite's FTS and replicate the database using something like
       | rqlite, litestream or mvsqlite, and then loadbalance the requests
       | to SQLite.
       | 
       | Given that SQLite is serverless in the classic sense, seems like
       | a nobrainer.
        
         | hardwaresofton wrote:
         | This hasn't made the front page of HN (maybe someone should
         | post it), but Litestream actually doesn't do replication that
         | way any more -- I stumbled upon this commit like 2 days ago:
         | 
         | https://github.com/benbjohnson/litestream/pull/411
         | 
         | [EDIT] - I indulged myself:
         | https://news.ycombinator.com/item?id=33204347
        
         | avereveard wrote:
         | Postgres comes with a trigram module. Supports indexing and its
         | much more stable in its result than relying on stemming
         | especially if you're interested in misspellings, the similarity
         | comes with variants with specialized behavior for total
         | similarity, word similarity etc.
         | 
         | You might want to give that a chance.
        
       | knutwannheden wrote:
       | I don't have experience with PG FTS, but I've used SQLite FTS5
       | and love it for its ease of use. Also I really like its NEAR()
       | queries. I was a bit surprised that the post doesn't mention this
       | capability and compare it with the other implementations (i.e. is
       | it available and how does it compare performance-wise).
        
         | canadiantim wrote:
         | What do near queries do?
        
           | knutwannheden wrote:
           | It lets you specify a list of terms which mustn't be further
           | apart from each other than a given number of tokens.
        
       | ianbutler wrote:
       | This lines up with what I've seen. I did a full deep dive
       | benchmark against Elasticsearch and a Postgres FTS instance a few
       | months ago.
       | 
       | With some tuning and memory page size adjustments in Postgres you
       | get very compelling speeds.
       | 
       | I've actually previously written a full twitter thread on a
       | somewhat inactive account about why we don't use existing dbs for
       | search more.
       | 
       | https://twitter.com/kinglycrow/status/1533270619353231360?s=...
       | 
       | That thread led me to a project/product idea where you take an
       | existing Postgres instance used for normal products or whatever,
       | replicate it to various read only clusters with a custom search
       | extension loaded and some orchestrator sitting on top (I've
       | written most of one in rust that uses 0mq to communicate with
       | it's nodes) and create drop in search from existing databases
       | with a nice guided web gui for automatic tuning suitable for most
       | business use cases.
       | 
       | It fell off when my friend who wanted to help work on it went off
       | to Google and you know working on a search engine for them became
       | a bit of a no no. I still think it's a great idea with a lot of
       | value, I should circle back.
       | 
       | I also write (at length) about how ease of use will win search
       | here: https://twitter.com/KinglyCrow/status/1532402654218964993
        
         | hardwaresofton wrote:
         | > That thread led me to a project/product idea where you take
         | an existing Postgres instance used for normal products or
         | whatever, replicate it to various read only clusters with a
         | custom search extension loaded and some orchestrator sitting on
         | top (I've written most of one in rust that uses 0mq to
         | communicate with it's nodes) and create drop in search from
         | existing databases with a nice guided web gui for automatic
         | tuning suitable for most business use cases.
         | 
         | Very interesting idea -- just want to add one thing, write it
         | in rust (with pgx?[0]) :)
         | 
         | [0]: https://github.com/tcdi/pgx
        
           | ianbutler wrote:
           | I was unaware, and absolutely! Thanks for showing me this.
        
             | hardwaresofton wrote:
             | Absolutely -- I think the next wave of awesome pg
             | extensions will be built on this, and I'm excited to use
             | yours!
        
       | matterhorn2000 wrote:
       | Had a use case recently where a vendor dataset of approximately
       | 1.5 million companies needed to be searchable. Fiddled around
       | with Postgres for quite a while, not really managing to get good
       | results while keeping query times below 1 second on average.
       | Ended up going with elastic search instead but I hate the
       | additional complexity. Anyone with similar experience or who
       | actually made a similar setup work with pgsql?
        
       | CSDude wrote:
       | We use Gin for searching cyber asset configuration at Resmo. It
       | works good enough because users search for specific things like
       | "logs bucket" or UUIDs and there are exact matches. Typos are not
       | important. But if they'd be, Postgres search is not very helpful.
       | But it beats hosting another thing just for this use case if
       | you'd be happy enough.
        
       | psanford wrote:
       | I love sqlite's FTS. I have a number of personal projects where I
       | have an archive of data and want to occasionally be able to do
       | ad-hoc searches over and get reasonable ranking and results.
       | Throwing sqlite fts at the problem gets me that in just a few
       | lines of code.
       | 
       | You can even combine it with a sqlite vfs[0] to run full text
       | searches against a sqlite db stored in s3 relatively efficiently.
       | 
       | [0]: https://github.com/psanford/sqlite3vfshttp
        
         | icedchai wrote:
         | That project looks neat! I could see it being useful in
         | searching archived, infrequently accessed data.
        
       | theandrewbailey wrote:
       | > Only Typesense and MeiliSearch properly handled mis-spellings
       | (the "suprman" query).
       | 
       | That's where trigrams come in. pg_trgm can fix mis-spellings, and
       | even compound words ("super man" to "superman"). I opt for
       | performing the search as entered, and use trigrams to offer
       | suggestions to the user, e.g. do the search for "suprman", but
       | show an option to search for "superman".
        
         | hardwaresofton wrote:
         | Absolutely -- this is something I didn't include in the article
         | -- it wasn't _really_ fast-search /type-ahead search per say.
         | 
         | Gitlab actually has a great article about this for those who
         | want to read more:
         | 
         | https://about.staging.gitlab.com/blog/2016/03/18/fast-search...
        
       | jonstewart wrote:
       | I use both ElasticSearch/OpenSearch and Postgres Full Text
       | Search.
       | 
       | It's not even close: Postgres Full Text Search is a kludge
       | compared to Lucene-based services. It is better than nothing, so,
       | yes, I agree with the Pareto improvement idea. But it is also
       | much, much worse than a Lucene-based alternative. If you need
       | something and can't support having ElasticSearch/OpenSearch
       | (heck, even Solr) running at the same time to support full text
       | search, then sure, use Postgres FTS. But if the queries are still
       | too slow or you need ranking or other text/information
       | retrieval/NLP-type features, you'll want to give up on Postgres
       | FTS and move to something else.
        
       | nemothekid wrote:
       | I think Postgres Full Text Search is "good enough" for some
       | internal use cases, but for user facing search I had a project
       | where there are a lot of weird edge cases that eventually makes
       | it worth it to move something dedicated. I believed we used
       | `websearch_to_tsquery` and users complained often enough where we
       | swapped it out for quickwit. A big issue I remember is when users
       | wanted to search for something and order by another field (such
       | as find all documents matching query "x" order by creation date),
       | and PG performance just took a massive nosedive. Even though we
       | knew where was one or two fields we wanted to co-index with FTS,
       | you can't have compound indexes when using FTS.
        
       | stuckinhell wrote:
       | Postgres Full Text Search has ALOT of limitations doesn't it?
       | 
       | I briefly looked into for storing long form text archive records
       | for my company two years ago. There are EXTREME limitations in
       | the source code around it that no one really talks about but have
       | important implications.
       | 
       | Phrase searching doesn't really work the way our analysts would
       | have liked and needed. There are a bunch of technical limitations
       | in the source about how much data is tracked about the tokens. I
       | can't remember exactly but there was something weird about
       | stemming or lemming in the phrase search too.
       | 
       | The following variables need to be bumped up to get phrase
       | searching more accurate.
       | 
       | - MAXSTRLEN (2047)
       | https://github.com/postgres/postgres/blob/master/src/include...
       | 
       | - MAXSTRPOS (1048575)
       | https://github.com/postgres/postgres/blob/master/src/include...
       | 
       | - MAXENTRYPOS (16363)
       | https://github.com/postgres/postgres/blob/master/src/include...
       | 
       | - MAXNUMPOS (256)
       | https://github.com/postgres/postgres/blob/master/src/include...
       | 
       | TsHeadline for highlighting doesn't consider phrase searching so
       | you can get weird results. It probably needs to be rewritten to
       | match websearch_to_tsquery.
       | 
       | The accuracy issue drained the blood from my BA's faces. I
       | eventually just went with on-premise SOLR because it's easier to
       | add new hardware for it than elasticsearch.
       | 
       | TLDR: postgres search is probably fine for short-form content,
       | but major gotcha's once you go past those max limits. Also phrase
       | searching will probably not work the way people are used too.
        
       | erikcw wrote:
       | Does anyone have any comparable experience using FTS in DuckDB?
       | 
       | https://duckdb.org/docs/extensions/full_text_search
        
       | hardwaresofton wrote:
       | Oh also shameless plug -- If you liked all the F/OSS search
       | engines in this post, you're going to love Awesome F/OSS[0][1] --
       | I'm going to highlight a new open source project _every day_ for
       | uhhh forever (terms and conditions apply).
       | 
       | I know of at least 5 more search engines I can't wait to share
       | and they're burning a hole in my bookmarks folder.
       | 
       | Also, donating 60% of subscription revenue towards supporting the
       | open source project of your choice (or one that was featured if
       | you have no preference) -- you can sign up for that as well[2].
       | 
       | [0]: https://awsmfoss.com
       | 
       | [1]: https://twitter.com/awsmfoss
       | 
       | [2]: https://baserow.vadosware.io/form/Xv5rChuZb-
       | YodDOKjpJJpuDhrE...
        
       | JoshGlazebrook wrote:
       | I've used Meilisearch for the last two years with around 300k
       | items in the search index for a personal project and its pretty
       | good. The earlier verisons were a bit rough but recently even
       | indexing a ton of data is speedy.
        
       | SigmundA wrote:
       | I still look forward to RUM indexes [1] getting integrated into
       | PG which would build the foundation for better ranking functions
       | such as TF/IDF or BM25. PG seems to lag behind here and hasn't be
       | a lot of movement in a while
       | 
       | 1. https://github.com/postgrespro/rum
        
       | hardwaresofton wrote:
       | Hey everyone -- I did the exploration for this piece and enjoyed
       | getting a chance to try so many of the new search solutions!
       | 
       | Feel free to ask me anything and let me know just how many
       | postgres features I missed out on (pg_trgm is definitely one) --
       | I love to learn about corners of Postgres I've missed.
        
         | amjith wrote:
         | Can you elaborate what this means?
         | 
         | > Typesense was relatively strict with matches compared to
         | other engines.
         | 
         | Perhaps an example to clarify the statement?
        
       | forinti wrote:
       | I've worked with FTS in the database before, and I'd much rather
       | use Elastic.
       | 
       | There's an FDW for Elastic, so they work together quite nicely
       | and you can have the best of both worlds.
        
       | canadiantim wrote:
       | Has anyone looked into Vespa? Yahoo's relatively recently (2017)
       | open-sourced search engine. Seems incredibly full featured but
       | also an exceptionally large code base (which is less than ideal).
       | Still, worthy of inspection.
        
       | freewizard wrote:
       | PostgreSQL has good enough built in search but I find it limited
       | in large real world global user cases:
       | 
       | - CJK support is tricky for almost all of these search products,
       | but postgresql is probably the only one that needs additional
       | extensions, and some extensions are not updated for a while.
       | 
       | - Scaling is tricky, bc on one hand, your search and other normal
       | queries may not have same load pattern, but you only have one
       | schema in multiple nodes, so sometimes you have to waste some
       | resource; on the other hand, dependencies of the 3rd party CJK
       | extensions makes it hard to use managed instances and maintaining
       | your own is time consuming.
       | 
       | - Most other search engines have well tuned APIs for typical use
       | cases like filters/typo/etc, but you have to build your own in
       | PostgreSQL. Of course it's not entirely bad thing bc PostgreSQL
       | does have most flexible query capabilities for those less than
       | typical use cases.
       | 
       | End of day, search for end users have many semi-edge cases, one
       | have to try all these engines to find the best fit. For large use
       | cases, operational cost and user tolerance of search errors are
       | all part of consideration. A conventional wisdom is do less work
       | in the early days , and find better solutions later if have to.
        
       | eric4smith wrote:
       | Pretty much all these engines (and PG) are fast enough.
       | 
       | Gist/Gin indexes are great and do a fine job to make millions of
       | records searchable in very few milliseconds.
       | 
       | The problem is accuracy.
       | 
       | I've tried a few of these and accuracy is wildly different with
       | most of them.
       | 
       | Accuracy depends on how much you index.
       | 
       | For example, even with decently designed weighting: If you index
       | title, subtitle, tags and content -- too much of content ruin
       | relevancy.
       | 
       | And yes, we have proper relevancy sorting setup nicely.
       | 
       | The best is something like elastic search, but it does not
       | integrate nicely with PG for our use case. Because of the multi
       | tenant nature of our data setup.
       | 
       | PG will get us to that magic 80% but that leaves the all
       | important 20% which is not great.
       | 
       | Not to mention the inability to index and search Asian character
       | sets.
       | 
       | So even though we exclusively use it, it's not great and every
       | time our team can't find something that drives my team into the
       | Psql CLI -- I start searching for alternatives again.
       | 
       | The big brick wall is updating, inserting and deleting from an
       | external solution fast enough so we don't miss stuff.
       | 
       | And quickly searching multi tenant records in that external
       | solution.
       | 
       | And no I'm not ready to use ES as my primary database.
        
         | sicp-enjoyer wrote:
         | Do you index keywords or bigrams? Bigrams provide context and
         | help a ton with relevancy.
        
         | ianbutler wrote:
         | Yup this is exactly the problem, which is why I'm surprised
         | with such decent extension support and heavy use no body has
         | created something that builds better analysis and query
         | pipelines specifically for full text search on top of these
         | databases to solve the relevancy issues.
        
           | benpacker wrote:
           | Do you two know about ZomboDB
           | (https://github.com/zombodb/zombodb)?
           | 
           | Quite deep Postgres/Elasticsearch integration
        
             | ianbutler wrote:
             | That's really cool but that's not quite the solution to
             | this problem that I would like.
             | 
             | I haven't used ZomboDB but I have managed plenty of
             | applications where Postgres was the main db and elastic was
             | used for FTS. Zombo looks like it makes it easier to do
             | that type of setup, but Postgres is so high performance (at
             | least at parity on speed w/ Elastic) and Elastic is such a
             | pain in the ass to manage from a DevOps perspective that
             | I'd like to eliminate the need for Elastic by investing
             | some time into Postgres Extensions.
             | 
             | Plenty of great db solutions have come out of Postgres in
             | the last few years based on extensions and fulltext search
             | is one of the areas that has been very quiet, I think we
             | can do better and I'd like to try.
             | 
             | I think in the interim Zombo looks like a really good
             | stopgap though!
        
       | jrochkind1 wrote:
       | I'm at least as interested in quality of results as I am in
       | performance; full-text search is not something where there is an
       | obvious single correct algorithm, and is something where results
       | are going to vary between products (and depending on how they are
       | configured, and what configuration options are available).
       | 
       | This is a lot harder to evaluate and compare because it's so not-
       | black-and-white, but... I don't think most people are choosing a
       | full-text search option based on latency alone, are they? While
       | some products may have unsatisfactory latency, most popular
       | products probably do okay, and once you have good enough latency,
       | the issue is results quality.
       | 
       | Getting an overview of pg's fts capabilities, and a list of other
       | products with similar, is I guess a useful starting point. But
       | the article's focus on performance is not too useful to me; are
       | there really an audience of people choosing an fts solution based
       | mostly on performance? I want to know if pg can provide good
       | enough results compared to fulltext-search-focused products --
       | it's true it's less clear how to measure that, and may depend on
       | your exact situation. Which is why I'd be interested in reading
       | from someone who has something to say on it!
        
       | kiwicopple wrote:
       | In case you skip to the comments like I usually do, this post is
       | a light comparison of Postgres FTS vs:                 -
       | MeiliSearch: https://www.meilisearch.com/       - OpenSearch:
       | https://opensearch.org/       - SQLite FTS:
       | https://www.sqlite.org/fts5.html       - Typesense:
       | https://typesense.org/
       | 
       | Some of the callouts from the results:                 - Even
       | when consuming similar content, engines can produce different
       | results, but generally ratios between queries on the same engine
       | should be consistent.       - Postgres FTS is quite close
       | performance-wise to many other solutions, at least in their
       | default configuration.       - Only Typesense and MeiliSearch
       | properly handled mis-spellings (the "suprman" query).       -
       | Typesense was relatively strict with matches compared to other
       | engines.       - OpenSearch was very fast with ingest, but also
       | failed with the misspelling out of the box.       - In-memory
       | SQLite is by far the fastest, and PG isn't too far behind for
       | this small data set.
        
         | dewey wrote:
         | MeiliSearch is really nice to work with, can recommend it. I
         | built a Twitter bookmark search recently and used PG +
         | MeiliSearch after using PG FTS in another project. Very happy
         | with that choice. I like PG FTS for some use cases (internal
         | search where you don't need the best performance) just as it's
         | very simple and works well enough.
        
           | canadiantim wrote:
           | MeiliSearch doesn't seem secure enough for my needs, so I'm
           | opting to use Typesense instead. E.g. no encryption at rest
           | in MeiliSearch and they have no intention to support it.
        
             | qdequelen wrote:
             | Hello, I'm Quentin, Meilisearch's CEO. Meilisearch Cloud is
             | encrypted at rest. We're using, at this time, a k8s with
             | replicated and encrypted persistent volumes. We just didn't
             | talk about it yet.
        
         | dmitriid wrote:
         | I'll add: (only?) Meilisearch is fully multi-lingual.
         | 
         | I didn't dive into the various engines, but I was looking for
         | one that would support Russian in a small side-project, and
         | Meilisearch was the only one [1] that had it right there out of
         | the box
         | 
         | [1] Criteria for "only one" where "out of the box, ease of
         | operation, no fiddling with configs, if not directly inside DB
         | then with an easy HTTP API"
        
           | qdequelen wrote:
           | Good to know. If you find some Russian language support
           | improvement we can do, don't hesitate to create an issue on
           | our tokeniser https://github.com/meilisearch/charabia.
           | 
           | Moreover, it's Hacktoberfest. If you want to help us improve
           | the language support, it would be awesome!
        
         | _tom_ wrote:
         | The fact that misspelling need to be turned in open search does
         | not mean anything "failed". If you want spelling correction,
         | turn it on, don't call it failure.
        
           | kiwicopple wrote:
           | that's a fair comment - I'll get the author to update the
           | post (I can't edit my comment anymore though)
           | 
           | edit: fixed https://github.com/supabase/supabase/pull/9565
        
         | joshxyz wrote:
         | there's also quickwit and sonic
         | 
         | different strengths and best use cases
         | 
         | https://jxyz.me/modern-search-engines
        
           | hardwaresofton wrote:
           | Yup! I'd heard of Quickwit and sonic, but Quickwit seems to
           | have pivoted to being a log-search focused engine. It's built
           | on Tantivy[0] IIRC so I could have used something like
           | Toshi[1].
           | 
           | Sonic[2] I know much less about but it also seems good.
           | Honestly anything _except_ ES is what I like to hear about
           | (though OpenSearch is interesting).
           | 
           | Another thing I think the world really needs is a CLI +/- API
           | tool (ideally rust lib + CLI + API) that _unifies_
           | interacting with these things. I got _REALLY_ close to
           | writing it while working on this article, but I was already
           | running late and I have a penchant for yak shaving.
           | 
           | This won't be the last thing I write about search engines --
           | there's been a LOT of movement in the space that has nothing
           | to do with the elastic/opensearch debacle and I don't see
           | enough tires getting kicked.
           | 
           | [0]: https://github.com/quickwit-oss/tantivy
           | 
           | [1]: https://github.com/toshi-search/Toshi
           | 
           | [2]: https://github.com/valeriansaliou/sonic
        
       | piyh wrote:
       | I built a website around pg full text search and hit a lot of
       | weird corner cases that ultimately will require me to move to
       | something like elastic search. It's very powerful and useful, but
       | if your main product is search, I wouldn't use it as a
       | foundational pillar.
       | 
       | They're fundamentally different products, but sometimes "pg full
       | text search is good enough" isn't true.
        
         | adhoc_slime wrote:
         | I'm coming up to a large segment of work where we'll adding
         | search with elasticsearch and this was the assessment I came to
         | after reviewing PG FTS.
         | 
         | I would say, frankly, that if you already have a PG database
         | and
         | 
         | 1. you want better full text search than a non-existing
         | solution.
         | 
         | 2. you don't need the excellent searching like we've come to
         | expect from search engines.
         | 
         | then use the full text search feature your PG database already
         | has. I wouldn't necessarily go out of my way to use a PG
         | database for FTS but I think its a good solution when you're in
         | the right position to use it and its not too shabby.
        
         | JackC wrote:
         | I also had this experience. I read blog posts like this one
         | that said "just start with postgres," so tried that, and spent
         | a bunch of time with GIN and RUM indexes trying to work around
         | edge cases before going to Elasticsearch, where things worked
         | out of the box because it's designed to be a search engine. I'm
         | sorry I can't remember specifics anymore -- it was stuff like
         | filtering and result counts. There just tended to be more
         | surprises because everything was a custom addition instead of a
         | feature enabled by default. In retrospect for an evolving
         | product, adding Elasticsearch was more of a one time up front
         | integration investment, while working around edge cases in
         | Postgres felt like an ongoing limitation we'd be stuck with.
        
         | KaoruAoiShiho wrote:
         | Agreed, it's okay for an MVP but not okay for a real app.
        
           | piyh wrote:
           | The amount of time needed for pg fts integration is less than
           | elastic, but if you don't replace it ~immediately~ after your
           | MVP has been validated, you might as well have gone elastic
           | from the beginning.
        
           | sicp-enjoyer wrote:
           | Since most deployed applications search with:
           | "for item in ... if item.containsSubstring(query)"
           | 
           | I would say it's definitely useful for real apps.
        
       | izietto wrote:
       | We migrated FTS to OpenSearch from PostgreSQL a couple of months
       | ago. Here the main issues we had with PostgreSQL FTS:
       | 
       | - cross-tables FTS requires at least materialized views, which
       | lock data for writing at refresh. This was too much worrying for
       | us.
       | 
       | - sorting by rank is not indexable, so we can't sort our dataset
       | and have acceptable timings at the same time. Our dataset isn't
       | enormous, but neither small (~1.5m records)
        
         | mdavidn wrote:
         | Large materialized views in general are really slow in
         | PostgreSQL. If you go down the path of indexing joined data,
         | consider database triggers or application-level hooks to update
         | individual rows as-needed in a separate search table.
        
       | simonw wrote:
       | One thing that isn't really covered in this article is relevance
       | calculations.
       | 
       | A surprising (to me) thing about PostgreSQL FTS is that it
       | doesn't do TF/IDF or BM25 relevance calculations.
       | 
       | These calculations take statistics about the entire corpus into
       | account - they do things like ranking a document higher if it
       | contains terms from the search which are statistically rare in
       | the overall corpus.
       | 
       | PostgreSQL FTS uses how often the search terms appear in the
       | document as part of the ranking score, but doesn't look at how
       | common/rare individual terms are.
       | https://www.postgresql.org/docs/current/textsearch-controls....
       | 
       | OpenSearch/Elasticsearch uses BM25:
       | https://opensearch.org/docs/latest/opensearch/rest-api/expla...
       | 
       | SQLite stores these kinds of statistics and can support BM25 or
       | TF/IDF or custom ranking functions. I wrote more about that here:
       | https://simonwillison.net/2019/Jan/7/exploring-search-releva...
       | 
       | Meilisearch does something a bit different (on reading this I
       | don't think it considers full corpus statistics, but I may be
       | misinterpreting that):
       | https://docs.meilisearch.com/learn/core_concepts/relevancy.h...
       | 
       | It looks like Typesense only considers the documents, not the
       | overall corpus: https://typesense.org/docs/guide/ranking-and-
       | relevance.html
        
         | billythemaniam wrote:
         | Yep. PG FTS works fine for a recall-based problem. If your data
         | has a high cardinality then recall search works great. But with
         | low cardinality data, you start needing BM25, ML, etc. The
         | larger the dataset, the lower the cardinality in general. So PG
         | FTS tends to work okay for small/medium datasets where most
         | records are unique and you don't need advanced features like
         | misspellings, synonyms, etc. Some of that advanced stuff is
         | possible, but it's not out of the box and can be hard to set
         | up.
        
         | ianbutler wrote:
         | Postgres Extensions should allow you to make a TF/IDF and BM25
         | plugin pretty easily as well as dealing with the other
         | relevance problems.
        
         | throwaway81523 wrote:
         | > doesn't do TF/IDF or BM25 relevance calculations
         | 
         | In practice these are nearly worthless. Useful relevance
         | ranking is difficult. Google sort of gamed it with PageRank
         | (using inbound links). "Information Retrieval" by Buettcher et
         | al is a good book on search implementation with a decent amount
         | of info about relevance ranking, though maybe it no longer up
         | to date: https://mitpress.mit.edu/9780262528870/information-
         | retrieval...
        
           | jrochkind1 wrote:
           | Useful relevance ranking is definitely difficult, but a lot
           | of people are getting something that they believe to be
           | useful from, say, ElasticSearch or Solr, both of which use
           | broadly TF/IDF-baesd algorithms.
           | 
           | Are you suggesting that you can do just as well without
           | TF/IDF, or with postgres fts specifically? In some/all
           | circumstances? I'd be interested in hearing more about that,
           | if it comes from experience.
           | 
           | Looking at the table of contents for the Buttcher et al book,
           | it looks to me like it covers TF/IDF-based algorithms pretty
           | extensively. BM25 is in the table of contents specifically.
           | Buttcher's own pedagogical search engine, _Wumpus_, includes
           | a BM25 implementation.
           | http://stefan.buettcher.org/cs/wumpus/docs/relevance.html
        
           | billythemaniam wrote:
           | In practice, it depends on the dataset size and use case. For
           | web search? Mostly worthless, but can be a valuable signal to
           | train ML. For small corpus of documents? BM25 alone does a
           | pretty good job in general.
        
         | hardwaresofton wrote:
         | This is within reach due to the flexibility of features like
         | setweight[0], but I don't know enough about TF/IDF or BM25.
         | 
         | Seems like an excellent weekend project.
         | 
         | [0]: https://www.postgresql.org/docs/current/textsearch-
         | features....
        
           | jrochkind1 wrote:
           | I don't see how the setweight function could be used to give
           | you something like a traditional TF/IDF algorithm, but
           | perhaps I'm missing it. I would be interested in reading more
           | about that if I am.
        
             | hardwaresofton wrote:
             | What I'm thinking is a little far out but it came up
             | recently on a project where I'm using Postgres FTS (it's
             | slow but I guess I might as well link it for now[0] --
             | please do not HN hug it).
             | 
             | Basically, I read on the internet (and was surprised by)
             | the fact that setweight can be used _and combined_ with
             | individual terms on tsvectors, and then those tsvectors can
             | be combined and they keep their weightings.
             | 
             | Some code from that project to illustrate:
             | UPDATE podcasts         SET fts_doc =
             | setweight(to_tsvector(COALESCE(title, ' ')), 'A')
             | || setweight(to_tsvector(COALESCE(homepage_url, ' ')), 'A')
             | ||
             | setweight(to_tsvector(COALESCE(podcast_idx_itunes_author, '
             | ')), 'A')                       || setweight(to_tsvector(CO
             | ALESCE(podcast_idx_itunes_ownername, ' ')), 'A')
             | || setweight(to_tsvector(COALESCE(podcast_idx_host, ' ')),
             | 'A')                       ||
             | setweight(to_tsvector(array_to_string(categories, ' ')),
             | 'B')                       ||
             | setweight(to_tsvector(COALESCE(description_html, ' ')),
             | 'D')
             | 
             | Basically I'm making tsvectors out of chunks of the
             | document, weighting them differently then recombining with
             | other vectors without losing the weightings -- I'm thinking
             | this could be applied to words identified by the corpus-
             | level algos.
             | 
             | So my simplistic thinking here is that if you've done the
             | corpus level processing, you could build an intermediate
             | data structure and re-evaluate each search document with
             | the appropriate weighting. It would likely be quite the
             | lengthy stored procedure, but seems like setweight could
             | support the usecase? Maybe I'm being a bit optimistic.
             | 
             | [0]: https://podcastsaver.com
        
               | jrochkind1 wrote:
               | Hm, I'm not totally following, but... would you have to
               | recalculate all row values every time the corpus changes?
               | I guess that could work for a seldom-changing corpus, not
               | sure how popular a use case that is. I suspect most
               | people would not be interested in such an approach,
               | instead either making do without TF/IDF, or moving to a
               | non-pg solution.
        
               | [deleted]
        
               | pbronez wrote:
               | If you could figure that out, it would be an awesome
               | plugin.
               | 
               | PS podcastsaver looks neat!
               | 
               | some quick feedback:
               | 
               | 1) your "switch back to light mode" icon looks a LOT like
               | a gear for a settings menu. I turned on dark mode, did a
               | search, saw the "back to light mode" icon and thought
               | "huh, the dark mode toggle is settings now? Weird choice,
               | let's see what's there..."
               | 
               | 2) the show notes seem truncated. It would be helpful for
               | me to be able to search the show notes for a defined set
               | of podcasts. Sometimes I remember that a podcast
               | mentioned a product or service that I wanted to check
               | out, but I can't remember the name of the product or the
               | overall episode, and it's painful to find the right one
               | by scrolling back through everything in my pod catcher.
               | 
               | 3) are you tracking Podcasts 2.0? Some interesting
               | additional stuff to index there.
               | https://origin.fm/blog/podcasting-2point0/
        
         | vosper wrote:
         | It's curious they omitted MySQL, which is the usual point of
         | comparison for Postgres.
         | 
         | > InnoDB full-text search is modeled on the Sphinx full-text
         | search engine, and the algorithms used are based on BM25 and
         | TF-IDF ranking algorithms. For these reasons, relevancy
         | rankings for InnoDB boolean full-text search may differ from
         | MyISAM relevancy rankings.
         | 
         | > InnoDB uses a variation of the "term frequency-inverse
         | document frequency" (TF-IDF) weighting system to rank a
         | document's relevance for a given full-text search query. The
         | TF-IDF weighting is based on how frequently a word appears in a
         | document, offset by how frequently the word appears in all
         | documents in the collection. In other words, the more
         | frequently a word appears in a document, and the less
         | frequently the word appears in the document collection, the
         | higher the document is ranked.
         | 
         | MySQL's FTS is fine. We're using it at work for fairly basic
         | boolean searches on millions of documents, also retrieving the
         | relevancy score, and it's plenty fast. We'll outgrow it one
         | day, but for now it's pretty easy and does well enough.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.htm...
        
           | sandinmyjoints wrote:
           | Agreed that MySQL FTS is "fine". Have you run into issues
           | with the default value of innodb_ft_result_cache_limit (2
           | GB)? That's bit us (OOM restarts) a couple times.
        
           | sroussey wrote:
           | BTW: when you outgrow it, it will be sudden. It's performance
           | fails in a very non-linear way.
           | 
           | Edit: keep the index in memory
        
       | akoumjian wrote:
       | Broad full text search is incredibly hard to do well. I've had to
       | build, maintain, and improve multiple search systems. The
       | difficulty is largely dependent on the context of the type of
       | search problem you are solving. A lot of complexity depends on
       | your answer to some of these questions:                 - Does
       | the searcher already know the result they are looking for? (If
       | yes, much easier)       - Are there subjective and objective
       | qualities of the results which should alter the search score,
       | sometimes separate from the text being indexed? (If yes, much
       | harder)       - What is the quality of the text being indexed?
       | (If end-user provided, this will vary widely)
       | 
       | Ultimately, building good search is often a struggle against
       | providing the best possible results between searcher intent and
       | incomplete document evaluation criteria. People never really
       | think about when a search is working really well, but they
       | definitely know and complain when it's working poorly.
        
         | hardwaresofton wrote:
         | How do you classify the difference (in problem/scope, etc) of
         | type ahead versus full blown search? It feels like these
         | systems can be grown almost completely differently -- you could
         | hack together completely in-browser "search" with datalists[0]
         | and just prune it actively (and fool most users, depending on
         | how varied searches were).
         | 
         | I do wonder how much deep search really matters when people
         | only really expect to look at the first page.
         | 
         | [0]: https://developer.mozilla.org/en-
         | US/docs/Web/HTML/Element/da...
        
           | akoumjian wrote:
           | "Type ahead" or "autocomplete" is absolutely a different type
           | of problem, and often simpler. This generally falls into the
           | use case where the searcher already knows the specific item
           | they are looking for. Often the results are objects owned by
           | or known to the user in question, or you are searching
           | through a very limited and relatively static set of documents
           | and topics. Reference documentation for software often falls
           | into this category.
           | 
           | In my experience, you don't have to spend a lot of time
           | thinking about scoring and relevancy for these types of
           | search. Generally you only want to include a small edit
           | distance in the results at all to handle misspellings.
           | 
           | This is so vastly different when you have a corpus of
           | millions of documents about an encyclopedia's worth of
           | topics.
           | 
           | > I do wonder how much deep search really matters when people
           | only really expect to look at the first page.
           | 
           | Getting the first page to have the best quality and relevancy
           | is much more difficult if the user is searching through
           | something like scientific papers, stock video footage. It is
           | a challenge in bridging the distance between ideas and
           | expectations.
        
       | zhxshen wrote:
       | The combinatorics of full-text search are astronomical. A dashed-
       | off comparison like this is pure noise. There are so many knobs
       | to twiddle, and a nudge to any one of them can yield remarkably
       | different results.
        
       ___________________________________________________________________
       (page generated 2022-10-14 23:00 UTC)