[HN Gopher] Postgres Full Text Search vs. the Rest
___________________________________________________________________
Postgres Full Text Search vs. the Rest
Author : kiwicopple
Score : 214 points
Date : 2022-10-14 13:41 UTC (9 hours ago)
(HTM) web link (supabase.com)
(TXT) w3m dump (supabase.com)
| endisneigh wrote:
| I always thought it was interesting that full text search handles
| misspellings, as opposed to just telling the client that they've
| misspelled and to clarify their intent and retry.
|
| From this it seems like the best solution would be to use
| SQLite's FTS and replicate the database using something like
| rqlite, litestream or mvsqlite, and then loadbalance the requests
| to SQLite.
|
| Given that SQLite is serverless in the classic sense, seems like
| a nobrainer.
| hardwaresofton wrote:
| This hasn't made the front page of HN (maybe someone should
| post it), but Litestream actually doesn't do replication that
| way any more -- I stumbled upon this commit like 2 days ago:
|
| https://github.com/benbjohnson/litestream/pull/411
|
| [EDIT] - I indulged myself:
| https://news.ycombinator.com/item?id=33204347
| avereveard wrote:
| Postgres comes with a trigram module. Supports indexing and its
| much more stable in its result than relying on stemming
| especially if you're interested in misspellings, the similarity
| comes with variants with specialized behavior for total
| similarity, word similarity etc.
|
| You might want to give that a chance.
| knutwannheden wrote:
| I don't have experience with PG FTS, but I've used SQLite FTS5
| and love it for its ease of use. Also I really like its NEAR()
| queries. I was a bit surprised that the post doesn't mention this
| capability and compare it with the other implementations (i.e. is
| it available and how does it compare performance-wise).
| canadiantim wrote:
| What do near queries do?
| knutwannheden wrote:
| It lets you specify a list of terms which mustn't be further
| apart from each other than a given number of tokens.
| ianbutler wrote:
| This lines up with what I've seen. I did a full deep dive
| benchmark against Elasticsearch and a Postgres FTS instance a few
| months ago.
|
| With some tuning and memory page size adjustments in Postgres you
| get very compelling speeds.
|
| I've actually previously written a full twitter thread on a
| somewhat inactive account about why we don't use existing dbs for
| search more.
|
| https://twitter.com/kinglycrow/status/1533270619353231360?s=...
|
| That thread led me to a project/product idea where you take an
| existing Postgres instance used for normal products or whatever,
| replicate it to various read only clusters with a custom search
| extension loaded and some orchestrator sitting on top (I've
| written most of one in rust that uses 0mq to communicate with
| it's nodes) and create drop in search from existing databases
| with a nice guided web gui for automatic tuning suitable for most
| business use cases.
|
| It fell off when my friend who wanted to help work on it went off
| to Google and you know working on a search engine for them became
| a bit of a no no. I still think it's a great idea with a lot of
| value, I should circle back.
|
| I also write (at length) about how ease of use will win search
| here: https://twitter.com/KinglyCrow/status/1532402654218964993
| hardwaresofton wrote:
| > That thread led me to a project/product idea where you take
| an existing Postgres instance used for normal products or
| whatever, replicate it to various read only clusters with a
| custom search extension loaded and some orchestrator sitting on
| top (I've written most of one in rust that uses 0mq to
| communicate with it's nodes) and create drop in search from
| existing databases with a nice guided web gui for automatic
| tuning suitable for most business use cases.
|
| Very interesting idea -- just want to add one thing, write it
| in rust (with pgx?[0]) :)
|
| [0]: https://github.com/tcdi/pgx
| ianbutler wrote:
| I was unaware, and absolutely! Thanks for showing me this.
| hardwaresofton wrote:
| Absolutely -- I think the next wave of awesome pg
| extensions will be built on this, and I'm excited to use
| yours!
| matterhorn2000 wrote:
| Had a use case recently where a vendor dataset of approximately
| 1.5 million companies needed to be searchable. Fiddled around
| with Postgres for quite a while, not really managing to get good
| results while keeping query times below 1 second on average.
| Ended up going with elastic search instead but I hate the
| additional complexity. Anyone with similar experience or who
| actually made a similar setup work with pgsql?
| CSDude wrote:
| We use Gin for searching cyber asset configuration at Resmo. It
| works good enough because users search for specific things like
| "logs bucket" or UUIDs and there are exact matches. Typos are not
| important. But if they'd be, Postgres search is not very helpful.
| But it beats hosting another thing just for this use case if
| you'd be happy enough.
| psanford wrote:
| I love sqlite's FTS. I have a number of personal projects where I
| have an archive of data and want to occasionally be able to do
| ad-hoc searches over and get reasonable ranking and results.
| Throwing sqlite fts at the problem gets me that in just a few
| lines of code.
|
| You can even combine it with a sqlite vfs[0] to run full text
| searches against a sqlite db stored in s3 relatively efficiently.
|
| [0]: https://github.com/psanford/sqlite3vfshttp
| icedchai wrote:
| That project looks neat! I could see it being useful in
| searching archived, infrequently accessed data.
| theandrewbailey wrote:
| > Only Typesense and MeiliSearch properly handled mis-spellings
| (the "suprman" query).
|
| That's where trigrams come in. pg_trgm can fix mis-spellings, and
| even compound words ("super man" to "superman"). I opt for
| performing the search as entered, and use trigrams to offer
| suggestions to the user, e.g. do the search for "suprman", but
| show an option to search for "superman".
| hardwaresofton wrote:
| Absolutely -- this is something I didn't include in the article
| -- it wasn't _really_ fast-search /type-ahead search per say.
|
| Gitlab actually has a great article about this for those who
| want to read more:
|
| https://about.staging.gitlab.com/blog/2016/03/18/fast-search...
| jonstewart wrote:
| I use both ElasticSearch/OpenSearch and Postgres Full Text
| Search.
|
| It's not even close: Postgres Full Text Search is a kludge
| compared to Lucene-based services. It is better than nothing, so,
| yes, I agree with the Pareto improvement idea. But it is also
| much, much worse than a Lucene-based alternative. If you need
| something and can't support having ElasticSearch/OpenSearch
| (heck, even Solr) running at the same time to support full text
| search, then sure, use Postgres FTS. But if the queries are still
| too slow or you need ranking or other text/information
| retrieval/NLP-type features, you'll want to give up on Postgres
| FTS and move to something else.
| nemothekid wrote:
| I think Postgres Full Text Search is "good enough" for some
| internal use cases, but for user facing search I had a project
| where there are a lot of weird edge cases that eventually makes
| it worth it to move something dedicated. I believed we used
| `websearch_to_tsquery` and users complained often enough where we
| swapped it out for quickwit. A big issue I remember is when users
| wanted to search for something and order by another field (such
| as find all documents matching query "x" order by creation date),
| and PG performance just took a massive nosedive. Even though we
| knew where was one or two fields we wanted to co-index with FTS,
| you can't have compound indexes when using FTS.
| stuckinhell wrote:
| Postgres Full Text Search has ALOT of limitations doesn't it?
|
| I briefly looked into for storing long form text archive records
| for my company two years ago. There are EXTREME limitations in
| the source code around it that no one really talks about but have
| important implications.
|
| Phrase searching doesn't really work the way our analysts would
| have liked and needed. There are a bunch of technical limitations
| in the source about how much data is tracked about the tokens. I
| can't remember exactly but there was something weird about
| stemming or lemming in the phrase search too.
|
| The following variables need to be bumped up to get phrase
| searching more accurate.
|
| - MAXSTRLEN (2047)
| https://github.com/postgres/postgres/blob/master/src/include...
|
| - MAXSTRPOS (1048575)
| https://github.com/postgres/postgres/blob/master/src/include...
|
| - MAXENTRYPOS (16363)
| https://github.com/postgres/postgres/blob/master/src/include...
|
| - MAXNUMPOS (256)
| https://github.com/postgres/postgres/blob/master/src/include...
|
| TsHeadline for highlighting doesn't consider phrase searching so
| you can get weird results. It probably needs to be rewritten to
| match websearch_to_tsquery.
|
| The accuracy issue drained the blood from my BA's faces. I
| eventually just went with on-premise SOLR because it's easier to
| add new hardware for it than elasticsearch.
|
| TLDR: postgres search is probably fine for short-form content,
| but major gotcha's once you go past those max limits. Also phrase
| searching will probably not work the way people are used too.
| erikcw wrote:
| Does anyone have any comparable experience using FTS in DuckDB?
|
| https://duckdb.org/docs/extensions/full_text_search
| hardwaresofton wrote:
| Oh also shameless plug -- If you liked all the F/OSS search
| engines in this post, you're going to love Awesome F/OSS[0][1] --
| I'm going to highlight a new open source project _every day_ for
| uhhh forever (terms and conditions apply).
|
| I know of at least 5 more search engines I can't wait to share
| and they're burning a hole in my bookmarks folder.
|
| Also, donating 60% of subscription revenue towards supporting the
| open source project of your choice (or one that was featured if
| you have no preference) -- you can sign up for that as well[2].
|
| [0]: https://awsmfoss.com
|
| [1]: https://twitter.com/awsmfoss
|
| [2]: https://baserow.vadosware.io/form/Xv5rChuZb-
| YodDOKjpJJpuDhrE...
| JoshGlazebrook wrote:
| I've used Meilisearch for the last two years with around 300k
| items in the search index for a personal project and its pretty
| good. The earlier verisons were a bit rough but recently even
| indexing a ton of data is speedy.
| SigmundA wrote:
| I still look forward to RUM indexes [1] getting integrated into
| PG which would build the foundation for better ranking functions
| such as TF/IDF or BM25. PG seems to lag behind here and hasn't be
| a lot of movement in a while
|
| 1. https://github.com/postgrespro/rum
| hardwaresofton wrote:
| Hey everyone -- I did the exploration for this piece and enjoyed
| getting a chance to try so many of the new search solutions!
|
| Feel free to ask me anything and let me know just how many
| postgres features I missed out on (pg_trgm is definitely one) --
| I love to learn about corners of Postgres I've missed.
| amjith wrote:
| Can you elaborate what this means?
|
| > Typesense was relatively strict with matches compared to
| other engines.
|
| Perhaps an example to clarify the statement?
| forinti wrote:
| I've worked with FTS in the database before, and I'd much rather
| use Elastic.
|
| There's an FDW for Elastic, so they work together quite nicely
| and you can have the best of both worlds.
| canadiantim wrote:
| Has anyone looked into Vespa? Yahoo's relatively recently (2017)
| open-sourced search engine. Seems incredibly full featured but
| also an exceptionally large code base (which is less than ideal).
| Still, worthy of inspection.
| freewizard wrote:
| PostgreSQL has good enough built in search but I find it limited
| in large real world global user cases:
|
| - CJK support is tricky for almost all of these search products,
| but postgresql is probably the only one that needs additional
| extensions, and some extensions are not updated for a while.
|
| - Scaling is tricky, bc on one hand, your search and other normal
| queries may not have same load pattern, but you only have one
| schema in multiple nodes, so sometimes you have to waste some
| resource; on the other hand, dependencies of the 3rd party CJK
| extensions makes it hard to use managed instances and maintaining
| your own is time consuming.
|
| - Most other search engines have well tuned APIs for typical use
| cases like filters/typo/etc, but you have to build your own in
| PostgreSQL. Of course it's not entirely bad thing bc PostgreSQL
| does have most flexible query capabilities for those less than
| typical use cases.
|
| End of day, search for end users have many semi-edge cases, one
| have to try all these engines to find the best fit. For large use
| cases, operational cost and user tolerance of search errors are
| all part of consideration. A conventional wisdom is do less work
| in the early days , and find better solutions later if have to.
| eric4smith wrote:
| Pretty much all these engines (and PG) are fast enough.
|
| Gist/Gin indexes are great and do a fine job to make millions of
| records searchable in very few milliseconds.
|
| The problem is accuracy.
|
| I've tried a few of these and accuracy is wildly different with
| most of them.
|
| Accuracy depends on how much you index.
|
| For example, even with decently designed weighting: If you index
| title, subtitle, tags and content -- too much of content ruin
| relevancy.
|
| And yes, we have proper relevancy sorting setup nicely.
|
| The best is something like elastic search, but it does not
| integrate nicely with PG for our use case. Because of the multi
| tenant nature of our data setup.
|
| PG will get us to that magic 80% but that leaves the all
| important 20% which is not great.
|
| Not to mention the inability to index and search Asian character
| sets.
|
| So even though we exclusively use it, it's not great and every
| time our team can't find something that drives my team into the
| Psql CLI -- I start searching for alternatives again.
|
| The big brick wall is updating, inserting and deleting from an
| external solution fast enough so we don't miss stuff.
|
| And quickly searching multi tenant records in that external
| solution.
|
| And no I'm not ready to use ES as my primary database.
| sicp-enjoyer wrote:
| Do you index keywords or bigrams? Bigrams provide context and
| help a ton with relevancy.
| ianbutler wrote:
| Yup this is exactly the problem, which is why I'm surprised
| with such decent extension support and heavy use no body has
| created something that builds better analysis and query
| pipelines specifically for full text search on top of these
| databases to solve the relevancy issues.
| benpacker wrote:
| Do you two know about ZomboDB
| (https://github.com/zombodb/zombodb)?
|
| Quite deep Postgres/Elasticsearch integration
| ianbutler wrote:
| That's really cool but that's not quite the solution to
| this problem that I would like.
|
| I haven't used ZomboDB but I have managed plenty of
| applications where Postgres was the main db and elastic was
| used for FTS. Zombo looks like it makes it easier to do
| that type of setup, but Postgres is so high performance (at
| least at parity on speed w/ Elastic) and Elastic is such a
| pain in the ass to manage from a DevOps perspective that
| I'd like to eliminate the need for Elastic by investing
| some time into Postgres Extensions.
|
| Plenty of great db solutions have come out of Postgres in
| the last few years based on extensions and fulltext search
| is one of the areas that has been very quiet, I think we
| can do better and I'd like to try.
|
| I think in the interim Zombo looks like a really good
| stopgap though!
| jrochkind1 wrote:
| I'm at least as interested in quality of results as I am in
| performance; full-text search is not something where there is an
| obvious single correct algorithm, and is something where results
| are going to vary between products (and depending on how they are
| configured, and what configuration options are available).
|
| This is a lot harder to evaluate and compare because it's so not-
| black-and-white, but... I don't think most people are choosing a
| full-text search option based on latency alone, are they? While
| some products may have unsatisfactory latency, most popular
| products probably do okay, and once you have good enough latency,
| the issue is results quality.
|
| Getting an overview of pg's fts capabilities, and a list of other
| products with similar, is I guess a useful starting point. But
| the article's focus on performance is not too useful to me; are
| there really an audience of people choosing an fts solution based
| mostly on performance? I want to know if pg can provide good
| enough results compared to fulltext-search-focused products --
| it's true it's less clear how to measure that, and may depend on
| your exact situation. Which is why I'd be interested in reading
| from someone who has something to say on it!
| kiwicopple wrote:
| In case you skip to the comments like I usually do, this post is
| a light comparison of Postgres FTS vs: -
| MeiliSearch: https://www.meilisearch.com/ - OpenSearch:
| https://opensearch.org/ - SQLite FTS:
| https://www.sqlite.org/fts5.html - Typesense:
| https://typesense.org/
|
| Some of the callouts from the results: - Even
| when consuming similar content, engines can produce different
| results, but generally ratios between queries on the same engine
| should be consistent. - Postgres FTS is quite close
| performance-wise to many other solutions, at least in their
| default configuration. - Only Typesense and MeiliSearch
| properly handled mis-spellings (the "suprman" query). -
| Typesense was relatively strict with matches compared to other
| engines. - OpenSearch was very fast with ingest, but also
| failed with the misspelling out of the box. - In-memory
| SQLite is by far the fastest, and PG isn't too far behind for
| this small data set.
| dewey wrote:
| MeiliSearch is really nice to work with, can recommend it. I
| built a Twitter bookmark search recently and used PG +
| MeiliSearch after using PG FTS in another project. Very happy
| with that choice. I like PG FTS for some use cases (internal
| search where you don't need the best performance) just as it's
| very simple and works well enough.
| canadiantim wrote:
| MeiliSearch doesn't seem secure enough for my needs, so I'm
| opting to use Typesense instead. E.g. no encryption at rest
| in MeiliSearch and they have no intention to support it.
| qdequelen wrote:
| Hello, I'm Quentin, Meilisearch's CEO. Meilisearch Cloud is
| encrypted at rest. We're using, at this time, a k8s with
| replicated and encrypted persistent volumes. We just didn't
| talk about it yet.
| dmitriid wrote:
| I'll add: (only?) Meilisearch is fully multi-lingual.
|
| I didn't dive into the various engines, but I was looking for
| one that would support Russian in a small side-project, and
| Meilisearch was the only one [1] that had it right there out of
| the box
|
| [1] Criteria for "only one" where "out of the box, ease of
| operation, no fiddling with configs, if not directly inside DB
| then with an easy HTTP API"
| qdequelen wrote:
| Good to know. If you find some Russian language support
| improvement we can do, don't hesitate to create an issue on
| our tokeniser https://github.com/meilisearch/charabia.
|
| Moreover, it's Hacktoberfest. If you want to help us improve
| the language support, it would be awesome!
| _tom_ wrote:
| The fact that misspelling need to be turned in open search does
| not mean anything "failed". If you want spelling correction,
| turn it on, don't call it failure.
| kiwicopple wrote:
| that's a fair comment - I'll get the author to update the
| post (I can't edit my comment anymore though)
|
| edit: fixed https://github.com/supabase/supabase/pull/9565
| joshxyz wrote:
| there's also quickwit and sonic
|
| different strengths and best use cases
|
| https://jxyz.me/modern-search-engines
| hardwaresofton wrote:
| Yup! I'd heard of Quickwit and sonic, but Quickwit seems to
| have pivoted to being a log-search focused engine. It's built
| on Tantivy[0] IIRC so I could have used something like
| Toshi[1].
|
| Sonic[2] I know much less about but it also seems good.
| Honestly anything _except_ ES is what I like to hear about
| (though OpenSearch is interesting).
|
| Another thing I think the world really needs is a CLI +/- API
| tool (ideally rust lib + CLI + API) that _unifies_
| interacting with these things. I got _REALLY_ close to
| writing it while working on this article, but I was already
| running late and I have a penchant for yak shaving.
|
| This won't be the last thing I write about search engines --
| there's been a LOT of movement in the space that has nothing
| to do with the elastic/opensearch debacle and I don't see
| enough tires getting kicked.
|
| [0]: https://github.com/quickwit-oss/tantivy
|
| [1]: https://github.com/toshi-search/Toshi
|
| [2]: https://github.com/valeriansaliou/sonic
| piyh wrote:
| I built a website around pg full text search and hit a lot of
| weird corner cases that ultimately will require me to move to
| something like elastic search. It's very powerful and useful, but
| if your main product is search, I wouldn't use it as a
| foundational pillar.
|
| They're fundamentally different products, but sometimes "pg full
| text search is good enough" isn't true.
| adhoc_slime wrote:
| I'm coming up to a large segment of work where we'll adding
| search with elasticsearch and this was the assessment I came to
| after reviewing PG FTS.
|
| I would say, frankly, that if you already have a PG database
| and
|
| 1. you want better full text search than a non-existing
| solution.
|
| 2. you don't need the excellent searching like we've come to
| expect from search engines.
|
| then use the full text search feature your PG database already
| has. I wouldn't necessarily go out of my way to use a PG
| database for FTS but I think its a good solution when you're in
| the right position to use it and its not too shabby.
| JackC wrote:
| I also had this experience. I read blog posts like this one
| that said "just start with postgres," so tried that, and spent
| a bunch of time with GIN and RUM indexes trying to work around
| edge cases before going to Elasticsearch, where things worked
| out of the box because it's designed to be a search engine. I'm
| sorry I can't remember specifics anymore -- it was stuff like
| filtering and result counts. There just tended to be more
| surprises because everything was a custom addition instead of a
| feature enabled by default. In retrospect for an evolving
| product, adding Elasticsearch was more of a one time up front
| integration investment, while working around edge cases in
| Postgres felt like an ongoing limitation we'd be stuck with.
| KaoruAoiShiho wrote:
| Agreed, it's okay for an MVP but not okay for a real app.
| piyh wrote:
| The amount of time needed for pg fts integration is less than
| elastic, but if you don't replace it ~immediately~ after your
| MVP has been validated, you might as well have gone elastic
| from the beginning.
| sicp-enjoyer wrote:
| Since most deployed applications search with:
| "for item in ... if item.containsSubstring(query)"
|
| I would say it's definitely useful for real apps.
| izietto wrote:
| We migrated FTS to OpenSearch from PostgreSQL a couple of months
| ago. Here the main issues we had with PostgreSQL FTS:
|
| - cross-tables FTS requires at least materialized views, which
| lock data for writing at refresh. This was too much worrying for
| us.
|
| - sorting by rank is not indexable, so we can't sort our dataset
| and have acceptable timings at the same time. Our dataset isn't
| enormous, but neither small (~1.5m records)
| mdavidn wrote:
| Large materialized views in general are really slow in
| PostgreSQL. If you go down the path of indexing joined data,
| consider database triggers or application-level hooks to update
| individual rows as-needed in a separate search table.
| simonw wrote:
| One thing that isn't really covered in this article is relevance
| calculations.
|
| A surprising (to me) thing about PostgreSQL FTS is that it
| doesn't do TF/IDF or BM25 relevance calculations.
|
| These calculations take statistics about the entire corpus into
| account - they do things like ranking a document higher if it
| contains terms from the search which are statistically rare in
| the overall corpus.
|
| PostgreSQL FTS uses how often the search terms appear in the
| document as part of the ranking score, but doesn't look at how
| common/rare individual terms are.
| https://www.postgresql.org/docs/current/textsearch-controls....
|
| OpenSearch/Elasticsearch uses BM25:
| https://opensearch.org/docs/latest/opensearch/rest-api/expla...
|
| SQLite stores these kinds of statistics and can support BM25 or
| TF/IDF or custom ranking functions. I wrote more about that here:
| https://simonwillison.net/2019/Jan/7/exploring-search-releva...
|
| Meilisearch does something a bit different (on reading this I
| don't think it considers full corpus statistics, but I may be
| misinterpreting that):
| https://docs.meilisearch.com/learn/core_concepts/relevancy.h...
|
| It looks like Typesense only considers the documents, not the
| overall corpus: https://typesense.org/docs/guide/ranking-and-
| relevance.html
| billythemaniam wrote:
| Yep. PG FTS works fine for a recall-based problem. If your data
| has a high cardinality then recall search works great. But with
| low cardinality data, you start needing BM25, ML, etc. The
| larger the dataset, the lower the cardinality in general. So PG
| FTS tends to work okay for small/medium datasets where most
| records are unique and you don't need advanced features like
| misspellings, synonyms, etc. Some of that advanced stuff is
| possible, but it's not out of the box and can be hard to set
| up.
| ianbutler wrote:
| Postgres Extensions should allow you to make a TF/IDF and BM25
| plugin pretty easily as well as dealing with the other
| relevance problems.
| throwaway81523 wrote:
| > doesn't do TF/IDF or BM25 relevance calculations
|
| In practice these are nearly worthless. Useful relevance
| ranking is difficult. Google sort of gamed it with PageRank
| (using inbound links). "Information Retrieval" by Buettcher et
| al is a good book on search implementation with a decent amount
| of info about relevance ranking, though maybe it no longer up
| to date: https://mitpress.mit.edu/9780262528870/information-
| retrieval...
| jrochkind1 wrote:
| Useful relevance ranking is definitely difficult, but a lot
| of people are getting something that they believe to be
| useful from, say, ElasticSearch or Solr, both of which use
| broadly TF/IDF-baesd algorithms.
|
| Are you suggesting that you can do just as well without
| TF/IDF, or with postgres fts specifically? In some/all
| circumstances? I'd be interested in hearing more about that,
| if it comes from experience.
|
| Looking at the table of contents for the Buttcher et al book,
| it looks to me like it covers TF/IDF-based algorithms pretty
| extensively. BM25 is in the table of contents specifically.
| Buttcher's own pedagogical search engine, _Wumpus_, includes
| a BM25 implementation.
| http://stefan.buettcher.org/cs/wumpus/docs/relevance.html
| billythemaniam wrote:
| In practice, it depends on the dataset size and use case. For
| web search? Mostly worthless, but can be a valuable signal to
| train ML. For small corpus of documents? BM25 alone does a
| pretty good job in general.
| hardwaresofton wrote:
| This is within reach due to the flexibility of features like
| setweight[0], but I don't know enough about TF/IDF or BM25.
|
| Seems like an excellent weekend project.
|
| [0]: https://www.postgresql.org/docs/current/textsearch-
| features....
| jrochkind1 wrote:
| I don't see how the setweight function could be used to give
| you something like a traditional TF/IDF algorithm, but
| perhaps I'm missing it. I would be interested in reading more
| about that if I am.
| hardwaresofton wrote:
| What I'm thinking is a little far out but it came up
| recently on a project where I'm using Postgres FTS (it's
| slow but I guess I might as well link it for now[0] --
| please do not HN hug it).
|
| Basically, I read on the internet (and was surprised by)
| the fact that setweight can be used _and combined_ with
| individual terms on tsvectors, and then those tsvectors can
| be combined and they keep their weightings.
|
| Some code from that project to illustrate:
| UPDATE podcasts SET fts_doc =
| setweight(to_tsvector(COALESCE(title, ' ')), 'A')
| || setweight(to_tsvector(COALESCE(homepage_url, ' ')), 'A')
| ||
| setweight(to_tsvector(COALESCE(podcast_idx_itunes_author, '
| ')), 'A') || setweight(to_tsvector(CO
| ALESCE(podcast_idx_itunes_ownername, ' ')), 'A')
| || setweight(to_tsvector(COALESCE(podcast_idx_host, ' ')),
| 'A') ||
| setweight(to_tsvector(array_to_string(categories, ' ')),
| 'B') ||
| setweight(to_tsvector(COALESCE(description_html, ' ')),
| 'D')
|
| Basically I'm making tsvectors out of chunks of the
| document, weighting them differently then recombining with
| other vectors without losing the weightings -- I'm thinking
| this could be applied to words identified by the corpus-
| level algos.
|
| So my simplistic thinking here is that if you've done the
| corpus level processing, you could build an intermediate
| data structure and re-evaluate each search document with
| the appropriate weighting. It would likely be quite the
| lengthy stored procedure, but seems like setweight could
| support the usecase? Maybe I'm being a bit optimistic.
|
| [0]: https://podcastsaver.com
| jrochkind1 wrote:
| Hm, I'm not totally following, but... would you have to
| recalculate all row values every time the corpus changes?
| I guess that could work for a seldom-changing corpus, not
| sure how popular a use case that is. I suspect most
| people would not be interested in such an approach,
| instead either making do without TF/IDF, or moving to a
| non-pg solution.
| [deleted]
| pbronez wrote:
| If you could figure that out, it would be an awesome
| plugin.
|
| PS podcastsaver looks neat!
|
| some quick feedback:
|
| 1) your "switch back to light mode" icon looks a LOT like
| a gear for a settings menu. I turned on dark mode, did a
| search, saw the "back to light mode" icon and thought
| "huh, the dark mode toggle is settings now? Weird choice,
| let's see what's there..."
|
| 2) the show notes seem truncated. It would be helpful for
| me to be able to search the show notes for a defined set
| of podcasts. Sometimes I remember that a podcast
| mentioned a product or service that I wanted to check
| out, but I can't remember the name of the product or the
| overall episode, and it's painful to find the right one
| by scrolling back through everything in my pod catcher.
|
| 3) are you tracking Podcasts 2.0? Some interesting
| additional stuff to index there.
| https://origin.fm/blog/podcasting-2point0/
| vosper wrote:
| It's curious they omitted MySQL, which is the usual point of
| comparison for Postgres.
|
| > InnoDB full-text search is modeled on the Sphinx full-text
| search engine, and the algorithms used are based on BM25 and
| TF-IDF ranking algorithms. For these reasons, relevancy
| rankings for InnoDB boolean full-text search may differ from
| MyISAM relevancy rankings.
|
| > InnoDB uses a variation of the "term frequency-inverse
| document frequency" (TF-IDF) weighting system to rank a
| document's relevance for a given full-text search query. The
| TF-IDF weighting is based on how frequently a word appears in a
| document, offset by how frequently the word appears in all
| documents in the collection. In other words, the more
| frequently a word appears in a document, and the less
| frequently the word appears in the document collection, the
| higher the document is ranked.
|
| MySQL's FTS is fine. We're using it at work for fairly basic
| boolean searches on millions of documents, also retrieving the
| relevancy score, and it's plenty fast. We'll outgrow it one
| day, but for now it's pretty easy and does well enough.
|
| https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.htm...
| sandinmyjoints wrote:
| Agreed that MySQL FTS is "fine". Have you run into issues
| with the default value of innodb_ft_result_cache_limit (2
| GB)? That's bit us (OOM restarts) a couple times.
| sroussey wrote:
| BTW: when you outgrow it, it will be sudden. It's performance
| fails in a very non-linear way.
|
| Edit: keep the index in memory
| akoumjian wrote:
| Broad full text search is incredibly hard to do well. I've had to
| build, maintain, and improve multiple search systems. The
| difficulty is largely dependent on the context of the type of
| search problem you are solving. A lot of complexity depends on
| your answer to some of these questions: - Does
| the searcher already know the result they are looking for? (If
| yes, much easier) - Are there subjective and objective
| qualities of the results which should alter the search score,
| sometimes separate from the text being indexed? (If yes, much
| harder) - What is the quality of the text being indexed?
| (If end-user provided, this will vary widely)
|
| Ultimately, building good search is often a struggle against
| providing the best possible results between searcher intent and
| incomplete document evaluation criteria. People never really
| think about when a search is working really well, but they
| definitely know and complain when it's working poorly.
| hardwaresofton wrote:
| How do you classify the difference (in problem/scope, etc) of
| type ahead versus full blown search? It feels like these
| systems can be grown almost completely differently -- you could
| hack together completely in-browser "search" with datalists[0]
| and just prune it actively (and fool most users, depending on
| how varied searches were).
|
| I do wonder how much deep search really matters when people
| only really expect to look at the first page.
|
| [0]: https://developer.mozilla.org/en-
| US/docs/Web/HTML/Element/da...
| akoumjian wrote:
| "Type ahead" or "autocomplete" is absolutely a different type
| of problem, and often simpler. This generally falls into the
| use case where the searcher already knows the specific item
| they are looking for. Often the results are objects owned by
| or known to the user in question, or you are searching
| through a very limited and relatively static set of documents
| and topics. Reference documentation for software often falls
| into this category.
|
| In my experience, you don't have to spend a lot of time
| thinking about scoring and relevancy for these types of
| search. Generally you only want to include a small edit
| distance in the results at all to handle misspellings.
|
| This is so vastly different when you have a corpus of
| millions of documents about an encyclopedia's worth of
| topics.
|
| > I do wonder how much deep search really matters when people
| only really expect to look at the first page.
|
| Getting the first page to have the best quality and relevancy
| is much more difficult if the user is searching through
| something like scientific papers, stock video footage. It is
| a challenge in bridging the distance between ideas and
| expectations.
| zhxshen wrote:
| The combinatorics of full-text search are astronomical. A dashed-
| off comparison like this is pure noise. There are so many knobs
| to twiddle, and a nudge to any one of them can yield remarkably
| different results.
___________________________________________________________________
(page generated 2022-10-14 23:00 UTC)