[HN Gopher] Postgres Full-Text Search: A search engine in a data...
___________________________________________________________________
Postgres Full-Text Search: A search engine in a database
Author : twakefield
Score : 557 points
Date : 2021-07-27 15:20 UTC (1 days ago)
(HTM) web link (blog.crunchydata.com)
(TXT) w3m dump (blog.crunchydata.com)
| eric4smith wrote:
| Postgres FTS is normally quite good.
|
| But it does not know how to deal with languages like Chinese,
| Japanese and Thai.
|
| For that you have to use something like PGroonga extension.
|
| The rest of PostgreSQL mostly handles things ok, unless you try
| to sort on one of these languages and the same things happen
| again.
|
| There are all ways around these problems. But it's not as easy as
| turning on Unicode and just expect everything to work!
|
| Yes I'm native English speaker who started to develop in Asia and
| discovered all of this recently.
| simonw wrote:
| The Django ORM includes support for PostgreSQL search and I've
| found it a really productive way to add search to a project:
| https://docs.djangoproject.com/en/3.2/ref/contrib/postgres/s...
| lettergram wrote:
| I actually built a search engine back in 2018 using postgresql
|
| https://austingwalters.com/fast-full-text-search-in-postgres...
|
| Worked quite well and still use it daily. Basically doing
| weighted searches on vectors is slower than my approach, but
| definitely good enough.
|
| Currently, I can search around 50m HN & Reddit comments in 200ms
| on the postgresql running on my machine.
| vincnetas wrote:
| Offtopic, but currious what are your use cases when searching
| all HN and reddit comments? Im at the beggining of this path,
| just crawled HN, but what to do with this, still a bit cloudy.
| lettergram wrote:
| I built a search engine that quantified the expertise of
| authors of comments. Then I created what I called "expert
| rank" that allowed me to build a really good search engine.
|
| Super good if you're at a company or something
|
| https://twitter.com/austingwalters/status/104189476543920128.
| ..
| turbocon wrote:
| Wow, that lead me down quite a rabbit hole, impressive
| work.
| busymom0 wrote:
| If you are just interested in searching HN and don't need
| full data dumps, the Algolia search for HN is quite good and
| fast:
|
| https://hn.algolia.com
|
| They also have a free API.
|
| If you need data dumps, maybe look into Google BigQuery.
| rattray wrote:
| Nice - looks like the ~same approach recommended here of adding
| a generated `tsvector` column with a GIN index and querying it
| with `col @@ @@ to_tsquery('english', query)`.
| lettergram wrote:
| Yeah my internal approach was creating custom vectors which
| are quicker to search.
| pvsukale3 wrote:
| If you are using Rails with Postgres you can use pg_search gem to
| build the named scopes to take advantage of full text search.
|
| https://github.com/Casecommons/pg_search
| theandrewbailey wrote:
| > You could also look into enabling extensions such as unaccent
| (remove diacritic signs from lexemes) or pg_trgm (for fuzzy
| search).
|
| Trigrams (pg_trgm) are practically needed for usable search when
| it comes to misspellings and compound words (e.g. a search for
| "down loads" won't return "downloads").
|
| I also recommend using websearch_to_tsquery instead of using the
| cryptic syntax of to_tsquery.
| kyrra wrote:
| Trigrams are amazing. I was doing a sideproject where I wanted
| to allow for substring searching, and trigrams seemed to be the
| only way to do it (easily/well) in postgres. Gitlab did a great
| writeup on this a few years ago that really helped me
| understand it:
|
| https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
|
| You can also always read the official docs:
|
| https://www.postgresql.org/docs/current/pgtrgm.html
| Grimm1 wrote:
| Postgres Full-Text search is a great way to get search running
| for a lot of standard web applications. I recently used just this
| in Elixir to set up a simple search by keyword. My only complaint
| was Ecto (Elixir's query builder library) doesn't have first
| class support for it and neither does Postgrex the lower level
| connector they use. Still, using fragments with sanitized SQL
| wasn't too messy at all.
| mrinterweb wrote:
| I've seen Elasticsearch set up for applications that would have
| equal benefit from just using the postgresql db's full-text
| search they already have access to.
|
| The additional complexity is usually incurred when the data in
| postgresql changes, and those changes need to be mirrored up to
| Elasticsearch. Elasticsearch obviously has its uses, but for some
| cases, postgresql's built in full-text search can make more
| sense.
| [deleted]
| jcuenod wrote:
| Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to
| search my PDF database. SQLite's full-text search is really
| excellent. The thing that tripped me up for a while was `GROUP
| BY` with the `snippet`/`highlight` function but that's the point
| of the blog post.
|
| [0] https://jcuenod.github.io/bibletech/2021/07/26/full-text-
| sea...
|
| [1] https://www.sqlite.org/fts5.html
| bob1029 wrote:
| We've been using SQLite's FTS capabilities to index customer
| log files since 2017 or so. It's been a wonderful approach for
| us. Even if we move to our own in-house data store (event
| sourced log), we would still continue using SQLite for tracing
| because it brings so many of these sorts of benefits.
| edwinyzh wrote:
| A very well written article about SQLite FTS5! One question -
| it seems that your search result displays the matching paging
| number, how did you do that? because as far as I know, unlike
| FTS4, FTS5 has no `offsets` function.
| jcuenod wrote:
| I index each page individually. It's not in the article
| (because I also don't explain separating out the metadata)
| but I did weigh the options. I considered indexing page
| pairs, for example (and having every page indexed twice). But
| I figured that if search terms are broken across pages,
| they're likely to also be separated by text in headers &
| footers, page numbers, and footnotes. So in the end I decided
| to just index individual pages. The FTS table now has `id`,
| `page_number`, and `content` columns (where `id` is the
| foreign key to a table that stores the metadata).
| MushyRoom wrote:
| I was hyped when I found out about it a while ago. Then I wasn't
| anymore.
|
| When you have 12 locales (kr/ru/cn/jp/..) it's not that fun
| anymore. Especially on a one man project :)
| freewizard wrote:
| For small project and simple full text search requirement, try
| this generic parser:
| https://github.com/freewizard/pg_cjk_parser
| oauea wrote:
| Why support so many locales in a one man project?
| MushyRoom wrote:
| It's just a project to learn all the things that are web.
| It's mainly a database for a game now with most of the
| information sourced from the game (including its localization
| files).
|
| I'm slowly transitioning from MariaDB to Postgres - again as
| a learning experience. There is cool stuff and there is
| annoying stuff to reproduce things like case-insensitive +
| ignore accents (utf8_general_ci) in Postgres.
|
| I've looked into FTS and searching for missing dictionaries
| to support all the locales but Chinese is one of the harder
| ones.
| blacktriangle wrote:
| One (dev) project here, we're up to 5 locales at a
| surprisingly small number of customers. Problem is when your
| customers are global, all of a sudden a single customer can
| bring along multiple locales. I very much regret not taking
| localization far more seriously early in development but we
| were blindsided by the interest outside the Angleosphere.
| habibur wrote:
| To attract more visitors/customers I guess. I plan to extend
| to 10 languages too. 1 man project.
| tabbott wrote:
| Zulip's search is powered by this built-in Postgres full-text
| search feature, and it's been a fantastic experience. There's a
| few things I love about it:
|
| * One can cheaply compose full-text search with other search
| operators by just doing normal joins on database indexes, which
| means we can cheaply and performantly support tons of useful
| operators (https://zulip.com/help/search-for-messages).
|
| * We don't have to build a pipeline to synchronize data between
| the real database and the search database. Being a chat product,
| a lot of the things users search for are things that changed
| recently; so lag, races, and inconsistencies are important to
| avoid. With the Postgres full-text search, all one needs to do is
| commit database transactions as usual, and we know that all
| future searches will return correct results.
|
| * We don't have to operate, manage, and scale a separate service
| just to support search. And neither do the thousands of self-
| hosted Zulip installations.
|
| Responding to the "Scaling bottleneck" concerns in comments
| below, one can send search traffic (which is fundamentally read-
| only) to a replica, with much less complexity than a dedicated
| search service.
|
| Doing fancy scoring pipelines is a good reason to use a
| specialized search service over the Postgres feature.
|
| I should also mention that a weakness of Postgres full-text
| search is that it only supports doing stemming for one language.
| The excellent PGroonga extension (https://pgroonga.github.io/)
| supports search in all languages; it's a huge improvement
| especially for character-based languages like Japanese. We're
| planning to migrate Zulip to using it by default; right now it's
| available as an option.
|
| More details are available here:
| https://zulip.readthedocs.io/en/latest/subsystems/full-text-...
| stavros wrote:
| I cannot tell you how much I love Zulip, but I can tell you
| that I have no friends any more because everyone is tired of me
| evangelizing it.
| brightball wrote:
| All of this. It's such a good operational experience that I
| will actively fight against the introduction of a dedicated
| search tool unless it's absolutely necessary.
| rattray wrote:
| TBH I hadn't known you could do weighted ranking with Postgres
| search before.
|
| Curious there's no mention of zombodb[0] though, which gives you
| the full power of elasticsearch from within postgres (with
| _consistency_ no, less!). You have to be willing to tolerate slow
| writes, of course, so using postgres ' built-in search
| functionality still makes sense for a lot of cases.
|
| [0] https://github.com/zombodb/zombodb
| craigkerstiens wrote:
| Zombo is definitely super interesting and we should probably
| add a bit in the post about it. Part of the goal here was that
| you can do a LOT with Postgres, without adding one more system
| to maintain. Zombo is great if you have Elastic around, but
| want Postgres as the primary interface, but what if you don't
| want to maintain Elastic.
|
| My ideal is always though to start with Postgres, and then see
| if it can solve my problem. I would never Postgres is the best
| at everything it can do, but for most things it is good enough
| without having another system to maintain and wear a pager for.
| rattray wrote:
| Zombo does at least promise to handle "complex reindexing
| processes" for you (which IME can be very painful) but yeah,
| I assume you'd still have to deal with shard rebalancing,
| hardware issues, network failures or latency between postgres
| and elastic, etc etc.
|
| The performance and cost implications of Zombo are more
| salient tradeoffs in my mind - if you want to index one of
| the main tables in your app, you'll have to wait for a
| network roundtrip and a multi-node write consensus on every
| update (~150ms or more[0]), you can't `CREATE INDEX
| CONCURRENTLY`, etc.
|
| All that said, IMO the fact that Zombo exists makes it easier
| to pitch "hey lets just build search with postgres for now
| and if we ever need ES's features, we can easily port it to
| Zombo without rearchitecting our product".
|
| [0] https://github.com/zombodb/zombodb/issues/640
| zombodb wrote:
| I wonder what the ZomboDB developers are up to now? What
| great text-search-in-postgres things could they be secretly
| working on?
| syoc wrote:
| My worst search experiences always come from the features
| applauded here. Word stemming and removing stop words is a big
| hurdle when you know what you are looking for but get flooded by
| noise because some part of the search string was ignored. Another
| issue is having to type out a full word before you get a hit in
| dynamic search boxes (looking at you Confluence).
| Someone1234 wrote:
| I'd argue that isn't a problem with the _feature_ , but a
| thoughtless implementation.
|
| A _good_ implementation will weigh verbatim results highest
| before considering the stop-word stripped or stemmed version.
| Configuring to_tsvector() to not strip stop words or using a
| stemming dictionary is, in my opinion, a little clunky in
| Postgres: You 'll want to make a new [language] dictionary and
| then call to_tsvector() using your new dictionary as the first
| parameter.
|
| After you've set up the dictionary globally, this would look
| _something_ like:
|
| setweight(to_tsvector('english_no_stem_stop', col), 'A') ||
| setweight(to_tsvector('english', col), 'B'))
|
| I think blaming Postgres for adding stemming/stop-word support
| because it can be [ab]used for a poor search user experience is
| like blaming a hammer for a poorly built home. It is just a
| tool, it can be used for good or evil.
|
| PS - You can do a verbatim search without using to_tsvector(),
| but that cannot be easily passed into setweight() and you
| cannot use features like ts_rank().
| shakascchen wrote:
| No fun doing it for Chinese, especially for traditional Chinese.
|
| I had to install software but on Cloud SQL you can't. You have to
| do it on your instances.
| justusw wrote:
| Same story for performing searches in Japanese.
| rattray wrote:
| Something that's missing from this which I'm curious about is how
| far _can 't_ postgres search take you?
|
| That is, what tends to be the "killer feature" that makes teams
| groan and set up Elasticsearch because you just can't do it in
| Postgres and your business needs it?
|
| Having dealt with ES, I'd really like to avoid the operational
| burden if possible, but I wouldn't want to choose an intermediary
| solution without being able to say, "keep in mind we'll need to
| budget a 3-mo transition to ES once we need X, Y, or Z".
| nostrademons wrote:
| Used to work on Google Search, used ES extensively for a
| startup I founded (which was sort of quasi-search...it was
| built around feed ranking, where the query is constant and a
| stream of documents is constantly coming in), and have also
| used Postgres extensively in other companies.
|
| The big problem with all the off-the-shelf search solutions
| (RDBMS full-text search, ES, Algolia) is that _search ranking
| is a complicated and subtle problem_ , and frequently depends
| on signals that are _not in the document itself_. Google 's big
| insight is that how _other_ people talk about a website is more
| important than how the website talks about itself, and its
| ranking algorithm weights accordingly.
|
| ES has the basic building blocks to construct such a ranking
| algorithm. In terms of fundamental infrastructure I found ES to
| be just as good as Google, and better in some ways. But its
| out-of-the-box ranking function _sucks_. Expect to put a domain
| expert just on search ranking and evaluation to get decent
| results, and they 're going to have to delve pretty deeply into
| advanced features of ES to get there.
|
| AFAICT Postgres search only lets you tweak the ranking
| algorithm by assigning different weights to fields, assuming
| that the final document score is a linear combination of
| individual fields. This is usually not what you want - it's
| pretty common to have non-linear terms from different signals.
| wraptile wrote:
| > Google's big insight is that how other people talk about a
| website is more important than how the website talks about
| itself, and its ranking algorithm weights accordingly.
|
| I'm having trouble believing that seeing how top results on
| opinionated keywords are all SEO spam of websites no one
| visits by themselves.
| lmm wrote:
| You don't remember what a game-changer Google was back in
| 2000 or so. Its results are pretty awful now, probably
| because of Goodhart's Law, but they were such a huge
| improvement over Jeeves et al..
| sdoering wrote:
| I remember when Google came into being and how amazed I
| was. It was quite something totally different and I
| switched and never looked back for a lot of years.
|
| In recent years however I tend to massively agree with your
| sentiment and experience. Every day I do not find the
| things that are really helpful on page 1 - 3, sadly.
| SOLAR_FIELDS wrote:
| Nice insightful comment. Can you give a bit additional
| insight related to when it makes sense to use ES vs Postgres?
| My takeaway from your comment is that unless your product
| depends on search as a central component that inbuilt search
| is _good enough_. Is that an incorrect takeaway?
| nostrademons wrote:
| That's probably a decent takeaway. If full-text search is a
| "nice to have" (a bonus feature for advanced users, for
| example) you can probably rely on built-in Postgres search.
| If search is a fundamental way that users interact with
| your product, you want to spend the time and money to get
| it right, and that probably means ElasticSearch and a
| custom ranking function.
| simondotau wrote:
| There is an incredibly large difference between effective
| scoring for data that is being gamed and data that is not
| being gamed. Most people's full text data isn't being gamed
| quite like web search SEO.
|
| There is also an incredibly large difference between
| effective scoring for data that has deep relationships and
| data that does not. Most people's full text data doesn't have
| deep relationships like web pages do with inbound links.
| __jem wrote:
| I built a big feature with percolate as well and it really
| sold me on the possibilities of ES for product architecture.
| cmckn wrote:
| I've never really had a good idea of what Percolator _is_.
| I understand it to be a "streaming" indexer that replaced a
| "batch" indexer based on MapReduce. ES allows indexing
| individual documents (or small batches) vs. rebuilding the
| entire index; is that the connection you're making?
| IgorPartola wrote:
| So let's say that you are building a search engine for
| performance car parts. There are going to be a bunch of
| technical terms you use there that are not necessarily going
| to stand out in the document itself but you know them to be
| important. For example, the amount of boost pressure a turbo
| can provide or the number of pistons in a brake caliper. Is
| there some structured way to specify the grammar which is
| used for such terms and treat those as important such that
| when a user puts in "19psi turbo" that they don't get a bunch
| of results for just "turbo" which isn't exactly what they
| want?
| dreamer7 wrote:
| When I implemented search for my app, I built a hierarchy
| of the different attributes weighted by levels and scored
| search phrases for how closely they matched. This is for
| mobile phones. For example, searching for "Silver iphone 11
| pro" would give a higher ranking to all iphone 11 pros that
| are Silver, then comes differently coloured 11 Pros,
| followed by 11 Pro Max and 11s, lastly all silver coloured
| phones.
| bogomipz wrote:
| >"Used to work on Google Search ..."
|
| Might you or anyone else have some recommendations for books
| or other resource on large scale search architecture that you
| think are are worthy reads on the subject?
|
| I would also be interested in hearing if you or anyone else
| might ave any similar resources you could recommend on the
| subject of "search ranking"?
| johnx123-up wrote:
| Did you get a chance to compare MeiliSearch?
| rawoke083600 wrote:
| >search ranking is a complicated and subtle problem
|
| Absolutely ! I think never before the saying "the devil is in
| the details"... is more appropriate than here.
|
| Sure most users agree on the extreme's the REALLY bad search
| result (putting in 'apple' getting out LCD TV's [maybe the
| tv's are in parent-category 'Electronics' and you have
| category and popularity boost to high' ?]) and the REALLY
| good (results that you expected)
|
| Search Results are commonly evaluated with Recall(did ALL the
| documents that are relevant got returned) vs Precision (how
| many of the results are 'correct')
|
| But that is "one" of many-many metrics.
|
| The biggest issues are non-tech ppl (like your boss or
| manager) walking in and "discussing" his/her pet-peeve-search
| query, cause in his mind when he put in APPLE iPhone we
| should be returning ONLY apple-iphones and NOT apple-iPhone
| accessories) or maybe we should be returning ONLY the "latest
| iphone" not the model from 2 generations back
|
| I've commented this before, you can usually only shoot for an
| "average amount of happiness" (sounds like Arthur
| Schopenhauer ? :P) for most users. Never "ok ppl, search
| works perfect" for everyone one now
|
| As to the "practical matters", we found that building a
| "search-test-suite" where you put in the "manager's pet-
| peeve" as well as any angry-emails about search queries, and
| whenever do you search-tuning it's easy to see any-query-
| regression oh and of course this needs to be an automated
| search-test-suite.
| sandGorgon wrote:
| it doesnt do TF-IDF or BM-25 - the current state of art in
| search relevance algorithms.
|
| that's where it cant be used for anything serious.
| SigmundA wrote:
| That why this really needs to get merged:
|
| https://github.com/postgrespro/rum
| rattray wrote:
| TF/IDF is listed as a TODO on that repo, and I don't see a
| PR which promises to provide it.
| SigmundA wrote:
| It lays the groundwork by storing the needed metadata in
| the index, yes it needs more work and I wish there was
| more interest to do it.
|
| I really think the is a neglected area and if PG was able
| to merge in TF-IDF and BM25 there would be little reason
| to use a separate search db / engine and many advantages
| with it being integrated.
| Thaxll wrote:
| PG is average at best for text search, it's not even good.
| nextaccountic wrote:
| If your search needs outgrow Postgres' native search engine,
| you can use Postgres search with an ElasticSearch backend,
| using Zombo
|
| https://github.com/zombodb/zombodb
|
| It basically gives you a new kind of index (create index ..
| using zombodb(..) ..)
| mjewkes wrote:
| Exact phrase matching. This generally requires falling back to
| ILIKE, which is not performant.
| e12e wrote:
| Even if using ILIKE over the result of an imprecise query?
| [deleted]
| mjewkes wrote:
| ILIKE itself is a linear scan. The only way to index them
| are trigram indicies, which are very inefficient (and
| sometimes not usable) if you're searching document-length
| content.
|
| Whether or not it works in your specific situation depends
| on your use case.
| simonw wrote:
| Exact phrase searching works in PostgreSQL full-text search -
| here's an example:
| https://simonwillison.net/search/?q=%22nosql+database%22
|
| I'm using search_type=websearch https://github.com/simonw/sim
| onwillisonblog/blob/a5b53a24b00...
|
| That's using websearch_to_tsquery() which was added in
| PostgreSQL 11: https://www.postgresql.org/docs/11/textsearch-
| controls.html#...
| mjewkes wrote:
| Try https://simonwillison.net/search/?q=%22your+own+benchma
| rks%2...
|
| Looks like we get 37 results, of which 2 are true
| positives.
|
| Looks like "your" and "own" are both contained in the
| english.stop stopwords list. So you could fix this by
| removing stopwords from your dictionary.
|
| While disabling the stemmer is relatively easy (use the
| 'simple' language setting for your ts_query), altering the
| stopword dictionaries is more involved, and not easy to
| maintain or pass between developers/environments, and not
| at all easy to share between queries.
|
| And so the most common suggestion is to use ILIKE.
|
| Lucene has no problems with any of this.
| SahAssar wrote:
| I had some issues with this recently as I couldn't get a
| FTS query to find something looking like a path or url in a
| query. As an example from your site:
| https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains
| the exact text https://www.niche-museums.com/, but I cannot
| find a way to search for that phrase exactly (trying
| https://simonwillison.net/search/?q=%22www.niche-
| museums.com... works though). I tried both in my own psql
| setup and on your site, and it seems like exact phrase
| searching is limited to the language used, even if it would
| be an exact string match.
|
| Are there any workarounds for that?
| rattray wrote:
| Pardon my ignorance - what is exact phrase matching and why
| doesn't it work with tsvector?
| mjewkes wrote:
| Exact phrase matching is what google (sometimes? used to?)
| do for you if you put your search terms in double "full
| quotes".
|
| It returns only results that contain the exact multi word
| sequence in exactly the same order.
| amichal wrote:
| Postgres was not good at (for us) - IDF and other corpus based
| relevancy measures. had to hand roll - thesaurus and
| missspelling - again possible of course with preprocessing and
| by adding config files - non Latin alphabet languages. E.g.
| Arabic - needed filesystem access (we used aws rds so couldn't
| do it) to add a dictionary based stemmed/word breaker
|
| We used es or solr for those cases. For English FTS with 100k
| documents doing it in PG is super easy and one less dependency
| some_developer wrote:
| Anectodal note:
|
| A few years ago we added yet-another part to our product and,
| whilst ES worked "okay", we got a bit weary of ES due to "some
| issues" (some bug in the architecture keeping things not
| perfect in sync, certain queries with "joins" of types taking
| long, demand on HW due to the size of database, no proper
| multi-node setup due to $$$ and time constraint, etc.; small
| things piling up over time).
|
| Bright idea: let's see how far Postgres, which is our primary
| datastore, can take us!
|
| Unfortunately, the feature never made it fully into production.
|
| We thought that on paper, the basic requirements were ideal:
|
| - although the table has multiple hundreds of millions of
| entries, natural segmentation by customer IDs made possible
| individual results much smaller
|
| - no weighted search result needed: datetime based is perfect
| enough for this use-case, we thought it would be easy to come
| up with the "perfect index [tm]"
|
| Alas, we didn't even get that far:
|
| - we identified ("only") 2 columns necessary for the search =>
| "yay, easy"
|
| - one of those columns was multi-language; though we didn't
| have specific requirements and did not have to deal with
| language specific behaviour in ES, we had to decide on one for
| the TS vectorization (details elude me why "simple" wasn't
| appropriate for this one column; it was certainly for the other
| one)
|
| - unsure which one, or both, we would need, for one of the
| columns we created both indices (difference being the
| "language")
|
| - we started out with a GIN index (see
| https://www.postgresql.org/docs/9.6/textsearch-indexes.html )
|
| - creating a single index took > 15 hours
|
| But once the second index was done, and had not even rolled out
| the feature in the app itself (which at this point was still an
| ever changing MVP), unrelated we suddenly got hit by lot of
| customer complains that totally different operations on this
| table (INSERTs and UPDATEs) started to be getting slow (like
| 5-15 seconds slow, something which usually takes tiny ms).
|
| Backend developer eyes were wide open O_O
|
| But since we knew that second index just finished, after
| checking the Posgres logs we decided to drop the FTS indices
| and, lo' and behold, "performance problem solved".
|
| Communication lines were very short back then (still are today,
| actually) and it was promptly decided we just cut the search
| functionality from this new part of the product and be done
| with it. This also solved the problem, basically (guess there's
| some "business lesson" to be learned here too, not just
| technical ones).
|
| Since no one within the company counter argued this decision,
| we did not spend more time analyzing the details of the
| performance issue though I would have loved to dig into this
| and get an expert on board to dissect this.
|
| --
|
| A year later or so I had a bit free time and analyzed one
| annoying recurring slow UPDATE query problem on a completely
| different table, but also involving FTS on a single column
| there also using a GIN index. That's when I stumble over
| https://www.postgresql.org/docs/9.6/gin-implementation.html
|
| > Updating a GIN index tends to be slow because of the
| intrinsic nature of inverted indexes: inserting or updating one
| heap row can cause many inserts into the index (one for each
| key extracted from the indexed item). As of PostgreSQL 8.4, GIN
| is capable of postponing much of this work by inserting new
| tuples into a temporary, unsorted list of pending entries. When
| the table is vacuumed or autoanalyzed, or when
| gin_clean_pending_list function is called, or if the pending
| list becomes larger than gin_pending_list_limit, the entries
| are moved to the main GIN data structure using the same bulk
| insert techniques used during initial index creation. This
| greatly improves GIN index update speed, even counting the
| additional vacuum overhead. Moreover the overhead work can be
| done by a background process instead of in foreground query
| processing.
|
| In this particular case I was able to solve the occasional slow
| UPDATE queries with "FASTUPDATE=OFF" on that table and,
| thinking back about the other issue, it might have solved or
| minimized the impact.
|
| Back to the original story: yep, this one table can have
| "peaks" of inserts but it's far from "facebook scale" or
| whatever, basically 1.5k inserts / second were the absolute
| rare peak I measured and usually it's in the <500 area. But I
| guess it was enough for this scenario to add latency within the
| database.
|
| --
|
| Turning back my memory further, I was always "pro" trying to
| minimize / get rid of ES after learning about
| http://rachbelaid.com/postgres-full-text-search-is-good-enou...
| even before we used any FTS feature. At also mentions the
| GIN/GiST issue but alas, in our case: ElasticSearch is good
| enough and, besides the thwarts we've with it, actually easier
| to reason about (so far).
| manigandham wrote:
| It can't go that far. Postgres is limited by dictionaries,
| stemming/processing, query semantics (like fuzzy searching),
| and the biggest issue of all being a lack of modern relevance
| algorithms. It's good for limited scenarios where you just need
| more than a SQL LIKE statement, and chaining some functions
| together can get you decent results [1] without adding another
| datastore.
|
| However search tech is pretty mature with Lucene at the core
| and there are many better options [2] from in-process libraries
| to simple standalone servers to full distributed systems like
| Elastic. There are also other databases (relational like
| MemSQL, or documentstores like MongoDB/RavenDB) that are adding
| search as native querying functions with most of the abilities
| of ES. If search is a core or complex part of your application
| (like patterns in raw image data or similarities in audio
| waveforms) then that's where ES will excel.
|
| 1. https://stackoverflow.com/questions/46122175/fulltext-
| search...
|
| 2.
| https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
| grncdr wrote:
| My experience has been that sorting by relevance ranking is
| quite expensive. I looked into this a bit and found
| https://github.com/postgrespro/rum (and some earlier slide
| decks about it) that explains why the GIN index type can't
| support searching and ranking itself (meaning you need to do
| heap scans for ranking). This is especially problematic if your
| users routinely do searches that match a lot of documents and
| you only want to show the top X results.
|
| Edit: if any of the Crunchy Data people are reading this:
| support for RUM indexes would be super cool to have in your
| managed service.
| thom wrote:
| Top X queries should be optimised with gist indexes.
| grncdr wrote:
| Ah sounds like I need to update myself on this. I don't
| recall what the other trade-offs vs GIN indexes are.
| craigkerstiens wrote:
| Actually thats a great suggestion. We need to take a deeper
| look at the code itself and ability to support the extension,
| but we'll definitely take and evaluate it at some of our
| upcoming roadmap planning for Crunchy Bridge
| (https://www.crunchybridge.com).
| grncdr wrote:
| Cool cool
|
| If I could have your other ear for a moment: support for
| _any_ EU-based cloud provider would be super nice. Ever
| since the Privacy Shield fig leaf was removed, questions
| about whether we store any data under US jurisdiction have
| become a lot more frequent.
| craigkerstiens wrote:
| Do you have a preferred one?
|
| We're currently on the big 3 US ones, in process of
| working on our 4th provider, our goal is very much to
| deliver the best Postgres experience whether on bare
| metal/on-premise or in the cloud, self hosted or fully
| managed.
|
| Edit: Always feel free to reach out directly, I'm always
| happy to spend time with anyone that has questions and
| usually pretty easy to track me down.
| james_in_the_uk wrote:
| Dublin usually works well from a regulatory perspective.
| jhoechtl wrote:
| They are the 52 state of america. The 51 was already
| taken.
|
| https://en.wikipedia.org/wiki/51st_State_(song)
| animeshjain wrote:
| From what I know, full text search in Postgres (and MySQL) does
| not have faceted search. So it only supports returning full
| text results from the entire index.
|
| Actually, it is possible, but doing a search on a particular
| segment of rows is a very slow operation - say text search for
| all employees with name matching 'x', in organization id 'y'.
|
| It is not able to utilise the index on organization id in this
| case, and it results in a full scan.
| xcambar wrote:
| This is anecdote, not proper feedback, since I wasn't directly
| involved in the topic.
|
| My company relied on PG as its search engine and everything
| went well from POC to production. After a few years of
| production and new clients requiring volumes of data an order
| of magnitude above our comfort zone, things went south pretty
| fast.
|
| Not many months later but many sweaty weeks of engineering
| after, we switched to ES and we're not looking back.
|
| tl;dr; even with great DB engineers (which we had), I'd suggest
| that scale is a strong limiting factor on this feature.
| dante_dev wrote:
| can you tell us the scale you're talking about? getting good
| enough results with ~1 billion rows
| lacksconfidence wrote:
| Can you tell us about your scoring function? Selecting 40M
| results from a dataset of ~1B and returning the top 10
| based on some trivial scoring function is easy, any
| reasonable search system will handle that. The problem is
| when you have to run your scoring function on all 40M
| matching docs to decide which 10 are the most relevant.
| It's even more of a problem when your scoring function
| captures some of the complexities of the real word, rather
| than something trivial like tf/idf or bm25.
| _009 wrote:
| If you are looking to do semantic search (Cosine similarity) +
| filtering (SQL) on data that can be represented as vectors
| (audio, text, video, bio) I suggest,
| https://github.com/ankane/pgvector
| iav wrote:
| I moved from ElasticSearch to PG FTS in production, and here
| are the things I had to give up:
|
| 1. PostgreSQL has a cap on column length, and the search index
| has to be stored in a column. The length of the column is
| indeterminate - it is storing every word in the document and
| where it's located, so a short document with very many unique
| words (numbers are treated as words too) can easily burst the
| cap. This means you have to truncate each document before
| indexing it, and pray that your cap is set low enough. You can
| use multiple columns but that slows down search and makes
| ranking a lot more complicated. I truncate documents at 4MB.
|
| 2. PostgreSQL supports custom dictionaries for specific
| languages, stemmers, and other nice tricks, but none of those
| are supported by AWS because the dictionary gets stored as a
| file on the filesystem (it's not a config setting). You can
| still have custom rules like whether or not numbers count as
| words.
| gk1 wrote:
| Semantic search using text embeddings. With Open Distro for
| Elasticsearch you can store your text embeddings and then
| perform a nearest-neighbor search[1] to find most similar
| documents using cosine similarity[2]. Elasticsearch (vanilla)
| will get this feature with 8.0.
|
| If migrating to ES makes you groan you can use a managed
| service like Pinecone[3] (disclaimer: I work there) just for
| storing and searching through text embeddings in-memory through
| an API while keeping the rest of your data in PG.
|
| [1] Nearest-neighbor searches in Open Distro:
| https://opendistro.github.io/for-elasticsearch-docs/docs/knn...
|
| [2] More on how semantic similarity is measured:
| https://www.pinecone.io/learn/semantic-search/
|
| [3] https://www.pinecone.io
| matsemann wrote:
| I'm not well versed on modern pg for this use, but when I
| managed a Solr instance ~5 years ago, it was the ranking of the
| results that was the killer feature. Finding results fast most
| systems can do. Knowing which results to present is harder.
|
| Our case was a domain specific knowledge base, with certain
| terms occurring often in many articles. Searching for a term
| could bring up thousands of results, but few of them were
| actually relevant to show in context of the search, they just
| happened to use the term.
| jka wrote:
| "Faceted search"[1] (aka aggregates in Elasticsearch) tends to
| be a popular one, to provide user-facing content navigation.
|
| That said, simonw has been on the case[2] demonstrating an
| implementation of that using Django and PostgreSQL.
|
| [1] - https://en.wikipedia.org/wiki/Faceted_search
|
| [2] - https://simonwillison.net/2017/Oct/5/django-postgresql-
| facet...
| craigds wrote:
| This is the key one for us that makes Postgres a non-starter
| for FTS (We use postgres for everything _else_ )
|
| We begrudgingly use Solr instead (we started before ES was
| really a thing and haven't found a _need_ to switch yet)
|
| When you get more than about two different _types_ of filters
| (e.g. types of filters could be 'tags', 'categories',
| 'geotags', 'media type', 'author' etc), the combinatorial
| explosion of Postgres queries required to provide facet
| counts gets unmanageable.
|
| For example, when I do a query filtered by
| `?tag=abc&category=category1`, I need to do these queries:
| - `... where tag = 'abc' and category_id = 1` (the current
| results) - `count(*) ... where tag = 'def' and
| category_id = 1` (for each other tag present in the results)
| - `count(*) ... where tag = 'abc' and category_id = 2` (for
| each other category present in the results) -
| `count(*) ... where category_id = 1` - `count(*)
| ... where tag = 'abc'` - `count(*)`
|
| There are certainly smarter ways to do this than lots of tiny
| queries, but all this complexity still ends up somewhere and
| isn't likely to be great for performance.
|
| Whereas solr/elasticsearch have faceting built in and handle
| this with ease.
| kalev wrote:
| This is exactly the issue I'm currently facing. We do a
| bunch of count queries to calculate facets and am looking
| for something that can do this out of the box. I'm glad I
| came to the same conclusion myself, either solr or
| elasticsearch might be the way to go. Starting this from
| scratch, which of the two would you recommend and why?
| craigds wrote:
| Unfortunately I don't know enough about elasticsearch to
| provide a useful comparison. I like the way ES queries
| are structured JSON instead of obscure compact
| querystring parameters, but that's not a good reason to
| choose one over the other :)
| Ueland wrote:
| I myself is currently setting up Solr for a project as my
| experience with ES as a DevOps is not a happy one. Always
| nodes/indexes having some kind of problem. In the same
| time i have also worked with Solr for years before and
| never met any major issues. It just works and does the
| job well.
| radiospiel wrote:
| You could run this in a single query `count(*) .. group by
| (tag, category_id)` and then post process the results.
|
| counting by groups is still relatively expensive though in
| postgres; I built an implementation which uses estimate
| counting (via query planner) for larger buckets, and exact
| counting for only smaller buckets, but my use case allowed
| for some degree of inconsistency.
| brightball wrote:
| Streaming data ingestion is the biggest. If you're constantly
| writing data to be searched, this is where ES really outshines
| everything.
| kayodelycaon wrote:
| If I recall correctly, Postgres search doesn't scale well. Not
| sure where it falls apart but it isn't optimized in the same
| way something like Solr is.
| ezekg wrote:
| I have a table with over a billion rows and most full-text
| searches still respond in around a few milliseconds. I think
| this will depend on a lot of factors, such as proper
| indexing, and filtering down the dataset as much as possible
| before performing the full-text ops. I've spent a
| considerable amount of time on optimizing these queries,
| thanks to tools like PgMustard [0]. Granted, I do still have
| a couple slow queries (1-10s query time), but that's likely
| due to very infrequent access i.e. cold cache.
|
| I will say, if you use open source libraries like pg_search,
| you are unlikely to ever have performant full-text search.
| Most full-text queries need to be written by hand to actually
| utilize indexes, instead of the query-soup that these types
| of libraries output. (No offense to the maintainers -- it's
| just how it be when you create a "general" solution.)
|
| [0]: https://pgmustard.com
| snack-boye wrote:
| Silly question, I'm using pg right now and most of my
| queries are something like this (in english)
|
| Find me some results in my area that contain these
| categoryIds and are slotted to start between now and next
| 10 days.
|
| Since its already quite a filtered set of data, would that
| mean I should have little issues adding pg text search
| because with correct indexing and all, it will usually be
| applied to a small set of data?
|
| Thanks
| ezekg wrote:
| I'm not a DBA, so I can't say for certain simply due to a
| gap in my knowledge. But in my experience, it depends on
| a lot of factors. Sometimes pg will use an index before
| performing the search ops, other times a subquery is
| needed. Check out pgmustard and dig into your slow query
| plans. :)
| john-shaffer wrote:
| You might be just fine adding an unindexed tsvector
| column, since you've already filtered down the results.
|
| The GIN indexes for FTS don't really work in conjunction
| with other indices, which is why
| https://github.com/postgrespro/rum exists. Luckily, it
| sounds like you can use your existing indices to filter
| and let postgres scan for matches on the tsvector. The
| GIN tsvector indices are quite expensive to build, so
| don't add one if postgres can't make use of it!
| kayodelycaon wrote:
| Oh cool. I was right and wrong. Thanks!
| fizx wrote:
| Hi, I started an Elasticsearch hosting company, since sold, and
| have built products on PG's search and SQLite FTS search.
|
| There are in my mind two reasons to not use PG's search.
|
| 1. Elasticsearch allows you to build sophisticated linguistic
| and feature scoring pipelines to optimize your search quality.
| This is not a typical use case in PG.
|
| 2. Your primary database is usually your scaling bottleneck
| even without adding a relatively expensive search workload into
| the mix. A full-text search tends to be around as expensive as
| a 5% table scan of the related table. Most DBAs don't like
| large scan workloads.
| busymom0 wrote:
| Regarding point 2, my implementation was to have a duplicate
| database server where all the search queries are sent. This
| would ensure that the search wouldn't slow down the main
| database. And most of the duplication would happen quick
| enough so that the search results were almost up to date.
| kurko wrote:
| I agree with reason 1, but reason 2 is an answer for, "should
| I use PG search in the same PG instance I already have", and
| that's a different discussion. You can set up a replica for
| that.
| quietbritishjim wrote:
| Do you see any particular reasons to use or not use ZomboDB
| [1]? It claims to lets you use ElasticSearch from PG
| seamlessly e.g. it manages coherency of which results ought
| to be returned according to the current transaction. (I've
| never quite ended up needing to use ES but it's always seemed
| to me I'd be likely to need ZomboDB if I did.)
|
| [1] https://github.com/zombodb/zombodb
| mixmastamyk wrote:
| You can do anything, anything at all, at https://zombo.com/
|
| _" The only limit, is yourself..."_
| FpUser wrote:
| This one is my favorite for many many years. Good for
| relaxing.
| habibur wrote:
| It's still around?
| mumblemumble wrote:
| That seems like one for the philosophers. If you
| completely rewrite a Flash site in HTML5, but it looks
| the same and has the same URL, is it still the same site?
| da_chicken wrote:
| So, YouTube?
| mixmastamyk wrote:
| Klicken-Sie Link.
| Conlectus wrote:
| Though I have only a recreational interest in datastores, I
| would be pretty wary of a service that claims to strap a
| Consistant-Unavailable database (Postgres) to an
| Inconsistent-Available database (ElasticSearch) in a useful
| way.
|
| Doing so would require ElasticSearch to reach consensus on
| every read/write, which would remove most of the point of a
| distributed cluster. Despite this, ZomboDB's documentation
| says "complex aggregate queries can be answered in parallel
| across your ElasticSearch cluster".
|
| They also claim that transactions will abort if
| ElasticSearch runs into network trouble, but the
| ElasticSearch documentation notes that writes during
| network partitions don't wait for confirmation of
| success[1], so I'm not sure how they would be able to
| detect that.
|
| In short: I'll wait for the Jepsen analysis.
|
| [1] https://www.elastic.co/blog/tracking-in-sync-shard-
| copies#:~...
| zombodb wrote:
| > Doing so would require ElasticSearch to reach consensus
| on every read/write
|
| ZomboDB only requires that ES have a view of its index
| that's consistent with the active Postgres transaction
| snapshot. ZDB handles this by ensuring that the ES index
| is fully refreshed after writes.
|
| This doesn't necessarily make ZDB great for high-update
| loads, but that's not ZDB's target usage.
|
| > They also claim that transactions will abort if
| ElasticSearch runs into network trouble...
|
| I had to search my own repo to see where I make this
| claim. I don't. I do note that network failures between
| PG & ES will cause the active Postgres xact to abort. On
| top of that, any error that ES is capable of reporting
| back to the client will cause the PG xact to abort --
| ensuring consistency between the two.
|
| Because the ES index is properly refreshed as it relates
| to the active Postgres transaction, all of ES' aggregate
| search functions are capable of providing proper MVCC-
| correct results, using the parallelism provided by the ES
| cluster.
|
| I don't have the time to detail everything that ZDB does
| to project Postgres xact snapshots on top of ES, but the
| above two points are the easy ones to solve.
| Conlectus wrote:
| I suppose I must have misinterpreted this line
|
| > As such, any sort of failure either with ZomboDB
| itself, between Postgres and Elasticsearch (network
| layer), or within Elasticsearch will cause the operating
| Postgres transaction to ABORT. [1]
|
| In my defense, there is a fairly important distinction
| between "any error that ES is capable of reporting back"
| and "any sort of failure within Elasticsearch".
|
| That said, I and trust that you're more familiar with
| consistency levels than me, so I'll bow out here.
|
| [1] https://www.zombodb.com/documentation/things-to-know/
| zombodb wrote:
| I can definitely reword that if it's confusing.
| mrslave wrote:
| ZomboDB scratches an itch in a way I find fascinating,
| though I have yet to do more with it than shoehorn it
| into a prototype that was a bit square-peg-in-a-round-
| hole. It is in my catalogue of technologies I hope
| exploit someday. And I hope you're enjoying the work and
| making some $$$ too.
|
| While I'm here might I ask, are you finding the hosted
| PostgreSQL services (AWS, Azure, etc.) growing or
| shrinking your market opportunities? Also, does it play
| nice with Citus?
| zombodb wrote:
| I don't pay any attention to what the cloud providers are
| doing. I have no control over them so..... eh.
|
| ZDB is still alive and well but I have a real job now
| too.
| NortySpock wrote:
| Regarding point 2: Shouldn't you be moving your search
| queries from your transaction server to a separate analysis
| or read-replica server? OLTP copies to OLAP and suddenly
| you've separated these two problems.
| blowski wrote:
| ...and created a new one if the projection is only
| eventually consistent. No free lunches here.
| zepolen wrote:
| How is that different from running Postgres and
| Elasticsearch separately?
| blowski wrote:
| It's not.
|
| It's a difference between having separate OLAP and OLTP
| databases, which is what the parent post suggested.
| justinclift wrote:
| Probably skill set of staff?
|
| For example, if a place has developed fairly good
| knowledge of PG already, they can "just" (!) continue
| developing their PG knowledge.
|
| Adding ES into the mix though, introduces a whole new
| thing that needs to be learned, optimised, etc.
| pmarreck wrote:
| > sophisticated linguistic and feature scoring pipelines to
| optimize your search quality
|
| You CAN score results using setweight, although it's likely
| not as sophisticated as Elasticsearch's
|
| https://www.postgresql.org/docs/9.1/textsearch-controls.html
|
| Disclaimer: I use Postgres fulltext search in production,
| very happy with it although maintaining the various triggers
| and stored procs it requires to work becomes cumbersome
| whenever you have to write a migration that alters any of
| them (or that in fact touches any related field, as you may
| be required to drop and recreate all of the parts in order
| not to violate referential integrity)
|
| It is certainly nice having not to worry about 1 additional
| dependency when deploying, though
| thom wrote:
| We get really nice results with gist indexes (gist_trgm_ops)
| searching across multiple entity types to do top X queries. It's
| very useful to be able to make a stab at a difficult-to-spell
| foreign football player's name, possibly with lots of diacritics,
| and get quick results back. I'm always surprised when I find a
| search engine on any site that is so unkind as to make you spell
| things exactly.
| nuker wrote:
| Is there alternative to ES that scales nicely? I'm running ELK
| stack for logging using AWS Elasticsearch. Logs have
| unpredictable traffic volume and even overprovisioned ES cluster
| gets clogged sometimes. I wonder is there something more scalable
| than ES, and have nice GUI like Kibana?
| shard972 wrote:
| loki
| jillesvangurp wrote:
| It's more a matter of configuring it right. I'd recommend
| trying out Elastic Cloud. It's a bit easier to deal with than
| Amazon's offering and much better supported. AWS has always
| been a bit hands-off on that front. Their opensearch project
| does not seem to break that pattern so far.
|
| Also, with Elastic Cloud you get some access to useful features
| for logging (like life cycle management and data streams) that
| will help you scale the setup.
|
| Kibana in recent iterations has actually improved quite a bit.
| The version you are getting from Amazon is probably a bit bare
| bones in comparison. One nice thing with Elastic is that going
| with the defaults gets you some useful dashboards out of the
| box if you use e.g. file or docker beats for collecting logs.
| nuker wrote:
| Thanks mate. I did tried Elastic before settling on AWS ES,
| it was slower somehow. As for Kibana features, im happy with
| barebones :)
|
| > you use e.g. file or docker beats for collecting logs.
|
| My setup is custom app reading from CW Logs.
|
| But can Elastic cluster scale automatically on indexing
| latency spikes, so my apps writes do not time out? If yes
| then how, please?
| jillesvangurp wrote:
| Autoscaling is of course something they can do:
| https://www.elastic.co/guide/en/cloud/current/ec-
| autoscaling...
|
| If you use life cycle management and data streams (which
| AWS doesn't have), you'd be able to control the sizes of
| your hot indices (i.e. the ones you write to). Basically
| keeping your hot indices small helps keeping things fast.
| If you have issues with app writes spiking, I'd use some
| queuing solution in between. Plenty of solutions for that.
|
| The rest is just a matter of configuring things right in
| terms of number of shards and setting up properly.
| Basically, you get what you pay for in the end.
| hnarn wrote:
| I don't know what features AWS provides but in general terms
| logs benefit a lot from compression, so if I were to set up
| this on my own I'd probably want to try something like a VDO or
| ZFS backed storage system as well as compressed transfers
| (perhaps in batch if that's required).
| nuker wrote:
| > something like a VDO or ZFS backed storage system
|
| And what text search engine?
| hnarn wrote:
| I have no idea, but I'm assuming that regardless of which
| one you use there will be disk level duplication that could
| be taken care of to alleviate uncontrolled growth somewhat.
| kureikain wrote:
| I used Postgres full-text search for mail log feature on my email
| forward app https://hanami.run
|
| Essentially allow arbitraty query in from/to/subject/body. One
| thing that make full-text serch work great for me is that I don't
| need to sort or rank the relevant of query. I just show a list of
| email that match the query order by their id.
|
| I also don't do pagination and counting, instead users has to
| load more paged and the ID of the email is pass to the query as a
| point to compare( where id < requests.get.before).
|
| And with those strategy, full text search works great for us
| since we don't really want to bring in ElasticSearch because only
| about 20% of users use this features.
| bityard wrote:
| I know Postgres and SQLite have mostly different purposes but
| FWIW, SQLite also has a surprisingly capable full-text search
| extension built right in: https://www.sqlite.org/fts5.html
| jjice wrote:
| It's very impressive, especially considering the SQLite version
| you're already using probably has it enabled already. I use it
| for a small site I run and it works fantastic. Little finicky
| with deletes and updates due to virtual tables in SQLite, but
| definitely impressive and has its uses.
| SigmundA wrote:
| Keep wondering if RUM Indexes [1] will ever get merged for faster
| and better ranking (TF/IDF). Really would make PG a much more
| complete text search engine.
|
| https://github.com/postgrespro/rum
___________________________________________________________________
(page generated 2021-07-28 19:02 UTC)