[HN Gopher] Postgres Full-Text Search: A search engine in a data...
___________________________________________________________________
Postgres Full-Text Search: A search engine in a database
Author : twakefield
Score : 318 points
Date : 2021-07-27 15:20 UTC (7 hours ago)
(HTM) web link (blog.crunchydata.com)
(TXT) w3m dump (blog.crunchydata.com)
| simonw wrote:
| The Django ORM includes support for PostgreSQL search and I've
| found it a really productive way to add search to a project:
| https://docs.djangoproject.com/en/3.2/ref/contrib/postgres/s...
| lettergram wrote:
| I actually built a search engine back in 2018 using postgresql
|
| https://austingwalters.com/fast-full-text-search-in-postgres...
|
| Worked quite well and still use it daily. Basically doing
| weighted searches on vectors is slower than my approach, but
| definitely good enough.
|
| Currently, I can search around 50m HN & Reddit comments in 200ms
| on the postgresql running on my machine.
| vincnetas wrote:
| Offtopic, but currious what are your use cases when searching
| all HN and reddit comments? Im at the beggining of this path,
| just crawled HN, but what to do with this, still a bit cloudy.
| lettergram wrote:
| I built a search engine that quantified the expertise of
| authors of comments. Then I created what I called "expert
| rank" that allowed me to build a really good search engine.
|
| Super good if you're at a company or something
|
| https://twitter.com/austingwalters/status/104189476543920128.
| ..
| turbocon wrote:
| Wow, that lead me down quite a rabbit hole, impressive
| work.
| rattray wrote:
| Nice - looks like the ~same approach recommended here of adding
| a generated `tsvector` column with a GIN index and querying it
| with `col @@ @@ to_tsquery('english', query)`.
| lettergram wrote:
| Yeah my internal approach was creating custom vectors which
| are quicker to search.
| pvsukale3 wrote:
| If you are using Rails with Postgres you can use pg_search gem to
| build the named scopes to take advantage of full text search.
|
| https://github.com/Casecommons/pg_search
| theandrewbailey wrote:
| > You could also look into enabling extensions such as unaccent
| (remove diacritic signs from lexemes) or pg_trgm (for fuzzy
| search).
|
| Trigrams (pg_trgm) are practically needed for usable search when
| it comes to misspellings and compound words (e.g. a search for
| "down loads" won't return "downloads").
|
| I also recommend using websearch_to_tsquery instead of using the
| cryptic syntax of to_tsquery.
| kyrra wrote:
| Trigrams are amazing. I was doing a sideproject where I wanted
| to allow for substring searching, and trigrams seemed to be the
| only way to do it (easily/well) in postgres. Gitlab did a great
| writeup on this a few years ago that really helped me
| understand it:
|
| https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
|
| You can also always read the official docs:
|
| https://www.postgresql.org/docs/current/pgtrgm.html
| Grimm1 wrote:
| Postgres Full-Text search is a great way to get search running
| for a lot of standard web applications. I recently used just this
| in Elixir to set up a simple search by keyword. My only complaint
| was Ecto (Elixir's query builder library) doesn't have first
| class support for it and neither does Postgrex the lower level
| connector they use. Still, using fragments with sanitized SQL
| wasn't too messy at all.
| mrinterweb wrote:
| I've seen Elasticsearch set up for applications that would have
| equal benefit from just using the postgresql db's full-text
| search they already have access to.
|
| The additional complexity is usually incurred when the data in
| postgresql changes, and those changes need to be mirrored up to
| Elasticsearch. Elasticsearch obviously has its uses, but for some
| cases, postgresql's built in full-text search can make more
| sense.
| [deleted]
| jcuenod wrote:
| Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to
| search my PDF database. SQLite's full-text search is really
| excellent. The thing that tripped me up for a while was `GROUP
| BY` with the `snippet`/`highlight` function but that's the point
| of the blog post.
|
| [0] https://jcuenod.github.io/bibletech/2021/07/26/full-text-
| sea...
|
| [1] https://www.sqlite.org/fts5.html
| MushyRoom wrote:
| I was hyped when I found out about it a while ago. Then I wasn't
| anymore.
|
| When you have 12 locales (kr/ru/cn/jp/..) it's not that fun
| anymore. Especially on a one man project :)
| freewizard wrote:
| For small project and simple full text search requirement, try
| this generic parser:
| https://github.com/freewizard/pg_cjk_parser
| oauea wrote:
| Why support so many locales in a one man project?
| MushyRoom wrote:
| It's just a project to learn all the things that are web.
| It's mainly a database for a game now with most of the
| information sourced from the game (including its localization
| files).
|
| I'm slowly transitioning from MariaDB to Postgres - again as
| a learning experience. There is cool stuff and there is
| annoying stuff to reproduce things like case-insensitive +
| ignore accents (utf8_general_ci) in Postgres.
|
| I've looked into FTS and searching for missing dictionaries
| to support all the locales but Chinese is one of the harder
| ones.
| blacktriangle wrote:
| One (dev) project here, we're up to 5 locales at a
| surprisingly small number of customers. Problem is when your
| customers are global, all of a sudden a single customer can
| bring along multiple locales. I very much regret not taking
| localization far more seriously early in development but we
| were blindsided by the interest outside the Angleosphere.
| habibur wrote:
| To attract more visitors/customers I guess. I plan to extend
| to 10 languages too. 1 man project.
| tabbott wrote:
| Zulip's search is powered by this built-in Postgres full-text
| search feature, and it's been a fantastic experience. There's a
| few things I love about it:
|
| * One can cheaply compose full-text search with other search
| operators by just doing normal joins on database indexes, which
| means we can cheaply and performantly support tons of useful
| operators (https://zulip.com/help/search-for-messages).
|
| * We don't have to build a pipeline to synchronize data between
| the real database and the search database. Being a chat product,
| a lot of the things users search for are things that changed
| recently; so lag, races, and inconsistencies are important to
| avoid. With the Postgres full-text search, all one needs to do is
| commit database transactions as usual, and we know that all
| future searches will return correct results.
|
| * We don't have to operate, manage, and scale a separate service
| just to support search. And neither do the thousands of self-
| hosted Zulip installations.
|
| Responding to the "Scaling bottleneck" concerns in comments
| below, one can send search traffic (which is fundamentally read-
| only) to a replica, with much less complexity than a dedicated
| search service.
|
| Doing fancy scoring pipelines is a good reason to use a
| specialized search service over the Postgres feature.
|
| I should also mention that a weakness of Postgres full-text
| search is that it only supports doing stemming for one language.
| The excellent PGroonga extension (https://pgroonga.github.io/)
| supports search in all languages; it's a huge improvement
| especially for character-based languages like Japanese. We're
| planning to migrate Zulip to using it by default; right now it's
| available as an option.
|
| More details are available here:
| https://zulip.readthedocs.io/en/latest/subsystems/full-text-...
| stavros wrote:
| I cannot tell you how much I love Zulip, but I can tell you
| that I have no friends any more because everyone is tired of me
| evangelizing it.
| brightball wrote:
| All of this. It's such a good operational experience that I
| will actively fight against the introduction of a dedicated
| search tool unless it's absolutely necessary.
| rattray wrote:
| TBH I hadn't known you could do weighted ranking with Postgres
| search before.
|
| Curious there's no mention of zombodb[0] though, which gives you
| the full power of elasticsearch from within postgres (with
| _consistency_ no, less!). You have to be willing to tolerate slow
| writes, of course, so using postgres ' built-in search
| functionality still makes sense for a lot of cases.
|
| [0] https://github.com/zombodb/zombodb
| craigkerstiens wrote:
| Zombo is definitely super interesting and we should probably
| add a bit in the post about it. Part of the goal here was that
| you can do a LOT with Postgres, without adding one more system
| to maintain. Zombo is great if you have Elastic around, but
| want Postgres as the primary interface, but what if you don't
| want to maintain Elastic.
|
| My ideal is always though to start with Postgres, and then see
| if it can solve my problem. I would never Postgres is the best
| at everything it can do, but for most things it is good enough
| without having another system to maintain and wear a pager for.
| rattray wrote:
| Zombo does at least promise to handle "complex reindexing
| processes" for you (which IME can be very painful) but yeah,
| I assume you'd still have to deal with shard rebalancing,
| hardware issues, network failures or latency between postgres
| and elastic, etc etc.
|
| The performance and cost implications of Zombo are more
| salient tradeoffs in my mind - if you want to index one of
| the main tables in your app, you'll have to wait for a
| network roundtrip and a multi-node write consensus on every
| update (~150ms or more[0]), you can't `CREATE INDEX
| CONCURRENTLY`, etc.
|
| All that said, IMO the fact that Zombo exists makes it easier
| to pitch "hey lets just build search with postgres for now
| and if we ever need ES's features, we can easily port it to
| Zombo without rearchitecting our product".
|
| [0] https://github.com/zombodb/zombodb/issues/640
| zombodb wrote:
| I wonder what the ZomboDB developers are up to now? What
| great text-search-in-postgres things could they be secretly
| working on?
| syoc wrote:
| My worst search experiences always come from the features
| applauded here. Word stemming and removing stop words is a big
| hurdle when you know what you are looking for but get flooded by
| noise because some part of the search string was ignored. Another
| issue is having to type out a full word before you get a hit in
| dynamic search boxes (looking at you Confluence).
| Someone1234 wrote:
| I'd argue that isn't a problem with the _feature_ , but a
| thoughtless implementation.
|
| A _good_ implementation will weigh verbatim results highest
| before considering the stop-word stripped or stemmed version.
| Configuring to_tsvector() to not strip stop words or using a
| stemming dictionary is, in my opinion, a little clunky in
| Postgres: You 'll want to make a new [language] dictionary and
| then call to_tsvector() using your new dictionary as the first
| parameter.
|
| After you've set up the dictionary globally, this would look
| _something_ like:
|
| setweight(to_tsvector('english_no_stem_stop', col), 'A') ||
| setweight(to_tsvector('english', col), 'B'))
|
| I think blaming Postgres for adding stemming/stop-word support
| because it can be [ab]used for a poor search user experience is
| like blaming a hammer for a poorly built home. It is just a
| tool, it can be used for good or evil.
|
| PS - You can do a verbatim search without using to_tsvector(),
| but that cannot be easily passed into setweight() and you
| cannot use features like ts_rank().
| shakascchen wrote:
| No fun doing it for Chinese, especially for traditional Chinese.
|
| I had to install software but on Cloud SQL you can't. You have to
| do it on your instances.
| rattray wrote:
| Something that's missing from this which I'm curious about is how
| far _can 't_ postgres search take you?
|
| That is, what tends to be the "killer feature" that makes teams
| groan and set up Elasticsearch because you just can't do it in
| Postgres and your business needs it?
|
| Having dealt with ES, I'd really like to avoid the operational
| burden if possible, but I wouldn't want to choose an intermediary
| solution without being able to say, "keep in mind we'll need to
| budget a 3-mo transition to ES once we need X, Y, or Z".
| nostrademons wrote:
| Used to work on Google Search, used ES extensively for a
| startup I founded (which was sort of quasi-search...it was
| built around feed ranking, where the query is constant and a
| stream of documents is constantly coming in), and have also
| used Postgres extensively in other companies.
|
| The big problem with all the off-the-shelf search solutions
| (RDBMS full-text search, ES, Algolia) is that _search ranking
| is a complicated and subtle problem_ , and frequently depends
| on signals that are _not in the document itself_. Google 's big
| insight is that how _other_ people talk about a website is more
| important than how the website talks about itself, and its
| ranking algorithm weights accordingly.
|
| ES has the basic building blocks to construct such a ranking
| algorithm. In terms of fundamental infrastructure I found ES to
| be just as good as Google, and better in some ways. But its
| out-of-the-box ranking function _sucks_. Expect to put a domain
| expert just on search ranking and evaluation to get decent
| results, and they 're going to have to delve pretty deeply into
| advanced features of ES to get there.
|
| AFAICT Postgres search only lets you tweak the ranking
| algorithm by assigning different weights to fields, assuming
| that the final document score is a linear combination of
| individual fields. This is usually not what you want - it's
| pretty common to have non-linear terms from different signals.
| sandGorgon wrote:
| it doesnt do TF-IDF or BM-25 - the current state of art in
| search relevance algorithms.
|
| that's where it cant be used for anything serious.
| SigmundA wrote:
| That why this really needs to get merged:
|
| https://github.com/postgrespro/rum
| rattray wrote:
| TF/IDF is listed as a TODO on that repo, and I don't see a
| PR which promises to provide it.
| Thaxll wrote:
| PG is average at best for text search, it's not even good.
| nextaccountic wrote:
| If your search needs outgrow Postgres' native search engine,
| you can use Postgres search with an ElasticSearch backend,
| using Zombo
|
| https://github.com/zombodb/zombodb
|
| It basically gives you a new kind of index (create index ..
| using zombodb(..) ..)
| mjewkes wrote:
| Exact phrase matching. This generally requires falling back to
| ILIKE, which is not performant.
| e12e wrote:
| Even if using ILIKE over the result of an imprecise query?
| [deleted]
| simonw wrote:
| Exact phrase searching works in PostgreSQL full-text search -
| here's an example:
| https://simonwillison.net/search/?q=%22nosql+database%22
|
| I'm using search_type=websearch https://github.com/simonw/sim
| onwillisonblog/blob/a5b53a24b00...
|
| That's using websearch_to_tsquery() which was added in
| PostgreSQL 11: https://www.postgresql.org/docs/11/textsearch-
| controls.html#...
| SahAssar wrote:
| I had some issues with this recently as I couldn't get a
| FTS query to find something looking like a path or url in a
| query. As an example from your site:
| https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains
| the exact text https://www.niche-museums.com/, but I cannot
| find a way to search for that phrase exactly (trying
| https://simonwillison.net/search/?q=%22www.niche-
| museums.com... works though). I tried both in my own psql
| setup and on your site, and it seems like exact phrase
| searching is limited to the language used, even if it would
| be an exact string match.
|
| Are there any workarounds for that?
| rattray wrote:
| Pardon my ignorance - what is exact phrase matching and why
| doesn't it work with tsvector?
| amichal wrote:
| Postgres was not good at (for us) - IDF and other corpus based
| relevancy measures. had to hand roll - thesaurus and
| missspelling - again possible of course with preprocessing and
| by adding config files - non Latin alphabet languages. E.g.
| Arabic - needed filesystem access (we used aws rds so couldn't
| do it) to add a dictionary based stemmed/word breaker
|
| We used es or solr for those cases. For English FTS with 100k
| documents doing it in PG is super easy and one less dependency
| some_developer wrote:
| Anectodal note:
|
| A few years ago we added yet-another part to our product and,
| whilst ES worked "okay", we got a bit weary of ES due to "some
| issues" (some bug in the architecture keeping things not
| perfect in sync, certain queries with "joins" of types taking
| long, demand on HW due to the size of database, no proper
| multi-node setup due to $$$ and time constraint, etc.; small
| things piling up over time).
|
| Bright idea: let's see how far Postgres, which is our primary
| datastore, can take us!
|
| Unfortunately, the feature never made it fully into production.
|
| We thought that on paper, the basic requirements were ideal:
|
| - although the table has multiple hundreds of millions of
| entries, natural segmentation by customer IDs made possible
| individual results much smaller
|
| - no weighted search result needed: datetime based is perfect
| enough for this use-case, we thought it would be easy to come
| up with the "perfect index [tm]"
|
| Alas, we didn't even get that far:
|
| - we identified ("only") 2 columns necessary for the search =>
| "yay, easy"
|
| - one of those columns was multi-language; though we didn't
| have specific requirements and did not have to deal with
| language specific behaviour in ES, we had to decide on one for
| the TS vectorization (details elude me why "simple" wasn't
| appropriate for this one column; it was certainly for the other
| one)
|
| - unsure which one, or both, we would need, for one of the
| columns we created both indices (difference being the
| "language")
|
| - we started out with a GIN index (see
| https://www.postgresql.org/docs/9.6/textsearch-indexes.html )
|
| - creating a single index took > 15 hours
|
| But once the second index was done, and had not even rolled out
| the feature in the app itself (which at this point was still an
| ever changing MVP), unrelated we suddenly got hit by lot of
| customer complains that totally different operations on this
| table (INSERTs and UPDATEs) started to be getting slow (like
| 5-15 seconds slow, something which usually takes tiny ms).
|
| Backend developer eyes were wide open O_O
|
| But since we knew that second index just finished, after
| checking the Posgres logs we decided to drop the FTS indices
| and, lo' and behold, "performance problem solved".
|
| Communication lines were very short back then (still are today,
| actually) and it was promptly decided we just cut the search
| functionality from this new part of the product and be done
| with it. This also solved the problem, basically (guess there's
| some "business lesson" to be learned here too, not just
| technical ones).
|
| Since no one within the company counter argued this decision,
| we did not spend more time analyzing the details of the
| performance issue though I would have loved to dig into this
| and get an expert on board to dissect this.
|
| --
|
| A year later or so I had a bit free time and analyzed one
| annoying recurring slow UPDATE query problem on a completely
| different table, but also involving FTS on a single column
| there also using a GIN index. That's when I stumble over
| https://www.postgresql.org/docs/9.6/gin-implementation.html
|
| > Updating a GIN index tends to be slow because of the
| intrinsic nature of inverted indexes: inserting or updating one
| heap row can cause many inserts into the index (one for each
| key extracted from the indexed item). As of PostgreSQL 8.4, GIN
| is capable of postponing much of this work by inserting new
| tuples into a temporary, unsorted list of pending entries. When
| the table is vacuumed or autoanalyzed, or when
| gin_clean_pending_list function is called, or if the pending
| list becomes larger than gin_pending_list_limit, the entries
| are moved to the main GIN data structure using the same bulk
| insert techniques used during initial index creation. This
| greatly improves GIN index update speed, even counting the
| additional vacuum overhead. Moreover the overhead work can be
| done by a background process instead of in foreground query
| processing.
|
| In this particular case I was able to solve the occasional slow
| UPDATE queries with "FASTUPDATE=OFF" on that table and,
| thinking back about the other issue, it might have solved or
| minimized the impact.
|
| Back to the original story: yep, this one table can have
| "peaks" of inserts but it's far from "facebook scale" or
| whatever, basically 1.5k inserts / second were the absolute
| rare peak I measured and usually it's in the <500 area. But I
| guess it was enough for this scenario to add latency within the
| database.
|
| --
|
| Turning back my memory further, I was always "pro" trying to
| minimize / get rid of ES after learning about
| http://rachbelaid.com/postgres-full-text-search-is-good-enou...
| even before we used any FTS feature. At also mentions the
| GIN/GiST issue but alas, in our case: ElasticSearch is good
| enough and, besides the thwarts we've with it, actually easier
| to reason about (so far).
| grncdr wrote:
| My experience has been that sorting by relevance ranking is
| quite expensive. I looked into this a bit and found
| https://github.com/postgrespro/rum (and some earlier slide
| decks about it) that explains why the GIN index type can't
| support searching and ranking itself (meaning you need to do
| heap scans for ranking). This is especially problematic if your
| users routinely do searches that match a lot of documents and
| you only want to show the top X results.
|
| Edit: if any of the Crunchy Data people are reading this:
| support for RUM indexes would be super cool to have in your
| managed service.
| thom wrote:
| Top X queries should be optimised with gist indexes.
| grncdr wrote:
| Ah sounds like I need to update myself on this. I don't
| recall what the other trade-offs vs GIN indexes are.
| craigkerstiens wrote:
| Actually thats a great suggestion. We need to take a deeper
| look at the code itself and ability to support the extension,
| but we'll definitely take and evaluate it at some of our
| upcoming roadmap planning for Crunchy Bridge
| (https://www.crunchybridge.com).
| grncdr wrote:
| Cool cool
|
| If I could have your other ear for a moment: support for
| _any_ EU-based cloud provider would be super nice. Ever
| since the Privacy Shield fig leaf was removed, questions
| about whether we store any data under US jurisdiction have
| become a lot more frequent.
| craigkerstiens wrote:
| Do you have a preferred one?
|
| We're currently on the big 3 US ones, in process of
| working on our 4th provider, our goal is very much to
| deliver the best Postgres experience whether on bare
| metal/on-premise or in the cloud, self hosted or fully
| managed.
|
| Edit: Always feel free to reach out directly, I'm always
| happy to spend time with anyone that has questions and
| usually pretty easy to track me down.
| xcambar wrote:
| This is anecdote, not proper feedback, since I wasn't directly
| involved in the topic.
|
| My company relied on PG as its search engine and everything
| went well from POC to production. After a few years of
| production and new clients requiring volumes of data an order
| of magnitude above our comfort zone, things went south pretty
| fast.
|
| Not many months later but many sweaty weeks of engineering
| after, we switched to ES and we're not looking back.
|
| tl;dr; even with great DB engineers (which we had), I'd suggest
| that scale is a strong limiting factor on this feature.
| dante_dev wrote:
| can you tell us the scale you're talking about? getting good
| enough results with ~1 billion rows
| lacksconfidence wrote:
| Can you tell us about your scoring function? Selecting 40M
| results from a dataset of ~1B and returning the top 10
| based on some trivial scoring function is easy, any
| reasonable search system will handle that. The problem is
| when you have to run your scoring function on all 40M
| matching docs to decide which 10 are the most relevant.
| It's even more of a problem when your scoring function
| captures some of the complexities of the real word, rather
| than something trivial like tf/idf or bm25.
| _009 wrote:
| If you are looking to do semantic search (Cosine similarity) +
| filtering (SQL) on data that can be represented as vectors
| (audio, text, video, bio) I suggest,
| https://github.com/ankane/pgvector
| iav wrote:
| I moved from ElasticSearch to PG FTS in production, and here
| are the things I had to give up:
|
| 1. PostgreSQL has a cap on column length, and the search index
| has to be stored in a column. The length of the column is
| indeterminate - it is storing every word in the document and
| where it's located, so a short document with very many unique
| words (numbers are treated as words too) can easily burst the
| cap. This means you have to truncate each document before
| indexing it, and pray that your cap is set low enough. You can
| use multiple columns but that slows down search and makes
| ranking a lot more complicated. I truncate documents at 4MB.
|
| 2. PostgreSQL supports custom dictionaries for specific
| languages, stemmers, and other nice tricks, but none of those
| are supported by AWS because the dictionary gets stored as a
| file on the filesystem (it's not a config setting). You can
| still have custom rules like whether or not numbers count as
| words.
| gk1 wrote:
| Semantic search using text embeddings. With Open Distro for
| Elasticsearch you can store your text embeddings and then
| perform a nearest-neighbor search[1] to find most similar
| documents using cosine similarity[2]. Elasticsearch (vanilla)
| will get this feature with 8.0.
|
| If migrating to ES makes you groan you can use a managed
| service like Pinecone[3] (disclaimer: I work there) just for
| storing and searching through text embeddings in-memory through
| an API while keeping the rest of your data in PG.
|
| [1] Nearest-neighbor searches in Open Distro:
| https://opendistro.github.io/for-elasticsearch-docs/docs/knn...
|
| [2] More on how semantic similarity is measured:
| https://www.pinecone.io/learn/semantic-search/
|
| [3] https://www.pinecone.io
| matsemann wrote:
| I'm not well versed on modern pg for this use, but when I
| managed a Solr instance ~5 years ago, it was the ranking of the
| results that was the killer feature. Finding results fast most
| systems can do. Knowing which results to present is harder.
|
| Our case was a domain specific knowledge base, with certain
| terms occurring often in many articles. Searching for a term
| could bring up thousands of results, but few of them were
| actually relevant to show in context of the search, they just
| happened to use the term.
| jka wrote:
| "Faceted search"[1] (aka aggregates in Elasticsearch) tends to
| be a popular one, to provide user-facing content navigation.
|
| That said, simonw has been on the case[2] demonstrating an
| implementation of that using Django and PostgreSQL.
|
| [1] - https://en.wikipedia.org/wiki/Faceted_search
|
| [2] - https://simonwillison.net/2017/Oct/5/django-postgresql-
| facet...
| craigds wrote:
| This is the key one for us that makes Postgres a non-starter
| for FTS (We use postgres for everything _else_ )
|
| We begrudgingly use Solr instead (we started before ES was
| really a thing and haven't found a _need_ to switch yet)
|
| When you get more than about two different _types_ of filters
| (e.g. types of filters could be 'tags', 'categories',
| 'geotags', 'media type', 'author' etc), the combinatorial
| explosion of Postgres queries required to provide facet
| counts gets unmanageable.
|
| For example, when I do a query filtered by
| `?tag=abc&category=category1`, I need to do these queries:
| - `... where tag = 'abc' and category_id = 1` (the current
| results) - `count(*) ... where tag = 'def' and
| category_id = 1` (for each other tag present in the results)
| - `count(*) ... where tag = 'abc' and category_id = 2` (for
| each other category present in the results) -
| `count(*) ... where category_id = 1` - `count(*)
| ... where tag = 'abc'` - `count(*)`
|
| There are certainly smarter ways to do this than lots of tiny
| queries, but all this complexity still ends up somewhere and
| isn't likely to be great for performance.
|
| Whereas solr/elasticsearch have faceting built in and handle
| this with ease.
| brightball wrote:
| Streaming data ingestion is the biggest. If you're constantly
| writing data to be searched, this is where ES really outshines
| everything.
| kayodelycaon wrote:
| If I recall correctly, Postgres search doesn't scale well. Not
| sure where it falls apart but it isn't optimized in the same
| way something like Solr is.
| ezekg wrote:
| I have a table with over a billion rows and most full-text
| searches still respond in around a few milliseconds. I think
| this will depend on a lot of factors, such as proper
| indexing, and filtering down the dataset as much as possible
| before performing the full-text ops. I've spent a
| considerable amount of time on optimizing these queries,
| thanks to tools like PgMustard [0]. Granted, I do still have
| a couple slow queries (1-10s query time), but that's likely
| due to very infrequent access i.e. cold cache.
|
| I will say, if you use open source libraries like pg_search,
| you are unlikely to ever have performant full-text search.
| Most full-text queries need to be written by hand to actually
| utilize indexes, instead of the query-soup that these types
| of libraries output. (No offense to the maintainers -- it's
| just how it be when you create a "general" solution.)
|
| [0]: https://pgmustard.com
| snack-boye wrote:
| Silly question, I'm using pg right now and most of my
| queries are something like this (in english)
|
| Find me some results in my area that contain these
| categoryIds and are slotted to start between now and next
| 10 days.
|
| Since its already quite a filtered set of data, would that
| mean I should have little issues adding pg text search
| because with correct indexing and all, it will usually be
| applied to a small set of data?
|
| Thanks
| ezekg wrote:
| I'm not a DBA, so I can't say for certain simply due to a
| gap in my knowledge. But in my experience, it depends on
| a lot of factors. Sometimes pg will use an index before
| performing the search ops, other times a subquery is
| needed. Check out pgmustard and dig into your slow query
| plans. :)
| john-shaffer wrote:
| You might be just fine adding an unindexed tsvector
| column, since you've already filtered down the results.
|
| The GIN indexes for FTS don't really work in conjunction
| with other indices, which is why
| https://github.com/postgrespro/rum exists. Luckily, it
| sounds like you can use your existing indices to filter
| and let postgres scan for matches on the tsvector. The
| GIN tsvector indices are quite expensive to build, so
| don't add one if postgres can't make use of it!
| kayodelycaon wrote:
| Oh cool. I was right and wrong. Thanks!
| fizx wrote:
| Hi, I started an Elasticsearch hosting company, since sold, and
| have built products on PG's search and SQLite FTS search.
|
| There are in my mind two reasons to not use PG's search.
|
| 1. Elasticsearch allows you to build sophisticated linguistic
| and feature scoring pipelines to optimize your search quality.
| This is not a typical use case in PG.
|
| 2. Your primary database is usually your scaling bottleneck
| even without adding a relatively expensive search workload into
| the mix. A full-text search tends to be around as expensive as
| a 5% table scan of the related table. Most DBAs don't like
| large scan workloads.
| kurko wrote:
| I agree with reason 1, but reason 2 is an answer for, "should
| I use PG search in the same PG instance I already have", and
| that's a different discussion. You can set up a replica for
| that.
| quietbritishjim wrote:
| Do you see any particular reasons to use or not use ZomboDB
| [1]? It claims to lets you use ElasticSearch from PG
| seamlessly e.g. it manages coherency of which results ought
| to be returned according to the current transaction. (I've
| never quite ended up needing to use ES but it's always seemed
| to me I'd be likely to need ZomboDB if I did.)
|
| [1] https://github.com/zombodb/zombodb
| mixmastamyk wrote:
| You can do anything, anything at all, at https://zombo.com/
|
| _" The only limit, is yourself..."_
| FpUser wrote:
| This one is my favorite for many many years. Good for
| relaxing.
| habibur wrote:
| It's still around?
| mumblemumble wrote:
| That seems like one for the philosophers. If you
| completely rewrite a Flash site in HTML5, but it looks
| the same and has the same URL, is it still the same site?
| da_chicken wrote:
| So, YouTube?
| mixmastamyk wrote:
| Klicken-Sie Link.
| Conlectus wrote:
| Though I have only a recreational interest in datastores, I
| would be pretty wary of a service that claims to strap a
| Consistant-Unavailable database (Postgres) to an
| Inconsistent-Available database (ElasticSearch) in a useful
| way.
|
| Doing so would require ElasticSearch to reach consensus on
| every read/write, which would remove most of the point of a
| distributed cluster. Despite this, ZomboDB's documentation
| says "complex aggregate queries can be answered in parallel
| across your ElasticSearch cluster".
|
| They also claim that transactions will abort if
| ElasticSearch runs into network trouble, but the
| ElasticSearch documentation notes that writes during
| network partitions don't wait for confirmation of
| success[1], so I'm not sure how they would be able to
| detect that.
|
| In short: I'll wait for the Jepsen analysis.
|
| [1] https://www.elastic.co/blog/tracking-in-sync-shard-
| copies#:~...
| zombodb wrote:
| > Doing so would require ElasticSearch to reach consensus
| on every read/write
|
| ZomboDB only requires that ES have a view of its index
| that's consistent with the active Postgres transaction
| snapshot. ZDB handles this by ensuring that the ES index
| is fully refreshed after writes.
|
| This doesn't necessarily make ZDB great for high-update
| loads, but that's not ZDB's target usage.
|
| > They also claim that transactions will abort if
| ElasticSearch runs into network trouble...
|
| I had to search my own repo to see where I make this
| claim. I don't. I do note that network failures between
| PG & ES will cause the active Postgres xact to abort. On
| top of that, any error that ES is capable of reporting
| back to the client will cause the PG xact to abort --
| ensuring consistency between the two.
|
| Because the ES index is properly refreshed as it relates
| to the active Postgres transaction, all of ES' aggregate
| search functions are capable of providing proper MVCC-
| correct results, using the parallelism provided by the ES
| cluster.
|
| I don't have the time to detail everything that ZDB does
| to project Postgres xact snapshots on top of ES, but the
| above two points are the easy ones to solve.
| NortySpock wrote:
| Regarding point 2: Shouldn't you be moving your search
| queries from your transaction server to a separate analysis
| or read-replica server? OLTP copies to OLAP and suddenly
| you've separated these two problems.
| blowski wrote:
| ...and created a new one if the projection is only
| eventually consistent. No free lunches here.
| zepolen wrote:
| How is that different from running Postgres and
| Elasticsearch separately?
| blowski wrote:
| It's not.
|
| It's a difference between having separate OLAP and OLTP
| databases, which is what the parent post suggested.
| justinclift wrote:
| Probably skill set of staff?
|
| For example, if a place has developed fairly good
| knowledge of PG already, they can "just" (!) continue
| developing their PG knowledge.
|
| Adding ES into the mix though, introduces a whole new
| thing that needs to be learned, optimised, etc.
| pmarreck wrote:
| > sophisticated linguistic and feature scoring pipelines to
| optimize your search quality
|
| You CAN score results using setweight, although it's likely
| not as sophisticated as Elasticsearch's
|
| https://www.postgresql.org/docs/9.1/textsearch-controls.html
|
| Disclaimer: I use Postgres fulltext search in production,
| very happy with it although maintaining the various triggers
| and stored procs it requires to work becomes cumbersome
| whenever you have to write a migration that alters any of
| them (or that in fact touches any related field, as you may
| be required to drop and recreate all of the parts in order
| not to violate referential integrity)
|
| It is certainly nice having not to worry about 1 additional
| dependency when deploying, though
| thom wrote:
| We get really nice results with gist indexes (gist_trgm_ops)
| searching across multiple entity types to do top X queries. It's
| very useful to be able to make a stab at a difficult-to-spell
| foreign football player's name, possibly with lots of diacritics,
| and get quick results back. I'm always surprised when I find a
| search engine on any site that is so unkind as to make you spell
| things exactly.
| bityard wrote:
| I know Postgres and SQLite have mostly different purposes but
| FWIW, SQLite also has a surprisingly capable full-text search
| extension built right in: https://www.sqlite.org/fts5.html
| jjice wrote:
| It's very impressive, especially considering the SQLite version
| you're already using probably has it enabled already. I use it
| for a small site I run and it works fantastic. Little finicky
| with deletes and updates due to virtual tables in SQLite, but
| definitely impressive and has its uses.
| SigmundA wrote:
| Keep wondering if RUM Indexes [1] will ever get merged for faster
| and better ranking (TF/IDF). Really would make PG a much more
| complete text search engine.
|
| https://github.com/postgrespro/rum
___________________________________________________________________
(page generated 2021-07-27 23:00 UTC)