[HN Gopher] Postgres Full-Text Search: A search engine in a data...
       ___________________________________________________________________
        
       Postgres Full-Text Search: A search engine in a database
        
       Author : twakefield
       Score  : 557 points
       Date   : 2021-07-27 15:20 UTC (1 days ago)
        
 (HTM) web link (blog.crunchydata.com)
 (TXT) w3m dump (blog.crunchydata.com)
        
       | eric4smith wrote:
       | Postgres FTS is normally quite good.
       | 
       | But it does not know how to deal with languages like Chinese,
       | Japanese and Thai.
       | 
       | For that you have to use something like PGroonga extension.
       | 
       | The rest of PostgreSQL mostly handles things ok, unless you try
       | to sort on one of these languages and the same things happen
       | again.
       | 
       | There are all ways around these problems. But it's not as easy as
       | turning on Unicode and just expect everything to work!
       | 
       | Yes I'm native English speaker who started to develop in Asia and
       | discovered all of this recently.
        
       | simonw wrote:
       | The Django ORM includes support for PostgreSQL search and I've
       | found it a really productive way to add search to a project:
       | https://docs.djangoproject.com/en/3.2/ref/contrib/postgres/s...
        
       | lettergram wrote:
       | I actually built a search engine back in 2018 using postgresql
       | 
       | https://austingwalters.com/fast-full-text-search-in-postgres...
       | 
       | Worked quite well and still use it daily. Basically doing
       | weighted searches on vectors is slower than my approach, but
       | definitely good enough.
       | 
       | Currently, I can search around 50m HN & Reddit comments in 200ms
       | on the postgresql running on my machine.
        
         | vincnetas wrote:
         | Offtopic, but currious what are your use cases when searching
         | all HN and reddit comments? Im at the beggining of this path,
         | just crawled HN, but what to do with this, still a bit cloudy.
        
           | lettergram wrote:
           | I built a search engine that quantified the expertise of
           | authors of comments. Then I created what I called "expert
           | rank" that allowed me to build a really good search engine.
           | 
           | Super good if you're at a company or something
           | 
           | https://twitter.com/austingwalters/status/104189476543920128.
           | ..
        
             | turbocon wrote:
             | Wow, that lead me down quite a rabbit hole, impressive
             | work.
        
           | busymom0 wrote:
           | If you are just interested in searching HN and don't need
           | full data dumps, the Algolia search for HN is quite good and
           | fast:
           | 
           | https://hn.algolia.com
           | 
           | They also have a free API.
           | 
           | If you need data dumps, maybe look into Google BigQuery.
        
         | rattray wrote:
         | Nice - looks like the ~same approach recommended here of adding
         | a generated `tsvector` column with a GIN index and querying it
         | with `col @@ @@ to_tsquery('english', query)`.
        
           | lettergram wrote:
           | Yeah my internal approach was creating custom vectors which
           | are quicker to search.
        
       | pvsukale3 wrote:
       | If you are using Rails with Postgres you can use pg_search gem to
       | build the named scopes to take advantage of full text search.
       | 
       | https://github.com/Casecommons/pg_search
        
       | theandrewbailey wrote:
       | > You could also look into enabling extensions such as unaccent
       | (remove diacritic signs from lexemes) or pg_trgm (for fuzzy
       | search).
       | 
       | Trigrams (pg_trgm) are practically needed for usable search when
       | it comes to misspellings and compound words (e.g. a search for
       | "down loads" won't return "downloads").
       | 
       | I also recommend using websearch_to_tsquery instead of using the
       | cryptic syntax of to_tsquery.
        
         | kyrra wrote:
         | Trigrams are amazing. I was doing a sideproject where I wanted
         | to allow for substring searching, and trigrams seemed to be the
         | only way to do it (easily/well) in postgres. Gitlab did a great
         | writeup on this a few years ago that really helped me
         | understand it:
         | 
         | https://about.gitlab.com/blog/2016/03/18/fast-search-using-p...
         | 
         | You can also always read the official docs:
         | 
         | https://www.postgresql.org/docs/current/pgtrgm.html
        
       | Grimm1 wrote:
       | Postgres Full-Text search is a great way to get search running
       | for a lot of standard web applications. I recently used just this
       | in Elixir to set up a simple search by keyword. My only complaint
       | was Ecto (Elixir's query builder library) doesn't have first
       | class support for it and neither does Postgrex the lower level
       | connector they use. Still, using fragments with sanitized SQL
       | wasn't too messy at all.
        
       | mrinterweb wrote:
       | I've seen Elasticsearch set up for applications that would have
       | equal benefit from just using the postgresql db's full-text
       | search they already have access to.
       | 
       | The additional complexity is usually incurred when the data in
       | postgresql changes, and those changes need to be mirrored up to
       | Elasticsearch. Elasticsearch obviously has its uses, but for some
       | cases, postgresql's built in full-text search can make more
       | sense.
        
       | [deleted]
        
       | jcuenod wrote:
       | Huh, just yesterday I blogged[0] about using FTS in SQLite[1] to
       | search my PDF database. SQLite's full-text search is really
       | excellent. The thing that tripped me up for a while was `GROUP
       | BY` with the `snippet`/`highlight` function but that's the point
       | of the blog post.
       | 
       | [0] https://jcuenod.github.io/bibletech/2021/07/26/full-text-
       | sea...
       | 
       | [1] https://www.sqlite.org/fts5.html
        
         | bob1029 wrote:
         | We've been using SQLite's FTS capabilities to index customer
         | log files since 2017 or so. It's been a wonderful approach for
         | us. Even if we move to our own in-house data store (event
         | sourced log), we would still continue using SQLite for tracing
         | because it brings so many of these sorts of benefits.
        
         | edwinyzh wrote:
         | A very well written article about SQLite FTS5! One question -
         | it seems that your search result displays the matching paging
         | number, how did you do that? because as far as I know, unlike
         | FTS4, FTS5 has no `offsets` function.
        
           | jcuenod wrote:
           | I index each page individually. It's not in the article
           | (because I also don't explain separating out the metadata)
           | but I did weigh the options. I considered indexing page
           | pairs, for example (and having every page indexed twice). But
           | I figured that if search terms are broken across pages,
           | they're likely to also be separated by text in headers &
           | footers, page numbers, and footnotes. So in the end I decided
           | to just index individual pages. The FTS table now has `id`,
           | `page_number`, and `content` columns (where `id` is the
           | foreign key to a table that stores the metadata).
        
       | MushyRoom wrote:
       | I was hyped when I found out about it a while ago. Then I wasn't
       | anymore.
       | 
       | When you have 12 locales (kr/ru/cn/jp/..) it's not that fun
       | anymore. Especially on a one man project :)
        
         | freewizard wrote:
         | For small project and simple full text search requirement, try
         | this generic parser:
         | https://github.com/freewizard/pg_cjk_parser
        
         | oauea wrote:
         | Why support so many locales in a one man project?
        
           | MushyRoom wrote:
           | It's just a project to learn all the things that are web.
           | It's mainly a database for a game now with most of the
           | information sourced from the game (including its localization
           | files).
           | 
           | I'm slowly transitioning from MariaDB to Postgres - again as
           | a learning experience. There is cool stuff and there is
           | annoying stuff to reproduce things like case-insensitive +
           | ignore accents (utf8_general_ci) in Postgres.
           | 
           | I've looked into FTS and searching for missing dictionaries
           | to support all the locales but Chinese is one of the harder
           | ones.
        
           | blacktriangle wrote:
           | One (dev) project here, we're up to 5 locales at a
           | surprisingly small number of customers. Problem is when your
           | customers are global, all of a sudden a single customer can
           | bring along multiple locales. I very much regret not taking
           | localization far more seriously early in development but we
           | were blindsided by the interest outside the Angleosphere.
        
           | habibur wrote:
           | To attract more visitors/customers I guess. I plan to extend
           | to 10 languages too. 1 man project.
        
       | tabbott wrote:
       | Zulip's search is powered by this built-in Postgres full-text
       | search feature, and it's been a fantastic experience. There's a
       | few things I love about it:
       | 
       | * One can cheaply compose full-text search with other search
       | operators by just doing normal joins on database indexes, which
       | means we can cheaply and performantly support tons of useful
       | operators (https://zulip.com/help/search-for-messages).
       | 
       | * We don't have to build a pipeline to synchronize data between
       | the real database and the search database. Being a chat product,
       | a lot of the things users search for are things that changed
       | recently; so lag, races, and inconsistencies are important to
       | avoid. With the Postgres full-text search, all one needs to do is
       | commit database transactions as usual, and we know that all
       | future searches will return correct results.
       | 
       | * We don't have to operate, manage, and scale a separate service
       | just to support search. And neither do the thousands of self-
       | hosted Zulip installations.
       | 
       | Responding to the "Scaling bottleneck" concerns in comments
       | below, one can send search traffic (which is fundamentally read-
       | only) to a replica, with much less complexity than a dedicated
       | search service.
       | 
       | Doing fancy scoring pipelines is a good reason to use a
       | specialized search service over the Postgres feature.
       | 
       | I should also mention that a weakness of Postgres full-text
       | search is that it only supports doing stemming for one language.
       | The excellent PGroonga extension (https://pgroonga.github.io/)
       | supports search in all languages; it's a huge improvement
       | especially for character-based languages like Japanese. We're
       | planning to migrate Zulip to using it by default; right now it's
       | available as an option.
       | 
       | More details are available here:
       | https://zulip.readthedocs.io/en/latest/subsystems/full-text-...
        
         | stavros wrote:
         | I cannot tell you how much I love Zulip, but I can tell you
         | that I have no friends any more because everyone is tired of me
         | evangelizing it.
        
         | brightball wrote:
         | All of this. It's such a good operational experience that I
         | will actively fight against the introduction of a dedicated
         | search tool unless it's absolutely necessary.
        
       | rattray wrote:
       | TBH I hadn't known you could do weighted ranking with Postgres
       | search before.
       | 
       | Curious there's no mention of zombodb[0] though, which gives you
       | the full power of elasticsearch from within postgres (with
       | _consistency_ no, less!). You have to be willing to tolerate slow
       | writes, of course, so using postgres ' built-in search
       | functionality still makes sense for a lot of cases.
       | 
       | [0] https://github.com/zombodb/zombodb
        
         | craigkerstiens wrote:
         | Zombo is definitely super interesting and we should probably
         | add a bit in the post about it. Part of the goal here was that
         | you can do a LOT with Postgres, without adding one more system
         | to maintain. Zombo is great if you have Elastic around, but
         | want Postgres as the primary interface, but what if you don't
         | want to maintain Elastic.
         | 
         | My ideal is always though to start with Postgres, and then see
         | if it can solve my problem. I would never Postgres is the best
         | at everything it can do, but for most things it is good enough
         | without having another system to maintain and wear a pager for.
        
           | rattray wrote:
           | Zombo does at least promise to handle "complex reindexing
           | processes" for you (which IME can be very painful) but yeah,
           | I assume you'd still have to deal with shard rebalancing,
           | hardware issues, network failures or latency between postgres
           | and elastic, etc etc.
           | 
           | The performance and cost implications of Zombo are more
           | salient tradeoffs in my mind - if you want to index one of
           | the main tables in your app, you'll have to wait for a
           | network roundtrip and a multi-node write consensus on every
           | update (~150ms or more[0]), you can't `CREATE INDEX
           | CONCURRENTLY`, etc.
           | 
           | All that said, IMO the fact that Zombo exists makes it easier
           | to pitch "hey lets just build search with postgres for now
           | and if we ever need ES's features, we can easily port it to
           | Zombo without rearchitecting our product".
           | 
           | [0] https://github.com/zombodb/zombodb/issues/640
        
             | zombodb wrote:
             | I wonder what the ZomboDB developers are up to now? What
             | great text-search-in-postgres things could they be secretly
             | working on?
        
       | syoc wrote:
       | My worst search experiences always come from the features
       | applauded here. Word stemming and removing stop words is a big
       | hurdle when you know what you are looking for but get flooded by
       | noise because some part of the search string was ignored. Another
       | issue is having to type out a full word before you get a hit in
       | dynamic search boxes (looking at you Confluence).
        
         | Someone1234 wrote:
         | I'd argue that isn't a problem with the _feature_ , but a
         | thoughtless implementation.
         | 
         | A _good_ implementation will weigh verbatim results highest
         | before considering the stop-word stripped or stemmed version.
         | Configuring to_tsvector() to not strip stop words or using a
         | stemming dictionary is, in my opinion, a little clunky in
         | Postgres: You 'll want to make a new [language] dictionary and
         | then call to_tsvector() using your new dictionary as the first
         | parameter.
         | 
         | After you've set up the dictionary globally, this would look
         | _something_ like:
         | 
         | setweight(to_tsvector('english_no_stem_stop', col), 'A') ||
         | setweight(to_tsvector('english', col), 'B'))
         | 
         | I think blaming Postgres for adding stemming/stop-word support
         | because it can be [ab]used for a poor search user experience is
         | like blaming a hammer for a poorly built home. It is just a
         | tool, it can be used for good or evil.
         | 
         | PS - You can do a verbatim search without using to_tsvector(),
         | but that cannot be easily passed into setweight() and you
         | cannot use features like ts_rank().
        
       | shakascchen wrote:
       | No fun doing it for Chinese, especially for traditional Chinese.
       | 
       | I had to install software but on Cloud SQL you can't. You have to
       | do it on your instances.
        
         | justusw wrote:
         | Same story for performing searches in Japanese.
        
       | rattray wrote:
       | Something that's missing from this which I'm curious about is how
       | far _can 't_ postgres search take you?
       | 
       | That is, what tends to be the "killer feature" that makes teams
       | groan and set up Elasticsearch because you just can't do it in
       | Postgres and your business needs it?
       | 
       | Having dealt with ES, I'd really like to avoid the operational
       | burden if possible, but I wouldn't want to choose an intermediary
       | solution without being able to say, "keep in mind we'll need to
       | budget a 3-mo transition to ES once we need X, Y, or Z".
        
         | nostrademons wrote:
         | Used to work on Google Search, used ES extensively for a
         | startup I founded (which was sort of quasi-search...it was
         | built around feed ranking, where the query is constant and a
         | stream of documents is constantly coming in), and have also
         | used Postgres extensively in other companies.
         | 
         | The big problem with all the off-the-shelf search solutions
         | (RDBMS full-text search, ES, Algolia) is that _search ranking
         | is a complicated and subtle problem_ , and frequently depends
         | on signals that are _not in the document itself_. Google 's big
         | insight is that how _other_ people talk about a website is more
         | important than how the website talks about itself, and its
         | ranking algorithm weights accordingly.
         | 
         | ES has the basic building blocks to construct such a ranking
         | algorithm. In terms of fundamental infrastructure I found ES to
         | be just as good as Google, and better in some ways. But its
         | out-of-the-box ranking function _sucks_. Expect to put a domain
         | expert just on search ranking and evaluation to get decent
         | results, and they 're going to have to delve pretty deeply into
         | advanced features of ES to get there.
         | 
         | AFAICT Postgres search only lets you tweak the ranking
         | algorithm by assigning different weights to fields, assuming
         | that the final document score is a linear combination of
         | individual fields. This is usually not what you want - it's
         | pretty common to have non-linear terms from different signals.
        
           | wraptile wrote:
           | > Google's big insight is that how other people talk about a
           | website is more important than how the website talks about
           | itself, and its ranking algorithm weights accordingly.
           | 
           | I'm having trouble believing that seeing how top results on
           | opinionated keywords are all SEO spam of websites no one
           | visits by themselves.
        
             | lmm wrote:
             | You don't remember what a game-changer Google was back in
             | 2000 or so. Its results are pretty awful now, probably
             | because of Goodhart's Law, but they were such a huge
             | improvement over Jeeves et al..
        
             | sdoering wrote:
             | I remember when Google came into being and how amazed I
             | was. It was quite something totally different and I
             | switched and never looked back for a lot of years.
             | 
             | In recent years however I tend to massively agree with your
             | sentiment and experience. Every day I do not find the
             | things that are really helpful on page 1 - 3, sadly.
        
           | SOLAR_FIELDS wrote:
           | Nice insightful comment. Can you give a bit additional
           | insight related to when it makes sense to use ES vs Postgres?
           | My takeaway from your comment is that unless your product
           | depends on search as a central component that inbuilt search
           | is _good enough_. Is that an incorrect takeaway?
        
             | nostrademons wrote:
             | That's probably a decent takeaway. If full-text search is a
             | "nice to have" (a bonus feature for advanced users, for
             | example) you can probably rely on built-in Postgres search.
             | If search is a fundamental way that users interact with
             | your product, you want to spend the time and money to get
             | it right, and that probably means ElasticSearch and a
             | custom ranking function.
        
           | simondotau wrote:
           | There is an incredibly large difference between effective
           | scoring for data that is being gamed and data that is not
           | being gamed. Most people's full text data isn't being gamed
           | quite like web search SEO.
           | 
           | There is also an incredibly large difference between
           | effective scoring for data that has deep relationships and
           | data that does not. Most people's full text data doesn't have
           | deep relationships like web pages do with inbound links.
        
           | __jem wrote:
           | I built a big feature with percolate as well and it really
           | sold me on the possibilities of ES for product architecture.
        
             | cmckn wrote:
             | I've never really had a good idea of what Percolator _is_.
             | I understand it to be a "streaming" indexer that replaced a
             | "batch" indexer based on MapReduce. ES allows indexing
             | individual documents (or small batches) vs. rebuilding the
             | entire index; is that the connection you're making?
        
           | IgorPartola wrote:
           | So let's say that you are building a search engine for
           | performance car parts. There are going to be a bunch of
           | technical terms you use there that are not necessarily going
           | to stand out in the document itself but you know them to be
           | important. For example, the amount of boost pressure a turbo
           | can provide or the number of pistons in a brake caliper. Is
           | there some structured way to specify the grammar which is
           | used for such terms and treat those as important such that
           | when a user puts in "19psi turbo" that they don't get a bunch
           | of results for just "turbo" which isn't exactly what they
           | want?
        
             | dreamer7 wrote:
             | When I implemented search for my app, I built a hierarchy
             | of the different attributes weighted by levels and scored
             | search phrases for how closely they matched. This is for
             | mobile phones. For example, searching for "Silver iphone 11
             | pro" would give a higher ranking to all iphone 11 pros that
             | are Silver, then comes differently coloured 11 Pros,
             | followed by 11 Pro Max and 11s, lastly all silver coloured
             | phones.
        
           | bogomipz wrote:
           | >"Used to work on Google Search ..."
           | 
           | Might you or anyone else have some recommendations for books
           | or other resource on large scale search architecture that you
           | think are are worthy reads on the subject?
           | 
           | I would also be interested in hearing if you or anyone else
           | might ave any similar resources you could recommend on the
           | subject of "search ranking"?
        
           | johnx123-up wrote:
           | Did you get a chance to compare MeiliSearch?
        
           | rawoke083600 wrote:
           | >search ranking is a complicated and subtle problem
           | 
           | Absolutely ! I think never before the saying "the devil is in
           | the details"... is more appropriate than here.
           | 
           | Sure most users agree on the extreme's the REALLY bad search
           | result (putting in 'apple' getting out LCD TV's [maybe the
           | tv's are in parent-category 'Electronics' and you have
           | category and popularity boost to high' ?]) and the REALLY
           | good (results that you expected)
           | 
           | Search Results are commonly evaluated with Recall(did ALL the
           | documents that are relevant got returned) vs Precision (how
           | many of the results are 'correct')
           | 
           | But that is "one" of many-many metrics.
           | 
           | The biggest issues are non-tech ppl (like your boss or
           | manager) walking in and "discussing" his/her pet-peeve-search
           | query, cause in his mind when he put in APPLE iPhone we
           | should be returning ONLY apple-iphones and NOT apple-iPhone
           | accessories) or maybe we should be returning ONLY the "latest
           | iphone" not the model from 2 generations back
           | 
           | I've commented this before, you can usually only shoot for an
           | "average amount of happiness" (sounds like Arthur
           | Schopenhauer ? :P) for most users. Never "ok ppl, search
           | works perfect" for everyone one now
           | 
           | As to the "practical matters", we found that building a
           | "search-test-suite" where you put in the "manager's pet-
           | peeve" as well as any angry-emails about search queries, and
           | whenever do you search-tuning it's easy to see any-query-
           | regression oh and of course this needs to be an automated
           | search-test-suite.
        
         | sandGorgon wrote:
         | it doesnt do TF-IDF or BM-25 - the current state of art in
         | search relevance algorithms.
         | 
         | that's where it cant be used for anything serious.
        
           | SigmundA wrote:
           | That why this really needs to get merged:
           | 
           | https://github.com/postgrespro/rum
        
             | rattray wrote:
             | TF/IDF is listed as a TODO on that repo, and I don't see a
             | PR which promises to provide it.
        
               | SigmundA wrote:
               | It lays the groundwork by storing the needed metadata in
               | the index, yes it needs more work and I wish there was
               | more interest to do it.
               | 
               | I really think the is a neglected area and if PG was able
               | to merge in TF-IDF and BM25 there would be little reason
               | to use a separate search db / engine and many advantages
               | with it being integrated.
        
         | Thaxll wrote:
         | PG is average at best for text search, it's not even good.
        
         | nextaccountic wrote:
         | If your search needs outgrow Postgres' native search engine,
         | you can use Postgres search with an ElasticSearch backend,
         | using Zombo
         | 
         | https://github.com/zombodb/zombodb
         | 
         | It basically gives you a new kind of index (create index ..
         | using zombodb(..) ..)
        
         | mjewkes wrote:
         | Exact phrase matching. This generally requires falling back to
         | ILIKE, which is not performant.
        
           | e12e wrote:
           | Even if using ILIKE over the result of an imprecise query?
        
             | [deleted]
        
             | mjewkes wrote:
             | ILIKE itself is a linear scan. The only way to index them
             | are trigram indicies, which are very inefficient (and
             | sometimes not usable) if you're searching document-length
             | content.
             | 
             | Whether or not it works in your specific situation depends
             | on your use case.
        
           | simonw wrote:
           | Exact phrase searching works in PostgreSQL full-text search -
           | here's an example:
           | https://simonwillison.net/search/?q=%22nosql+database%22
           | 
           | I'm using search_type=websearch https://github.com/simonw/sim
           | onwillisonblog/blob/a5b53a24b00...
           | 
           | That's using websearch_to_tsquery() which was added in
           | PostgreSQL 11: https://www.postgresql.org/docs/11/textsearch-
           | controls.html#...
        
             | mjewkes wrote:
             | Try https://simonwillison.net/search/?q=%22your+own+benchma
             | rks%2...
             | 
             | Looks like we get 37 results, of which 2 are true
             | positives.
             | 
             | Looks like "your" and "own" are both contained in the
             | english.stop stopwords list. So you could fix this by
             | removing stopwords from your dictionary.
             | 
             | While disabling the stemmer is relatively easy (use the
             | 'simple' language setting for your ts_query), altering the
             | stopword dictionaries is more involved, and not easy to
             | maintain or pass between developers/environments, and not
             | at all easy to share between queries.
             | 
             | And so the most common suggestion is to use ILIKE.
             | 
             | Lucene has no problems with any of this.
        
             | SahAssar wrote:
             | I had some issues with this recently as I couldn't get a
             | FTS query to find something looking like a path or url in a
             | query. As an example from your site:
             | https://simonwillison.net/2020/Jan/6/sitemap-xml/ contains
             | the exact text https://www.niche-museums.com/, but I cannot
             | find a way to search for that phrase exactly (trying
             | https://simonwillison.net/search/?q=%22www.niche-
             | museums.com... works though). I tried both in my own psql
             | setup and on your site, and it seems like exact phrase
             | searching is limited to the language used, even if it would
             | be an exact string match.
             | 
             | Are there any workarounds for that?
        
           | rattray wrote:
           | Pardon my ignorance - what is exact phrase matching and why
           | doesn't it work with tsvector?
        
             | mjewkes wrote:
             | Exact phrase matching is what google (sometimes? used to?)
             | do for you if you put your search terms in double "full
             | quotes".
             | 
             | It returns only results that contain the exact multi word
             | sequence in exactly the same order.
        
         | amichal wrote:
         | Postgres was not good at (for us) - IDF and other corpus based
         | relevancy measures. had to hand roll - thesaurus and
         | missspelling - again possible of course with preprocessing and
         | by adding config files - non Latin alphabet languages. E.g.
         | Arabic - needed filesystem access (we used aws rds so couldn't
         | do it) to add a dictionary based stemmed/word breaker
         | 
         | We used es or solr for those cases. For English FTS with 100k
         | documents doing it in PG is super easy and one less dependency
        
         | some_developer wrote:
         | Anectodal note:
         | 
         | A few years ago we added yet-another part to our product and,
         | whilst ES worked "okay", we got a bit weary of ES due to "some
         | issues" (some bug in the architecture keeping things not
         | perfect in sync, certain queries with "joins" of types taking
         | long, demand on HW due to the size of database, no proper
         | multi-node setup due to $$$ and time constraint, etc.; small
         | things piling up over time).
         | 
         | Bright idea: let's see how far Postgres, which is our primary
         | datastore, can take us!
         | 
         | Unfortunately, the feature never made it fully into production.
         | 
         | We thought that on paper, the basic requirements were ideal:
         | 
         | - although the table has multiple hundreds of millions of
         | entries, natural segmentation by customer IDs made possible
         | individual results much smaller
         | 
         | - no weighted search result needed: datetime based is perfect
         | enough for this use-case, we thought it would be easy to come
         | up with the "perfect index [tm]"
         | 
         | Alas, we didn't even get that far:
         | 
         | - we identified ("only") 2 columns necessary for the search =>
         | "yay, easy"
         | 
         | - one of those columns was multi-language; though we didn't
         | have specific requirements and did not have to deal with
         | language specific behaviour in ES, we had to decide on one for
         | the TS vectorization (details elude me why "simple" wasn't
         | appropriate for this one column; it was certainly for the other
         | one)
         | 
         | - unsure which one, or both, we would need, for one of the
         | columns we created both indices (difference being the
         | "language")
         | 
         | - we started out with a GIN index (see
         | https://www.postgresql.org/docs/9.6/textsearch-indexes.html )
         | 
         | - creating a single index took > 15 hours
         | 
         | But once the second index was done, and had not even rolled out
         | the feature in the app itself (which at this point was still an
         | ever changing MVP), unrelated we suddenly got hit by lot of
         | customer complains that totally different operations on this
         | table (INSERTs and UPDATEs) started to be getting slow (like
         | 5-15 seconds slow, something which usually takes tiny ms).
         | 
         | Backend developer eyes were wide open O_O
         | 
         | But since we knew that second index just finished, after
         | checking the Posgres logs we decided to drop the FTS indices
         | and, lo' and behold, "performance problem solved".
         | 
         | Communication lines were very short back then (still are today,
         | actually) and it was promptly decided we just cut the search
         | functionality from this new part of the product and be done
         | with it. This also solved the problem, basically (guess there's
         | some "business lesson" to be learned here too, not just
         | technical ones).
         | 
         | Since no one within the company counter argued this decision,
         | we did not spend more time analyzing the details of the
         | performance issue though I would have loved to dig into this
         | and get an expert on board to dissect this.
         | 
         | --
         | 
         | A year later or so I had a bit free time and analyzed one
         | annoying recurring slow UPDATE query problem on a completely
         | different table, but also involving FTS on a single column
         | there also using a GIN index. That's when I stumble over
         | https://www.postgresql.org/docs/9.6/gin-implementation.html
         | 
         | > Updating a GIN index tends to be slow because of the
         | intrinsic nature of inverted indexes: inserting or updating one
         | heap row can cause many inserts into the index (one for each
         | key extracted from the indexed item). As of PostgreSQL 8.4, GIN
         | is capable of postponing much of this work by inserting new
         | tuples into a temporary, unsorted list of pending entries. When
         | the table is vacuumed or autoanalyzed, or when
         | gin_clean_pending_list function is called, or if the pending
         | list becomes larger than gin_pending_list_limit, the entries
         | are moved to the main GIN data structure using the same bulk
         | insert techniques used during initial index creation. This
         | greatly improves GIN index update speed, even counting the
         | additional vacuum overhead. Moreover the overhead work can be
         | done by a background process instead of in foreground query
         | processing.
         | 
         | In this particular case I was able to solve the occasional slow
         | UPDATE queries with "FASTUPDATE=OFF" on that table and,
         | thinking back about the other issue, it might have solved or
         | minimized the impact.
         | 
         | Back to the original story: yep, this one table can have
         | "peaks" of inserts but it's far from "facebook scale" or
         | whatever, basically 1.5k inserts / second were the absolute
         | rare peak I measured and usually it's in the <500 area. But I
         | guess it was enough for this scenario to add latency within the
         | database.
         | 
         | --
         | 
         | Turning back my memory further, I was always "pro" trying to
         | minimize / get rid of ES after learning about
         | http://rachbelaid.com/postgres-full-text-search-is-good-enou...
         | even before we used any FTS feature. At also mentions the
         | GIN/GiST issue but alas, in our case: ElasticSearch is good
         | enough and, besides the thwarts we've with it, actually easier
         | to reason about (so far).
        
         | manigandham wrote:
         | It can't go that far. Postgres is limited by dictionaries,
         | stemming/processing, query semantics (like fuzzy searching),
         | and the biggest issue of all being a lack of modern relevance
         | algorithms. It's good for limited scenarios where you just need
         | more than a SQL LIKE statement, and chaining some functions
         | together can get you decent results [1] without adding another
         | datastore.
         | 
         | However search tech is pretty mature with Lucene at the core
         | and there are many better options [2] from in-process libraries
         | to simple standalone servers to full distributed systems like
         | Elastic. There are also other databases (relational like
         | MemSQL, or documentstores like MongoDB/RavenDB) that are adding
         | search as native querying functions with most of the abilities
         | of ES. If search is a core or complex part of your application
         | (like patterns in raw image data or similarities in audio
         | waveforms) then that's where ES will excel.
         | 
         | 1. https://stackoverflow.com/questions/46122175/fulltext-
         | search...
         | 
         | 2.
         | https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...
        
         | grncdr wrote:
         | My experience has been that sorting by relevance ranking is
         | quite expensive. I looked into this a bit and found
         | https://github.com/postgrespro/rum (and some earlier slide
         | decks about it) that explains why the GIN index type can't
         | support searching and ranking itself (meaning you need to do
         | heap scans for ranking). This is especially problematic if your
         | users routinely do searches that match a lot of documents and
         | you only want to show the top X results.
         | 
         | Edit: if any of the Crunchy Data people are reading this:
         | support for RUM indexes would be super cool to have in your
         | managed service.
        
           | thom wrote:
           | Top X queries should be optimised with gist indexes.
        
             | grncdr wrote:
             | Ah sounds like I need to update myself on this. I don't
             | recall what the other trade-offs vs GIN indexes are.
        
           | craigkerstiens wrote:
           | Actually thats a great suggestion. We need to take a deeper
           | look at the code itself and ability to support the extension,
           | but we'll definitely take and evaluate it at some of our
           | upcoming roadmap planning for Crunchy Bridge
           | (https://www.crunchybridge.com).
        
             | grncdr wrote:
             | Cool cool
             | 
             | If I could have your other ear for a moment: support for
             | _any_ EU-based cloud provider would be super nice. Ever
             | since the Privacy Shield fig leaf was removed, questions
             | about whether we store any data under US jurisdiction have
             | become a lot more frequent.
        
               | craigkerstiens wrote:
               | Do you have a preferred one?
               | 
               | We're currently on the big 3 US ones, in process of
               | working on our 4th provider, our goal is very much to
               | deliver the best Postgres experience whether on bare
               | metal/on-premise or in the cloud, self hosted or fully
               | managed.
               | 
               | Edit: Always feel free to reach out directly, I'm always
               | happy to spend time with anyone that has questions and
               | usually pretty easy to track me down.
        
               | james_in_the_uk wrote:
               | Dublin usually works well from a regulatory perspective.
        
               | jhoechtl wrote:
               | They are the 52 state of america. The 51 was already
               | taken.
               | 
               | https://en.wikipedia.org/wiki/51st_State_(song)
        
         | animeshjain wrote:
         | From what I know, full text search in Postgres (and MySQL) does
         | not have faceted search. So it only supports returning full
         | text results from the entire index.
         | 
         | Actually, it is possible, but doing a search on a particular
         | segment of rows is a very slow operation - say text search for
         | all employees with name matching 'x', in organization id 'y'.
         | 
         | It is not able to utilise the index on organization id in this
         | case, and it results in a full scan.
        
         | xcambar wrote:
         | This is anecdote, not proper feedback, since I wasn't directly
         | involved in the topic.
         | 
         | My company relied on PG as its search engine and everything
         | went well from POC to production. After a few years of
         | production and new clients requiring volumes of data an order
         | of magnitude above our comfort zone, things went south pretty
         | fast.
         | 
         | Not many months later but many sweaty weeks of engineering
         | after, we switched to ES and we're not looking back.
         | 
         | tl;dr; even with great DB engineers (which we had), I'd suggest
         | that scale is a strong limiting factor on this feature.
        
           | dante_dev wrote:
           | can you tell us the scale you're talking about? getting good
           | enough results with ~1 billion rows
        
             | lacksconfidence wrote:
             | Can you tell us about your scoring function? Selecting 40M
             | results from a dataset of ~1B and returning the top 10
             | based on some trivial scoring function is easy, any
             | reasonable search system will handle that. The problem is
             | when you have to run your scoring function on all 40M
             | matching docs to decide which 10 are the most relevant.
             | It's even more of a problem when your scoring function
             | captures some of the complexities of the real word, rather
             | than something trivial like tf/idf or bm25.
        
         | _009 wrote:
         | If you are looking to do semantic search (Cosine similarity) +
         | filtering (SQL) on data that can be represented as vectors
         | (audio, text, video, bio) I suggest,
         | https://github.com/ankane/pgvector
        
         | iav wrote:
         | I moved from ElasticSearch to PG FTS in production, and here
         | are the things I had to give up:
         | 
         | 1. PostgreSQL has a cap on column length, and the search index
         | has to be stored in a column. The length of the column is
         | indeterminate - it is storing every word in the document and
         | where it's located, so a short document with very many unique
         | words (numbers are treated as words too) can easily burst the
         | cap. This means you have to truncate each document before
         | indexing it, and pray that your cap is set low enough. You can
         | use multiple columns but that slows down search and makes
         | ranking a lot more complicated. I truncate documents at 4MB.
         | 
         | 2. PostgreSQL supports custom dictionaries for specific
         | languages, stemmers, and other nice tricks, but none of those
         | are supported by AWS because the dictionary gets stored as a
         | file on the filesystem (it's not a config setting). You can
         | still have custom rules like whether or not numbers count as
         | words.
        
         | gk1 wrote:
         | Semantic search using text embeddings. With Open Distro for
         | Elasticsearch you can store your text embeddings and then
         | perform a nearest-neighbor search[1] to find most similar
         | documents using cosine similarity[2]. Elasticsearch (vanilla)
         | will get this feature with 8.0.
         | 
         | If migrating to ES makes you groan you can use a managed
         | service like Pinecone[3] (disclaimer: I work there) just for
         | storing and searching through text embeddings in-memory through
         | an API while keeping the rest of your data in PG.
         | 
         | [1] Nearest-neighbor searches in Open Distro:
         | https://opendistro.github.io/for-elasticsearch-docs/docs/knn...
         | 
         | [2] More on how semantic similarity is measured:
         | https://www.pinecone.io/learn/semantic-search/
         | 
         | [3] https://www.pinecone.io
        
         | matsemann wrote:
         | I'm not well versed on modern pg for this use, but when I
         | managed a Solr instance ~5 years ago, it was the ranking of the
         | results that was the killer feature. Finding results fast most
         | systems can do. Knowing which results to present is harder.
         | 
         | Our case was a domain specific knowledge base, with certain
         | terms occurring often in many articles. Searching for a term
         | could bring up thousands of results, but few of them were
         | actually relevant to show in context of the search, they just
         | happened to use the term.
        
         | jka wrote:
         | "Faceted search"[1] (aka aggregates in Elasticsearch) tends to
         | be a popular one, to provide user-facing content navigation.
         | 
         | That said, simonw has been on the case[2] demonstrating an
         | implementation of that using Django and PostgreSQL.
         | 
         | [1] - https://en.wikipedia.org/wiki/Faceted_search
         | 
         | [2] - https://simonwillison.net/2017/Oct/5/django-postgresql-
         | facet...
        
           | craigds wrote:
           | This is the key one for us that makes Postgres a non-starter
           | for FTS (We use postgres for everything _else_ )
           | 
           | We begrudgingly use Solr instead (we started before ES was
           | really a thing and haven't found a _need_ to switch yet)
           | 
           | When you get more than about two different _types_ of filters
           | (e.g. types of filters could be  'tags', 'categories',
           | 'geotags', 'media type', 'author' etc), the combinatorial
           | explosion of Postgres queries required to provide facet
           | counts gets unmanageable.
           | 
           | For example, when I do a query filtered by
           | `?tag=abc&category=category1`, I need to do these queries:
           | - `... where tag = 'abc' and category_id = 1` (the current
           | results)            - `count(*) ... where tag = 'def' and
           | category_id = 1` (for each other tag present in the results)
           | - `count(*) ... where tag = 'abc' and category_id = 2` (for
           | each other category present in the results)            -
           | `count(*) ... where category_id = 1`            - `count(*)
           | ... where tag = 'abc'`            - `count(*)`
           | 
           | There are certainly smarter ways to do this than lots of tiny
           | queries, but all this complexity still ends up somewhere and
           | isn't likely to be great for performance.
           | 
           | Whereas solr/elasticsearch have faceting built in and handle
           | this with ease.
        
             | kalev wrote:
             | This is exactly the issue I'm currently facing. We do a
             | bunch of count queries to calculate facets and am looking
             | for something that can do this out of the box. I'm glad I
             | came to the same conclusion myself, either solr or
             | elasticsearch might be the way to go. Starting this from
             | scratch, which of the two would you recommend and why?
        
               | craigds wrote:
               | Unfortunately I don't know enough about elasticsearch to
               | provide a useful comparison. I like the way ES queries
               | are structured JSON instead of obscure compact
               | querystring parameters, but that's not a good reason to
               | choose one over the other :)
        
               | Ueland wrote:
               | I myself is currently setting up Solr for a project as my
               | experience with ES as a DevOps is not a happy one. Always
               | nodes/indexes having some kind of problem. In the same
               | time i have also worked with Solr for years before and
               | never met any major issues. It just works and does the
               | job well.
        
             | radiospiel wrote:
             | You could run this in a single query `count(*) .. group by
             | (tag, category_id)` and then post process the results.
             | 
             | counting by groups is still relatively expensive though in
             | postgres; I built an implementation which uses estimate
             | counting (via query planner) for larger buckets, and exact
             | counting for only smaller buckets, but my use case allowed
             | for some degree of inconsistency.
        
         | brightball wrote:
         | Streaming data ingestion is the biggest. If you're constantly
         | writing data to be searched, this is where ES really outshines
         | everything.
        
         | kayodelycaon wrote:
         | If I recall correctly, Postgres search doesn't scale well. Not
         | sure where it falls apart but it isn't optimized in the same
         | way something like Solr is.
        
           | ezekg wrote:
           | I have a table with over a billion rows and most full-text
           | searches still respond in around a few milliseconds. I think
           | this will depend on a lot of factors, such as proper
           | indexing, and filtering down the dataset as much as possible
           | before performing the full-text ops. I've spent a
           | considerable amount of time on optimizing these queries,
           | thanks to tools like PgMustard [0]. Granted, I do still have
           | a couple slow queries (1-10s query time), but that's likely
           | due to very infrequent access i.e. cold cache.
           | 
           | I will say, if you use open source libraries like pg_search,
           | you are unlikely to ever have performant full-text search.
           | Most full-text queries need to be written by hand to actually
           | utilize indexes, instead of the query-soup that these types
           | of libraries output. (No offense to the maintainers -- it's
           | just how it be when you create a "general" solution.)
           | 
           | [0]: https://pgmustard.com
        
             | snack-boye wrote:
             | Silly question, I'm using pg right now and most of my
             | queries are something like this (in english)
             | 
             | Find me some results in my area that contain these
             | categoryIds and are slotted to start between now and next
             | 10 days.
             | 
             | Since its already quite a filtered set of data, would that
             | mean I should have little issues adding pg text search
             | because with correct indexing and all, it will usually be
             | applied to a small set of data?
             | 
             | Thanks
        
               | ezekg wrote:
               | I'm not a DBA, so I can't say for certain simply due to a
               | gap in my knowledge. But in my experience, it depends on
               | a lot of factors. Sometimes pg will use an index before
               | performing the search ops, other times a subquery is
               | needed. Check out pgmustard and dig into your slow query
               | plans. :)
        
               | john-shaffer wrote:
               | You might be just fine adding an unindexed tsvector
               | column, since you've already filtered down the results.
               | 
               | The GIN indexes for FTS don't really work in conjunction
               | with other indices, which is why
               | https://github.com/postgrespro/rum exists. Luckily, it
               | sounds like you can use your existing indices to filter
               | and let postgres scan for matches on the tsvector. The
               | GIN tsvector indices are quite expensive to build, so
               | don't add one if postgres can't make use of it!
        
             | kayodelycaon wrote:
             | Oh cool. I was right and wrong. Thanks!
        
         | fizx wrote:
         | Hi, I started an Elasticsearch hosting company, since sold, and
         | have built products on PG's search and SQLite FTS search.
         | 
         | There are in my mind two reasons to not use PG's search.
         | 
         | 1. Elasticsearch allows you to build sophisticated linguistic
         | and feature scoring pipelines to optimize your search quality.
         | This is not a typical use case in PG.
         | 
         | 2. Your primary database is usually your scaling bottleneck
         | even without adding a relatively expensive search workload into
         | the mix. A full-text search tends to be around as expensive as
         | a 5% table scan of the related table. Most DBAs don't like
         | large scan workloads.
        
           | busymom0 wrote:
           | Regarding point 2, my implementation was to have a duplicate
           | database server where all the search queries are sent. This
           | would ensure that the search wouldn't slow down the main
           | database. And most of the duplication would happen quick
           | enough so that the search results were almost up to date.
        
           | kurko wrote:
           | I agree with reason 1, but reason 2 is an answer for, "should
           | I use PG search in the same PG instance I already have", and
           | that's a different discussion. You can set up a replica for
           | that.
        
           | quietbritishjim wrote:
           | Do you see any particular reasons to use or not use ZomboDB
           | [1]? It claims to lets you use ElasticSearch from PG
           | seamlessly e.g. it manages coherency of which results ought
           | to be returned according to the current transaction. (I've
           | never quite ended up needing to use ES but it's always seemed
           | to me I'd be likely to need ZomboDB if I did.)
           | 
           | [1] https://github.com/zombodb/zombodb
        
             | mixmastamyk wrote:
             | You can do anything, anything at all, at https://zombo.com/
             | 
             |  _" The only limit, is yourself..."_
        
               | FpUser wrote:
               | This one is my favorite for many many years. Good for
               | relaxing.
        
               | habibur wrote:
               | It's still around?
        
               | mumblemumble wrote:
               | That seems like one for the philosophers. If you
               | completely rewrite a Flash site in HTML5, but it looks
               | the same and has the same URL, is it still the same site?
        
               | da_chicken wrote:
               | So, YouTube?
        
               | mixmastamyk wrote:
               | Klicken-Sie Link.
        
             | Conlectus wrote:
             | Though I have only a recreational interest in datastores, I
             | would be pretty wary of a service that claims to strap a
             | Consistant-Unavailable database (Postgres) to an
             | Inconsistent-Available database (ElasticSearch) in a useful
             | way.
             | 
             | Doing so would require ElasticSearch to reach consensus on
             | every read/write, which would remove most of the point of a
             | distributed cluster. Despite this, ZomboDB's documentation
             | says "complex aggregate queries can be answered in parallel
             | across your ElasticSearch cluster".
             | 
             | They also claim that transactions will abort if
             | ElasticSearch runs into network trouble, but the
             | ElasticSearch documentation notes that writes during
             | network partitions don't wait for confirmation of
             | success[1], so I'm not sure how they would be able to
             | detect that.
             | 
             | In short: I'll wait for the Jepsen analysis.
             | 
             | [1] https://www.elastic.co/blog/tracking-in-sync-shard-
             | copies#:~...
        
               | zombodb wrote:
               | > Doing so would require ElasticSearch to reach consensus
               | on every read/write
               | 
               | ZomboDB only requires that ES have a view of its index
               | that's consistent with the active Postgres transaction
               | snapshot. ZDB handles this by ensuring that the ES index
               | is fully refreshed after writes.
               | 
               | This doesn't necessarily make ZDB great for high-update
               | loads, but that's not ZDB's target usage.
               | 
               | > They also claim that transactions will abort if
               | ElasticSearch runs into network trouble...
               | 
               | I had to search my own repo to see where I make this
               | claim. I don't. I do note that network failures between
               | PG & ES will cause the active Postgres xact to abort. On
               | top of that, any error that ES is capable of reporting
               | back to the client will cause the PG xact to abort --
               | ensuring consistency between the two.
               | 
               | Because the ES index is properly refreshed as it relates
               | to the active Postgres transaction, all of ES' aggregate
               | search functions are capable of providing proper MVCC-
               | correct results, using the parallelism provided by the ES
               | cluster.
               | 
               | I don't have the time to detail everything that ZDB does
               | to project Postgres xact snapshots on top of ES, but the
               | above two points are the easy ones to solve.
        
               | Conlectus wrote:
               | I suppose I must have misinterpreted this line
               | 
               | > As such, any sort of failure either with ZomboDB
               | itself, between Postgres and Elasticsearch (network
               | layer), or within Elasticsearch will cause the operating
               | Postgres transaction to ABORT. [1]
               | 
               | In my defense, there is a fairly important distinction
               | between "any error that ES is capable of reporting back"
               | and "any sort of failure within Elasticsearch".
               | 
               | That said, I and trust that you're more familiar with
               | consistency levels than me, so I'll bow out here.
               | 
               | [1] https://www.zombodb.com/documentation/things-to-know/
        
               | zombodb wrote:
               | I can definitely reword that if it's confusing.
        
               | mrslave wrote:
               | ZomboDB scratches an itch in a way I find fascinating,
               | though I have yet to do more with it than shoehorn it
               | into a prototype that was a bit square-peg-in-a-round-
               | hole. It is in my catalogue of technologies I hope
               | exploit someday. And I hope you're enjoying the work and
               | making some $$$ too.
               | 
               | While I'm here might I ask, are you finding the hosted
               | PostgreSQL services (AWS, Azure, etc.) growing or
               | shrinking your market opportunities? Also, does it play
               | nice with Citus?
        
               | zombodb wrote:
               | I don't pay any attention to what the cloud providers are
               | doing. I have no control over them so..... eh.
               | 
               | ZDB is still alive and well but I have a real job now
               | too.
        
           | NortySpock wrote:
           | Regarding point 2: Shouldn't you be moving your search
           | queries from your transaction server to a separate analysis
           | or read-replica server? OLTP copies to OLAP and suddenly
           | you've separated these two problems.
        
             | blowski wrote:
             | ...and created a new one if the projection is only
             | eventually consistent. No free lunches here.
        
               | zepolen wrote:
               | How is that different from running Postgres and
               | Elasticsearch separately?
        
               | blowski wrote:
               | It's not.
               | 
               | It's a difference between having separate OLAP and OLTP
               | databases, which is what the parent post suggested.
        
               | justinclift wrote:
               | Probably skill set of staff?
               | 
               | For example, if a place has developed fairly good
               | knowledge of PG already, they can "just" (!) continue
               | developing their PG knowledge.
               | 
               | Adding ES into the mix though, introduces a whole new
               | thing that needs to be learned, optimised, etc.
        
           | pmarreck wrote:
           | > sophisticated linguistic and feature scoring pipelines to
           | optimize your search quality
           | 
           | You CAN score results using setweight, although it's likely
           | not as sophisticated as Elasticsearch's
           | 
           | https://www.postgresql.org/docs/9.1/textsearch-controls.html
           | 
           | Disclaimer: I use Postgres fulltext search in production,
           | very happy with it although maintaining the various triggers
           | and stored procs it requires to work becomes cumbersome
           | whenever you have to write a migration that alters any of
           | them (or that in fact touches any related field, as you may
           | be required to drop and recreate all of the parts in order
           | not to violate referential integrity)
           | 
           | It is certainly nice having not to worry about 1 additional
           | dependency when deploying, though
        
       | thom wrote:
       | We get really nice results with gist indexes (gist_trgm_ops)
       | searching across multiple entity types to do top X queries. It's
       | very useful to be able to make a stab at a difficult-to-spell
       | foreign football player's name, possibly with lots of diacritics,
       | and get quick results back. I'm always surprised when I find a
       | search engine on any site that is so unkind as to make you spell
       | things exactly.
        
       | nuker wrote:
       | Is there alternative to ES that scales nicely? I'm running ELK
       | stack for logging using AWS Elasticsearch. Logs have
       | unpredictable traffic volume and even overprovisioned ES cluster
       | gets clogged sometimes. I wonder is there something more scalable
       | than ES, and have nice GUI like Kibana?
        
         | shard972 wrote:
         | loki
        
         | jillesvangurp wrote:
         | It's more a matter of configuring it right. I'd recommend
         | trying out Elastic Cloud. It's a bit easier to deal with than
         | Amazon's offering and much better supported. AWS has always
         | been a bit hands-off on that front. Their opensearch project
         | does not seem to break that pattern so far.
         | 
         | Also, with Elastic Cloud you get some access to useful features
         | for logging (like life cycle management and data streams) that
         | will help you scale the setup.
         | 
         | Kibana in recent iterations has actually improved quite a bit.
         | The version you are getting from Amazon is probably a bit bare
         | bones in comparison. One nice thing with Elastic is that going
         | with the defaults gets you some useful dashboards out of the
         | box if you use e.g. file or docker beats for collecting logs.
        
           | nuker wrote:
           | Thanks mate. I did tried Elastic before settling on AWS ES,
           | it was slower somehow. As for Kibana features, im happy with
           | barebones :)
           | 
           | > you use e.g. file or docker beats for collecting logs.
           | 
           | My setup is custom app reading from CW Logs.
           | 
           | But can Elastic cluster scale automatically on indexing
           | latency spikes, so my apps writes do not time out? If yes
           | then how, please?
        
             | jillesvangurp wrote:
             | Autoscaling is of course something they can do:
             | https://www.elastic.co/guide/en/cloud/current/ec-
             | autoscaling...
             | 
             | If you use life cycle management and data streams (which
             | AWS doesn't have), you'd be able to control the sizes of
             | your hot indices (i.e. the ones you write to). Basically
             | keeping your hot indices small helps keeping things fast.
             | If you have issues with app writes spiking, I'd use some
             | queuing solution in between. Plenty of solutions for that.
             | 
             | The rest is just a matter of configuring things right in
             | terms of number of shards and setting up properly.
             | Basically, you get what you pay for in the end.
        
         | hnarn wrote:
         | I don't know what features AWS provides but in general terms
         | logs benefit a lot from compression, so if I were to set up
         | this on my own I'd probably want to try something like a VDO or
         | ZFS backed storage system as well as compressed transfers
         | (perhaps in batch if that's required).
        
           | nuker wrote:
           | > something like a VDO or ZFS backed storage system
           | 
           | And what text search engine?
        
             | hnarn wrote:
             | I have no idea, but I'm assuming that regardless of which
             | one you use there will be disk level duplication that could
             | be taken care of to alleviate uncontrolled growth somewhat.
        
       | kureikain wrote:
       | I used Postgres full-text search for mail log feature on my email
       | forward app https://hanami.run
       | 
       | Essentially allow arbitraty query in from/to/subject/body. One
       | thing that make full-text serch work great for me is that I don't
       | need to sort or rank the relevant of query. I just show a list of
       | email that match the query order by their id.
       | 
       | I also don't do pagination and counting, instead users has to
       | load more paged and the ID of the email is pass to the query as a
       | point to compare( where id < requests.get.before).
       | 
       | And with those strategy, full text search works great for us
       | since we don't really want to bring in ElasticSearch because only
       | about 20% of users use this features.
        
       | bityard wrote:
       | I know Postgres and SQLite have mostly different purposes but
       | FWIW, SQLite also has a surprisingly capable full-text search
       | extension built right in: https://www.sqlite.org/fts5.html
        
         | jjice wrote:
         | It's very impressive, especially considering the SQLite version
         | you're already using probably has it enabled already. I use it
         | for a small site I run and it works fantastic. Little finicky
         | with deletes and updates due to virtual tables in SQLite, but
         | definitely impressive and has its uses.
        
       | SigmundA wrote:
       | Keep wondering if RUM Indexes [1] will ever get merged for faster
       | and better ranking (TF/IDF). Really would make PG a much more
       | complete text search engine.
       | 
       | https://github.com/postgrespro/rum
        
       ___________________________________________________________________
       (page generated 2021-07-28 19:02 UTC)