[HN Gopher] Show HN: Full text search Project Gutenberg (60m par...
       ___________________________________________________________________
        
       Show HN: Full text search Project Gutenberg (60m paragraphs)
        
       Author : gutensearch
       Score  : 183 points
       Date   : 2021-01-24 04:58 UTC (18 hours ago)
        
 (HTM) web link (gutensearch.com)
 (TXT) w3m dump (gutensearch.com)
        
       | ilaksh wrote:
       | I am sure the Postgres full text search is pretty good by now.
       | But I always expect projects like these to eventually reluctantly
       | switch over to Elasticsearch in order to handle 10-20% of the
       | queries better.
        
       | snikolaev wrote:
       | Hi. Nice project! I don't know why there's no BM25 (or at least
       | SOME TF-IDF implementation) in Postgres FTS, but if you decide
       | you need it (and more languages support and highlighting and
       | lower response time) ping us at contact@manticoresearch.com and
       | we'll help you with integrating your postgres dataset with
       | Manticore Search. 60M docs shouldn't be a problem at all (should
       | take about an hour to make an index) and you'll get proper
       | ranking and nice highlighting with just few lines of new code.
       | Here's an interactive course about indexing from mysql
       | https://play.manticoresearch.com/mysql/ , but with postgres it's
       | the same.
        
         | gutensearch wrote:
         | Thank you for your generous offer of help! I look forward to
         | taking it up (may take a while as I'm about to move countries
         | and quarantine).
         | 
         | In particular I love that one of the examples in your comment
         | history is in Latin as that language is not currently supported
         | by Postgres FTS. Are Latin and Ancient Greek supported by
         | Manticore? (dare I hope for Anglo Saxon...)
        
           | snikolaev wrote:
           | In terms of advanced NLP (stemming, lemmatization, stopwords,
           | wordforms) - no. In terms of just general tokenization - I've
           | never dealt with Latin and Ancient Greek characters (if
           | there're specific characters for those languages), but if
           | even they are not supported by default it's not a problem to
           | add them in config (https://mnt.cr/charset_table)
        
             | yorwba wrote:
             | For the character mappings, it might be useful to have a
             | look at the config for https://tatoeba.org (or rather, the
             | PHP script that generates the config): https://github.com/T
             | atoeba/tatoeba2/blob/dev/src/Shell/Sphin...
             | 
             | There's one big list of mappings for almost every script
             | under the sun, including Greek. (With mappings like
             | 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 A [CAPITAL
             | ALPHA WITH PSILI] into U+1F00 a [SMALL ALPHA WITH PSILI],
             | and the same for seven other accented alphas. I've
             | considered turning them all into unaccented alpha instead,
             | but I don't know enough about Greek orthography to decide
             | that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad
             | 2939c691...
             | 
             | For Latin, there are some special exceptions so that "GAIVS
             | IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the
             | same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2
             | 939c691...
             | 
             | It's not beautiful, but it's used in production. People who
             | don't need to support quite as many languages as Tatoeba
             | will probably want a simpler config, but it might still be
             | useful as a reference.
        
       | kristopolous wrote:
       | "What is a cynic" the famous line from Lady Windmere's Fan comes
       | up empty. You can find it here:
       | https://www.gutenberg.org/files/790/790-h/790-h.htm
       | 
       | The next line I can find though. How are you parsing?
        
         | gutensearch wrote:
         | Thanks for noticing!
         | 
         | To be more specific, "what is a" are stop words and "cynic" is
         | very common, so a lot of rows are returned (see my other
         | comment). ts_rank takes too long to rank them, and the server
         | times out, leaving you with the previous query's table because
         | I didn't take the time to program a correct response to this
         | issue. "Cecil Graham. What is a cynic?" returns Lady
         | Windermere's Fan almost instantly.
         | 
         | The workarounds I've thought of would be to cache these queries
         | (assuming I've seen them before, and after I've set up
         | logging), buy a larger server, or pay Second Quadrant to speed
         | up ts_rank... I'd love any suggestions from more experienced
         | Postgres engineers!
        
           | gutensearch wrote:
           | Thanks for noticing! To be more specific, "what is a" are
           | stop words and "cynic" is very common, so a lot of rows are
           | returned (see my other comment). ts_rank takes too long to
           | rank them, and the server times out, leaving you with the
           | previous query's table because I didn't take the time to
           | program a correct response to this issue. "Cecil Graham. What
           | is a cynic?" returns Lady Windermere's Fan almost instantly.
           | 
           | The workarounds I've thought of would be to cache these
           | queries (assuming I've seen them before, and after I've set
           | up logging), buy a larger server, or pay Second Quadrant to
           | speed up ts_rank... I'd love any suggestions from more
           | experienced Postgres engineers!
           | 
           | Edit to your edit, re parsing. The subset of rows returned
           | follows:                 where language = %s::regconfig and
           | textsearchable_index_col @@ phraseto_tsquery(%s::regconfig,
           | %s)
           | 
           | and relevance is determined by:
           | ts_rank_cd(textsearchable_index_col,
           | phraseto_tsquery(%s::regconfig, %s), 32)
           | 
           | with the %s being language and paragraph respectively.
        
             | talolard wrote:
             | I have a few suggestions regarding search performance.
             | 
             | As others have mentioned, there isn't one "right answer"
             | but there are a few nifty tricks you can use to boost
             | performance.
             | 
             | You could try a trigram index[0] as the primary search
             | index. You'd loose stemming, but gain language support and
             | indexed regular expression search which is a nice power
             | user feature.
             | 
             | Looking at your where clause, did you index language and
             | textsearchable_index_col together ? If not, you are giving
             | the planner the opportunity to skip your full text index.
             | 
             | if you order by ts_rank_cd then the query will sort all
             | rows, including those that don't match. A better pattern is
             | to take the ranking score as a column and then sort it in a
             | subquery. [1] from stack overlfow has an example. (As an
             | aside, from pg2, CTEs are no longer an optimization fence,
             | so you can write the query out more cleanly with a CTE and
             | still get the desired performance).
             | 
             | You should experiment with GIN vs GIST indices. GIN are
             | larger, so on a large dataset the index might not fit in
             | memory. You could pay more to have more memory, but worth
             | trying a GIST index to see if that makes things faster just
             | because it fits in memory.
             | 
             | A final frontend comment, I'm a fan of infinite scroll for
             | this kind of stuff. You already have pagination effectively
             | set up, you could spoil me as a user with infinite scroll.
             | react-virtualized[2] is a great library for that.
             | 
             | Hope that helps.
             | 
             | [0] https://www.postgresql.org/docs/9.6/pgtrgm.html [1]
             | https://dba.stackexchange.com/questions/4054/ts-rank-
             | massive... [2] https://github.com/bvaughn/react-virtualized
        
               | gutensearch wrote:
               | Thanks for the suggestions! I will try them next.
               | 
               | > Looking at your where clause, did you index language
               | and textsearchable_index_col together ? If not, you are
               | giving the planner the opportunity to skip your full text
               | index.
               | 
               | Here is the index:                 update
               | gutenberg.paragraphs       set textsearchable_index_col =
               | to_tsvector(b.cfgname::regconfig, coalesce(paragraph, '
               | '))       from       gutenberg.all_data a        inner
               | join pg_ts_config b on lower(a.language) = b.cfgname
               | where gutenberg.paragraphs.num = a.num;       create
               | index textsearch_paragraph_idx on gutenberg.paragraphs
               | using gin (textsearchable_index_col);
               | 
               | > if you order by ts_rank_cd then the query will sort all
               | rows, including those that don't match. A better pattern
               | is to take the ranking score as a column and then sort it
               | in a subquery. [1] from stack overlfow has an example.
               | (As an aside, from pg2, CTEs are no longer an
               | optimization fence, so you can write the query out more
               | cleanly with a CTE and still get the desired
               | performance).
               | 
               | Ranking was already done as a column in a CTE and ranked
               | further down:                 with paragraphs as (
               | select                      num                     ,
               | paragraph                     ,
               | ts_rank_cd(textsearchable_index_col,
               | phraseto_tsquery(%s::regconfig, %s), 32) as rank
               | , ts_headline(%s, paragraph,
               | phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000,
               | StartSel=**, StopSel=**') as highlighted_result
               | from gutenberg.paragraphs                      where
               | language = %s::regconfig and textsearchable_index_col @@
               | phraseto_tsquery(%s::regconfig, %s)                 )
               | 
               | I did rewrite the query as per your [1] but EXPLAIN
               | ANALYZE found no difference.
               | 
               | Pagination is a tricky one. I was in a rush and hit an
               | issue where building large Dash tables was very expensive
               | and prone to crashing. I initially set up pagination as
               | per the Dash examples, but that didn't play nice with
               | markdown which I needed for the URLs and highlighting
               | (giving these up for plain text tables made in HN-fast,
               | but it confused my trial users). So the quick and dirty
               | solution "to ship now" was to set up custom pagination in
               | the SQL and show the results.
               | 
               | I think that if there is enough interest in this project,
               | the front end will have to move off Dash.
        
             | kristopolous wrote:
             | full-text searching is contentious, it's a rabbit-hole of
             | holy-wars.
             | 
             | I don't believe I'm going to summon cranks with the
             | following:
             | 
             | There's more specialized tools for your usecase than
             | postgres and you should be looking into "n-gram indexing".
             | 
             | Lucene based systems such as elasticsearch are quite
             | popular there's also sphinx and xapian, also fairly
             | widespread.
             | 
             | You need to read the documentation and configure them, they
             | are flexible and need to be tuned to this usecase.
             | 
             | In the end, there is no "correct" way to do things. For
             | instance, sometimes stemming words is the right way to go,
             | but if you are say doing a medical system where two very
             | different medicines could be spelled similar and stem to
             | the same word, a mistake could lead to death, so no,
             | stemming is very bad here.
             | 
             | Sometimes stop words is the way to go, but if people are
             | looking for quotes, such as "to be or not to be" well now
             | you have the empty string, splendid.
             | 
             | So yeah, configure configure configure.
             | 
             | This may bring out the cranks:
             | 
             | If you want to roll-your-own nosql systems like redis and
             | mongo or couch seem to work really well (I've rolled my own
             | in all 3 on separate occasions).
             | 
             | I guarantee there's advanced features of maria and postgres
             | that aren't widely used and some people reading this will
             | confidently claim they are superior but I assure you, that
             | is a minority opinion. Most people go with the other
             | options.
             | 
             | If you ever doubt it, ask the actual developers of postgres
             | or maria on chat. They are extremely nice and way more
             | honest about the limitations then their hardcore fans are.
             | The databases are under constant development with advanced
             | features and you'll learn a lot (really, they are both
             | super chill dev communities, impressively so).
             | 
             | Perhaps your solution (as mine has been) is a hybrid. You
             | can store the existence for instance, in one system and the
             | offsets and chunks in another so you get a parallelizable
             | asynchronous worker pipeline, it's impressively fast when
             | you horizontally scale it. <0.5 sec for multiple terabytes
             | of text (and yes, I'm talking nvme/hundreds of gb of RAM
             | per node/>=10gb network). I've legitimately just done
             | random queries to marvel at the speed
             | 
             | I really hope I save myself from the rock throwing.
        
               | thaumaturgy wrote:
               | > _For instance, sometimes stemming words is the right
               | way to go, but if you are say doing a medical system
               | where two very different medicines could be spelled
               | similar and stem to the same word..._
               | 
               | FWIW lemmatization may be a good alternative to stemming.
               | Stemming is algorithmic and can generate errors, as you
               | point out; "caring" for example might naively be stemmed
               | to "car". Lemmatization uses a dictionary of senses and
               | their root words to avoid this. For common English,
               | there's Princeton's Wordnet
               | (https://wordnet.princeton.edu/). Lemmatizing technical
               | niches, like medicine, would require an additional
               | dictionary.
        
               | gutensearch wrote:
               | Thanks! I really appreciate the pointers. I already had
               | planned to explore some of these and you've expanded and
               | directed the search nicely.
        
               | yorwba wrote:
               | > there's also sphinx and xapian, also fairly widespread.
               | 
               | Sphinx is now Manticore, and as luck has it, a Manticore
               | dev is in this thread, offering support:
               | https://news.ycombinator.com/item?id=25890828
        
       | gebt wrote:
       | What useful thing! It would be much better if it was possible to
       | pass search query using GET requests.
       | 
       | e.g. https://gutensearch.com/?q=avicenna
        
         | gutensearch wrote:
         | An API is already on my roadmap! I couldn't quite figure the
         | state of passing query parameters in the URL with Dash, though.
        
           | gebt wrote:
           | Having an API is good but not suitable for what I mean.
           | 
           | I'm not familiar with Dash, but IN WORST CASE I think you can
           | add a route to the nginx (an additional app) to pass GET
           | parameters to the app.
        
       | DerWOK wrote:
       | Hm. I cannot do language selection on iOS Safari. Even not with
       | the mentioned workaround of first typing the language name
       | prefix.
        
       | jjt-yn_t wrote:
       | I changed the default search to 'wherefore' and it came up with
       | results for 'it was the best of times' which existed upon loading
       | the web page. Twice. So a bug somewhere.
        
       | Ninjinka wrote:
       | I enjoy being able to grep the 200,000 books I have downloaded as
       | part of the Pile that was on HN a while back. It allowed me to do
       | things like "show me all the paragraphs where 'Edmund Burke' and
       | 'patriotism' appear close to each other." This seems to be a
       | similar thing, just with less books.
        
       | rahimnathwani wrote:
       | This is really cool. Something like this should exist.
       | 
       | It seems like you could do it more easily, include all recent
       | additions, and have faster search responses:
       | 
       | 1. Mirror the current gutenberg archive (e.g. rsync -av --del
       | aleph.gutenberg.org::gutenberg gutenberg)
       | 
       | 2. Install recoll-webui from
       | https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins...
       | or using docker-recoll-webui: https://github.com/sunde41/recoll
       | 
       | 3. Run the recoll indexer
       | 
       | 4. Each week, repeat steps #1 and #3
        
       | loughnane wrote:
       | I love this. So often when I search for passages I get bombarded
       | with links to quote websites.
       | 
       | This is much closer to what I'm looking for.
       | 
       | EDIT: and it's going to be open sourced? I love it
        
       | imaginenore wrote:
       | I searched for "it was a dark and stormy night", and it didn't
       | find the origin (1830 "Paul Clifford" novel).
        
       | tornato7 wrote:
       | I'd love to see this dataset used as a performance and relevance
       | benchmark for different search engines!
        
         | gutensearch wrote:
         | That was definitely part of the original plan! I spotted two
         | other attempts [1] [2] here using BERT and ElasticSearch
         | respectively.
         | 
         | The main performance issue with the Postgres FTS approach
         | (possibly also the others?) is ranking. Matching results uses
         | the index, but ts_rank cannot.
         | 
         | Most of the time, few results are returned and the front end
         | gets its answer in ~300ms including formatting the text for the
         | front end (~20ms without).
         | 
         | However, a reasonably common sentence will return tens or
         | hundreds of thousands of rows, which takes a minute or more to
         | get ranked. In production, this could be worked around by
         | tracking and caching such queries if they are common enough.
         | 
         | I'd love to hear from anyone experienced with the other options
         | (Lucene, Solr, ElasticSearch, etc.) whether and how they get
         | around this.
         | 
         | [1] https://news.ycombinator.com/item?id=19095963
         | 
         | [2] https://news.ycombinator.com/item?id=6562126 (the link does
         | not load for me)
        
           | ngrilly wrote:
           | I suggest to have a look at
           | https://github.com/postgrespro/rum if you haven't yet. It
           | solves the issue of slow ranking in PostgreSQL FTS.
        
           | karterk wrote:
           | What kind of hardware are you using to host the Postgres
           | instance?
        
             | gutensearch wrote:
             | Same place as the app: a Start-2-M-SSD from online.net in
             | their AMS1 DC (Amsterdam).
             | 
             | Subset of sudo lshw --short:                 Class
             | Description
             | ======================================================
             | processor      Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
             | memory         16GiB System Memory       disk
             | 256GB Micron_1100_MTFD
        
       | gerardnll wrote:
       | I cannot choose another language, some weird css/html issue
       | happens that hides all other fields. Looks like an
       | overflow:hidden hides the possible selections.
        
         | gutensearch wrote:
         | Thanks for the bug report!
         | 
         | It has to do (I think) with Dash's columnar layout which
         | unfolds the menu over the next few columns at least in Chrome.
         | 
         | The quick workaround I found was to type out the language until
         | it appears below and click on it or finish typing, then press
         | enter. This should select it.
         | 
         | I'd love to hear from other Dash developers who've had, and
         | solved this issue.
        
       | _Microft wrote:
       | Oh, so Project Gutenberg is still a thing? I used to use it until
       | they blocked German users (context: they were asked to prevent
       | access to certain pieces for German users but they decided to go
       | nuclear instead).
       | 
       | Nowadays I just go to LibGen when I want to have a look into a
       | book. That LibGen doesn't limit itself to works in the public
       | domain is rather a feature ;)
        
         | IndySun wrote:
         | What's the German story?
        
           | robin_reala wrote:
           | Work that's PD in the US but not in Germany. German rights-
           | holders took PG to court (or rather, a poor German sysadmin
           | for PG), court demanded PG pull the titles for Germany, PG
           | refused saying that it was down to the users to determine PD
           | status for their home country, and because of HTTPS the
           | German court's only option was to block PG at the domain
           | level.
        
             | _Microft wrote:
             | Project Gutenberg themselves decided to block all access
             | from Germany instead of just the items in question.
             | 
             | ,,On February 9 2018, the Court issued a judgment granting
             | essentially most of the Plaintiff's demands. The Court did
             | not order that the 18 items no longer be made available by
             | Project Gutenberg, and instead wrote that it is sufficient
             | to instead make them no longer accessible to German
             | Internet (IP) addresses.
             | 
             | PGLAF complied with the Court's order on February 28, 2018
             | by blocking all access to www.gutenberg.org and sub-pages
             | to all of Germany.", from
             | https://cand.pglaf.org/germany/index.html
        
         | gutensearch wrote:
         | And this is why the server is in Amsterdam, even though I have
         | had good experiences with Hetzner in the past.
         | 
         | I was quite sad to read about the case in 2018, and it is
         | unfortunate that it is still not resolved.
        
           | SahAssar wrote:
           | Hetzner has locations in finland too if that would work.
        
       | abhayhegde wrote:
       | Seems like a nice project. However, I am currently receiving
       | PR_CONNECT_RESET_ERROR on Firefox. Bringing it to your attention.
        
       | Kaknut wrote:
       | This is indeed very useful. But I think there are some bugs like
       | when I click on dropdown to choose language it breaks
       | (malfunction)
        
         | gutensearch wrote:
         | Thanks!
         | 
         | There's a workaround to the dropdown issue, see this other
         | comment thread: https://news.ycombinator.com/item?id=25890458
        
       | tkgally wrote:
       | This is a very useful tool. I've known people doing research on
       | the evolution of grammar, vocabulary, literary style, etc. who
       | use only small subsets of the Project Gutenberg data. I'm sure
       | they would appreciate being able to search the entire corpus.
       | 
       | The corpus-search functions those researchers use include
       | wildcards, exact-phrase specification with quotation marks,
       | proximity searches, and Boolean search strings. When you have a
       | chance, you might want to add a list of the syntax formats that
       | currently work. (I tried using * as a wildcard in a phrase
       | surrounded by quotation marks, and it didn't seem to work.)
       | 
       | One small improvement you could make would be to widen the
       | "Search terms" field so that longer search strings are visible.
        
         | andai wrote:
         | Do you know what tools they're using for that?
        
         | gutensearch wrote:
         | Thank you for taking the time to lay out feature requests in
         | such details! I really appreciate it.
         | 
         | The current search box is a wrapper around Postgres
         | phraseto_tsquery [1] whilst the Discovery tab uses
         | plainto_tsquery, so you could play with either as an ersatz for
         | some of these features for now, although special characters
         | might get stripped or parsed incorrectly.
         | 
         | Do you know where the people you are talking about hang out
         | online (for example, subreddits)? I'd love to get in touch with
         | them once the features are built and for more general feedback.
         | 
         | [1] https://www.postgresql.org/docs/12/textsearch-controls.html
        
           | StavrosK wrote:
           | I'm curious if you ever tried MeiliSearch for this, I tried
           | it recently for something unrelated and had a very good
           | experience with it. Since you have the corpus already, it
           | might be worth trying and seeing if it speeds search up? I'd
           | be interested in the results either way.
        
             | gutensearch wrote:
             | Thank you for the suggestion! The demo looks great and I'm
             | curious to see how they automatically handle language. It
             | would be nice to add support for Chinese which is the great
             | absent from this attempt, even at the cost of several other
             | languages which in any case have few books transcribed to
             | text in Gutenberg.
             | 
             | I will probably write a blog post once I've tried a few of
             | the approaches suggested in this thread.
        
               | StavrosK wrote:
               | That would be great, thanks!
        
           | iiv wrote:
           | I know /r/CompLing on Reddit is quite popular.
        
         | Jabbles wrote:
         | Surely someone doing serious research would download the entire
         | corpus and use grep (or equivalent)?
         | 
         | It wouldn't have the same nice interface and searches may take
         | several seconds, but there's only 60,000 books...
        
       | nestorD wrote:
       | Great ! It is not far from one of my dream project : numerising a
       | maximum of old text and let historian do research on them using
       | state of the art tools (that work across synonims and languages)
       | with parameters to restrict by time of publication and
       | localisation obviously.
        
         | toomuchtodo wrote:
         | Have you considered working with the Internet Archive on this
         | across their corpus? They are open to such work being done. And
         | if some of the material you need isn't in the archive, let's
         | get it in there.
        
           | nestorD wrote:
           | I have not but I am going to file the idea, it would indeed
           | be a good starting point.
        
       | JensRantil wrote:
       | Nice project! You might want to ask a designer to give it a look.
       | UI is a little cluttered.
        
         | gutensearch wrote:
         | Thanks! I would _love_ for an experienced designer to join the
         | project (and DevOps, and front end, and...). My email is in the
         | profile.
        
       ___________________________________________________________________
       (page generated 2021-01-24 23:02 UTC)