[HN Gopher] Show HN: Full text search Project Gutenberg (60m par...
___________________________________________________________________
Show HN: Full text search Project Gutenberg (60m paragraphs)
Author : gutensearch
Score : 183 points
Date : 2021-01-24 04:58 UTC (18 hours ago)
(HTM) web link (gutensearch.com)
(TXT) w3m dump (gutensearch.com)
| ilaksh wrote:
| I am sure the Postgres full text search is pretty good by now.
| But I always expect projects like these to eventually reluctantly
| switch over to Elasticsearch in order to handle 10-20% of the
| queries better.
| snikolaev wrote:
| Hi. Nice project! I don't know why there's no BM25 (or at least
| SOME TF-IDF implementation) in Postgres FTS, but if you decide
| you need it (and more languages support and highlighting and
| lower response time) ping us at contact@manticoresearch.com and
| we'll help you with integrating your postgres dataset with
| Manticore Search. 60M docs shouldn't be a problem at all (should
| take about an hour to make an index) and you'll get proper
| ranking and nice highlighting with just few lines of new code.
| Here's an interactive course about indexing from mysql
| https://play.manticoresearch.com/mysql/ , but with postgres it's
| the same.
| gutensearch wrote:
| Thank you for your generous offer of help! I look forward to
| taking it up (may take a while as I'm about to move countries
| and quarantine).
|
| In particular I love that one of the examples in your comment
| history is in Latin as that language is not currently supported
| by Postgres FTS. Are Latin and Ancient Greek supported by
| Manticore? (dare I hope for Anglo Saxon...)
| snikolaev wrote:
| In terms of advanced NLP (stemming, lemmatization, stopwords,
| wordforms) - no. In terms of just general tokenization - I've
| never dealt with Latin and Ancient Greek characters (if
| there're specific characters for those languages), but if
| even they are not supported by default it's not a problem to
| add them in config (https://mnt.cr/charset_table)
| yorwba wrote:
| For the character mappings, it might be useful to have a
| look at the config for https://tatoeba.org (or rather, the
| PHP script that generates the config): https://github.com/T
| atoeba/tatoeba2/blob/dev/src/Shell/Sphin...
|
| There's one big list of mappings for almost every script
| under the sun, including Greek. (With mappings like
| 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 A [CAPITAL
| ALPHA WITH PSILI] into U+1F00 a [SMALL ALPHA WITH PSILI],
| and the same for seven other accented alphas. I've
| considered turning them all into unaccented alpha instead,
| but I don't know enough about Greek orthography to decide
| that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad
| 2939c691...
|
| For Latin, there are some special exceptions so that "GAIVS
| IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the
| same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2
| 939c691...
|
| It's not beautiful, but it's used in production. People who
| don't need to support quite as many languages as Tatoeba
| will probably want a simpler config, but it might still be
| useful as a reference.
| kristopolous wrote:
| "What is a cynic" the famous line from Lady Windmere's Fan comes
| up empty. You can find it here:
| https://www.gutenberg.org/files/790/790-h/790-h.htm
|
| The next line I can find though. How are you parsing?
| gutensearch wrote:
| Thanks for noticing!
|
| To be more specific, "what is a" are stop words and "cynic" is
| very common, so a lot of rows are returned (see my other
| comment). ts_rank takes too long to rank them, and the server
| times out, leaving you with the previous query's table because
| I didn't take the time to program a correct response to this
| issue. "Cecil Graham. What is a cynic?" returns Lady
| Windermere's Fan almost instantly.
|
| The workarounds I've thought of would be to cache these queries
| (assuming I've seen them before, and after I've set up
| logging), buy a larger server, or pay Second Quadrant to speed
| up ts_rank... I'd love any suggestions from more experienced
| Postgres engineers!
| gutensearch wrote:
| Thanks for noticing! To be more specific, "what is a" are
| stop words and "cynic" is very common, so a lot of rows are
| returned (see my other comment). ts_rank takes too long to
| rank them, and the server times out, leaving you with the
| previous query's table because I didn't take the time to
| program a correct response to this issue. "Cecil Graham. What
| is a cynic?" returns Lady Windermere's Fan almost instantly.
|
| The workarounds I've thought of would be to cache these
| queries (assuming I've seen them before, and after I've set
| up logging), buy a larger server, or pay Second Quadrant to
| speed up ts_rank... I'd love any suggestions from more
| experienced Postgres engineers!
|
| Edit to your edit, re parsing. The subset of rows returned
| follows: where language = %s::regconfig and
| textsearchable_index_col @@ phraseto_tsquery(%s::regconfig,
| %s)
|
| and relevance is determined by:
| ts_rank_cd(textsearchable_index_col,
| phraseto_tsquery(%s::regconfig, %s), 32)
|
| with the %s being language and paragraph respectively.
| talolard wrote:
| I have a few suggestions regarding search performance.
|
| As others have mentioned, there isn't one "right answer"
| but there are a few nifty tricks you can use to boost
| performance.
|
| You could try a trigram index[0] as the primary search
| index. You'd loose stemming, but gain language support and
| indexed regular expression search which is a nice power
| user feature.
|
| Looking at your where clause, did you index language and
| textsearchable_index_col together ? If not, you are giving
| the planner the opportunity to skip your full text index.
|
| if you order by ts_rank_cd then the query will sort all
| rows, including those that don't match. A better pattern is
| to take the ranking score as a column and then sort it in a
| subquery. [1] from stack overlfow has an example. (As an
| aside, from pg2, CTEs are no longer an optimization fence,
| so you can write the query out more cleanly with a CTE and
| still get the desired performance).
|
| You should experiment with GIN vs GIST indices. GIN are
| larger, so on a large dataset the index might not fit in
| memory. You could pay more to have more memory, but worth
| trying a GIST index to see if that makes things faster just
| because it fits in memory.
|
| A final frontend comment, I'm a fan of infinite scroll for
| this kind of stuff. You already have pagination effectively
| set up, you could spoil me as a user with infinite scroll.
| react-virtualized[2] is a great library for that.
|
| Hope that helps.
|
| [0] https://www.postgresql.org/docs/9.6/pgtrgm.html [1]
| https://dba.stackexchange.com/questions/4054/ts-rank-
| massive... [2] https://github.com/bvaughn/react-virtualized
| gutensearch wrote:
| Thanks for the suggestions! I will try them next.
|
| > Looking at your where clause, did you index language
| and textsearchable_index_col together ? If not, you are
| giving the planner the opportunity to skip your full text
| index.
|
| Here is the index: update
| gutenberg.paragraphs set textsearchable_index_col =
| to_tsvector(b.cfgname::regconfig, coalesce(paragraph, '
| ')) from gutenberg.all_data a inner
| join pg_ts_config b on lower(a.language) = b.cfgname
| where gutenberg.paragraphs.num = a.num; create
| index textsearch_paragraph_idx on gutenberg.paragraphs
| using gin (textsearchable_index_col);
|
| > if you order by ts_rank_cd then the query will sort all
| rows, including those that don't match. A better pattern
| is to take the ranking score as a column and then sort it
| in a subquery. [1] from stack overlfow has an example.
| (As an aside, from pg2, CTEs are no longer an
| optimization fence, so you can write the query out more
| cleanly with a CTE and still get the desired
| performance).
|
| Ranking was already done as a column in a CTE and ranked
| further down: with paragraphs as (
| select num ,
| paragraph ,
| ts_rank_cd(textsearchable_index_col,
| phraseto_tsquery(%s::regconfig, %s), 32) as rank
| , ts_headline(%s, paragraph,
| phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000,
| StartSel=**, StopSel=**') as highlighted_result
| from gutenberg.paragraphs where
| language = %s::regconfig and textsearchable_index_col @@
| phraseto_tsquery(%s::regconfig, %s) )
|
| I did rewrite the query as per your [1] but EXPLAIN
| ANALYZE found no difference.
|
| Pagination is a tricky one. I was in a rush and hit an
| issue where building large Dash tables was very expensive
| and prone to crashing. I initially set up pagination as
| per the Dash examples, but that didn't play nice with
| markdown which I needed for the URLs and highlighting
| (giving these up for plain text tables made in HN-fast,
| but it confused my trial users). So the quick and dirty
| solution "to ship now" was to set up custom pagination in
| the SQL and show the results.
|
| I think that if there is enough interest in this project,
| the front end will have to move off Dash.
| kristopolous wrote:
| full-text searching is contentious, it's a rabbit-hole of
| holy-wars.
|
| I don't believe I'm going to summon cranks with the
| following:
|
| There's more specialized tools for your usecase than
| postgres and you should be looking into "n-gram indexing".
|
| Lucene based systems such as elasticsearch are quite
| popular there's also sphinx and xapian, also fairly
| widespread.
|
| You need to read the documentation and configure them, they
| are flexible and need to be tuned to this usecase.
|
| In the end, there is no "correct" way to do things. For
| instance, sometimes stemming words is the right way to go,
| but if you are say doing a medical system where two very
| different medicines could be spelled similar and stem to
| the same word, a mistake could lead to death, so no,
| stemming is very bad here.
|
| Sometimes stop words is the way to go, but if people are
| looking for quotes, such as "to be or not to be" well now
| you have the empty string, splendid.
|
| So yeah, configure configure configure.
|
| This may bring out the cranks:
|
| If you want to roll-your-own nosql systems like redis and
| mongo or couch seem to work really well (I've rolled my own
| in all 3 on separate occasions).
|
| I guarantee there's advanced features of maria and postgres
| that aren't widely used and some people reading this will
| confidently claim they are superior but I assure you, that
| is a minority opinion. Most people go with the other
| options.
|
| If you ever doubt it, ask the actual developers of postgres
| or maria on chat. They are extremely nice and way more
| honest about the limitations then their hardcore fans are.
| The databases are under constant development with advanced
| features and you'll learn a lot (really, they are both
| super chill dev communities, impressively so).
|
| Perhaps your solution (as mine has been) is a hybrid. You
| can store the existence for instance, in one system and the
| offsets and chunks in another so you get a parallelizable
| asynchronous worker pipeline, it's impressively fast when
| you horizontally scale it. <0.5 sec for multiple terabytes
| of text (and yes, I'm talking nvme/hundreds of gb of RAM
| per node/>=10gb network). I've legitimately just done
| random queries to marvel at the speed
|
| I really hope I save myself from the rock throwing.
| thaumaturgy wrote:
| > _For instance, sometimes stemming words is the right
| way to go, but if you are say doing a medical system
| where two very different medicines could be spelled
| similar and stem to the same word..._
|
| FWIW lemmatization may be a good alternative to stemming.
| Stemming is algorithmic and can generate errors, as you
| point out; "caring" for example might naively be stemmed
| to "car". Lemmatization uses a dictionary of senses and
| their root words to avoid this. For common English,
| there's Princeton's Wordnet
| (https://wordnet.princeton.edu/). Lemmatizing technical
| niches, like medicine, would require an additional
| dictionary.
| gutensearch wrote:
| Thanks! I really appreciate the pointers. I already had
| planned to explore some of these and you've expanded and
| directed the search nicely.
| yorwba wrote:
| > there's also sphinx and xapian, also fairly widespread.
|
| Sphinx is now Manticore, and as luck has it, a Manticore
| dev is in this thread, offering support:
| https://news.ycombinator.com/item?id=25890828
| gebt wrote:
| What useful thing! It would be much better if it was possible to
| pass search query using GET requests.
|
| e.g. https://gutensearch.com/?q=avicenna
| gutensearch wrote:
| An API is already on my roadmap! I couldn't quite figure the
| state of passing query parameters in the URL with Dash, though.
| gebt wrote:
| Having an API is good but not suitable for what I mean.
|
| I'm not familiar with Dash, but IN WORST CASE I think you can
| add a route to the nginx (an additional app) to pass GET
| parameters to the app.
| DerWOK wrote:
| Hm. I cannot do language selection on iOS Safari. Even not with
| the mentioned workaround of first typing the language name
| prefix.
| jjt-yn_t wrote:
| I changed the default search to 'wherefore' and it came up with
| results for 'it was the best of times' which existed upon loading
| the web page. Twice. So a bug somewhere.
| Ninjinka wrote:
| I enjoy being able to grep the 200,000 books I have downloaded as
| part of the Pile that was on HN a while back. It allowed me to do
| things like "show me all the paragraphs where 'Edmund Burke' and
| 'patriotism' appear close to each other." This seems to be a
| similar thing, just with less books.
| rahimnathwani wrote:
| This is really cool. Something like this should exist.
|
| It seems like you could do it more easily, include all recent
| additions, and have faster search responses:
|
| 1. Mirror the current gutenberg archive (e.g. rsync -av --del
| aleph.gutenberg.org::gutenberg gutenberg)
|
| 2. Install recoll-webui from
| https://www.lesbonscomptes.com/recoll/pages/recoll-webui-ins...
| or using docker-recoll-webui: https://github.com/sunde41/recoll
|
| 3. Run the recoll indexer
|
| 4. Each week, repeat steps #1 and #3
| loughnane wrote:
| I love this. So often when I search for passages I get bombarded
| with links to quote websites.
|
| This is much closer to what I'm looking for.
|
| EDIT: and it's going to be open sourced? I love it
| imaginenore wrote:
| I searched for "it was a dark and stormy night", and it didn't
| find the origin (1830 "Paul Clifford" novel).
| tornato7 wrote:
| I'd love to see this dataset used as a performance and relevance
| benchmark for different search engines!
| gutensearch wrote:
| That was definitely part of the original plan! I spotted two
| other attempts [1] [2] here using BERT and ElasticSearch
| respectively.
|
| The main performance issue with the Postgres FTS approach
| (possibly also the others?) is ranking. Matching results uses
| the index, but ts_rank cannot.
|
| Most of the time, few results are returned and the front end
| gets its answer in ~300ms including formatting the text for the
| front end (~20ms without).
|
| However, a reasonably common sentence will return tens or
| hundreds of thousands of rows, which takes a minute or more to
| get ranked. In production, this could be worked around by
| tracking and caching such queries if they are common enough.
|
| I'd love to hear from anyone experienced with the other options
| (Lucene, Solr, ElasticSearch, etc.) whether and how they get
| around this.
|
| [1] https://news.ycombinator.com/item?id=19095963
|
| [2] https://news.ycombinator.com/item?id=6562126 (the link does
| not load for me)
| ngrilly wrote:
| I suggest to have a look at
| https://github.com/postgrespro/rum if you haven't yet. It
| solves the issue of slow ranking in PostgreSQL FTS.
| karterk wrote:
| What kind of hardware are you using to host the Postgres
| instance?
| gutensearch wrote:
| Same place as the app: a Start-2-M-SSD from online.net in
| their AMS1 DC (Amsterdam).
|
| Subset of sudo lshw --short: Class
| Description
| ======================================================
| processor Intel(R) Atom(TM) CPU C2750 @ 2.40GHz
| memory 16GiB System Memory disk
| 256GB Micron_1100_MTFD
| gerardnll wrote:
| I cannot choose another language, some weird css/html issue
| happens that hides all other fields. Looks like an
| overflow:hidden hides the possible selections.
| gutensearch wrote:
| Thanks for the bug report!
|
| It has to do (I think) with Dash's columnar layout which
| unfolds the menu over the next few columns at least in Chrome.
|
| The quick workaround I found was to type out the language until
| it appears below and click on it or finish typing, then press
| enter. This should select it.
|
| I'd love to hear from other Dash developers who've had, and
| solved this issue.
| _Microft wrote:
| Oh, so Project Gutenberg is still a thing? I used to use it until
| they blocked German users (context: they were asked to prevent
| access to certain pieces for German users but they decided to go
| nuclear instead).
|
| Nowadays I just go to LibGen when I want to have a look into a
| book. That LibGen doesn't limit itself to works in the public
| domain is rather a feature ;)
| IndySun wrote:
| What's the German story?
| robin_reala wrote:
| Work that's PD in the US but not in Germany. German rights-
| holders took PG to court (or rather, a poor German sysadmin
| for PG), court demanded PG pull the titles for Germany, PG
| refused saying that it was down to the users to determine PD
| status for their home country, and because of HTTPS the
| German court's only option was to block PG at the domain
| level.
| _Microft wrote:
| Project Gutenberg themselves decided to block all access
| from Germany instead of just the items in question.
|
| ,,On February 9 2018, the Court issued a judgment granting
| essentially most of the Plaintiff's demands. The Court did
| not order that the 18 items no longer be made available by
| Project Gutenberg, and instead wrote that it is sufficient
| to instead make them no longer accessible to German
| Internet (IP) addresses.
|
| PGLAF complied with the Court's order on February 28, 2018
| by blocking all access to www.gutenberg.org and sub-pages
| to all of Germany.", from
| https://cand.pglaf.org/germany/index.html
| gutensearch wrote:
| And this is why the server is in Amsterdam, even though I have
| had good experiences with Hetzner in the past.
|
| I was quite sad to read about the case in 2018, and it is
| unfortunate that it is still not resolved.
| SahAssar wrote:
| Hetzner has locations in finland too if that would work.
| abhayhegde wrote:
| Seems like a nice project. However, I am currently receiving
| PR_CONNECT_RESET_ERROR on Firefox. Bringing it to your attention.
| Kaknut wrote:
| This is indeed very useful. But I think there are some bugs like
| when I click on dropdown to choose language it breaks
| (malfunction)
| gutensearch wrote:
| Thanks!
|
| There's a workaround to the dropdown issue, see this other
| comment thread: https://news.ycombinator.com/item?id=25890458
| tkgally wrote:
| This is a very useful tool. I've known people doing research on
| the evolution of grammar, vocabulary, literary style, etc. who
| use only small subsets of the Project Gutenberg data. I'm sure
| they would appreciate being able to search the entire corpus.
|
| The corpus-search functions those researchers use include
| wildcards, exact-phrase specification with quotation marks,
| proximity searches, and Boolean search strings. When you have a
| chance, you might want to add a list of the syntax formats that
| currently work. (I tried using * as a wildcard in a phrase
| surrounded by quotation marks, and it didn't seem to work.)
|
| One small improvement you could make would be to widen the
| "Search terms" field so that longer search strings are visible.
| andai wrote:
| Do you know what tools they're using for that?
| gutensearch wrote:
| Thank you for taking the time to lay out feature requests in
| such details! I really appreciate it.
|
| The current search box is a wrapper around Postgres
| phraseto_tsquery [1] whilst the Discovery tab uses
| plainto_tsquery, so you could play with either as an ersatz for
| some of these features for now, although special characters
| might get stripped or parsed incorrectly.
|
| Do you know where the people you are talking about hang out
| online (for example, subreddits)? I'd love to get in touch with
| them once the features are built and for more general feedback.
|
| [1] https://www.postgresql.org/docs/12/textsearch-controls.html
| StavrosK wrote:
| I'm curious if you ever tried MeiliSearch for this, I tried
| it recently for something unrelated and had a very good
| experience with it. Since you have the corpus already, it
| might be worth trying and seeing if it speeds search up? I'd
| be interested in the results either way.
| gutensearch wrote:
| Thank you for the suggestion! The demo looks great and I'm
| curious to see how they automatically handle language. It
| would be nice to add support for Chinese which is the great
| absent from this attempt, even at the cost of several other
| languages which in any case have few books transcribed to
| text in Gutenberg.
|
| I will probably write a blog post once I've tried a few of
| the approaches suggested in this thread.
| StavrosK wrote:
| That would be great, thanks!
| iiv wrote:
| I know /r/CompLing on Reddit is quite popular.
| Jabbles wrote:
| Surely someone doing serious research would download the entire
| corpus and use grep (or equivalent)?
|
| It wouldn't have the same nice interface and searches may take
| several seconds, but there's only 60,000 books...
| nestorD wrote:
| Great ! It is not far from one of my dream project : numerising a
| maximum of old text and let historian do research on them using
| state of the art tools (that work across synonims and languages)
| with parameters to restrict by time of publication and
| localisation obviously.
| toomuchtodo wrote:
| Have you considered working with the Internet Archive on this
| across their corpus? They are open to such work being done. And
| if some of the material you need isn't in the archive, let's
| get it in there.
| nestorD wrote:
| I have not but I am going to file the idea, it would indeed
| be a good starting point.
| JensRantil wrote:
| Nice project! You might want to ask a designer to give it a look.
| UI is a little cluttered.
| gutensearch wrote:
| Thanks! I would _love_ for an experienced designer to join the
| project (and DevOps, and front end, and...). My email is in the
| profile.
___________________________________________________________________
(page generated 2021-01-24 23:02 UTC)