[HN Gopher] PruningRadixTrie - Faster Radix trie for prefix sear...
       ___________________________________________________________________
        
       PruningRadixTrie - Faster Radix trie for prefix search and auto-
       complete
        
       Author : g0xA52A2A
       Score  : 157 points
       Date   : 2023-09-30 12:33 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ParallelThread wrote:
       | What is the state of art in longest prefix matching e.g for
       | router prefixes in routing tables?
        
       | vvrm wrote:
       | The fine-grained results look like:
       | 
       | - 1444x faster for single character prefixes
       | 
       | - 252x faster for two character prefixes
       | 
       | - 55x faster for three character prefixes
       | 
       | - ~20x faster for 4 and 5 character prefixes
       | 
       | - <= 5x faster for longer prefixes
       | 
       | I used to work on a production auto-complete system operating at
       | over 100k peak QPS. For prefixes of length one and two we would
       | not even bother hitting the server, just from a quality
       | perspective, not because of latency/throughput considerations.
       | Btw, up until 3 characters, you could store everything in an in-
       | memory hash map. 20x speedup on length 4 and 5 prefixes is still
       | very impressive, but not quite 1000x speedup either.
        
         | jjice wrote:
         | I also worked on a production auto complete feature for a web
         | app a bit ago and I couldn't agree more with the quality
         | sentiment. One or two characters is almost never enough to give
         | a meaningful result. Using history or similar user search is
         | much more effective than trying to guess what someone meant by
         | "th".
        
           | Kalabasa wrote:
           | How does that fare for non-English queries? E.g. are two
           | chars still not enough for Chinese languages?
        
             | meepmorp wrote:
             | I've seen 1.5 chars/token used as a rule of thumb for
             | estimating token counts in Chinese text.
        
           | ComputerGuru wrote:
           | > One or two characters is almost never enough to give a
           | meaningful result.
           | 
           | TFA acknowledges that and mentions the exception:
           | 
           |  _While a prefix of length=1 is not very useful for the Latin
           | alphabet, it does make sense for CJK languages_
        
       | somat wrote:
       | In the game "keep talking and nobody explodes" one of the puzzles
       | has you match a limited set of characters in each position to a
       | dictionary of possible words. A task I was having a lot of
       | trouble doing in the limited time provided. I ended up rewriting
       | the dictionary as a prefix tree and including it as a addendum to
       | the manual.
       | 
       | It was neat seeing seeing these algorithms work at human scale
       | and how much better it made the process.
        
       | cirwin wrote:
       | I wrote something like this for contact auto completion - running
       | in the browser we had to ensure that both tree construction was
       | fast, and also that completion was instantaneous. So the next
       | level of optimization is to amortize tree construction over
       | searches (using the observation that most of the tree is never
       | accessed)
       | 
       | https://github.com/superhuman/trie-ing
        
         | rockwotj wrote:
         | If youre interested in a TypeScript fork of this that also
         | supports deletion, see here: https://github.com/shortwave/trie
         | 
         | There are also a couple of bug fixes in there, for example:
         | https://github.com/shortwave/trie/commit/1e7045d89cc20011251...
        
         | leeoniya wrote:
         | interesting. i made something much more stupid, but brutally
         | effective: https://github.com/leeoniya/uFuzzy
         | 
         | i guess if you wanted results ordered by contact frequency you
         | can just keep the original haystack sorted by frequency, and
         | would get the Richard match first. or keep the frequency in an
         | adjacent lookup and sort/pick the result afterwards.
         | 
         | PSA: don't use the ultra popular Fuse.js for fuzzy matching,
         | it's super slow and has pretty terrible match quality. e.g.
         | https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...
        
       | a1o wrote:
       | This is interesting, I am curious if it's possible to add more
       | than 10 results and how it would look.
       | 
       | Also if it makes sense to use something like this for code
       | autocomplete - I would prefer I could just scroll to find
       | everything.
       | 
       | I think though in case of code parsing through the code and add
       | things to the dictionary may be more expensive than looking
       | through it.
        
       | tgv wrote:
       | Or put a cache in front.
        
         | TOMDM wrote:
         | This is a data structure for implementing dictionary lookup,
         | the sort of thing you'd use to make a cache...
        
           | tgv wrote:
           | I don't think a trie makes a good cache.
        
       | DennisL123 wrote:
       | Nice piece of engineering and surely worth a read. The ideas are
       | hardly novel, tho.
        
       | ot wrote:
       | It's 1000x compared to a very naive baseline, exhaustive
       | traversal of the matching subtree. Nobody in practice would do
       | that, storing the maximum score for each subtree so you can prune
       | is a very common technique, definitely not novel here.
       | 
       | Even doing a Levenshtein-bounded beam search along the trie is
       | pretty much common practice, see for example this paper [1] from
       | 2011.
       | 
       | There are more sophisticated ways of doing this in space-
       | efficient ways, for example this paper [2] I co-authored in 2013.
       | 
       | [1] https://www.microsoft.com/en-us/research/wp-
       | content/uploads/...
       | 
       | [2] https://www.microsoft.com/en-us/research/wp-
       | content/uploads/...
        
         | alexchamberlain wrote:
         | Is the `PruningRadixTrie` the same as the Completion Trie?
        
         | agencies wrote:
         | What is state of the art currently?
        
           | anonzzzies wrote:
           | And also; are there implementations to look at? Or
           | libraries/open source dbs/search engines that use these?
        
           | rutgersnj22 wrote:
           | here is a more recent paper where I am one of the authors:
           | http://www.vldb.org/pvldb/vol9/p828-deng.pdf
        
             | itake wrote:
             | https link:
             | 
             | https://www.vldb.org/pvldb/vol9/p828-deng.pdf
        
           | sa-code wrote:
           | Lucene's WFST is absurdly fast
        
       | xwdv wrote:
       | Inspiring, but we use Postgres full text search for all
       | autocomplete functionality and it has worked just fine.
        
         | saaspirant wrote:
         | Can you elaborate a bit about this? I am a noob on Full-Text
         | search by the way who might have to implement this at {work}
         | soon.
        
       | eatonphil wrote:
       | Part of the optimization here seems to be that you only return
       | the top 10 results. This might result in a really annoying UX
       | though if you can't page through other results.
       | 
       | For example to tag someone on LinkedIn in a post you have to type
       | their name (this comes up when I share someone's blog post on
       | LinkedIn who I'm not connected with). But if their name doesn't
       | come up within the top 10 results there seems to be literally no
       | way to tag them.
       | 
       | And this isn't a permissions issue about tagging people I'm not
       | connected with either. If the person I'm trying to tag who I'm
       | not connected with has an uncommon name, I can normally find them
       | in the auto complete list and tag them. If they have a common
       | name, I often can't.
       | 
       | I'm not posting this to ask for solutions, just to say that being
       | arbitrarily limited to top results can sometimes be infuriating.
        
         | gwern wrote:
         | OP includes the entire trie if you set the minimum rank to
         | -INF, so that's not an issue with the data structure but the
         | UX, I would think? The data structure _can_ show all the
         | results, it 's just a matter of knowing when the user wants to
         | see more.
         | 
         | You'd need something like a button for 'show more results', or
         | perhaps use a timeout ('if no user action in 10s, assume the
         | desired result is missing & show more results) where it
         | increases the _n_ limit (say, doubling each time). Then the
         | user only pays for what they use, and the common case remains
         | accelerated.
        
           | swader999 wrote:
           | You could use scroll event(s) to trigger more results.
        
         | gpderetta wrote:
         | It seems to me that an hybrid solution using a pruning trie for
         | prefixes and a normal trie for almost complete completions
         | would work.
        
           | gchamonlive wrote:
           | You mean you do both in parallel, asynchronously, return the
           | first top 10 results, and by the time the user gets to the
           | point it needs the rest of the results, the classical trie is
           | done, or at least it is close to ready?
        
         | mgaunard wrote:
         | Not "part", this is the whole basis of the data structure.
        
         | ok123456 wrote:
         | He keeps the sorted frequency count. It doesn't have to be the
         | top 10. You can query arbitrary k.
        
       ___________________________________________________________________
       (page generated 2023-09-30 23:00 UTC)