[HN Gopher] PruningRadixTrie - Faster Radix trie for prefix sear...
___________________________________________________________________
PruningRadixTrie - Faster Radix trie for prefix search and auto-
complete
Author : g0xA52A2A
Score : 157 points
Date : 2023-09-30 12:33 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ParallelThread wrote:
| What is the state of art in longest prefix matching e.g for
| router prefixes in routing tables?
| vvrm wrote:
| The fine-grained results look like:
|
| - 1444x faster for single character prefixes
|
| - 252x faster for two character prefixes
|
| - 55x faster for three character prefixes
|
| - ~20x faster for 4 and 5 character prefixes
|
| - <= 5x faster for longer prefixes
|
| I used to work on a production auto-complete system operating at
| over 100k peak QPS. For prefixes of length one and two we would
| not even bother hitting the server, just from a quality
| perspective, not because of latency/throughput considerations.
| Btw, up until 3 characters, you could store everything in an in-
| memory hash map. 20x speedup on length 4 and 5 prefixes is still
| very impressive, but not quite 1000x speedup either.
| jjice wrote:
| I also worked on a production auto complete feature for a web
| app a bit ago and I couldn't agree more with the quality
| sentiment. One or two characters is almost never enough to give
| a meaningful result. Using history or similar user search is
| much more effective than trying to guess what someone meant by
| "th".
| Kalabasa wrote:
| How does that fare for non-English queries? E.g. are two
| chars still not enough for Chinese languages?
| meepmorp wrote:
| I've seen 1.5 chars/token used as a rule of thumb for
| estimating token counts in Chinese text.
| ComputerGuru wrote:
| > One or two characters is almost never enough to give a
| meaningful result.
|
| TFA acknowledges that and mentions the exception:
|
| _While a prefix of length=1 is not very useful for the Latin
| alphabet, it does make sense for CJK languages_
| somat wrote:
| In the game "keep talking and nobody explodes" one of the puzzles
| has you match a limited set of characters in each position to a
| dictionary of possible words. A task I was having a lot of
| trouble doing in the limited time provided. I ended up rewriting
| the dictionary as a prefix tree and including it as a addendum to
| the manual.
|
| It was neat seeing seeing these algorithms work at human scale
| and how much better it made the process.
| cirwin wrote:
| I wrote something like this for contact auto completion - running
| in the browser we had to ensure that both tree construction was
| fast, and also that completion was instantaneous. So the next
| level of optimization is to amortize tree construction over
| searches (using the observation that most of the tree is never
| accessed)
|
| https://github.com/superhuman/trie-ing
| rockwotj wrote:
| If youre interested in a TypeScript fork of this that also
| supports deletion, see here: https://github.com/shortwave/trie
|
| There are also a couple of bug fixes in there, for example:
| https://github.com/shortwave/trie/commit/1e7045d89cc20011251...
| leeoniya wrote:
| interesting. i made something much more stupid, but brutally
| effective: https://github.com/leeoniya/uFuzzy
|
| i guess if you wanted results ordered by contact frequency you
| can just keep the original haystack sorted by frequency, and
| would get the Richard match first. or keep the frequency in an
| adjacent lookup and sort/pick the result afterwards.
|
| PSA: don't use the ultra popular Fuse.js for fuzzy matching,
| it's super slow and has pretty terrible match quality. e.g.
| https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uF...
| a1o wrote:
| This is interesting, I am curious if it's possible to add more
| than 10 results and how it would look.
|
| Also if it makes sense to use something like this for code
| autocomplete - I would prefer I could just scroll to find
| everything.
|
| I think though in case of code parsing through the code and add
| things to the dictionary may be more expensive than looking
| through it.
| tgv wrote:
| Or put a cache in front.
| TOMDM wrote:
| This is a data structure for implementing dictionary lookup,
| the sort of thing you'd use to make a cache...
| tgv wrote:
| I don't think a trie makes a good cache.
| DennisL123 wrote:
| Nice piece of engineering and surely worth a read. The ideas are
| hardly novel, tho.
| ot wrote:
| It's 1000x compared to a very naive baseline, exhaustive
| traversal of the matching subtree. Nobody in practice would do
| that, storing the maximum score for each subtree so you can prune
| is a very common technique, definitely not novel here.
|
| Even doing a Levenshtein-bounded beam search along the trie is
| pretty much common practice, see for example this paper [1] from
| 2011.
|
| There are more sophisticated ways of doing this in space-
| efficient ways, for example this paper [2] I co-authored in 2013.
|
| [1] https://www.microsoft.com/en-us/research/wp-
| content/uploads/...
|
| [2] https://www.microsoft.com/en-us/research/wp-
| content/uploads/...
| alexchamberlain wrote:
| Is the `PruningRadixTrie` the same as the Completion Trie?
| agencies wrote:
| What is state of the art currently?
| anonzzzies wrote:
| And also; are there implementations to look at? Or
| libraries/open source dbs/search engines that use these?
| rutgersnj22 wrote:
| here is a more recent paper where I am one of the authors:
| http://www.vldb.org/pvldb/vol9/p828-deng.pdf
| itake wrote:
| https link:
|
| https://www.vldb.org/pvldb/vol9/p828-deng.pdf
| sa-code wrote:
| Lucene's WFST is absurdly fast
| xwdv wrote:
| Inspiring, but we use Postgres full text search for all
| autocomplete functionality and it has worked just fine.
| saaspirant wrote:
| Can you elaborate a bit about this? I am a noob on Full-Text
| search by the way who might have to implement this at {work}
| soon.
| eatonphil wrote:
| Part of the optimization here seems to be that you only return
| the top 10 results. This might result in a really annoying UX
| though if you can't page through other results.
|
| For example to tag someone on LinkedIn in a post you have to type
| their name (this comes up when I share someone's blog post on
| LinkedIn who I'm not connected with). But if their name doesn't
| come up within the top 10 results there seems to be literally no
| way to tag them.
|
| And this isn't a permissions issue about tagging people I'm not
| connected with either. If the person I'm trying to tag who I'm
| not connected with has an uncommon name, I can normally find them
| in the auto complete list and tag them. If they have a common
| name, I often can't.
|
| I'm not posting this to ask for solutions, just to say that being
| arbitrarily limited to top results can sometimes be infuriating.
| gwern wrote:
| OP includes the entire trie if you set the minimum rank to
| -INF, so that's not an issue with the data structure but the
| UX, I would think? The data structure _can_ show all the
| results, it 's just a matter of knowing when the user wants to
| see more.
|
| You'd need something like a button for 'show more results', or
| perhaps use a timeout ('if no user action in 10s, assume the
| desired result is missing & show more results) where it
| increases the _n_ limit (say, doubling each time). Then the
| user only pays for what they use, and the common case remains
| accelerated.
| swader999 wrote:
| You could use scroll event(s) to trigger more results.
| gpderetta wrote:
| It seems to me that an hybrid solution using a pruning trie for
| prefixes and a normal trie for almost complete completions
| would work.
| gchamonlive wrote:
| You mean you do both in parallel, asynchronously, return the
| first top 10 results, and by the time the user gets to the
| point it needs the rest of the results, the classical trie is
| done, or at least it is close to ready?
| mgaunard wrote:
| Not "part", this is the whole basis of the data structure.
| ok123456 wrote:
| He keeps the sorted frequency count. It doesn't have to be the
| top 10. You can query arbitrary k.
___________________________________________________________________
(page generated 2023-09-30 23:00 UTC)