[HN Gopher] A search engine in 80 lines of Python
       ___________________________________________________________________
        
       A search engine in 80 lines of Python
        
       Author : alexmolas
       Score  : 583 points
       Date   : 2024-02-07 19:28 UTC (1 days ago)
        
 (HTM) web link (www.alexmolas.com)
 (TXT) w3m dump (www.alexmolas.com)
        
       | softwaredoug wrote:
       | This is really cool. I have a pretty fast BM25 search engine in
       | Pandas I've been working on for local testing.
       | 
       | https://github.com/softwaredoug/searcharray
       | 
       | Why Pandas? Because BM25 is one thing, but you also want to
       | combine with other factors (recency, popularity, etc) easily
       | computed in pandas / numpy...
       | 
       | BTW phrases are the hard thing. There are a lot of edge case in
       | phrase matching. Not to mention slop, etc. And you want to
       | compact positions into as little memory as possible
       | 
       | https://github.com/softwaredoug/searcharray/blob/main/search...
        
         | alexmolas wrote:
         | Thanks for your comment Doug! I just bought your book
         | (Relevance Search) and I'm planning to read it asap.
         | 
         | I'll give a look to your project and try it for experimenting,
         | thanks for sharing.
        
           | Xenoamorphous wrote:
           | I also bought the book (it's Relevant not Relevance BTW),
           | that one and Deep Learning for Search were invaluable when
           | building a news search engine a few years ago (right before
           | and in the middle of the pandemic).
        
             | boyter wrote:
             | Can confirm this book is excellent.
             | 
             | It's one I point people at all the time when they ask me
             | why something isn't working as expected in any standard
             | search tool, and something I reference from time to time to
             | refresh my own knowledge.
             | 
             | Well worth the money.
        
           | softwaredoug wrote:
           | Thank you! It's a bit old now, but I think the principles in
           | it are still valid :)
        
         | thierrydamiba wrote:
         | Have you found that including sentiment analysis has helped(or
         | hurt) with phrases? I also have found that phrases are
         | difficult to work with and I'm wondering what I can do to
         | improve performance.
        
         | karolist wrote:
         | How come you found this post and commented on it so quickly, do
         | you have some sort of search scanning frontpage for terms of
         | interest or was it by chance? Curious
        
           | serialNumber wrote:
           | Huh? I check Hacker News multiple times a day - it's not odd
           | to click on an article within an hour of it being posted.
        
             | karolist wrote:
             | It's not the first time I saw an article posted and then an
             | expert in the field comment on it rather quickly, I thought
             | I may be missing something how other people use this site,
             | had no negative intentions asking this and thanks for the
             | answer ;)
        
               | abetusk wrote:
               | HN has an RSS feed [0] so there's no need to keep
               | refreshing or make the rounds of this and other sites
               | that have interesting information.
               | 
               | I have my own feed setup with sites I like to frequent
               | [1].
               | 
               | [0] https://hackaday.com/blog/feed/
               | 
               | [1] https://mechaelephant.com/feed
        
               | tsukurimashou wrote:
               | what software did you use for your own feed?
        
               | abetusk wrote:
               | Various scripts in python, shell and javascript [0].
               | 
               | [0] https://github.com/abetusk/www.mechaelephant.com/tree
               | /releas...
        
               | tsukurimashou wrote:
               | thanks! I'm currently looking at the files
               | 
               | how did you filter which posts appear in the feed?
               | 
               | EDIT: I was able to make it work, I replaced the node
               | module xmltojson with xq (python lib available from pip)
        
               | n_plus_1_acc wrote:
               | There's an RSS feed https://news.ycombinator.com/rss
        
               | pests wrote:
               | Some people set up something like a Google Search Alert
               | for terms about them or their writings, etc.
               | 
               | Also, a lot of subject matter experts hang out here so it
               | just happens a lot :)
        
               | robertlacok wrote:
               | https://f5bot.com is a way to get an email notification
               | when a keyword is mentioned
        
           | softwaredoug wrote:
           | I got that HN gaaaame
        
         | alpinelogic wrote:
         | Hey, I tackled phrase matching in my toy project here:
         | https://github.com/vasilionjea/lofi-dx/blob/main/test/search...
         | 
         | I think I tested it thoroughly but any feedback would be
         | appreciated!
         | 
         | Edit: I delta-encoded and base36-encoded the positions
        
           | softwaredoug wrote:
           | Oh nice I'll take a look!
        
       | marginalia_nu wrote:
       | Checks out. Most of the difficulty in search is dealing with the
       | data volume. The logic itself is surprisingly easy, or it can be.
       | Of course you can complicate it endlesssly, but this project
       | successfully cuts most of the fluff...
       | 
       | So with that in mind, approaching the problem not as one of how
       | to make the search engine bigger, but the data smaller
       | (physically or better signal/noise ratio) goes a very long way.
        
       | nonrandomstring wrote:
       | Thanks for this lovely well written account. Made me wish I still
       | taught this kind of coding because it would be such a lovely
       | example for students to work through. (We used to do a bootcamp
       | of build your own Facebook and Google in a week) Very
       | inspirational.
        
       | 3rd3 wrote:
       | Last time I checked lxml.html and lxml.html.clean (possibly with
       | cssselect) were much faster than BeautifulSoup. Not sure this is
       | still the case.
        
       | behnamoh wrote:
       | Is it really a good idea to build something like this (big data,
       | needs to crunch data fast) in Python (slow)?
        
         | quickthrower2 wrote:
         | This is a toy project to understand the space. He is using solr
         | at work.
        
         | BD103 wrote:
         | I would argue that Python does not inhibit this project as much
         | as you imply. There are multiple benefits to Python, such as:
         | 
         | - A built-in dictionary type, used for indexing words
         | 
         | - Clean and easy to read code, which is one of Python's core
         | strengths
         | 
         | - It's fast to draft code in, perfect for toy programs
         | 
         | - Easy async support, which the author comments on
         | 
         | - Plenty of libraries to do the heavy lifting of tasks not
         | focused on by the post, such as hosting a web server, rendering
         | template HTML, and parsing CLI arguments
         | 
         | Yes, Python is not fast relative to C or Rust, but it's perfect
         | for this type of project.
        
         | eichin wrote:
         | Yes. Paragraph 3:
         | 
         | > This implementation doesn't pretend to be a production-ready
         | search engine, just a usable toy example showing how a search
         | engine works under the hood.
         | 
         | The whole point is expressiveness.
        
         | vidarh wrote:
         | I'm not a fan of Python, but the bigger "issue" here if this
         | was meant to be production code (it's clearly not) is not the
         | choice of Python, but algorithm choices. It shows the
         | principles, but for a production search engine there are a
         | whole lot of tricks you'd apply whether you stick with Python
         | or not.
         | 
         | And once you do, odds are you'd find most of the runtime spent
         | in very small portions of code doing fairly basic composable
         | operations on large compressed arrays, and you can get very far
         | with just rewriting a tiny core in something faster.
         | 
         | The number of people who need a scale where this is hard to do
         | fast enough is small...
        
         | softwaredoug wrote:
         | "Crunch data fast" is something numpy is really good at :)
         | 
         | You can see my comment about SearchArray above, but you can do
         | a lot of native performance comparable things if you embrace
         | array based programming
        
       | smalu wrote:
       | What is the point of flexing about LOC, if it is not a total
       | number of \r\n since we are using external deps? I know that
       | there is no unit for codebase in SI system, but I think we should
       | measure cognitive load somehow.
        
         | hobs wrote:
         | Well, the old school way is
         | https://en.wikipedia.org/wiki/Cyclomatic_complexity
        
           | smalu wrote:
           | I am aware of it, but as far as I understand, cyclomatic
           | complexity is used to compare implementations of single flows
           | (algorithms), not for codebases. What is the cyclomatic
           | complexity of Facebook? :)
        
             | seanw444 wrote:
             | Probably at least 5
        
         | pphysch wrote:
         | It's meaningful here because if it said "A search engine in
         | 4000 lines of Python" most readers' eyes would glaze over, but
         | 80 is short enough to warrant a glance.
        
           | einpoklum wrote:
           | "A search engine in 80 columns of Python code!"
        
             | archargelod wrote:
             | "A search engine in 80 lines of Python code" (but every
             | line is 10 statements separated by semicolon)
        
         | dharmab wrote:
         | Although it's not formal, my team sometimes says "this code is
         | not grug" or "this code is pretty grug" in reference to
         | https://grugbrain.dev
        
           | DontchaKnowit wrote:
           | Thats awesome im stealing this
        
           | FergusArgyll wrote:
           | Best thing I've read today, thanks!
        
           | supperrtadderr wrote:
           | > "grug tempted reach for club when too much agile talk
           | happen but always stay calm"
        
           | noman-land wrote:
           | One of the best scholarly essays on the net.
        
           | itronitron wrote:
           | me nod head save to pdf not just bookmark link
        
           | burning_hamster wrote:
           | This grug not big brain. This grug ate food and read article
           | on big screen at the same time. Now food on screen, not in
           | belly. Grug still hungry, but need to find rag now to clean
           | screen. Otherwise no more work and no more shiny rocks for
           | grug. But grug thanks other grug for article. Grug belly
           | empty, but grug brain full now.
        
         | lijok wrote:
         | Why not celebrate the achievements of the industry allowing us
         | to build something like this in 80 LOC?
        
           | Drakim wrote:
           | I mean, I could import this achievement in my own project and
           | build a search engine in 1 LOC.
        
             | duck wrote:
             | You code just use an existing search engine and it would be
             | 0 LOC, but I think you're missing the point. The focus
             | wasn't on 80 LOC, but rather being able to talk through it
             | in a short blog post.
        
               | Arisaka1 wrote:
               | I don't think that LOC affects one's ability to
               | effectively communicate the way their code operates. And,
               | if we really need to, lowering the amount of operations
               | required to attain the desired result would be better
               | than lowering lined of code, sine low ops = less
               | explanation.
        
             | hombre_fatal wrote:
             | "X in N lines" is interesting because it's going to show
             | you the minimal implementation of the interesting or
             | central part of the solution without all the other moving
             | parts of a production system, usually for the purpose of
             | demonstrating how something works.
             | 
             | You can see how your 1-liner pitch doesn't fulfill this
             | expectation.
        
         | golergka wrote:
         | Operating system and programming language are also external
         | dependencies.
        
         | wongarsu wrote:
         | The 80 line search engine isn't using any external deps though.
         | It only imports collections, math and string, all in the
         | standard library. Maybe it'd be more accurate to call it a
         | search engine engine though. The crawler and interface aren't
         | counted towards that goal, but they are obviously needed in
         | some form, and the implementations presented add a bunch of
         | lines and a lot of libraries.
         | 
         | But even then, those libraries aren't related to search
         | engines. If we start counting generic dependencies like pandas
         | and fastapi, we might as well start counting the millions of
         | loc of the operating system needed to run this search engine,
         | and the firmware in the network card. Maybe even account for
         | the complexity of the hardware this is running on.
        
       | maaaaattttt wrote:
       | I like it!
       | 
       | Here is a recommendation engine in < 20 lines of python to go
       | along your search engine (if you keep session logs of clicked
       | urls).                 def build_recommendations(logs:
       | List[List[str]], window_size: int = 10,
       | max_recommendations_per_url: int = 50) -> dict:
       | recommendations = {}           for session in logs:
       | for i in range(0, len(session)-1):                   url_id =
       | session[i] # reference element                   window =
       | session[i+1 : i+window_size] # sliding window
       | recommendations[url_id] = recommendations.get(url_id, {})
       | for pos, next_link in enumerate(window):
       | weight = window_size - pos # elements in the window get
       | decreasing weight proportional to their distance from the
       | reference element
       | recommendations[url_id][next_link] =
       | recommendations[url_id].get(next_link, 0)
       | recommendations[url_id][next_link] += weight
       | for url_id, link_recommendations in recommendations.items():
       | # sort and truncate the recommendations
       | recommendations[url_id] =
       | dict(sorted(link_recommendations.items(), key=lambda item:
       | item[1], reverse=True)[:max_recommendations_per_url])
       | return recommendations            recommendations =
       | build_recommendations(your_logs)
       | print(list(recommendations[some_url_id].keys())) # gives an
       | ordered list of the recommended url_ids for the given url_id
       | 
       | With some tweaking (mix typed queries and clicked urls in the
       | logs you feed) you can get a spellcheck suggestion out of it as
       | well :)
        
       | renegat0x0 wrote:
       | I have myself dabbled a little bit in that subject. Some of my
       | notes:
       | 
       | - some RSS feeds are protected by cloudflare. It is true however
       | that it is not necessary for majority of blogs. If you would like
       | to do more then selenium would be a way to solve "cloudflare"
       | protected links
       | 
       | - sometimes even selenium headless is not enough and full blown
       | browser in selenium is necessary to fool it's protection
       | 
       | - sometimes even that is not enough
       | 
       | - then I started to wonder, why some RSS feeds are so well
       | protected by cloudflare, but who am I to judge?
       | 
       | - sometimes it is beneficial to cover user agent. I feel bad for
       | setting my user agent to chrome, but again, why RSS feeds are so
       | well protected?
       | 
       | - you cannot parse, read entire Internet, therefore you always
       | need to think about compromises. For example I have narrowed area
       | of my searches in one of my projects to domains only. Now I can
       | find most of the common domains, and I sort them by their
       | "importance"
       | 
       | - RSS links do change. There need to be automated means to
       | disable some feeds automatically to prevent checking inactive
       | domains
       | 
       | - I do not see any configurable timeout for reading a page, but I
       | am not familiar with aiohttp. Some pages might waste your time
       | 
       | - I hate that some RSS feeds are not configured properly. Some
       | sites do not provide a valid meta "link" with
       | "application/rss+xml". Some RSS feeds have naive titles like
       | "Home", or no title at all. Such a waste of opportunity
       | 
       | My RSS feed parser, link archiver, web crawler:
       | https://github.com/rumca-js/Django-link-archive. Especially
       | interesting could be file rsshistory/webtools.py. It is not
       | advanced programming craft, but it got the job done.
       | 
       | Additionally, in other project I have collected around 2378 of
       | personal sites. I collect domains in https://github.com/rumca-
       | js/Internet-Places-Database/tree/ma... . These files are JSONs.
       | All personal sites have tag "personal".
       | 
       | Most of the things are collected from:
       | 
       | https://nownownow.com/
       | 
       | https://searchmysite.net/
       | 
       | I wanted also to process domains from
       | https://downloads.marginalia.nu/, but haven't got time to read
       | structure of the files
        
         | marginalia_nu wrote:
         | > I wanted also to process domains from
         | https://downloads.marginalia.nu/, but haven't got time to read
         | structure of the files
         | 
         | Feel free to shoot me an email if you have any questions about
         | how to use the files. They're very much there to fan the flames
         | of other indie search projects :D
        
       | hodanli wrote:
       | i know it is unrelated but i liked your website.
        
         | alexmolas wrote:
         | thanks! The design has been carefully chosen, I was looking for
         | something minimalistic. I'm glad you enjoyed it :)
        
       | MultiVectors wrote:
       | Really cool project! Building a search engine from the ground up
       | to tackle small site discoverability? That's a challenge I can
       | get behind. I'm especially into how you used asyncio in Python
       | for faster crawling - smart move.
       | 
       | But, let's talk scale and features. As it stands, handling bigger
       | data sets or adding more complex search features might be tough.
       | On the features side, playing around with query operators or
       | n-gram indexing could seriously level up your search results.
       | 
       | Expanding beyond RSS for content could also give your engine a
       | nice touch. Just throwing in my thoughts - been around the block
       | with search tech a bit. Can't wait to see what's next for your
       | project!
        
       | CephalopodMD wrote:
       | This checks out! Pretty much exactly what we did in my undergrad
       | Information Retrieval class.
        
       | adw wrote:
       | This is very cool and very educational. Don't deploy it, though.
       | :-)
       | 
       | I needed something like this once, but at a little bit larger
       | scale (few tens of thousands of documents). The answer, as
       | always, was [sqlite](https://www.sqlite.org/fts5.html);
       | structurally though it's just what you have here but with someone
       | else writing the inverted-index persistence layer.
        
         | rcarmo wrote:
         | I use SQLite FTS for just about everything. Has never let me
         | down.
        
       | pshirshov wrote:
       | Don't use keywords (1-grams), the best results for English can be
       | achieved with 2+3-grams. n-grams retain context.
        
         | marginalia_nu wrote:
         | I think you'd probably get the best result with both. There's
         | definitely real merit to a keyword-understanding of which terms
         | appear in the title or as a named entity for example.
        
           | pshirshov wrote:
           | Usually you would want to have weighted n-grams with 1-grams
           | having lowest weights. In many cases it's better to have
           | zeros. For English, 4-grams react on generic phrases/idioms,
           | 5+ are way too selective and just keywords usually reduce
           | relevance too much. 2+3 are the best.
        
       | cabalamat wrote:
       | Looking at the code (src/microsearch/engine.py), we have:
       | class SearchEngine:             def __init__(self, k1: float =
       | 1.5, b: float = 0.75):                 self._index: dict[str,
       | dict[str, int]] = defaultdict(lambda: defaultdict(int))
       | self._documents: dict[str, str] = {}                 self.k1 = k1
       | self.b = b
       | 
       | I've no idea what `k1` or `b` are. Nor is there a single comment
       | in the entire file. Are comments considered unfashionable these
       | days?
       | 
       | Looking at `_documents`, I'm guessing the keys are URLs and the
       | values contents of those URLs, but i might be wrong.
       | 
       | The whole thing looks like it would've been a useful resource for
       | people to learn how to build search engines with, that people
       | could build on, had the writer been bothered to document it. But
       | as it is I'm disappointed with the poor code.
        
         | boyter wrote:
         | On mobile device but it's the standard weighting values for
         | either TF/IDF or BM25. In this case BM25.
         | 
         | A comment would be useful but they are also instantly
         | recognisable to anyone familiar with the problem.
        
           | 6510 wrote:
           | > instantly recognisable to anyone familiar with the problem.
           | 
           | I always love reading those when not familiar. It's almost as
           | funny as reading something one already knows, waiting for the
           | punch line...
        
         | jffry wrote:
         | That's explained in the article, which serves as the
         | documentation for the code within the article. The link for
         | BM25 goes to some of the math, and a little more searching the
         | internet about BM25 parameters can lead you to some relevant
         | articles on how to choose them.
        
         | Barrin92 wrote:
         | this trend of `a: float` always reminds me of the Rich Hickey
         | "you don't want types, you want proper names" talk. I really
         | hate this (feels to me Go inspired) tendency of undescriptive
         | single letter variable, with the type system abused as a naming
         | assistant.
         | 
         | Names can convey proper semantic information about what your
         | program does, use them godammit
        
           | alexmolas wrote:
           | I agree with you that better names are always preferable. But
           | type hints also work as documentation. We can have both.
           | 
           | However, in this particular case the undescriptive names are
           | like this for historical reasons. I agree these are not the
           | best names, but are the names used in the literature. If I
           | was working in a physics I would probably use "c" as the
           | speed of light or "kb" as the Boltzmann constant, which are
           | non very descriptive names.
        
           | planb wrote:
           | This is right, but if you are implementing a formula known in
           | the domain or documented elsewhere, you can (and should) use
           | the letters used in the formula (in this case, b and k1)
           | instead of making up names.
        
           | sampo wrote:
           | > tendency of undescriptive single letter variable
           | 
           | There are 2 schools of thought on which one is clearer,
           | F = G * m1 * m2 / r**2
           | 
           | or                   force = gravitational_constant *
           | mass_of_body_1 * mass_of_body_2 / distance_between_bodies **
           | 2
        
             | trashtester wrote:
             | Phycisist :
             | 
             | > F = G * m1 * m2 / r**2
             | 
             | Computer scientist:
             | nonrelativistic_gravitational_force = (
             | Physics.NonRelativistic.Gravity.gravitational_constant
             | * body1.NonRelativistic.mass()            *
             | body2.NonRelativistic.mass()            /
             | body1.NonRelativistic.distanceTo(body2) ** 2         )*
        
               | ZaoLahma wrote:
               | Software engineer straight out of uni:
               | 
               | # Get force of gravity as below to be used later
               | 
               | F = G * m1 * m2 / r*2
               | 
               | Software engineer 5 years after uni:
               | gravitational_force = (            PhysicsContextConstruc
               | torFactory.createByRelativisticEnum(SystemConfig.getRelat
               | ivisticEnum()).construct().getGravitationalConstant().val
               | ue()           * body1.getRelativisticContext(SystemConfi
               | g.getRelativisticEnum()).mass().value()            * body
               | 2.getRelativisticContext(SystemConfig.getRelativisticEnum
               | ()).mass().value()            / SystemConfig.getRelativis
               | ticContext(SystemConfig.getRelativisticEnum()).distanceTo
               | (body1, body2).value() \* PlatformUnsignedInt(PlatformMin
               | MaxAwareUnsignedInt(2).value()).getUnsignedInt().value()
               | )*
               | 
               | Software engineer after 15 years:
               | 
               | # Newton's law of universal gravitation is well within
               | wanted margin of error
               | 
               | F = G * m1 * m2 / r*2
        
               | lloeki wrote:
               | Sadly the 5y SE forgot about dependency injecting the
               | computation... after all we're always on the brink of new
               | physics discovery, we'd hate to have to rewrite all that
               | code when we can readily swap in fresh laws with the
               | proper architecture!
        
               | ZaoLahma wrote:
               | Oh, but that comes as a needs-to-be-fixed to the pull
               | request from the 6y SE, so no worries at all.
        
             | cabalamat wrote:
             | Short names are OK if they are explained somewhere. I often
             | explain them in a comment at the start of a class or
             | function.
        
         | alexmolas wrote:
         | Hi, author here.
         | 
         | If I wanted a catchy title for the post I needed to cut the
         | number of LOC as much as possible;)
         | 
         | Joking apart, thanks for your feedback. I agree that usually
         | it's better to have documentation and code together, but in
         | this case since it's an educational project I decided to split
         | code and documentation, and document the code in a blog post.
        
           | cabalamat wrote:
           | Fair enough
        
         | marginalia_nu wrote:
         | k1 and b are tuning parameters for the BM-25 ranking
         | function[1]. These are not names OP's invented. Virtually every
         | implementation and definitely every textbook uses these
         | variable names. You give them the name k1 and b because
         | otherwise nobody who's into information retrieval will
         | understand what they do.
         | 
         | [1] https://en.wikipedia.org/wiki/Okapi_BM25
        
       | hipadev23 wrote:
       | > chunk for chunk in chunks if chunk
       | 
       | I believe there's a saying about a woodchuck somewhere in here
        
       | userbinator wrote:
       | _Google allows you to search for "search engine" (notice the
       | double quotes) and it'll only show you results where the two
       | words appear in this specific order._
       | 
       | At least only some of the time, unfortunately. What power users
       | want is "grep for the Web", not "Google, tell me what you want me
       | to see."
        
         | marginalia_nu wrote:
         | > What power users want is "grep for the Web", not "Google,
         | tell me what you want me to see."
         | 
         | I can almost guarantee that nobody actually wants this. "Grep
         | for the web" is strictly bad compared to a search engine that
         | does the tiniest amount of query expansion. Google is
         | definitely taking too many liberties in interpreting the query,
         | but there are many things any search engine should do that will
         | be a straight improvement over not doing them.
         | 
         | The problem with Google search right now is that it's hard to
         | reason about why it gives the results it does, seemingly
         | because they rely too heavily on embeddings to compare strings.
         | It's frustrating when "cat food" matches "dog restaurant"
         | because the two were semantically close in some embedding space
         | that doesn't quite align with human reasoning.
        
       | malomalsky wrote:
       | Imo, it's not fair to talk about 80 lines of code while using
       | thitd party libraries (feedparser, bs4, etc)
        
         | marginalia_nu wrote:
         | If they'd built it on top of elasticsearch I'd agree with this
         | sentiment, but given the actual search engine bit is
         | implemented in those 80 lines, I think it's fair. The libraries
         | being pulled are exactly the sort of libraries you shouldn't
         | try to hand-roll.
         | 
         | (sometimes you see those articles like 'built your own search
         | engine' and it's a guide on how to install searxng or yacy or
         | something.)
        
         | make3 wrote:
         | it is if they're extremely general & main stream
        
       | alpinelogic wrote:
       | For an inverted index with JavaScript/Typescript, here is my
       | implementation: https://github.com/vasilionjea/lofi-dx
       | 
       | It was a fun little project but definitely way more than 80 LOC
       | :)
        
       | lqet wrote:
       | Nice! It wouldn't be much work to add fuzzy-search functionality
       | (all results with a prefix edit-distance below some threshold
       | delta, so that a search for "hackrnew" matches "hackernews").
       | Basically, what you would do is you add an additional inverted
       | index, but this time the keys are n-grams of words in your
       | document collection (typically 3-grams), and the postings are the
       | words (or their ID) in which these n-grams occur. There is then a
       | nice lemma that basically says the following (where PED is the
       | prefix edit distance, and N(x) are the n-grams of word x):
       | 
       | If PED(x, y) <= delta, then |N(x) [?] N(y)| >= |N(x)| - n [?]
       | delta
       | 
       | That is, x and y must have at least |N(x)| - n [?] delta n-grams
       | in common to have a PED(x, y) less then or equal to delta.
       | 
       | If you now have an input x, you calculate N(x) and retrieve all
       | postings from the n-gram index for each n-gram of x. You can now
       | merge all of these postings and get a list that looks like this:
       | [worda, worda, worda, wordb, wordb, wordc] (for each q-gram x and
       | some word y have in common, you get one entry of y). If you merge
       | the duplicates, you get: [(worda, 3), (wordb, 2), (wordc, 1)],
       | and for each y (in this case, worda, wordb, wordc), the number in
       | the corresponding tuple is |N(x) [?] N(y)|.
       | 
       | If this number is larger than |N(x)| - n [?] delta, you
       | explicitly compute PED(x, y) and check whether it is below your
       | threshold. If the number is smaller, you can simply skip it,
       | saving you large amounts of costly PED calculations.
       | 
       | Your result is a list of words y with a PED(x, y) to the input x
       | below some threshold, and you can then use this list of words to
       | query your existing index.
       | 
       | I used this approach many years ago to implement a fuzzy client-
       | side JS search engine on https://dont.watch/ (if you look into
       | the JS code, you can see that the inverted index and the
       | (compressed) n-gram index are simply transferred in the JS-file).
       | The actual search engine is around 300 lines of JS, with no
       | external dependencies and some very basic heuristics to improve
       | the search results).
        
         | snowstormsun wrote:
         | How much does that increase the index size?
        
       | pshirshov wrote:
       | Also, forgot to add, for this approach you better use sketches
       | (bloom filter/hyperminhash). Same results with lightning
       | performance.
        
       | make3 wrote:
       | just fyi, PageRank and BM25 address orthogonal problems of
       | reputation/popularity (PageRank) and query/semantic matching
       | (BM25). Saying you're using bm25 instead of PageRank makes no
       | sense.
        
       | amai wrote:
       | Why not use whoosh?
       | https://whoosh.readthedocs.io/en/latest/quickstart.html
        
       ___________________________________________________________________
       (page generated 2024-02-08 23:01 UTC)