[HN Gopher] A search engine in 80 lines of Python
___________________________________________________________________
A search engine in 80 lines of Python
Author : alexmolas
Score : 583 points
Date : 2024-02-07 19:28 UTC (1 days ago)
(HTM) web link (www.alexmolas.com)
(TXT) w3m dump (www.alexmolas.com)
| softwaredoug wrote:
| This is really cool. I have a pretty fast BM25 search engine in
| Pandas I've been working on for local testing.
|
| https://github.com/softwaredoug/searcharray
|
| Why Pandas? Because BM25 is one thing, but you also want to
| combine with other factors (recency, popularity, etc) easily
| computed in pandas / numpy...
|
| BTW phrases are the hard thing. There are a lot of edge case in
| phrase matching. Not to mention slop, etc. And you want to
| compact positions into as little memory as possible
|
| https://github.com/softwaredoug/searcharray/blob/main/search...
| alexmolas wrote:
| Thanks for your comment Doug! I just bought your book
| (Relevance Search) and I'm planning to read it asap.
|
| I'll give a look to your project and try it for experimenting,
| thanks for sharing.
| Xenoamorphous wrote:
| I also bought the book (it's Relevant not Relevance BTW),
| that one and Deep Learning for Search were invaluable when
| building a news search engine a few years ago (right before
| and in the middle of the pandemic).
| boyter wrote:
| Can confirm this book is excellent.
|
| It's one I point people at all the time when they ask me
| why something isn't working as expected in any standard
| search tool, and something I reference from time to time to
| refresh my own knowledge.
|
| Well worth the money.
| softwaredoug wrote:
| Thank you! It's a bit old now, but I think the principles in
| it are still valid :)
| thierrydamiba wrote:
| Have you found that including sentiment analysis has helped(or
| hurt) with phrases? I also have found that phrases are
| difficult to work with and I'm wondering what I can do to
| improve performance.
| karolist wrote:
| How come you found this post and commented on it so quickly, do
| you have some sort of search scanning frontpage for terms of
| interest or was it by chance? Curious
| serialNumber wrote:
| Huh? I check Hacker News multiple times a day - it's not odd
| to click on an article within an hour of it being posted.
| karolist wrote:
| It's not the first time I saw an article posted and then an
| expert in the field comment on it rather quickly, I thought
| I may be missing something how other people use this site,
| had no negative intentions asking this and thanks for the
| answer ;)
| abetusk wrote:
| HN has an RSS feed [0] so there's no need to keep
| refreshing or make the rounds of this and other sites
| that have interesting information.
|
| I have my own feed setup with sites I like to frequent
| [1].
|
| [0] https://hackaday.com/blog/feed/
|
| [1] https://mechaelephant.com/feed
| tsukurimashou wrote:
| what software did you use for your own feed?
| abetusk wrote:
| Various scripts in python, shell and javascript [0].
|
| [0] https://github.com/abetusk/www.mechaelephant.com/tree
| /releas...
| tsukurimashou wrote:
| thanks! I'm currently looking at the files
|
| how did you filter which posts appear in the feed?
|
| EDIT: I was able to make it work, I replaced the node
| module xmltojson with xq (python lib available from pip)
| n_plus_1_acc wrote:
| There's an RSS feed https://news.ycombinator.com/rss
| pests wrote:
| Some people set up something like a Google Search Alert
| for terms about them or their writings, etc.
|
| Also, a lot of subject matter experts hang out here so it
| just happens a lot :)
| robertlacok wrote:
| https://f5bot.com is a way to get an email notification
| when a keyword is mentioned
| softwaredoug wrote:
| I got that HN gaaaame
| alpinelogic wrote:
| Hey, I tackled phrase matching in my toy project here:
| https://github.com/vasilionjea/lofi-dx/blob/main/test/search...
|
| I think I tested it thoroughly but any feedback would be
| appreciated!
|
| Edit: I delta-encoded and base36-encoded the positions
| softwaredoug wrote:
| Oh nice I'll take a look!
| marginalia_nu wrote:
| Checks out. Most of the difficulty in search is dealing with the
| data volume. The logic itself is surprisingly easy, or it can be.
| Of course you can complicate it endlesssly, but this project
| successfully cuts most of the fluff...
|
| So with that in mind, approaching the problem not as one of how
| to make the search engine bigger, but the data smaller
| (physically or better signal/noise ratio) goes a very long way.
| nonrandomstring wrote:
| Thanks for this lovely well written account. Made me wish I still
| taught this kind of coding because it would be such a lovely
| example for students to work through. (We used to do a bootcamp
| of build your own Facebook and Google in a week) Very
| inspirational.
| 3rd3 wrote:
| Last time I checked lxml.html and lxml.html.clean (possibly with
| cssselect) were much faster than BeautifulSoup. Not sure this is
| still the case.
| behnamoh wrote:
| Is it really a good idea to build something like this (big data,
| needs to crunch data fast) in Python (slow)?
| quickthrower2 wrote:
| This is a toy project to understand the space. He is using solr
| at work.
| BD103 wrote:
| I would argue that Python does not inhibit this project as much
| as you imply. There are multiple benefits to Python, such as:
|
| - A built-in dictionary type, used for indexing words
|
| - Clean and easy to read code, which is one of Python's core
| strengths
|
| - It's fast to draft code in, perfect for toy programs
|
| - Easy async support, which the author comments on
|
| - Plenty of libraries to do the heavy lifting of tasks not
| focused on by the post, such as hosting a web server, rendering
| template HTML, and parsing CLI arguments
|
| Yes, Python is not fast relative to C or Rust, but it's perfect
| for this type of project.
| eichin wrote:
| Yes. Paragraph 3:
|
| > This implementation doesn't pretend to be a production-ready
| search engine, just a usable toy example showing how a search
| engine works under the hood.
|
| The whole point is expressiveness.
| vidarh wrote:
| I'm not a fan of Python, but the bigger "issue" here if this
| was meant to be production code (it's clearly not) is not the
| choice of Python, but algorithm choices. It shows the
| principles, but for a production search engine there are a
| whole lot of tricks you'd apply whether you stick with Python
| or not.
|
| And once you do, odds are you'd find most of the runtime spent
| in very small portions of code doing fairly basic composable
| operations on large compressed arrays, and you can get very far
| with just rewriting a tiny core in something faster.
|
| The number of people who need a scale where this is hard to do
| fast enough is small...
| softwaredoug wrote:
| "Crunch data fast" is something numpy is really good at :)
|
| You can see my comment about SearchArray above, but you can do
| a lot of native performance comparable things if you embrace
| array based programming
| smalu wrote:
| What is the point of flexing about LOC, if it is not a total
| number of \r\n since we are using external deps? I know that
| there is no unit for codebase in SI system, but I think we should
| measure cognitive load somehow.
| hobs wrote:
| Well, the old school way is
| https://en.wikipedia.org/wiki/Cyclomatic_complexity
| smalu wrote:
| I am aware of it, but as far as I understand, cyclomatic
| complexity is used to compare implementations of single flows
| (algorithms), not for codebases. What is the cyclomatic
| complexity of Facebook? :)
| seanw444 wrote:
| Probably at least 5
| pphysch wrote:
| It's meaningful here because if it said "A search engine in
| 4000 lines of Python" most readers' eyes would glaze over, but
| 80 is short enough to warrant a glance.
| einpoklum wrote:
| "A search engine in 80 columns of Python code!"
| archargelod wrote:
| "A search engine in 80 lines of Python code" (but every
| line is 10 statements separated by semicolon)
| dharmab wrote:
| Although it's not formal, my team sometimes says "this code is
| not grug" or "this code is pretty grug" in reference to
| https://grugbrain.dev
| DontchaKnowit wrote:
| Thats awesome im stealing this
| FergusArgyll wrote:
| Best thing I've read today, thanks!
| supperrtadderr wrote:
| > "grug tempted reach for club when too much agile talk
| happen but always stay calm"
| noman-land wrote:
| One of the best scholarly essays on the net.
| itronitron wrote:
| me nod head save to pdf not just bookmark link
| burning_hamster wrote:
| This grug not big brain. This grug ate food and read article
| on big screen at the same time. Now food on screen, not in
| belly. Grug still hungry, but need to find rag now to clean
| screen. Otherwise no more work and no more shiny rocks for
| grug. But grug thanks other grug for article. Grug belly
| empty, but grug brain full now.
| lijok wrote:
| Why not celebrate the achievements of the industry allowing us
| to build something like this in 80 LOC?
| Drakim wrote:
| I mean, I could import this achievement in my own project and
| build a search engine in 1 LOC.
| duck wrote:
| You code just use an existing search engine and it would be
| 0 LOC, but I think you're missing the point. The focus
| wasn't on 80 LOC, but rather being able to talk through it
| in a short blog post.
| Arisaka1 wrote:
| I don't think that LOC affects one's ability to
| effectively communicate the way their code operates. And,
| if we really need to, lowering the amount of operations
| required to attain the desired result would be better
| than lowering lined of code, sine low ops = less
| explanation.
| hombre_fatal wrote:
| "X in N lines" is interesting because it's going to show
| you the minimal implementation of the interesting or
| central part of the solution without all the other moving
| parts of a production system, usually for the purpose of
| demonstrating how something works.
|
| You can see how your 1-liner pitch doesn't fulfill this
| expectation.
| golergka wrote:
| Operating system and programming language are also external
| dependencies.
| wongarsu wrote:
| The 80 line search engine isn't using any external deps though.
| It only imports collections, math and string, all in the
| standard library. Maybe it'd be more accurate to call it a
| search engine engine though. The crawler and interface aren't
| counted towards that goal, but they are obviously needed in
| some form, and the implementations presented add a bunch of
| lines and a lot of libraries.
|
| But even then, those libraries aren't related to search
| engines. If we start counting generic dependencies like pandas
| and fastapi, we might as well start counting the millions of
| loc of the operating system needed to run this search engine,
| and the firmware in the network card. Maybe even account for
| the complexity of the hardware this is running on.
| maaaaattttt wrote:
| I like it!
|
| Here is a recommendation engine in < 20 lines of python to go
| along your search engine (if you keep session logs of clicked
| urls). def build_recommendations(logs:
| List[List[str]], window_size: int = 10,
| max_recommendations_per_url: int = 50) -> dict:
| recommendations = {} for session in logs:
| for i in range(0, len(session)-1): url_id =
| session[i] # reference element window =
| session[i+1 : i+window_size] # sliding window
| recommendations[url_id] = recommendations.get(url_id, {})
| for pos, next_link in enumerate(window):
| weight = window_size - pos # elements in the window get
| decreasing weight proportional to their distance from the
| reference element
| recommendations[url_id][next_link] =
| recommendations[url_id].get(next_link, 0)
| recommendations[url_id][next_link] += weight
| for url_id, link_recommendations in recommendations.items():
| # sort and truncate the recommendations
| recommendations[url_id] =
| dict(sorted(link_recommendations.items(), key=lambda item:
| item[1], reverse=True)[:max_recommendations_per_url])
| return recommendations recommendations =
| build_recommendations(your_logs)
| print(list(recommendations[some_url_id].keys())) # gives an
| ordered list of the recommended url_ids for the given url_id
|
| With some tweaking (mix typed queries and clicked urls in the
| logs you feed) you can get a spellcheck suggestion out of it as
| well :)
| renegat0x0 wrote:
| I have myself dabbled a little bit in that subject. Some of my
| notes:
|
| - some RSS feeds are protected by cloudflare. It is true however
| that it is not necessary for majority of blogs. If you would like
| to do more then selenium would be a way to solve "cloudflare"
| protected links
|
| - sometimes even selenium headless is not enough and full blown
| browser in selenium is necessary to fool it's protection
|
| - sometimes even that is not enough
|
| - then I started to wonder, why some RSS feeds are so well
| protected by cloudflare, but who am I to judge?
|
| - sometimes it is beneficial to cover user agent. I feel bad for
| setting my user agent to chrome, but again, why RSS feeds are so
| well protected?
|
| - you cannot parse, read entire Internet, therefore you always
| need to think about compromises. For example I have narrowed area
| of my searches in one of my projects to domains only. Now I can
| find most of the common domains, and I sort them by their
| "importance"
|
| - RSS links do change. There need to be automated means to
| disable some feeds automatically to prevent checking inactive
| domains
|
| - I do not see any configurable timeout for reading a page, but I
| am not familiar with aiohttp. Some pages might waste your time
|
| - I hate that some RSS feeds are not configured properly. Some
| sites do not provide a valid meta "link" with
| "application/rss+xml". Some RSS feeds have naive titles like
| "Home", or no title at all. Such a waste of opportunity
|
| My RSS feed parser, link archiver, web crawler:
| https://github.com/rumca-js/Django-link-archive. Especially
| interesting could be file rsshistory/webtools.py. It is not
| advanced programming craft, but it got the job done.
|
| Additionally, in other project I have collected around 2378 of
| personal sites. I collect domains in https://github.com/rumca-
| js/Internet-Places-Database/tree/ma... . These files are JSONs.
| All personal sites have tag "personal".
|
| Most of the things are collected from:
|
| https://nownownow.com/
|
| https://searchmysite.net/
|
| I wanted also to process domains from
| https://downloads.marginalia.nu/, but haven't got time to read
| structure of the files
| marginalia_nu wrote:
| > I wanted also to process domains from
| https://downloads.marginalia.nu/, but haven't got time to read
| structure of the files
|
| Feel free to shoot me an email if you have any questions about
| how to use the files. They're very much there to fan the flames
| of other indie search projects :D
| hodanli wrote:
| i know it is unrelated but i liked your website.
| alexmolas wrote:
| thanks! The design has been carefully chosen, I was looking for
| something minimalistic. I'm glad you enjoyed it :)
| MultiVectors wrote:
| Really cool project! Building a search engine from the ground up
| to tackle small site discoverability? That's a challenge I can
| get behind. I'm especially into how you used asyncio in Python
| for faster crawling - smart move.
|
| But, let's talk scale and features. As it stands, handling bigger
| data sets or adding more complex search features might be tough.
| On the features side, playing around with query operators or
| n-gram indexing could seriously level up your search results.
|
| Expanding beyond RSS for content could also give your engine a
| nice touch. Just throwing in my thoughts - been around the block
| with search tech a bit. Can't wait to see what's next for your
| project!
| CephalopodMD wrote:
| This checks out! Pretty much exactly what we did in my undergrad
| Information Retrieval class.
| adw wrote:
| This is very cool and very educational. Don't deploy it, though.
| :-)
|
| I needed something like this once, but at a little bit larger
| scale (few tens of thousands of documents). The answer, as
| always, was [sqlite](https://www.sqlite.org/fts5.html);
| structurally though it's just what you have here but with someone
| else writing the inverted-index persistence layer.
| rcarmo wrote:
| I use SQLite FTS for just about everything. Has never let me
| down.
| pshirshov wrote:
| Don't use keywords (1-grams), the best results for English can be
| achieved with 2+3-grams. n-grams retain context.
| marginalia_nu wrote:
| I think you'd probably get the best result with both. There's
| definitely real merit to a keyword-understanding of which terms
| appear in the title or as a named entity for example.
| pshirshov wrote:
| Usually you would want to have weighted n-grams with 1-grams
| having lowest weights. In many cases it's better to have
| zeros. For English, 4-grams react on generic phrases/idioms,
| 5+ are way too selective and just keywords usually reduce
| relevance too much. 2+3 are the best.
| cabalamat wrote:
| Looking at the code (src/microsearch/engine.py), we have:
| class SearchEngine: def __init__(self, k1: float =
| 1.5, b: float = 0.75): self._index: dict[str,
| dict[str, int]] = defaultdict(lambda: defaultdict(int))
| self._documents: dict[str, str] = {} self.k1 = k1
| self.b = b
|
| I've no idea what `k1` or `b` are. Nor is there a single comment
| in the entire file. Are comments considered unfashionable these
| days?
|
| Looking at `_documents`, I'm guessing the keys are URLs and the
| values contents of those URLs, but i might be wrong.
|
| The whole thing looks like it would've been a useful resource for
| people to learn how to build search engines with, that people
| could build on, had the writer been bothered to document it. But
| as it is I'm disappointed with the poor code.
| boyter wrote:
| On mobile device but it's the standard weighting values for
| either TF/IDF or BM25. In this case BM25.
|
| A comment would be useful but they are also instantly
| recognisable to anyone familiar with the problem.
| 6510 wrote:
| > instantly recognisable to anyone familiar with the problem.
|
| I always love reading those when not familiar. It's almost as
| funny as reading something one already knows, waiting for the
| punch line...
| jffry wrote:
| That's explained in the article, which serves as the
| documentation for the code within the article. The link for
| BM25 goes to some of the math, and a little more searching the
| internet about BM25 parameters can lead you to some relevant
| articles on how to choose them.
| Barrin92 wrote:
| this trend of `a: float` always reminds me of the Rich Hickey
| "you don't want types, you want proper names" talk. I really
| hate this (feels to me Go inspired) tendency of undescriptive
| single letter variable, with the type system abused as a naming
| assistant.
|
| Names can convey proper semantic information about what your
| program does, use them godammit
| alexmolas wrote:
| I agree with you that better names are always preferable. But
| type hints also work as documentation. We can have both.
|
| However, in this particular case the undescriptive names are
| like this for historical reasons. I agree these are not the
| best names, but are the names used in the literature. If I
| was working in a physics I would probably use "c" as the
| speed of light or "kb" as the Boltzmann constant, which are
| non very descriptive names.
| planb wrote:
| This is right, but if you are implementing a formula known in
| the domain or documented elsewhere, you can (and should) use
| the letters used in the formula (in this case, b and k1)
| instead of making up names.
| sampo wrote:
| > tendency of undescriptive single letter variable
|
| There are 2 schools of thought on which one is clearer,
| F = G * m1 * m2 / r**2
|
| or force = gravitational_constant *
| mass_of_body_1 * mass_of_body_2 / distance_between_bodies **
| 2
| trashtester wrote:
| Phycisist :
|
| > F = G * m1 * m2 / r**2
|
| Computer scientist:
| nonrelativistic_gravitational_force = (
| Physics.NonRelativistic.Gravity.gravitational_constant
| * body1.NonRelativistic.mass() *
| body2.NonRelativistic.mass() /
| body1.NonRelativistic.distanceTo(body2) ** 2 )*
| ZaoLahma wrote:
| Software engineer straight out of uni:
|
| # Get force of gravity as below to be used later
|
| F = G * m1 * m2 / r*2
|
| Software engineer 5 years after uni:
| gravitational_force = ( PhysicsContextConstruc
| torFactory.createByRelativisticEnum(SystemConfig.getRelat
| ivisticEnum()).construct().getGravitationalConstant().val
| ue() * body1.getRelativisticContext(SystemConfi
| g.getRelativisticEnum()).mass().value() * body
| 2.getRelativisticContext(SystemConfig.getRelativisticEnum
| ()).mass().value() / SystemConfig.getRelativis
| ticContext(SystemConfig.getRelativisticEnum()).distanceTo
| (body1, body2).value() \* PlatformUnsignedInt(PlatformMin
| MaxAwareUnsignedInt(2).value()).getUnsignedInt().value()
| )*
|
| Software engineer after 15 years:
|
| # Newton's law of universal gravitation is well within
| wanted margin of error
|
| F = G * m1 * m2 / r*2
| lloeki wrote:
| Sadly the 5y SE forgot about dependency injecting the
| computation... after all we're always on the brink of new
| physics discovery, we'd hate to have to rewrite all that
| code when we can readily swap in fresh laws with the
| proper architecture!
| ZaoLahma wrote:
| Oh, but that comes as a needs-to-be-fixed to the pull
| request from the 6y SE, so no worries at all.
| cabalamat wrote:
| Short names are OK if they are explained somewhere. I often
| explain them in a comment at the start of a class or
| function.
| alexmolas wrote:
| Hi, author here.
|
| If I wanted a catchy title for the post I needed to cut the
| number of LOC as much as possible;)
|
| Joking apart, thanks for your feedback. I agree that usually
| it's better to have documentation and code together, but in
| this case since it's an educational project I decided to split
| code and documentation, and document the code in a blog post.
| cabalamat wrote:
| Fair enough
| marginalia_nu wrote:
| k1 and b are tuning parameters for the BM-25 ranking
| function[1]. These are not names OP's invented. Virtually every
| implementation and definitely every textbook uses these
| variable names. You give them the name k1 and b because
| otherwise nobody who's into information retrieval will
| understand what they do.
|
| [1] https://en.wikipedia.org/wiki/Okapi_BM25
| hipadev23 wrote:
| > chunk for chunk in chunks if chunk
|
| I believe there's a saying about a woodchuck somewhere in here
| userbinator wrote:
| _Google allows you to search for "search engine" (notice the
| double quotes) and it'll only show you results where the two
| words appear in this specific order._
|
| At least only some of the time, unfortunately. What power users
| want is "grep for the Web", not "Google, tell me what you want me
| to see."
| marginalia_nu wrote:
| > What power users want is "grep for the Web", not "Google,
| tell me what you want me to see."
|
| I can almost guarantee that nobody actually wants this. "Grep
| for the web" is strictly bad compared to a search engine that
| does the tiniest amount of query expansion. Google is
| definitely taking too many liberties in interpreting the query,
| but there are many things any search engine should do that will
| be a straight improvement over not doing them.
|
| The problem with Google search right now is that it's hard to
| reason about why it gives the results it does, seemingly
| because they rely too heavily on embeddings to compare strings.
| It's frustrating when "cat food" matches "dog restaurant"
| because the two were semantically close in some embedding space
| that doesn't quite align with human reasoning.
| malomalsky wrote:
| Imo, it's not fair to talk about 80 lines of code while using
| thitd party libraries (feedparser, bs4, etc)
| marginalia_nu wrote:
| If they'd built it on top of elasticsearch I'd agree with this
| sentiment, but given the actual search engine bit is
| implemented in those 80 lines, I think it's fair. The libraries
| being pulled are exactly the sort of libraries you shouldn't
| try to hand-roll.
|
| (sometimes you see those articles like 'built your own search
| engine' and it's a guide on how to install searxng or yacy or
| something.)
| make3 wrote:
| it is if they're extremely general & main stream
| alpinelogic wrote:
| For an inverted index with JavaScript/Typescript, here is my
| implementation: https://github.com/vasilionjea/lofi-dx
|
| It was a fun little project but definitely way more than 80 LOC
| :)
| lqet wrote:
| Nice! It wouldn't be much work to add fuzzy-search functionality
| (all results with a prefix edit-distance below some threshold
| delta, so that a search for "hackrnew" matches "hackernews").
| Basically, what you would do is you add an additional inverted
| index, but this time the keys are n-grams of words in your
| document collection (typically 3-grams), and the postings are the
| words (or their ID) in which these n-grams occur. There is then a
| nice lemma that basically says the following (where PED is the
| prefix edit distance, and N(x) are the n-grams of word x):
|
| If PED(x, y) <= delta, then |N(x) [?] N(y)| >= |N(x)| - n [?]
| delta
|
| That is, x and y must have at least |N(x)| - n [?] delta n-grams
| in common to have a PED(x, y) less then or equal to delta.
|
| If you now have an input x, you calculate N(x) and retrieve all
| postings from the n-gram index for each n-gram of x. You can now
| merge all of these postings and get a list that looks like this:
| [worda, worda, worda, wordb, wordb, wordc] (for each q-gram x and
| some word y have in common, you get one entry of y). If you merge
| the duplicates, you get: [(worda, 3), (wordb, 2), (wordc, 1)],
| and for each y (in this case, worda, wordb, wordc), the number in
| the corresponding tuple is |N(x) [?] N(y)|.
|
| If this number is larger than |N(x)| - n [?] delta, you
| explicitly compute PED(x, y) and check whether it is below your
| threshold. If the number is smaller, you can simply skip it,
| saving you large amounts of costly PED calculations.
|
| Your result is a list of words y with a PED(x, y) to the input x
| below some threshold, and you can then use this list of words to
| query your existing index.
|
| I used this approach many years ago to implement a fuzzy client-
| side JS search engine on https://dont.watch/ (if you look into
| the JS code, you can see that the inverted index and the
| (compressed) n-gram index are simply transferred in the JS-file).
| The actual search engine is around 300 lines of JS, with no
| external dependencies and some very basic heuristics to improve
| the search results).
| snowstormsun wrote:
| How much does that increase the index size?
| pshirshov wrote:
| Also, forgot to add, for this approach you better use sketches
| (bloom filter/hyperminhash). Same results with lightning
| performance.
| make3 wrote:
| just fyi, PageRank and BM25 address orthogonal problems of
| reputation/popularity (PageRank) and query/semantic matching
| (BM25). Saying you're using bm25 instead of PageRank makes no
| sense.
| amai wrote:
| Why not use whoosh?
| https://whoosh.readthedocs.io/en/latest/quickstart.html
___________________________________________________________________
(page generated 2024-02-08 23:01 UTC)