[HN Gopher] Show HN: Building a web search engine from scratch w...
       ___________________________________________________________________
        
       Show HN: Building a web search engine from scratch with 3B neural
       embeddings
        
       Author : wilsonzlin
       Score  : 260 points
       Date   : 2025-08-12 16:02 UTC (6 hours ago)
        
 (HTM) web link (blog.wilsonl.in)
 (TXT) w3m dump (blog.wilsonl.in)
        
       | abraxas wrote:
       | Very nice project. Do you have plans to commercialize it next?
        
       | giancarlostoro wrote:
       | This then begs the question for me, without an LLM what is the
       | approach to build a search engine? Google search used to be razor
       | sharp, then it degraded in the late 2000s and early 2010s and now
       | its meh. They filter out so much content for a billion different
       | reasons and the results are just not what they used to be. I've
       | found better results from some LLMs like Grok (surprisingly) but
       | I can't seem to understand why what was once a razor exact search
       | engine like Google, it cannot find verbatim or near verbatim
       | quotes of content I remember seeing on the internet.
        
         | thr0w wrote:
         | I see you're also having trouble coping with this. Fact is,
         | "that" internet is simply gone.
        
           | giancarlostoro wrote:
           | Nah, its a series of tubes, just gotta get the right tubes
           | together.
        
         | andai wrote:
         | My understanding was that every few months Google was forced to
         | adjust their algorithms because the search results would get
         | flooded by people using black hat SEO techniques. At least
         | that's the excuse I heard for why it got so much worse over
         | time.
         | 
         | Not sure if that's related to it ignoring quotes and operators
         | though. I'd imagine that to be a cost saving measure (and very
         | rarely used, considering it keeps accusing me of being a robot
         | when I do...)
         | 
         | From what I understand, that good old Google from the 2000s was
         | built entirely without any kind of machine learning. Just a
         | keyword index and PageRank. Everything they added since then
         | seems to have made it worse (though it did also degrade
         | "organically" from the SEO spam).
        
           | giancarlostoro wrote:
           | That begs the question, if you can recreate their engine from
           | the 2000s with high quality search results, would investors
           | even fund you? Lol
        
             | entropie wrote:
             | > if you can recreate their engine from the 2000s
             | 
             | Seriously, how? Iam pretty sure you have to have a very
             | different approach than google had in its best times. The
             | web is a very different place now
        
           | xnx wrote:
           | The majority of the public internet shifted to "SEO
           | optimized" garbage while the real user-generated content
           | shifted to walled gardens like Instagram, Facebook, and
           | Reddit (somewhat open). More recently, even use generated
           | content is poisoned by wannabe influencers shilling some
           | snake oil or scam.
        
           | ASalazarMX wrote:
           | This is my take as well. When websites were few, directories
           | were awesome. When websites multiplied, Google was awesome.
           | When websites became SEO trash, social networks were awesome.
           | When social networks are become trash, I'm hoping the
           | Fediverse becomes the next awesome.
           | 
           | I don't see AI in any form becoming the next awesome.
        
             | Imustaskforhelp wrote:
             | I wish all the best wishes to fediverse too. I'd like to
             | take this one step too that communities have gone a similar
             | transition too from forums to mostly now discord and I wish
             | them to move to something like matrix which is federated
             | (yes I know it has issues, but trust me sacrifices must be
             | made)
             | 
             | What are your thoughts on things like bluesky/nostr and
             | (matrix) too.
             | 
             | Bluesky does seem centralized in its current stage but its
             | idea of (pds?) makes it fundamentally hack proof in the
             | sense that if you are on a server which gets hacked, then
             | your account is still safe or atleast that's the plan, not
             | sure about its current implementation.
             | 
             | I also agree with AI not being the next awesome. Maybe for
             | coding sure, but not in general yeah. But even in coding
             | man, I feel like its good enough and its hard to catch more
             | progress from now on and its just not worth it but honestly
             | that's just me.
        
               | ASalazarMX wrote:
               | I think BlueSky still needs to prove itself. It is what
               | Twitter/X was a decade ago, before the enshittification,
               | and I enjoy the content a lot, with my reservations.
               | 
               | The weakness of Mastodon (and the Fediverse IMO), is that
               | you can join one of many instances, and it becomes easier
               | to form an echo chamber. Your feed will the the Fediverse
               | hose (lots of irrelevant content), your local instance
               | (an echo chamber), or your subscriptions (curating them
               | takes effort). Nevertheless, that might be as well a
               | strength I'm not truly appreciating.
        
               | Imustaskforhelp wrote:
               | I mean both bluesky and fediverse are just decentralized
               | technologies, so lets say that you are worried about
               | bluesky "enshittening"
               | 
               | I doubt it to happen because of its decentralized-enough
               | nature.
               | 
               | I also agree with the subscriptions curation part the
               | last time I checked, but I didn't use mastodon as often
               | as I used lemmy and it was a less of an issue on lemmy.
               | 
               | Still, I feel like bluesky as an technology is goated and
               | doesn't feel like it can be enshittened.
               | 
               | Nostr on the other hand does seem to me as an echo
               | chamber of crypto bros but honestly, that's the most
               | decentralization as you can ever get. Shame that we are
               | going to get mostly nothing meaningful out of it imo.
               | Which in that case bluesky seems to me as good enough but
               | things like search etc. / the current bluesky is
               | definitely centralized but honestly the same problems
               | kept coming up on fediverse too, lemmy.world was getting
               | too bloated with too many members and even mastodon had
               | only one really famous home server afaik iirc
               | mastodon.social right?
               | 
               | Also I may be wrong, I usually am but iirc mastodon only
               | allows you to comment/ interact with posts on your own
               | server like, I wanted to comment on mastodon.social from
               | some other server but I don't remember being able to do
               | so, maybe skill issue from my side.
        
               | mwcz wrote:
               | There was a Neal Stephenson novel where curated feeds had
               | become a big business because it was the only tolerable
               | way to browse the Internet. Lately I've been thinking
               | that's more likely to happen.
        
           | reactordev wrote:
           | This is correct. Marketing and Advertising manipulated pages
           | to gain higher rankings because they figured out the
           | algorithm behind it. Forcing Google to change the algorithm.
           | Originally, prior to the flood of <meta> garbage and hidden
           | <div>'s it was very good at linking content together. Now,
           | it's a weighted database.
        
           | h2zizzle wrote:
           | This has always been the explanation, but I've always
           | wondered if it wasn't so much battling SEO as balancing the
           | appearance of battling SEO while not killing some factor
           | related to their revenue.
        
           | masfuerte wrote:
           | Google certainly had to update their algorithms to cope with
           | SEO, but that's not why their results have become so poor in
           | the last five years or so. They made a conscious decision to
           | prioritize profit over search quality. This came out in
           | internal emails that were published as part of discovery for
           | one of the antitrust suits.
           | 
           | To reiterate: Google search results are shit because shit ad-
           | laden results make them more money in the short term.
           | 
           | That's it. And it's sad that so many people continue to give
           | them the benefit of the doubt when there is no doubt.
        
         | yorwba wrote:
         | When I encounter the "cannot find verbatim quote I remember"
         | problem and then later find what I was looking for in some
         | other way, I usually discover that I misremembered and the
         | actual quote was different. I do prefer getting zero results in
         | that case, though.
        
         | msgodel wrote:
         | I wish there was an old fashioned n-gram + page rank search
         | engine for those of us who don't mind the issues the older
         | Google had. I've thought about making my own a few times.
        
         | mike_hearn wrote:
         | The internet itself has changed over time, and a lot of content
         | has just disappeared. It shouldn't appear in search because
         | it's just not there anymore, it'd be a 404.
        
           | cosmic_cheese wrote:
           | A search engine that kept dead entries but maybe put them in
           | an "missing" tab or something would've been _monstrously_
           | useful for me in so many situations. There's been numerous
           | times I've remembered looking at something N years ago only
           | for all but the faintest traces of it to have disappeared
           | from the internet. With a "missing" tab I'd at least have
           | former URLs, page titles, etc to work with (archive.org,
           | etc).
        
       | randomcatuser wrote:
       | This is so cool. A question on the service mesh - is building
       | your own typically the best way to do things?
       | 
       | I'm new to networking..
        
       | ccgreg wrote:
       | At the end, the author thinks about adding Common Crawl data. Our
       | ranking information, generated from our web graph, would probably
       | be a big help in picking which pages to crawl.
       | 
       | I love seeing the worked out example at scale -- I'm surprised at
       | how cost effective the vector database was.
        
       | jobswithgptcom wrote:
       | I been doing a smaller version of the same idea for just domain
       | of job listings. Initially I looked at HNSW but couldn't reason
       | on how to scale it with predictable compute time cost. I ended up
       | using IVF because I am a bit memory starved. I will have to take
       | at look at coreNN.
        
       | Imustaskforhelp wrote:
       | This is really really cool. I had earlier wanted to entirely run
       | my searches on it and though that seems possible, I feel like it
       | would be sadly a little bit more waste of time in terms of
       | searches but still I'll maybe try to run some of my searches
       | against this too and give me thoughts on this after doing
       | something like this if I could, like, it is a big hit or miss but
       | it will almost land you to the right spot, like not exactly.
       | 
       | For example, I searched lemmy hoping to find the fediverse and it
       | gave me their liberapay page though.
       | 
       | Please, actually follow up on that common crawl promise and maybe
       | even archive.org or other websites too and I hope that people are
       | spending billions in this AI industry, I just hope that you can
       | whether even through funding or just community crowdwork,
       | actually succeed in creating such an alternative. People are
       | honestly fed up with the current search engine almost monopoly.
       | 
       | Wasn't Ecosia trying to roll out their own search engine, They
       | should definitely take your help or have you in their team..
       | 
       | I just want a decentralized search engine man, I understand that
       | you want to make it sustaianable and that's why you haven't open
       | sourced but please, there is honestly so much money going into
       | potholes doing nothing but make our society worse and this
       | project almost works good enough and has insane potential...
       | 
       | Please open source it and lets hope that the community tries to
       | figure out a way around some ways of monetization/crowd funding
       | to actually make it sustainable
       | 
       | But still, I haven't read the blog post in its entirety since I
       | was so excited that I just started using the search engine.., But
       | I feel like the article feels super indepth and that this idea
       | can definitely help others to create their own proof of concepts
       | or actually create some open source search engine that's decent
       | once and for all.
       | 
       | Not going to lie, But this feels like a little magic and I am all
       | for it. I have never been this excited the more I think about it
       | of such projects in actual months!
       | 
       | I know open source is tough and I come from a third country but
       | this is actually so cool that I will donate ya as much as I can /
       | have for my own right now. Not much around 50$ but this is coming
       | from a guy who has not spent a single penny online and wanting to
       | donate to ya, please I beg ya to open source and use that common
       | crawl, but I just wish you all the best wishes in your life and
       | career man.
        
       | 1gn15 wrote:
       | This is incredibly, incredibly cool. Creating a search engine
       | that beats Google in quality in just 2 months and less than a
       | thousand dollars.
       | 
       | Really great idea about the federated search index too! YaCy has
       | it but it's really heavy and never really gave good results for
       | me.
        
       | AndrewKemendo wrote:
       | That stack element is amazing
       | 
       | I wish more people showed their whole exploded stack like that
       | and in an elegant way
       | 
       | Really well done writeup!
        
       | lysecret wrote:
       | Just wow. My greatest respect! Also an incredible write up. I
       | like the take-away that an essential ingredient to a search
       | engine is curated and well filtered data (garbage in garbage out)
       | I feel like this has been a big learning of the LLM training too,
       | rather work with less much higher quality data. I'm curious how a
       | search engine would perform where all content has been judged by
       | an LLM.
        
         | throwawaylaptop wrote:
         | I'm currently trying to get a friends small business website to
         | rank. I have a decent understanding of SEO, doing more
         | technically correct things and did a decent amount of hand
         | written content specific to local areas and services provided.
         | 
         | Two months in, bing still hasn't crawled the fav icon. Google
         | finally did after a month. I'm still getting outranked by
         | tangentially related services, garbage national lead collection
         | sites, yelp top 10 blog spam, and even exact service providers
         | from 300 miles away that definitely don't serve the area.
         | 
         | Something is definitely wrong with pagerank and crawling in
         | general.
        
           | mv4 wrote:
           | Sadly, that ship has sailed. The web is dead. SEO should be
           | called SEM (Search Engine Manipulation).
        
       | A_Stefan wrote:
       | Such a big inspiration! One of the few times where I genuinely
       | read and liked the work - didn't even notice how the time flew
       | by.
       | 
       | Feels like it's more and more about consuming data & outputting
       | the desired result.
        
       | tmelm wrote:
       | Incredibly cool. What a write-up. What an engineer.
        
       | voiper1 wrote:
       | Wow, looks like a tremendous commitment and depth of knowledge
       | went into this one-man project. I couldn't even read the whole
       | write up, I had to skim part of it. I'm super impressed.
        
       | rkunnamp wrote:
       | I couldn't get the search working (there was some cors error) .
       | But what a feat and writeup. Wonder Stuck!
        
       | Flux159 wrote:
       | Getting a CORS error from the API - is the demo at
       | https://search.wilsonl.in/ working for anyone else?
        
         | bstsb wrote:
         | cors error is due to the actual request failing (502 Bad
         | Gateway). hug of death?
        
           | Flux159 wrote:
           | Yeah just saw the 502 - probably hug of death.
        
       | mmargenot wrote:
       | I love this and think that your write-up is fantastic, thank you
       | for sharing your work in such detail.
       | 
       | What are you thinking in terms of improving [and using] the
       | knowledge graph beyond the knowledge panel on the side? If I'm
       | reading this correctly, it seems like you only have knowledge
       | panel results for those top results that exist in Wikipedia, is
       | that correct?
        
       | a_spy wrote:
       | Man! This incredible.. It gives me motivation to continue with my
       | document search engine..
        
       | poly2it wrote:
       | Very cool project!
       | 
       | Just out of interest, I sent a query I've had difficulties
       | getting good results for with major engines: "what are some good
       | options for high-resolution ultrawide monitors?".
       | 
       | The response in this engine for this query at this point seems to
       | have the same fallacy as I've seen in other engines. Meta-pages
       | "specialising" in broad rankings are preferred above specialist
       | data about the specific sought-after item. It seems that the
       | desire for a ranking weighs the most.
       | 
       | If I were to manually try to answer this query, I would start by
       | looking at hardware forums and geeky blogs, pick N candidates,
       | then try to find the specifications and quirks for all products.
       | 
       | Of course, it is difficult to generically answer if a given
       | website has performed this analysis. It can be favourable to rank
       | sites citing specific data higher in these circumstances.
       | 
       | As a user, I would prefer to be presented with the initial
       | sources used for assembling this analysis. Of course, this
       | doesn't happen because engines don't perform this kind of bottom-
       | to-top evaluation.
        
       | sciencesama wrote:
       | how much did it cost ?
        
       | dangoodmanUT wrote:
       | This is really well written, especially considering the
       | complexity
        
         | dangoodmanUT wrote:
         | I also love how often the author finds that traditionally
         | glazed databases don't scale as you might think, and turn to
         | stronger storage primitves that were designed exactly to do
         | that
        
       | de6u99er wrote:
       | Thank you for sharing! This is one of the coolest articles I have
       | seen in a while on HN. I did some searches and I think the search
       | results looked very useful so far. I particularly loved about
       | your article that most of the questions I had while reading got
       | answered in a most structured way.
       | 
       | I still have questions:
       | 
       | * How long do you plan to keep the live demo up?
       | 
       | * Are you planning to make the source code public?
       | 
       | * How many hours in total did you invest into this "hobby
       | project" in the two months you mentioned in your write-up?
        
       | divineg wrote:
       | It's incredible. I can't believe it but it actually works quite
       | nicely.
       | 
       | If 10K $5 subscriptions can cover its cost, maybe a community run
       | search engine funded through donations isn't _that_ insane?
        
       ___________________________________________________________________
       (page generated 2025-08-12 23:00 UTC)