[HN Gopher] Show HN: Building a web search engine from scratch w...
___________________________________________________________________
Show HN: Building a web search engine from scratch with 3B neural
embeddings
Author : wilsonzlin
Score : 260 points
Date : 2025-08-12 16:02 UTC (6 hours ago)
(HTM) web link (blog.wilsonl.in)
(TXT) w3m dump (blog.wilsonl.in)
| abraxas wrote:
| Very nice project. Do you have plans to commercialize it next?
| giancarlostoro wrote:
| This then begs the question for me, without an LLM what is the
| approach to build a search engine? Google search used to be razor
| sharp, then it degraded in the late 2000s and early 2010s and now
| its meh. They filter out so much content for a billion different
| reasons and the results are just not what they used to be. I've
| found better results from some LLMs like Grok (surprisingly) but
| I can't seem to understand why what was once a razor exact search
| engine like Google, it cannot find verbatim or near verbatim
| quotes of content I remember seeing on the internet.
| thr0w wrote:
| I see you're also having trouble coping with this. Fact is,
| "that" internet is simply gone.
| giancarlostoro wrote:
| Nah, its a series of tubes, just gotta get the right tubes
| together.
| andai wrote:
| My understanding was that every few months Google was forced to
| adjust their algorithms because the search results would get
| flooded by people using black hat SEO techniques. At least
| that's the excuse I heard for why it got so much worse over
| time.
|
| Not sure if that's related to it ignoring quotes and operators
| though. I'd imagine that to be a cost saving measure (and very
| rarely used, considering it keeps accusing me of being a robot
| when I do...)
|
| From what I understand, that good old Google from the 2000s was
| built entirely without any kind of machine learning. Just a
| keyword index and PageRank. Everything they added since then
| seems to have made it worse (though it did also degrade
| "organically" from the SEO spam).
| giancarlostoro wrote:
| That begs the question, if you can recreate their engine from
| the 2000s with high quality search results, would investors
| even fund you? Lol
| entropie wrote:
| > if you can recreate their engine from the 2000s
|
| Seriously, how? Iam pretty sure you have to have a very
| different approach than google had in its best times. The
| web is a very different place now
| xnx wrote:
| The majority of the public internet shifted to "SEO
| optimized" garbage while the real user-generated content
| shifted to walled gardens like Instagram, Facebook, and
| Reddit (somewhat open). More recently, even use generated
| content is poisoned by wannabe influencers shilling some
| snake oil or scam.
| ASalazarMX wrote:
| This is my take as well. When websites were few, directories
| were awesome. When websites multiplied, Google was awesome.
| When websites became SEO trash, social networks were awesome.
| When social networks are become trash, I'm hoping the
| Fediverse becomes the next awesome.
|
| I don't see AI in any form becoming the next awesome.
| Imustaskforhelp wrote:
| I wish all the best wishes to fediverse too. I'd like to
| take this one step too that communities have gone a similar
| transition too from forums to mostly now discord and I wish
| them to move to something like matrix which is federated
| (yes I know it has issues, but trust me sacrifices must be
| made)
|
| What are your thoughts on things like bluesky/nostr and
| (matrix) too.
|
| Bluesky does seem centralized in its current stage but its
| idea of (pds?) makes it fundamentally hack proof in the
| sense that if you are on a server which gets hacked, then
| your account is still safe or atleast that's the plan, not
| sure about its current implementation.
|
| I also agree with AI not being the next awesome. Maybe for
| coding sure, but not in general yeah. But even in coding
| man, I feel like its good enough and its hard to catch more
| progress from now on and its just not worth it but honestly
| that's just me.
| ASalazarMX wrote:
| I think BlueSky still needs to prove itself. It is what
| Twitter/X was a decade ago, before the enshittification,
| and I enjoy the content a lot, with my reservations.
|
| The weakness of Mastodon (and the Fediverse IMO), is that
| you can join one of many instances, and it becomes easier
| to form an echo chamber. Your feed will the the Fediverse
| hose (lots of irrelevant content), your local instance
| (an echo chamber), or your subscriptions (curating them
| takes effort). Nevertheless, that might be as well a
| strength I'm not truly appreciating.
| Imustaskforhelp wrote:
| I mean both bluesky and fediverse are just decentralized
| technologies, so lets say that you are worried about
| bluesky "enshittening"
|
| I doubt it to happen because of its decentralized-enough
| nature.
|
| I also agree with the subscriptions curation part the
| last time I checked, but I didn't use mastodon as often
| as I used lemmy and it was a less of an issue on lemmy.
|
| Still, I feel like bluesky as an technology is goated and
| doesn't feel like it can be enshittened.
|
| Nostr on the other hand does seem to me as an echo
| chamber of crypto bros but honestly, that's the most
| decentralization as you can ever get. Shame that we are
| going to get mostly nothing meaningful out of it imo.
| Which in that case bluesky seems to me as good enough but
| things like search etc. / the current bluesky is
| definitely centralized but honestly the same problems
| kept coming up on fediverse too, lemmy.world was getting
| too bloated with too many members and even mastodon had
| only one really famous home server afaik iirc
| mastodon.social right?
|
| Also I may be wrong, I usually am but iirc mastodon only
| allows you to comment/ interact with posts on your own
| server like, I wanted to comment on mastodon.social from
| some other server but I don't remember being able to do
| so, maybe skill issue from my side.
| mwcz wrote:
| There was a Neal Stephenson novel where curated feeds had
| become a big business because it was the only tolerable
| way to browse the Internet. Lately I've been thinking
| that's more likely to happen.
| reactordev wrote:
| This is correct. Marketing and Advertising manipulated pages
| to gain higher rankings because they figured out the
| algorithm behind it. Forcing Google to change the algorithm.
| Originally, prior to the flood of <meta> garbage and hidden
| <div>'s it was very good at linking content together. Now,
| it's a weighted database.
| h2zizzle wrote:
| This has always been the explanation, but I've always
| wondered if it wasn't so much battling SEO as balancing the
| appearance of battling SEO while not killing some factor
| related to their revenue.
| masfuerte wrote:
| Google certainly had to update their algorithms to cope with
| SEO, but that's not why their results have become so poor in
| the last five years or so. They made a conscious decision to
| prioritize profit over search quality. This came out in
| internal emails that were published as part of discovery for
| one of the antitrust suits.
|
| To reiterate: Google search results are shit because shit ad-
| laden results make them more money in the short term.
|
| That's it. And it's sad that so many people continue to give
| them the benefit of the doubt when there is no doubt.
| yorwba wrote:
| When I encounter the "cannot find verbatim quote I remember"
| problem and then later find what I was looking for in some
| other way, I usually discover that I misremembered and the
| actual quote was different. I do prefer getting zero results in
| that case, though.
| msgodel wrote:
| I wish there was an old fashioned n-gram + page rank search
| engine for those of us who don't mind the issues the older
| Google had. I've thought about making my own a few times.
| mike_hearn wrote:
| The internet itself has changed over time, and a lot of content
| has just disappeared. It shouldn't appear in search because
| it's just not there anymore, it'd be a 404.
| cosmic_cheese wrote:
| A search engine that kept dead entries but maybe put them in
| an "missing" tab or something would've been _monstrously_
| useful for me in so many situations. There's been numerous
| times I've remembered looking at something N years ago only
| for all but the faintest traces of it to have disappeared
| from the internet. With a "missing" tab I'd at least have
| former URLs, page titles, etc to work with (archive.org,
| etc).
| randomcatuser wrote:
| This is so cool. A question on the service mesh - is building
| your own typically the best way to do things?
|
| I'm new to networking..
| ccgreg wrote:
| At the end, the author thinks about adding Common Crawl data. Our
| ranking information, generated from our web graph, would probably
| be a big help in picking which pages to crawl.
|
| I love seeing the worked out example at scale -- I'm surprised at
| how cost effective the vector database was.
| jobswithgptcom wrote:
| I been doing a smaller version of the same idea for just domain
| of job listings. Initially I looked at HNSW but couldn't reason
| on how to scale it with predictable compute time cost. I ended up
| using IVF because I am a bit memory starved. I will have to take
| at look at coreNN.
| Imustaskforhelp wrote:
| This is really really cool. I had earlier wanted to entirely run
| my searches on it and though that seems possible, I feel like it
| would be sadly a little bit more waste of time in terms of
| searches but still I'll maybe try to run some of my searches
| against this too and give me thoughts on this after doing
| something like this if I could, like, it is a big hit or miss but
| it will almost land you to the right spot, like not exactly.
|
| For example, I searched lemmy hoping to find the fediverse and it
| gave me their liberapay page though.
|
| Please, actually follow up on that common crawl promise and maybe
| even archive.org or other websites too and I hope that people are
| spending billions in this AI industry, I just hope that you can
| whether even through funding or just community crowdwork,
| actually succeed in creating such an alternative. People are
| honestly fed up with the current search engine almost monopoly.
|
| Wasn't Ecosia trying to roll out their own search engine, They
| should definitely take your help or have you in their team..
|
| I just want a decentralized search engine man, I understand that
| you want to make it sustaianable and that's why you haven't open
| sourced but please, there is honestly so much money going into
| potholes doing nothing but make our society worse and this
| project almost works good enough and has insane potential...
|
| Please open source it and lets hope that the community tries to
| figure out a way around some ways of monetization/crowd funding
| to actually make it sustainable
|
| But still, I haven't read the blog post in its entirety since I
| was so excited that I just started using the search engine.., But
| I feel like the article feels super indepth and that this idea
| can definitely help others to create their own proof of concepts
| or actually create some open source search engine that's decent
| once and for all.
|
| Not going to lie, But this feels like a little magic and I am all
| for it. I have never been this excited the more I think about it
| of such projects in actual months!
|
| I know open source is tough and I come from a third country but
| this is actually so cool that I will donate ya as much as I can /
| have for my own right now. Not much around 50$ but this is coming
| from a guy who has not spent a single penny online and wanting to
| donate to ya, please I beg ya to open source and use that common
| crawl, but I just wish you all the best wishes in your life and
| career man.
| 1gn15 wrote:
| This is incredibly, incredibly cool. Creating a search engine
| that beats Google in quality in just 2 months and less than a
| thousand dollars.
|
| Really great idea about the federated search index too! YaCy has
| it but it's really heavy and never really gave good results for
| me.
| AndrewKemendo wrote:
| That stack element is amazing
|
| I wish more people showed their whole exploded stack like that
| and in an elegant way
|
| Really well done writeup!
| lysecret wrote:
| Just wow. My greatest respect! Also an incredible write up. I
| like the take-away that an essential ingredient to a search
| engine is curated and well filtered data (garbage in garbage out)
| I feel like this has been a big learning of the LLM training too,
| rather work with less much higher quality data. I'm curious how a
| search engine would perform where all content has been judged by
| an LLM.
| throwawaylaptop wrote:
| I'm currently trying to get a friends small business website to
| rank. I have a decent understanding of SEO, doing more
| technically correct things and did a decent amount of hand
| written content specific to local areas and services provided.
|
| Two months in, bing still hasn't crawled the fav icon. Google
| finally did after a month. I'm still getting outranked by
| tangentially related services, garbage national lead collection
| sites, yelp top 10 blog spam, and even exact service providers
| from 300 miles away that definitely don't serve the area.
|
| Something is definitely wrong with pagerank and crawling in
| general.
| mv4 wrote:
| Sadly, that ship has sailed. The web is dead. SEO should be
| called SEM (Search Engine Manipulation).
| A_Stefan wrote:
| Such a big inspiration! One of the few times where I genuinely
| read and liked the work - didn't even notice how the time flew
| by.
|
| Feels like it's more and more about consuming data & outputting
| the desired result.
| tmelm wrote:
| Incredibly cool. What a write-up. What an engineer.
| voiper1 wrote:
| Wow, looks like a tremendous commitment and depth of knowledge
| went into this one-man project. I couldn't even read the whole
| write up, I had to skim part of it. I'm super impressed.
| rkunnamp wrote:
| I couldn't get the search working (there was some cors error) .
| But what a feat and writeup. Wonder Stuck!
| Flux159 wrote:
| Getting a CORS error from the API - is the demo at
| https://search.wilsonl.in/ working for anyone else?
| bstsb wrote:
| cors error is due to the actual request failing (502 Bad
| Gateway). hug of death?
| Flux159 wrote:
| Yeah just saw the 502 - probably hug of death.
| mmargenot wrote:
| I love this and think that your write-up is fantastic, thank you
| for sharing your work in such detail.
|
| What are you thinking in terms of improving [and using] the
| knowledge graph beyond the knowledge panel on the side? If I'm
| reading this correctly, it seems like you only have knowledge
| panel results for those top results that exist in Wikipedia, is
| that correct?
| a_spy wrote:
| Man! This incredible.. It gives me motivation to continue with my
| document search engine..
| poly2it wrote:
| Very cool project!
|
| Just out of interest, I sent a query I've had difficulties
| getting good results for with major engines: "what are some good
| options for high-resolution ultrawide monitors?".
|
| The response in this engine for this query at this point seems to
| have the same fallacy as I've seen in other engines. Meta-pages
| "specialising" in broad rankings are preferred above specialist
| data about the specific sought-after item. It seems that the
| desire for a ranking weighs the most.
|
| If I were to manually try to answer this query, I would start by
| looking at hardware forums and geeky blogs, pick N candidates,
| then try to find the specifications and quirks for all products.
|
| Of course, it is difficult to generically answer if a given
| website has performed this analysis. It can be favourable to rank
| sites citing specific data higher in these circumstances.
|
| As a user, I would prefer to be presented with the initial
| sources used for assembling this analysis. Of course, this
| doesn't happen because engines don't perform this kind of bottom-
| to-top evaluation.
| sciencesama wrote:
| how much did it cost ?
| dangoodmanUT wrote:
| This is really well written, especially considering the
| complexity
| dangoodmanUT wrote:
| I also love how often the author finds that traditionally
| glazed databases don't scale as you might think, and turn to
| stronger storage primitves that were designed exactly to do
| that
| de6u99er wrote:
| Thank you for sharing! This is one of the coolest articles I have
| seen in a while on HN. I did some searches and I think the search
| results looked very useful so far. I particularly loved about
| your article that most of the questions I had while reading got
| answered in a most structured way.
|
| I still have questions:
|
| * How long do you plan to keep the live demo up?
|
| * Are you planning to make the source code public?
|
| * How many hours in total did you invest into this "hobby
| project" in the two months you mentioned in your write-up?
| divineg wrote:
| It's incredible. I can't believe it but it actually works quite
| nicely.
|
| If 10K $5 subscriptions can cover its cost, maybe a community run
| search engine funded through donations isn't _that_ insane?
___________________________________________________________________
(page generated 2025-08-12 23:00 UTC)