[HN Gopher] Why Writing Your Own Search Engine Is Hard (2004)
___________________________________________________________________
Why Writing Your Own Search Engine Is Hard (2004)
Author : georgehill
Score : 86 points
Date : 2022-07-23 17:34 UTC (5 hours ago)
(HTM) web link (queue.acm.org)
(TXT) w3m dump (queue.acm.org)
| ldjkfkdsjnv wrote:
| Theory I have:
|
| Text search on the web will slowly die. People will search video
| based content, and use the fact that a human spoke the
| information, as well as comments/upvotes to vet it as trustworthy
| material. Google search as we know it will slowly die, and then
| will decline like Facebook. TikTok will steal search marketshare
| as their video clips span all of human life.
| xnx wrote:
| Returning text results in response to queries will continue to
| decline in favor of returning answers and synthesized responses
| directly. I don't want Google to point me to a page that
| contains the answer somewhere, when it could provide me an even
| better summary based on thousands of related pages it has read.
| ldjkfkdsjnv wrote:
| Right but the main flaw with Google is people increasingly
| dont trust the result whether it is synthesized or not. And
| Google is in the adversarial position of wanting to censor
| certain answers as well as present answers that maximize
| their own revenue. An answer (like video based TikTok), will
| arise and crush them eventually.
| wizofaus wrote:
| Doesn't mention the hardest part I found when developing a
| crawler - dealing with pages whose content is mostly dynamic and
| generated client side (SPA's). Even using V8 it's hard to do
| reliably and performantly at scale.
| sanjayts wrote:
| > Doesn't mention the hardest part ... dealing with pages whose
| content is mostly dynamic and generated client side (SPA's)
|
| Given this is from 2004 I'm not surprised.
| wizofaus wrote:
| That was about when I was writing my crawler (not for search
| but for rules-based analysis). Even in 2004 a lot of key DOM
| elements were created/modified client side.
| wizofaus wrote:
| Though I do remember now that we solved it by having a
| separate mechanism for accessing pages that required
| logging in or had significant client-side rendering by
| allowing the user to record a macro that was played back in
| a headless browser. Within a few years though it was
| obvious a crawler would need to be able to automatically
| handle client scripts.
| [deleted]
| wolfgang42 wrote:
| I've been puttering away at making a search engine of my own (I
| should really do a Show HN sometime); let's see how my experience
| compares with 18 years ago:
|
| Bandwidth: This is now also cheap; my residential service is 1
| Gbit. However, the suggestion to wait until you've got indexing
| working well before optimizing crawling is IMO still spot-on;
| trying to make a polite, performant crawler that can deal with
| all the bizzare edge cases
| (https://memex.marginalia.nu/log/32-bot-apologetics.gmi) on the
| Web will drag you down. (I bypassed this problem by starting with
| the Stack Exchange data dumps and Wikipedia crawls, which are a
| lot more consistent than trying to deal with random websites.)
|
| CPU: Computers are _really_ fast now; I'm using a 2-core computer
| from 2014 and it does what I need just fine.
|
| Disk: SATA is the new thing now, of course, but the difference
| these days is HDD vs SSD. SSD is faster: but you can design your
| architecture so that this mostly doesn't matter, and even a
| "slow" HDD will be running at capacity. (The trick is to do
| linear streaming as much as possible, and avoid seeks at all
| costs.) Still, it's probably a good idea to store your production
| index on an SSD, and it's useful for intermediate data as well;
| by happenstance more than design I have a large HDD and a small
| SSD and they balance each other nicely.
|
| Storing files: 100% agree with this section, for the disk-seek
| reasons I mention above. Also, pages from the same website often
| compress very well against each other (since they're using the
| same templates, large chunks of HTML can be squished down
| considerably), so if you're pressed for space consider storing
| one GZIPped file per domain. (The tradeoff with zipping is that
| you can't arbitrarily seek, but ideally you've designed things so
| you don't need to do that anyway.) Also, WARC is a standard file
| format that has a lot of tooling for this exact use case.
|
| Networking: I skipped this by just storing everything on one
| computer; I expect to be able to continue doing this for a long
| time, since vertical scaling can get you _very_ far these days.
|
| Indexing: You basically don't need to write _anything_ to get
| started with this these days! I'm just using bog-standard
| Elasticsearch with some glue code to do html2text; it's working
| fine and took all of an afternoon to set up from scratch. (That
| said, I'm not sure I'll _continue_ using Elastic: it has a ton of
| features I don't need, which makes it very hard to understand and
| work with since there's so much that's irrelevant to me. I'm
| probably going to switch to either straight Lucene or Bleve
| soon.)
|
| Page rank: I added pagerank very early on in the hopes that it
| would improve my results, and I'm not really sure how helpful it
| is if your results aren't decent to begin with. However, the
| march of Moore's law has made it an easy experiment: what Page
| and Brin's server could compute in a week with carefully
| optimized C code, mine can do in less than 5 minutes (!) with a
| bit of JavaScript.
|
| Serving: Again, ElasticSearch will solve this entire problem for
| you (at least to start with); all your frontend has to do is take
| the JSON result and poke it into an HTML template.
|
| It's easier than ever to start building a search engine in your
| own home; the recent explosion of such services (as seen on HN)
| is an indicator of the feasibility, and the rising complaints
| about Google show that the demand is there. Come and join us, the
| water's fine!
| boyter wrote:
| Please do write about it and your thinking behind it. There is
| so little out there written in the space.
| t_mann wrote:
| Would be interesting to see stats from that time how many people
| were working on search engines and how it turned out for them.
| Did they end up getting acquired, at least funded for a while,
| exited, or just bootstrapped themselves until they realized
| there'll only be one winner.
| boyter wrote:
| Glad to see this on the front page. One of those posts I reread
| every now and then. Better yet it's written by Anna Patterson,
| who in addition to the mentioned searches at the bottom wrote
| chunks of Cuil (interesting even if it failed) and works on parts
| of Googles index both before Cuil and I think now.
|
| Sadly it's a little out of date. I'd love to see a more modern
| post by someone. Perhaps the authors of mojeek, right dao or
| someone Elise running their own custom index. Heck I'd pay for
| some by Matt Wells of Gigablast or those behind Blekko. The whole
| space is so secretive that for those really interested in the
| space only crumbs of information are ever really released.
|
| If you are into this space or just curious the videos about
| bitfunnel which forms parts of the bing index are an excellent
| watch https://www.youtube.com/watch?v=1-Xoy5w5ydM and
| https://www.clsp.jhu.edu/events/mike-hopcroft-microsoft/#.YT...
| Xeoncross wrote:
| Yeah, there are certainly more problems these days. For one, the
| size of the web is larger, more of it is spam causing issues with
| pure page rank to detect networks that heavily link to each
| other.
|
| Important sites have a bunch of anti-crawling detection set up
| (especially news sites). It's even worse that the best user-
| generated content is behind walled gardens in facebook groups,
| slack channels, quora threads, etc...
|
| The rest of the good sites are javascript-heavy and you often
| have to run chrome headless to render the page and find the
| content - but that is detectable so you end up renting IP's from
| mobile number farms or trying to build your own 4G network.
|
| On the upside, https://commoncrawl.org/ now exists and makes the
| prototype crawling work much easier. It's not the full internet,
| but gives you plenty to work with and test against so you can
| skip to the part where you figure out if you can produce anything
| useful should you actually try to crawl the whole internet.
| ArrayBoundCheck wrote:
| I don't know how people can use the data. There's so much of
| it! I don't see any harddrives that are 80TB. It seems like
| people would need some kind of raid setup that can handle
| 200+TB of uncompressed data
| francoismassot wrote:
| A search index is often made of smaller independent pieces
| often called segments. So you can download & process
| progressively the data locally and upload it to an object
| storage. And run queries on it. That's what we did here for
| this project: https://quickwit.io/blog/commoncrawl
|
| Also an interesting blog post here:
| https://fulmicoton.com/posts/commoncrawl/
| Xeoncross wrote:
| You don't need to download the whole thing. You can parse the
| WARC files from S3 to only extract the information you want
| (like pages with content). It's a lot smaller when you only
| keep the links and text.
| nonrandomstring wrote:
| > but that is detectable so you end up renting IP's from mobile
| number farms or trying to build your own 4G network.
|
| Something is deeply wrong with such an adversarial ecosystem.
| If sites don't want to be found and indexed why go to any
| effort to include them? On the other hand there are millions of
| small sites out there keen to be found.
|
| The established idea of a "search engine" seems stuck, limited
| and based on some 90's technology that worked on a 90's web
| that no longer exists. Surely after 30 years we can build some
| kind of content discovery layer on top of what's out there?
| noncoml wrote:
| They don't want to be indexed unless you are Google
| amelius wrote:
| > Something is deeply wrong with such an adversarial
| ecosystem. If sites don't want to be found and indexed why go
| to any effort to include them?
|
| I think it is not about being found. It is more about being
| copied.
|
| These sites are afraid their content is stolen, so they only
| allow Google to crawl them.
| jonhohle wrote:
| Maybe we need a categorized, hand curated directory of sites
| that users can submit their own sites to for inclusion and
| categorization. Maybe like an open directory. Perhaps Mozilla
| could operate it, or maybe Yahoo!
| groffee wrote:
| With Goggles[0] (goggles/googles/potato/potato) you can get
| them. Curated lists by topic.
|
| [0] https://search.brave.com/help/goggles
| noduerme wrote:
| I know, right? Imagine if you went to the front page of
| Yahoo! and it was like a curated directory of websites.
| Like... a _portal_.
|
| It could look something like this: https://web.archive.org/
| web/20000302042007/http://www1.yahoo...
| wongarsu wrote:
| We could also make a website where people can submit
| links to great websites they find, and also allow then to
| vote on the submissions of other users. That way you have
| a page filled with the best links, as determined by
| users. Maybe call it "the homepage of the internet".
|
| You could even add the ability to discuss these links,
| and add a similar voting system to those discussions.
| zeroonetwothree wrote:
| Wow this gave me such an overwhelming feeling of
| nostalgia. I really miss the early years of the web.
| Xeoncross wrote:
| https://blogsurf.io/ is an example of a small search engine
| that just stuck to a directory of known blogs instead of
| indexing the big sites or randomly crawling the web and
| ending up with mostly gibberish pages from all the spam
| sites.
| mannyistyping wrote:
| thank you for sharing this! I read through the site's about
| and I really enjoy how the creator wanted to stick to a
| specific area for quality over quantity.
| altdataseller wrote:
| >> If sites don't want to be found and indexed why go to any
| effort to include them? On the other hand there are millions
| of small sites out there keen to be found.
|
| Then they should treat all bots equally and block Google as
| well. If they block Google as well, then yes, we should leave
| them alone.
|
| Why give unfair treatment to Google? That's anti-competitive
| behavior and it just prevents new search engines from being
| created.
| nonrandomstring wrote:
| I think I understand, combined with jeffbee's answer, that
| these sites are behaving selectively according to who you
| are. So we're back to "No Blacks or Irish" on the 2022
| Internet?
|
| What do you think they have against smaller search engines?
| I can't quite fathom the motives.
| wolfgang42 wrote:
| There are a lot of crawlers out there, and many of them
| are ill-behaved. When GoogleBot crawls your site, you get
| more visitors. When FizzBuzzBot/0.1.3 comes along, you're
| more likely to get an overloaded server, weird URLs in
| your crash logs, spam, or any other manner of mess.
|
| Small search engines getting blocked is just collateral
| damage from websites trying to solve this problem with a
| blunt ban-hammer.
| jeffbee wrote:
| I think that is not what they mean. I think what they meant
| is the site will detect your headless robot and serve it good
| content, while serving spam and malware to everyone else. The
| crawlers need their own distributed networks of unrelated
| addresses to prevent or detect this behavior.
| thanksgiving wrote:
| > Something is deeply wrong with such an adversarial
| ecosystem. If sites don't want to be found and indexed why go
| to any effort to include them? On the other hand there are
| millions of small sites out there keen to be found.
|
| I work on a small - medium ecommerce website and my code
| just... sucks. I kind of don't want to admit it but it is
| true. When there is some Chinese search engine that tries to
| crawl all the product detail pages during the day (presumably
| at night for them?), it slows down the site to a crawl. I
| mean technically I should have the pages set up so they can't
| pierce through the cloudflare cache but it is easier to just
| ask cloudflare to challenge user (captcha?) if there are more
| than n (I think currently set to something small like ten)
| requests per second from any single source.
|
| I don't understand all the business decisions but yeah, I'd
| suspect the biggest reason is we simply have poor codebases
| and can't spend too much time fixing this while we have so
| many backlog items from marketing to work on...
| [deleted]
| ALittleLight wrote:
| Why are page loads so slow or demanding? I can't imagine
| how a web crawler could be DoS'ing you if it's in good
| faith. What is the TPS? What caching are you doing? What's
| your stack like?
| Gh0stRAT wrote:
| Not GP, but from having run a small/niche search engine
| that got hammered by a crawler in the past:
|
| Webserver was a single VM running a Java + Spring
| webserver in Tomcat, connecting to an overworked Solr
| cluster to do the actual faceted searching.
|
| Caches kept most page loads for organic traffic within
| respectable bounds, but the crawler destroyed our cache
| hit rate when it was scraping our site and at one point
| did exhaust a concurrent connection limit of some kind
| because there were so many slow/timing-out requests in
| progress at the same time.
| ALittleLight wrote:
| I would expect that a small to medium e-commerce site
| would cache all their pages.
| Xeoncross wrote:
| There isn't a one-size-fits all approach, but I've never worked
| on a project that encompasses as many computer science
| algorithms as a search engine.
|
| - Tries (patricia, radix, etc...)
|
| - Trees (b-trees, b+trees, merkle trees, log-structured merge-
| tree, etc..)
|
| - Consensus (raft, paxos, etc..)
|
| - Block storage (disk block size optimizations, mmap files,
| delta storage, etc..)
|
| - Probabilistic filters (hyperloloog, bloom filters, etc...)
|
| - Binary Search (sstables, sorted inverted indexes)
|
| - Ranking (pagerank, tf/idf, bm25, etc...)
|
| - NLP (stemming, POS tagging, subject identification, etc...)
|
| - HTML (document parsing/lexing)
|
| - Images (exif extraction, removal, resizing / proxying,
| etc...)
|
| - Queues (SQS, NATS, Apollo, etc...)
|
| - Clustering (k-means, density, hierarchical, gaussian
| distributions, etc...)
|
| - Rate limiting (leaky bucket, windowed, etc...)
|
| - text processing (unicode-normalization, slugify, sanitation,
| lossless and lossy hashing like metaphone and document
| fingerprinting)
|
| - etc...
|
| I'm sure there is plenty more I've missed. There are lots of
| generic structures involved like hashes, linked-lists, skip-
| lists, heaps and priority queues and this is just to get 2000's
| level basic tech.
|
| - https://github.com/quickwit-oss/tantivy
|
| - https://github.com/valeriansaliou/sonic
|
| - https://github.com/mosuka/phalanx
|
| - https://github.com/meilisearch/MeiliSearch
|
| - https://github.com/blevesearch/bleve
|
| - https://github.com/thomasjungblut/go-sstables
|
| A lot of people new to this space mistakenly think you can just
| throw elastic search or postgres fulltext search in front of
| terabytes of records and have something decent. That might work
| for something small like a curated collection of a few hundred
| sites.
| kreeben wrote:
| Yes, yes, yes :D There are so many topics in this space that
| are so interesting it's like a dream. I would add to your
| list
|
| - sentiment analysis
|
| - roaring bitmaps
|
| - compression
|
| - applied linear algebra
|
| - ai
|
| In a vent diagram intersecting all of these topics, is
| search. Coding a search engine from scratch is a beautiful
| way to spend ones days, if you're into programming.
| boyter wrote:
| > That might work for something small like a curated
| collection of a few hundred sites.
|
| Probably more like a few million but otherwise 100% true.
| Once you really need to scale you have to start losing some
| accuracy or correctness.
|
| It helps that the goal of a search engine is not to find all
| the results but instead delight the user by finding the
| things they want.
| [deleted]
| streets1627 wrote:
| Hey folks, I am one of the co-founders of neeva.com
|
| While writing a search engine is hard, it is also incredibly
| rewarding. Over the past two years, we have brought up a
| meaningful crawl / index / serve pipeline for Neeva. Being able
| to create pages like https://neeva.com/search?q=tomato%20soup or
| https://neeva.com/search?q=golang+struct+split which are so much
| better than what is out there in commercial search engines is so
| worth it.
|
| We are private, ads free and customer paid.
| amelius wrote:
| The article doesn't touch upon the hardest and most interesting
| part: NLP and finding the most relevant results. I would like to
| see a post on this.
___________________________________________________________________
(page generated 2022-07-23 23:00 UTC)