[HN Gopher] Ask HN: Why doesn't anyone create a search engine co...
       ___________________________________________________________________
        
       Ask HN: Why doesn't anyone create a search engine comparable to
       2005 Google?
        
       I seem to recall that Google consistently produced relevant results
       and strictly respected search operators in 2005 (?), unlike the
       modern Google. And back then, I think search results were the same
       for everyone, rather than being customized for each user.
        
       Author : syedkarim
       Score  : 233 points
       Date   : 2021-12-02 15:10 UTC (7 hours ago)
        
       | keddad wrote:
       | While I feel that Google has become worse in last couple of
       | years, I'm pretty sure it is still better now when 15 years ago.
       | Maybe it is just some kind of nostalgia?
        
         | micromacrofoot wrote:
         | the internet has changed, partially due to google's influence
         | 
         | instead of discussion forums and Q&A sites, everyone's on
         | facebook/twitter/discord/slack/snapchat/tiktok/etc... none of
         | that is really very google friendly
         | 
         | online marketing and SEO is a _much_ larger industry now, so
         | with less (by % of total) searchable content generated by
         | people (which is on social media) a lot of the high-ranking
         | content that appears in search is highly optimized marketing
         | 
         | then you have other kind of weird things like... half of all
         | internet traffic being bots
        
       | abhaynayar wrote:
       | I'm probably the only person who doesn't think Google search has
       | deteriorated. I play security CTFs, so a lot of times I have to
       | search for peculiar technical details on various software. Also,
       | like any other human being, I also make generic queries. In both
       | cases, I feel like I almost always get to the desired webpage
       | within the top few results.
        
         | Ellipse0934 wrote:
         | It honestly depends on what you are searching for.
         | 
         | Case 1: You just want the name of the website, or an article,
         | example "Facebook" -> fb.com, "Gordan Ramsay" -> Wiki/official
         | website/Celeb gossip website you are good. Not much competition
         | here.
         | 
         | Case 2: You are looking for something technical like "GNU rnano
         | CVE-abcde"/"OpenBSD ARM64 Qualcomm Wifi driver not working",
         | you are again in the fine territory, not much if any money to
         | be made here so very less competition. There will be the
         | official forums, websites, maybe some conference websites in
         | this category.
         | 
         | Case 3: "Chicken potpie recipe", "How to be more organised":
         | This is the category where people are trying to game the SEO
         | algos. How the hell do recipe websites with 27 popups, 12000
         | word essay on the secret family history ends up on top ? There
         | are a huge number of passionately made simple recipe websites
         | but they have to be "found" by us. For the second query I
         | mention about being more organized I think most people are
         | looking for some sort of a review article which looks at some
         | various schools of thoughts regarding discipline, cleanliness
         | pointing to further resources and exploring the why and what to
         | do for this. Here the search engine needs to determine the
         | context of the query which is fairly abstract and then the
         | internal heuristics it uses are supposed to drive it to a
         | meaningful list of websites. Maybe the average joe would like
         | to click cosmopolitan's article but I would never do that.
         | Based on my previous click history maybe google should
         | determine what I kind of links am I looking for. But when they
         | figure that out they'd much faster use this behavioral insight
         | for advertisers. A great search engine is basically a primitive
         | personal librarian, I'd pay a yearly subscription for one.
         | 
         | The internet is vast and it has stuff that I don't know about.
         | How my 7 word abstract query is gonna get me there is the
         | question mark. Also, for a lot of queries the top results can
         | be plagued by spammy/fraud results which are on top because
         | they managed to trick the SEO algos. These bad actors were not
         | as prevalent for 2005 google.
        
         | jeffbee wrote:
         | Well no, it's you and me and the whole google search quality
         | evaluation team and everyone who works on google search and
         | like 99% of the general public as well. The meme of falling
         | search quality infects only HN. Mostly what people are
         | complaining about is that the quality of the web itself is in
         | free-fall.
        
       | wodenokoto wrote:
       | The web has changed drastically. I'd imagine 2005-google engine
       | today would be nothing but abandoned Wordpress blogs with comment
       | spam.
        
         | warning26 wrote:
         | I suspect this is exactly it--a lot of what made 2005-era-
         | Google good wasn't necessarily Google's own doing. It was that
         | SEO people hadn't yet fully figured out how to game the system
         | yet.
         | 
         | If you took an exact copy of Google circa-2005 and had it crawl
         | today's web, you'd probably get mostly "SEO optimized"
         | irrelevant blogspam.
        
         | ghaff wrote:
         | And even more copy-pasted spam than already exists.
         | 
         | The early Google (and other even earlier search engines) were
         | invented for an Internet world which, if not pristine and pure,
         | was at least mostly fairly legit content. Today's Internet is
         | probably 90% deliberate spammers and scammers.
        
       | simonebrunozzi wrote:
       | These guys [0] have built something really close to 2005-Google,
       | and possibly slightly better.
       | 
       | The parent company, Tiscali, was a huge hit in the 1990s, as it
       | provided internet access to millions of Italians. It went through
       | some struggle for several years, but lately the original founder,
       | Renato Soru, came back to run the company.
       | 
       | The company is based in Cagliari, the capital of Sardinia, Italy.
       | 
       | [0]: https://www.istella.it/en/
        
       | jakub_g wrote:
       | Cliqz wanted to build new search engine but failed. It's just too
       | difficult to operate at that scale and break the existing
       | monopoly of big G.
       | 
       | https://www.burda.com/en/news/cliqz-closes-areas-browser-and...
       | 
       | https://news.ycombinator.com/item?id=23031520
       | 
       | https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...
        
         | hunterb123 wrote:
         | And then Brave bought them and it succeeded.
         | 
         | Cliqz is now Brave Search, I use it for all my devices, it's
         | great.
         | 
         | Works better than DDG and sometimes better than Google.
         | 
         | I only hash bang every 100 searches or so, most of the time
         | Google doesn't have it either. It's just to make sure.
         | 
         | http://search.brave.com/
        
         | arthur_sav wrote:
         | What if we didn't try to replicate google. Smaller and niche
         | search engines would probably work better in this new world of
         | vast information.
        
       | vangelis wrote:
       | They have, sort of: https://search.marginalia.nu/
        
       | pkamb wrote:
       | I would use a search engine that only indexed Reddit, Stack
       | Exchange, Wikipedia, and a small number of other sites.
       | 
       | And that specifically blocked Pinterest, Quora, most non-personal
       | "blogs", etc.
       | 
       | People suggest DDG ! operators, but I don't want to use a site's
       | (bad, single-site) search box. I want a multi-site SERP that only
       | displays results from known good sites, which are customizable.
        
         | monkeybutton wrote:
         | Too bad they whitelist which bots can access their sitemaps!
        
         | pkamb wrote:
         | Even rules such as "if there is a Wikipedia result in the top
         | 10, display it first".
        
         | all2 wrote:
         | If I could add sites I liked to the index that'd be great. Find
         | a blogger/hacker I like? Add to the index. Can I share my index
         | with others? Can I include their indices in my searches?
         | 
         | Search engine as a social media platform? If I follow you, now
         | I can search in your indices?
        
       | vincent_s wrote:
       | Some people try:
       | 
       | https://www.mojeek.com/
       | 
       | https://fireball.com/
       | 
       | https://search.brave.com/
        
         | 7373737373 wrote:
         | Time for an https://github.com/sindresorhus/awesome search
         | engines?
        
         | ColinHayhurst wrote:
         | Mojeek founder story here: https://blog.mojeek.com/2021/03/to-
         | track-or-not-to-track.htm...
         | 
         | No-tracking and independent from the start. Now at 4.6 billion
         | pages with own infrastructure and IP. Went to market in 2020
         | with contextual ads and API. Self-disclosure: CEO
        
           | bullen wrote:
           | Do you use some sort of PageRank?
        
           | prox wrote:
           | Never heard of Mojeek. I will try it for a month and see how
           | it works. Currently using DDG 99% of the time.
        
           | snovv_crash wrote:
           | HN is wild: 30m after something is mentiond, the CEO chimes
           | in.
        
       | ColinHayhurst wrote:
       | You might call this a search engine based on the principle of
       | Information Neutrality.
       | 
       | "Information Neutrality is the principle to treat all information
       | provided (by a service) equally. The information provided, after
       | being processed by an information-neutral service, is the same
       | for every user requesting it, independent of the user's
       | attributes, including, e.g., origin, history or personal
       | preferences and independent of the financial or influential
       | interest of the service provider, as well as independent of the
       | timeliness of information."
       | 
       | I wrote about this in relation to search [0]. We need to be
       | allowed more freedom to choose search engines and services. One
       | (default or selected) choice for search is unhealthy. We
       | shouldn't have to choose between Google or Bing; DuckDuckGo or
       | Startpage; Brave or Ecosia; Mojeek or Gigablast ..... Personally
       | I use all 8 of these and more, as also explained [0].
       | 
       | [0] https://blog.mojeek.com/2021/09/multiple-choice-in-
       | search.ht...
        
       | gorgoiler wrote:
       | Random thought, based especially on using DuckDuckGo for two
       | years:
       | 
       | Search engine isn't singular, it's plural.
       | 
       | (1) Search engine for something I know exists.
       | 
       | (2) Search engine for finding something new.
       | 
       | There's a market for both, but you don't have to solve both
       | problems with the same product.
       | 
       | Sometimes I switch to Google for the former, but the latter works
       | well enough for me that I don't care what else Google would've
       | shown me.
       | 
       | More often than not, my feeling is Google would only have shown
       | me more ads in addition to whatever I could already find
       | elsewhere.
        
       | llaolleh wrote:
       | Everyone runs in the other direction anytime a search engine is
       | mentioned. The thought of competing with Google turns people off.
       | 
       | Even in 2021, despite how bad it's become, it's still miles ahead
       | of other competitors.
        
         | prox wrote:
         | I disagree. A lot of people I know already switched to
         | Duckduckgo. Google's ability to get relevant results is
         | dropping like a brick, while the quality of DDG has been
         | improving slowly but steadily.
        
           | datenarsch wrote:
           | I wish I could agree but from my experience, DDG's search
           | results aren't really that great. Often even worse than
           | Google's.
           | 
           | And another private company is not the answer I believe. We
           | need something more drastic, an open-source search engine
           | organized as a genuine non-profit organization. Something
           | like that. Otherwise, whatever replaces Google will just turn
           | into another Google as soon as it gets any momentum.
        
       | amelius wrote:
       | Also, where are the books about writing a search engine?
       | 
       | Knuth's "Searching and Sorting" volume desperately needs an
       | update.
        
         | mindcrime wrote:
         | I don't even know if anybody has written a book specifically
         | about search at "web scale" (no MongoDB jokes here, please).
         | But about the closest things I know of would be something like:
         | 
         | https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
         | 
         | https://www.amazon.com/Information-Retrieval-Implementing-Ev...
         | 
         | https://www.amazon.com/Introduction-Information-Retrieval-Ch...
        
       | indymike wrote:
       | Brave's new search engine seems to work pretty well. Have been
       | using it as my primary for about 10 days, and so far, I've only
       | had to revert to Google once, and when I did the results were
       | chock full of spam.
        
         | concinds wrote:
         | The nice thing about Brave Search is that they're trying to
         | create an index completely independent from Bing/Google, and
         | they seem to be trying to innovate on ways to get there as well
         | with their Web Discovery Project[0], unlike DuckDuckGo. They've
         | announced Brave Search will get ads soon, with a premium
         | version without ads, which I think is acceptable given the
         | costs of running an independent index sustainably.
         | 
         | [0]: https://brave.com/privacy/browser/#web-discovery-project
        
         | travisgriggs wrote:
         | Can echo this. About 30 days going all devices. I'd say about
         | once a day I do a !g, and rarely do I actually find something
         | there, it usually just ends up being a confirmation search.
        
       | lgrialn wrote:
       | What I miss most of all from the Good Old Days was getting as
       | many hits back as I could read.
       | 
       | Rather than being told "No, there are only eight pages of results
       | on anything in the goddamned world. Really. Would I lie to you?"
        
       | greyman wrote:
       | 1) Google is better at AI, for example let's take this sloppy
       | search: "some joke where you can't tell if it is serious or joke"
       | 
       | It is called Poe's law, and Google returned it at #4. Bing or
       | Duckduckgo don't have a clue...
       | 
       | 2) They have a years of user's data, like for specific term, they
       | see what users clicked most, so they see which results were
       | perceived as most relevant. It is hard to catch up if you dont
       | have such data.
       | 
       | 3) They developed anti-spamming tools during the years of
       | fighting against SEO-spammers.
        
         | wmil wrote:
         | > Google is better at AI, for example let's take this sloppy
         | search: "some joke where you can't tell if it is serious or
         | joke"
         | 
         | My problem there is that I don't expect or want my search
         | engine to do that. The counter case is where I remember a quote
         | from and article and want to find the article. Old Google would
         | help me find matching text and I could quickly find the
         | original article. Current Google will try to interpret the text
         | and give me some nonsense based on that.
         | 
         | AI has ruined other Google features... the "search by image"
         | feature now analyzes the image, returns a generic tag like
         | "woman", and shows me the wikipedia article on women as the
         | first result.
         | 
         | Old search by image had tineye like functionality and you could
         | find the source of images.
        
         | lolpython wrote:
         | > 1) Google is better at AI, for example let's take this sloppy
         | search: "some joke where you can't tell if it is serious or
         | joke"
         | 
         | > It is called Poe's law, and Google returned it at #4. Bing or
         | Duckduckgo don't have a clue...
         | 
         | Interesting, I was looking for a good benchmark like this. For
         | me Google returned it at #5 with an image/related terms
         | carousel before it which places it physically more around #7 on
         | the page. Brave Search (never tried it before today) puts Poe's
         | Law at #8. So Google is still better.
         | 
         | But the other results are mostly worse (IMO) on Google. Here
         | are the first 8 results:
         | 
         | - 175 Bad Jokes That You Can't Help But Laugh At - Reader's
         | (rd.com)
         | 
         | - 57 Hilarious, Silly Jokes No One Is Too Old to Laugh At
         | (bestlifeonline.com)
         | 
         | - 145 Best Dad Jokes That Will Have the Whole Family Laughing
         | (countryliving.com)
         | 
         | - Sarcasm, Self-Deprecation, and Inside Jokes: A User's Guide
         | (hbr.org)
         | 
         | - Poe's law - Wikipedia (wikipedia.org)
         | 
         | - Managing Conflict with Humor - HelpGuide.org (helpguide.org)
         | 
         | - 175 Bad Jokes That Are So Cringeworthy, You Can't ... -
         | Parade (parade.com)
         | 
         | - Encouraging Your Child's Sense of Humor (for Parents) - Kids
         | ... (kidshealth.org)
         | 
         | And here are the first 8 results from Brave Search:
         | 
         | - phrase requests - Is there a word for "pretending to joke
         | when ... (english.stackexchange.com)
         | 
         | - Joke - Wikipedia (wikipedia.org)
         | 
         | - "Are you joking or serious?" - The Caffeinated Autistic
         | (thecaffeinatedautistic.wordpress.com)
         | 
         | - How do I tell when people are joking or being serious?
         | (reddit.com/r/socialskills)
         | 
         | - be a joke | meaning of be a joke in Longman Dictionary of
         | (ldoceonline.com)
         | 
         | - Quote by Ricky Gervais: "If you can't joke about the most
         | (goodreads.com)
         | 
         | - How can you tell if someone is joking with you or not?
         | (quora.com)
         | 
         | - Poe's law - Wikipedia (wikipedia.org)
         | 
         | -----
         | 
         | edit: I did not count to 8 correctly the first time. Fixed
         | that.
        
           | pkamb wrote:
           | The Brave results though seem to contain "good sites" whereas
           | the Google results are content mill blogspam. The exact
           | placement of Poe's Law is somewhat less important.
        
             | lolpython wrote:
             | I agree. I switched to Brave Search after running this
             | test.
        
       | gbmatt wrote:
       | Ha, yes, I've done that at https://gigablast.com/ . The biggest
       | problems now are the following: 1) Too hard to spider the web.
       | Gatekeeper companies like Cloudflare (owned in part by Google)
       | and Cloudfront make it really difficult for upstart search
       | engines to download web pages. 2) Hardware costs are too high.
       | It's much more expensive now to build a large index (50B+ pages)
       | to be competitive.
       | 
       | I believe my algorithms are decent, but the biggest problem for
       | Gigablast is now the index size. You do a search on Gigablast and
       | say, well, why didn't it get this result that Google got. And
       | that's because the index isn't big enough because I don't have
       | the cash for the hardware. btw, I've been working on this engine
       | for over 20 years and have coded probably 1-2M lines of code on
       | it.
        
         | easton wrote:
         | You can be whitelisted so Cloudflare doesn't slow you down (or
         | block you): https://support.cloudflare.com/hc/en-
         | us/articles/36003538743...
        
         | fsflover wrote:
         | > 2) Hardware costs are too high.
         | 
         | Which is why the next big search engine should be distributed:
         | https://yacy.net.
        
           | wruza wrote:
           | No way to test it right away, demo peer 502-es.
        
         | indymike wrote:
         | I've used Gigablast off and on for a long time (I think I first
         | discovered Gigablast in 2006 or so). Would be cool to have a
         | registration service for legitimate spiders. I used to run a
         | team that scraped jobs and delivered them (by fax, email, us
         | mail as require by law) to local veteran's employment staffers
         | for compliance. We were contracted by huge companies (at one
         | point about 700 of the fortune 1000) to do so, and often our
         | spiders would be blocked by the employer's IT department even
         | though the HR team was paying us big bucks to do so.
        
         | mrlinx wrote:
         | Where did you read that google/alphabet owns part of
         | Cloudflare?
        
           | [deleted]
        
           | bloudermilk wrote:
           | Assuming OP is referring to Google Venture's participation in
           | at least one of Cloudflare's rounds.
           | 
           | https://www.crunchbase.com/funding_round/cloudflare-
           | series-d...
        
         | skyde wrote:
         | what kind of index is Gigablast using? traditional inverted
         | index like Lucene or something more esoteric?
         | 
         | I know Google and Bing both use weird data-structure like
         | BitFunnel
         | 
         | https://www.microsoft.com/en-us/research/publication/bitfunn...
        
         | collin128 wrote:
         | Have you ever looked at the Amazon file?
         | 
         | I'll see if I can track down the link but I remember somebody
         | sharing a dump with me from Amazon that apparently was a recent
         | scrape.
         | 
         | Edit: https://registry.opendata.aws/commoncrawl/
        
           | web007 wrote:
           | That's Common Crawl, they do the spidering of some billions
           | of webpages but that's still a tiny percentage of the web
           | versus Google or Bing.
        
             | visarga wrote:
             | Common Crawl is being used to train the likes of GPT-3 and
             | mine image-text pairs for CLIP. I wonder how much useful
             | content is missing, we're going to use all the web text,
             | images and video soon and then what do we do? We run out of
             | natural content. No more scaling laws.
        
             | cschmidt wrote:
             | Do you have any stats on that? I've always wondered about
             | the coverage of Common Crawl, if you include all the
             | historical crawl files too.
        
         | DavidCole1 wrote:
         | Interesting. I had some interests in building a search engine
         | myself (for playing around ofcourse). I had read a blog post by
         | Michael Nielson [1] which had sparked my interest. Do you have
         | any written material about your architecture and stuff like
         | that? Would love to read up.
         | 
         | [1]: https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-
         | billio...
        
           | gbmatt wrote:
           | there's some stuff here : https://github.com/gigablast/open-
           | source-search-engine
        
             | DavidCole1 wrote:
             | Thank you.
        
             | entropie wrote:
             | Holy, thats a huge codebase. Github even shows no
             | code/syntax hl for many cpp files because they are so big.
             | 
             | I fiddled around and searched for some not so well known
             | sites in germany and the results were surprisingly good.
             | But it looks really... aged.
        
         | ramboldio wrote:
         | maybe just add small webpages into your index, dont bother yo
         | execute JS and dont download any images.
         | 
         | The content quality will be higher and it's a lot cheaper.
        
           | woutr_be wrote:
           | Out of curiosity, why would not executing JavaScript or not
           | downloading images equal higher content quality?
        
         | lloydatkinson wrote:
         | With a slightly fresher coat of paint this could be very
         | popular. For example, no grey background.
        
         | justinzollars wrote:
         | This is great! I found something other engines do not pick up!
         | apparently I signed an agile manifesto in 2010
         | https://agilemanifesto.org/display/000000190.html
        
         | yumraj wrote:
         | > Cloudflare (owned in part by Google)
         | 
         | Please elaborate. Is there a special relationship between
         | Cloudflare and Google?
        
           | spullara wrote:
           | Google Capital is an investor:
           | https://www.forbes.com/sites/katevinton/2015/09/22/google-
           | mi...
        
             | yumraj wrote:
             | That is not the same as being owned by Google.
        
         | SamBam wrote:
         | > You do a search on Gigablast and say, well, why didn't it get
         | this result that Google got. And that's because the index isn't
         | big enough
         | 
         | I wionder how much this is true, and how much (despite all our
         | rhetoric to the contrary) it's because we have actually come to
         | expect Google's modern proprietary page ranking, which counts
         | more than just inbound links but all sorts of other signals
         | (freshness, relevance to our previous queries, etc.).
         | 
         | We dislike the additional signals when it feels like Google is
         | trying to second-guess our intentions, but we probably don't
         | notice how well they work when they give us the result we
         | expect in the first three links.
        
           | JacobThreeThree wrote:
           | I think people also have an inflated recollection of how good
           | Google actually was back in 2005.
           | 
           | Back then Google was only going up against indexes and link-
           | rings, not 2021 Google/Bing/DDG/etc.
        
             | pbhjpbhj wrote:
             | 2005? There were loads of other search engines (SE), and
             | many meta-SE: hotbot, dogpile, metacrawler, ... (IIRC),
             | plenty more.
             | 
             | There was also indexes, which Yahoo, AOL (remember them!)
             | had but there was, what was it called, dmoz?, the open web
             | directory. When Google started, being in the right web
             | directory gave you a boost in SERPs as it was used as a
             | domain trust indicator, and the categories were used for
             | keywords. Of course it got gamed hard.
             | 
             | Google was good, but I used it as an alt for maybe 6 months
             | before it won over my main SE at the time. I've tried but
             | can't remember what SE that was, Omni-something??
             | 
             | One of the main things Google had was all the extra
             | operators like link: inurl:, etc., but they had Boolean
             | logic operators too at one point I think.
        
             | pronik wrote:
             | I've been using Altavista at that time, every now and then
             | switching to Northern Light. Everything else was abysmal.
             | Google blew them out of the water in terms of speed,
             | quality, simplicity, unclutterdness and everything else. I
             | can't remember ever retraining muscle memory so fast when
             | switching to Google. So, no, Google has been great then and
             | apart from people actively working against the algorithm is
             | still good now, but obviously a completely different beast.
        
           | romwell wrote:
           | Well if the result didn't appear in the first 5-10 pages,
           | it's probably not in the index.
           | 
           | You can see it with other search engines. I challenge you to
           | come up with a Google query for which a first-page result
           | won't be seen within the first 10 pages of Bing results for
           | the same query.
           | 
           | (Bonus points if that result is relevant).
           | 
           | There's only so much tweaking that personalization and other
           | heuristic can do.
           | 
           | But if something is missing from them index, that's it.
        
           | jldugger wrote:
           | I assume the author has the ability to search the index to
           | see if your preferred Google result is even indexed.
        
         | garaetjjte wrote:
         | >Gigablast has teamed up with Imperial Family Companies
         | 
         | Associating with that crank (responsible for recent freenode
         | drama) is very off-putting.
        
           | loo wrote:
           | Oh no, you see he isn't responsible, it's everyone else! /s
        
             | djbusby wrote:
             | I don't get it, what's the fuzz here?
        
               | meepmorp wrote:
               | The guy who took over Freenode styles himself as the
               | crown prince of korea; IFC is his company.
        
               | [deleted]
        
           | [deleted]
        
         | sockaddr wrote:
         | Nice.
         | 
         | I'd pay 5-10$/mo for a search engine that didn't just funnel me
         | into the revenue-extracting regions of the web like Google
         | does.
        
           | RhysU wrote:
           | A subscriber-supported search engine sounds cool to me. Any
           | precedent?
        
             | samcrawford wrote:
             | Kagi.com does this. In closed beta at the moment, but you
             | can email and request access.
        
             | gianthockey495 wrote:
             | You'll like https://neeva.com/
        
             | xtracto wrote:
             | Copernic ( https://copernic.com/ ) had Copernic Agent
             | Professional, a for-pay desktop application that had really
             | good search features, a while ago . Not sure if they
             | discontinued it.
        
               | gompertz wrote:
               | Wow blast from the past. I think I was using Copernic all
               | the way back in 2003... Forgot all about them. Thanks!
        
         | mirker wrote:
         | If you have customers, does that mean the incremental gain from
         | an improved index costs too much to store? Or are you talking
         | about computational costs?
        
           | gbmatt wrote:
           | it's both storage and computational. they go hand in hand.
        
         | Minor49er wrote:
         | I really love how the results organize multiple matching pages
         | from the same domain. This is really cool.
        
         | afrcnc wrote:
         | how recent are your results? 1-2h? 1 day?
        
           | gbmatt wrote:
           | it's continually spidering. just not at a high rate.
           | actually, back in the day i had real time updates while
           | google was doing the 'google dance'. that caused quite a stir
           | in the web dev community because people could see their pages
           | in the index being updated in real time whereas google took
           | up to 30 days to do it.
        
         | 1vuio0pswjnm7 wrote:
         | I really like GigaBlast.
         | 
         | I wrote a "meta" search utility for myself that can query
         | multiple search engines from the command line.^1 It mixes the
         | results into a simplified SERP ("metaSERP"), optimised for a
         | text-only browser, with indicators to show which search engine
         | each result came from. The key feature is that it allows for
         | what I might call "continuation searches". Each metaSERPs
         | contains timestamps in its page source to indicate when
         | searches were executed, as well as preformatted HTTP paths. The
         | next search can thus pick up where the previous one left off.
         | Thus I can, if desired, build a maximum-sized metaSERP for each
         | query.
         | 
         | The reason I wrote this is because search engines (not
         | GigaBlast) funded by ads are increasingly trying to keep users
         | on page one, where the "top ads" are, and they want to keep the
         | number of results small. That's one change from 2005 and
         | earlier. With AltaVista I used to dig deep into SERPs and there
         | was a feeling of comprehensiveness; leave no stone unturned.
         | Google has gradually ruined the ability to perform this type of
         | searching with their now secretive and obviously biased behind-
         | the-scenes ranking procedures.
         | 
         | Why is there no way to re-order results according to objective
         | criteria, e.g., alphabetical; the user must accept the search
         | engines' ordering, giving them the ability to "hide" results on
         | pages the user will never view or simply not return them. That
         | design is more favorable to advertising and less favorable to
         | intellectual curiosity.
         | 
         | Each metaSERP, OTOH, is a file and is saved in a search
         | directory for future reference; I will often go back to
         | previous queries. I can later add more results to a metaSERP if
         | desired. I actually like that GigaBlast's results are different
         | than other search engines. The variety of results I get from
         | different sources arguably improves the quality of the
         | metaSERP. And, of course, metSERPs can be sorted according to
         | objective criteria.
         | 
         | This is, AFAIK, a different way of searching. The "meta-search
         | engines" of yesteryear did not do "continuations", probably
         | because it was not necessary. Nor was there en expectation that
         | user would want to save meta-searches to local files. Users
         | were not trying to minimise their usage of a website, they were
         | not trying to "un-google".
         | 
         | Today's world of web search is different, IMO. There seems to
         | be a belief that the operator of a search engine can guess what
         | a user is searching for, that a user who sends a query is only
         | searching for one specific thing, and that the website has an
         | ad to match with that query. At least, those are the only
         | searches that really matter for advertising purposes.
         | Serendipitous discovery while perusing results is not
         | contemplated in the design. By serendipitous discovery I do not
         | mean sending a random query, e.g., adding an "I'm feeling
         | lucky" button, which to me always seemed like a bad joke.
         | 
         | The only downside so far is I ocassionally have to prune "one-
         | off" searches that I do not want to save from the search
         | directory. I am going to add an indicator at search time that a
         | search is to be considered "ephemeral" and not meant to be
         | saved. Periodically these ephemeral searches can then be pruned
         | from the search directory automatically.
         | 
         | 1. Of course this is not limited to web search engines. I also
         | include various individual site search engines, e.g., Github.
        
         | xwdv wrote:
         | How much cash do you need?
        
         | JPKab wrote:
         | Regarding the Gatekeeper companies like Cloudflare, it sounds
         | like anti-competitive behavior that could potentially be
         | targeted with anti-trust legislation, correct?
        
           | adolph wrote:
           | At a theoretical level it looks like Cloudflare won't block
           | search engine crawlers. The docs are very Google and Bing
           | oriented and also oriented towards supporting their
           | customers, not random new search engine crawler.
           | 
           |  _Cloudflare allows search engine crawlers and bots. If you
           | observe crawl issues or Cloudflare challenges presented to
           | the search engine crawler or bot, contact Cloudflare support
           | with the information you gather when troubleshooting the
           | crawl errors via the methods outlined in this guide._
           | 
           | https://support.cloudflare.com/hc/en-us/articles/200169806
        
           | danielmarkbruce wrote:
           | No, it is not.
           | 
           | Cloudflare is giving it's customers what they want. They
           | don't want all kinds of bots claiming to be search engines
           | crawling their sites. Cloudflare isn't hurting cloudflare
           | competitors by doing this. Cloudflare isn't hurting their
           | customers by doing this. To repeat - most websites don't want
           | lots and lots of crawlers. They want the 2 or 3 which matter
           | and no more, because at some point it's difficult to tell
           | what the crawler is doing... (is it a search engine???). They
           | aren't obliged to help search engines. Even if Cloudflare
           | wasn't offering this, bigger customers would roll their own
           | and do.. more or less the same thing.
        
           | shashashasha___ wrote:
           | i would assume its mostly anti scraping protection which is
           | mostly for privacy. you don't want to allow everyone scrap
           | your website, pull and use your info. for example from fb,
           | ig, LinkedIn, github, .... you can build a really big
           | profiling db on people that way. so websites need to know you
           | are a legit search engine first
        
             | karmanyaahm wrote:
             | people can still be targeted if that information is public.
             | anti scraping sounds like security by obscurity
        
           | gbmatt wrote:
           | it should be. there should be some sort of 'bots rights' to
           | level the playing field. perhaps this is something the FTC
           | can look into. but, as it is right now big tech continues to
           | keep their iron grip on the web and i don't see that changing
           | any time soon. big tech has all the money and controls access
           | to all the data and supply chains to prevent anyone else from
           | being a competitive threat.
           | 
           | look at linkedin (owned by microsoft unspiderable by all but
           | google/bing). github (now microsoft using this to fuel its AI
           | coding buddy, but if you try to spider this at capacity your
           | IP is banned) facebook (unspiderable) .. the list goes on and
           | on ..
           | 
           | and as you can see, data is required to train advanced AI
           | systems, too. So big tech has the advantage there as well.
           | especially when they can swoop in and corrupt once non-profit
           | companies like openai, and make them [partially] for-profit.
           | 
           | and to rant on (yes, this is what i do :)) it very difficult
           | to buy a computer now. have you tried to buy a raspberry pi
           | or even a jetson nano lately? Who is getting preferred access
           | to the chip factories? Does anyone know? Is big tech getting
           | dibs on all the microchips now too?
        
           | technobabbler wrote:
           | Cloudflare functions kinda like a private security company.
           | They don't go around blocking sites willy-nilly, site owners
           | have to specifically choose to use their service (and maybe
           | pay for it), configuring the bot blocking rules themselves.
           | 
           | That's not really Cloudflare's fault. Someone has to do it,
           | whether it's them or a competitor or sys admins manually
           | making firewall rules. Cloudflare just happens to be good
           | enough and darned affordable, so many choose to use them.
           | 
           | Hosting costs for small site owners would be much more
           | expensive without Cloudflare shielding and caching.
        
             | foxfluff wrote:
             | No, I think it is partially Cloudflare's fault because they
             | offer this service and make it easy to deploy. This shit
             | has exploded with Cloudflare's popularity.
             | 
             | Nobody _has_ to do it, but a lot of people will do it when
             | they notice there 's an easy way to do it. Cloudflare is
             | very much an enabler of bad behavior here. Now a lot of
             | sites just toggle that on without even thinking about
             | collateral.
        
             | gbmatt wrote:
             | I've had extensively dealing with Cloudflare. They have a
             | complex whitelisting system that is difficult to get on,
             | and they also have an 'AI' system that determines if you
             | should be kicked off that whitelist for whatever reason.
             | 
             | Furthermore, they give Google preferred treatment in their
             | UIs and backend algos because it is the incumbent and
             | nobody cares about other smaller search engines. So there's
             | a lot of detail to how they work in this domain.
             | 
             | It's 100% Cloudflare's fault, and it's up to them to give
             | everyone a fair shot. They just don't care. Also, you are
             | overlooking the fact that Google is a major investor (and
             | so is Bing and Baidu). So really this exacerbates the
             | issue. Should Google be allowed (either directly or
             | indirectly) to block competing crawlers from dowloading web
             | pages?
        
               | technobabbler wrote:
               | These are all great points.
        
               | danielmarkbruce wrote:
               | It isn't up to them to give everyone a fair shot. That
               | isn't what their customers actually want. Cloudflare
               | aren't in the "fair shots for all search engines"
               | business. They are in the "stop requests you don't want
               | hitting your servers" business.
        
           | AtlasBarfed wrote:
           | "targeted with anti-trust legislation"
           | 
           | Um, this is America. Every market is basically a trust,
           | cartel, or monopoly.
           | 
           | And I don't know if you can hear that, but there is literally
           | laughter in the halls of power. All the show hearings by
           | congress on social media and tech companies only has to do
           | with two things:
           | 
           | 1) one political party thinking the other is getting an
           | advantage by them
           | 
           | 2) shaking them down for more lobbying and campaign donations
           | 
           | No one in the halls of power give two shits about
           | competition. Larger companies mean larger campaign donations,
           | and more powerful people to hobnob with if/when you leave or
           | lose your political office.
           | 
           | Of course I think that breaking up the cartels in every major
           | sector would lead to massive improvements: more companies is
           | more employment, more domestic employment, more people trying
           | to innovate in management and product development, more
           | product choice, lower prices, more competition, more
           | redundancy/tolerance to supply chain disruption, less
           | corruption in government and possibly better regulation.
           | 
           | Every large company brazenly does market abuse up and to the
           | point of one and only one limiter: the "bad PR" line. So I
           | guess we have that.
        
         | Archelaos wrote:
         | I tried out four search words with your search engine, and I am
         | not convinced that it is mainly the index size and not the
         | algorithm that is to blame for bad search results. There are
         | way too much high ranking false positives. Here is what I
         | tried:                 a) "Berlin":             1. The movie
         | festival "Berlinale"       2. The Wikipedia entry about Berlin
         | 3. Something about a venue "Little Berlin", but the link
         | resolves to an online gaming site from Singapure       4.
         | "Visit Berlin", the official tourism site of Berlin       5.
         | The hash tag "#Berlin" on Twitter       6. "1011 Now" a local
         | news site for Lincoln, Nebraska       7. "Freie Universitat
         | Berlin"       8. Some random "Berlin" videos on Youtube
         | 9. The Berlin Declaration of the Open Access Initiative
         | 10. Some random "Berlin" entries on IMDb       11. A "Berlin"
         | Nightclub from Chicago       12. Some random "Berlin" books on
         | Amazon       13. The town of Berlin, Maryland       14. Some
         | random "Berlin" entries on Facebook       15. The BMW Berlin
         | Marathon              b) "philosophy"            1. The
         | Wikipedia entry about philosophy       2. "Skin Care,
         | Fragrances, and Bath & Body Gifts" from philosophy.com       3.
         | "Unconditional Love Shampoo, Bath & Shower Gel" from
         | philosophy.com       4. Definition of Philosophy at
         | Dictionary.com       5. The Stanford Encyclopedia of Philosophy
         | 6. PhilPapers, an index and bibliography of philosophy       7.
         | The University of Science and Philosophy, a rather
         | insignificant institution that happens to use the domain
         | philosophy.org       8. "What Can I Do With This Major?"
         | section about philosophy       9. Pages on "philosophy" from
         | "Psychology Today". I looked at the first and found it to be
         | too short and eclectic to be useful.         10. The Department
         | of philosophy of Tufts University              c) "history"
         | 1. Some random pages from history.com       2. "Watch Full
         | Episodes of Your Favorite Shows" from history.com       3. Some
         | random pages from history.org       4. "Battle of Bunker Hill
         | begins" from history.com       5. Some random "History" pages
         | from bbc.co.uk       6. Some random pages from historyplace.com
         | 7. The hash tag "#history" on Twitter       8. The Missouri
         | Historical Society (mohistory.com)       9. Some random pages
         | from History Channel       10. Some random pages from the U.S.
         | Census Bureau (www.census.gov/history/)              d)
         | "Caesar"            1. The Wikipedia entry about Caesar
         | 2. Little Caesars Pizza        3. "CAESAR", a source for body
         | measurement data. But the link is dead and resolves to SAE
         | International, a professional association for engineering
         | 4. The Caesar Stiftung, a neuroethology institute       5. Some
         | random "Caesar" books on Amazon       6. Hotels and Casinos of
         | a Caesars group       7. A very short bio of Julius Ceasar on
         | livius.org        8. Texts on and from Caesar provided by a
         | University of Chicago scholar       9. (Extremely short)
         | articles related to Caesar from britannica.com       10.
         | "Syria: Stories Behind Photos of Killed Detainees | Human
         | Rights Watch". The photos were by an organization called the
         | Caesar Files Group
         | 
         | So what I can see are some high ranked false positives that are
         | somehow using the search term, but not in its basic meaning
         | (a3, a11, b2, b3, d2, d3, d4, d6) or not even that (a6). Some
         | results are ranking prominently although they are of minor
         | importance for the (general) search term (a9, a13, b7, b8 --
         | perhaps a15 and d10). Then there are the links to the usual
         | suspects such as Wikipedia, Twitter, Amazon, etc. (a2, a5, a8,
         | a10, a12, a14, b7, c5, d1, d5); I understand that Wikipedia
         | articles are featuring prominently, but for the others I would
         | rather go directly to eg. Amazon when I am interested in
         | finding a book (or use a search term like "Caesar amazon" or
         | "Caesar books"). Well, and then there are the search results
         | that are not completely off, but either contain almost no
         | information, at least compared to the corresponding Wikipedia
         | article and its summary (b4, b9, d7, d9), or that are too
         | specific for the general search term (c1, c2, c3, c4, c6, c9,
         | c10).
         | 
         | That leaves me with the following more or less high quality
         | results (outside of the Wikipedia pages): a1, a4, a7, b5, b6,
         | b10, and d8. The a15 and d10 results I could tolerate if there
         | had been more high quality results in front of them; but as a
         | fourth and second, respectively, good result they seem to me to
         | be too prominent. Also in the case of "Berlin" a4 should have
         | been more prominent than a1, and a7 is somewhat arbitrary,
         | because Humbolt University and the Technical University of
         | Berlin are likewise important; what is completely missing is
         | the official Website of the city of Berlin (English version at
         | www.berlin.de/en/).
         | 
         | All in all, I would say that your ranking algorithm lacks
         | semantic context. It seems the prominence of an entry is mainly
         | determined by either just being from the big players like
         | Twitter, Youtube, Amazon, Facebook, etc. or by the search term
         | appearing in the domain name or the path of the resource,
         | regardless of the quality of the content.
        
           | kilburn wrote:
           | I don't know about others, but when I think of the "good old
           | google days" I'm _not_ expecting the results for your example
           | queries to be any good.
           | 
           | In those days querying took some effort but the effort paid
           | off. The results for "history" just couldn't matter less in
           | this mindset. You search for "USA history" or "house commons
           | history" or "lake whatever history" instead. If the results
           | come up with unexpected things mixed in, you refine the
           | query.
           | 
           | It was almost like a dialog. As a user, you brought in some
           | context. The engine showed you its results, with a healthy
           | mix of examples of everything it thought was in scope. Then
           | you narrowed the scope by adding keywords (or forcing
           | keywords out). Rinse and repeat. As a user, you were in
           | command and the results reflected that.
           | 
           | The idea that the engine should "understand what you mean" is
           | what took us to the current state. Now it feels like queries
           | don't matter anymore. Google thinks it knows the semantics
           | better than you, and steering it off its chosen path is
           | sometimes obnoxiously hard.
        
             | yellowstuff wrote:
             | I get what you mean, but part of the whole initial appeal
             | of Google was that it gave much more relevant results
             | initially than Altavista or the other options. That was why
             | Google put in the audacious "I'm feeling lucky" button.
        
               | DarkmSparks wrote:
               | This was the result of two things. mapreduce. using links
               | rank the pages.
               | 
               | Using links to rank the pages is not really possible any
               | longer because of seo spam links.
        
               | 1_player wrote:
               | Yeah but it's from that same philosophy that Google
               | Search is useless as it optimises for the first result.
               | 
               | There is no search engine that searches literally for
               | what you asked and nothing else. Search is shit in 2021
               | because it tries to be too clever. I'm more clever than
               | it, let me do the refining.
        
             | 1024core wrote:
             | > The idea that the engine should "understand what you
             | mean" is what took us to the current state. Now it feels
             | like queries don't matter anymore. Google thinks it knows
             | the semantics better than you, and steering it off its
             | chosen path is sometimes obnoxiously hard.
             | 
             | Bingo! If you cede control to Google, it _will_ do what
             | it's optimized to do, and not what _you_ are looking for.
        
           | dgivney wrote:
           | I think you have some great feedback here but for me it also
           | highlights how subjective search results can be for
           | individuals - for example, these false positives that you
           | mention (b2, b3) appear as the top result on Google for me
           | for that query.
           | 
           | It makes me think there must be some fairly large segment of
           | the population that want that domain returned as a result for
           | their query, no?
        
           | 1024core wrote:
           | OK, I'll bite. How would _you_ rank the results for each of
           | those queries?
        
         | melony wrote:
         | What we need is a net neutrality doctrine on the server side.
         | Bandwidth is hardly scarce outside of AWS's business model. Ban
         | the crawler user-agent dominance by the big search engine
         | players. "Good behaviour" should be enforced via rate limiting
         | that equally applies to all crawlers, without exemption for
         | certain big players.
        
           | ColinHayhurst wrote:
           | https://knuckleheads.club/
        
         | hdjjhhvvhga wrote:
         | Regarding the gatekeeper problem: it's a wild guess but maybe
         | if there was a way to involve users by organizing distributed
         | scraping just for the sake of building a decent index, I'm sure
         | many of them would help.
        
           | gbmatt wrote:
           | yes, large proxy networks are potential solutions. but they
           | cost money, and you are playing a cat and mouse game with
           | turing tests, and some sites require a login. furthermore,
           | people have tried to use these to spider linkedin (sometimes
           | creating fake accounts to login) only to be sued by microsoft
           | who swings the CFAA at them. so you start off with an
           | intellectual desire to make a nice search engine and end up
           | getting sidetracked into this pit of muck and having
           | microsoft try to put you in jail. and, no, i'm not the one
           | microsoft was suing.
        
         | 1cvmask wrote:
         | Why do you have a user account with a login?
        
         | subsubzero wrote:
         | curious how you implemented the index, memory based or disk
         | based? Either way you are right, HW costs are extremely
         | expensive and you would need a lot of high RAM/high core count
         | machines to return such a large index to the endusers in a low
         | latency fashion.
        
         | 1_player wrote:
         | If you're serious about this, add a paid tier. Until it's free,
         | I don't trust you will not ever sell my data to make bank.
        
           | jermaustin1 wrote:
           | You are going to pay for a generalized web search when
           | DDG/Google/Bing/etc are free?
        
             | Closi wrote:
             | I would - the problem with those services is that they
             | prioritise the results that generate the search engine the
             | most money rather than give me the best results, and then
             | indexes my searches to track and advertise to me throughout
             | the web.
             | 
             | A clear pricing transaction sounds much nicer to me. Should
             | generate better results too.
        
             | 1_player wrote:
             | Yes. I use Brave Search and I hope they add a paid tier,
             | which I think they have confirmed they'll add at a later
             | date.
             | 
             | If you don't pay, you are the product. Simple as that.
        
               | pythux wrote:
               | https://twitter.com/brave/status/1466510541128548362?s=20
        
               | duckmysick wrote:
               | > If you don't pay, you are the product.
               | 
               | If not enough people pay, there's no product.
        
               | 1_player wrote:
               | If nobody pays, there's even less of it. Not sure what's
               | your point.
        
           | Nasrudith wrote:
           | Why do people think a paid tier will prevent their data from
           | being sold after pocketing it? Aside from that if they go
           | bankrupt then it isn't theirs to not give away anymore for
           | one.
        
         | InfiniteRand wrote:
         | Not sure if you're looking for feedback, but the News search
         | could use some work, I searched for "Ethiopia" and almost all
         | of the articles were unrelated to Ethiopia except for the
         | existence of some link somewhere on the page.
         | 
         | Your general web search seems pretty good, although I've just
         | given it a casual glance. I think your News search could be
         | improved by just filtering the general search results for News-
         | related content, since the "Ethiopia" content I get there is
         | certainly Ethiopia-related.
         | 
         | In any case, an interesting product, I'll try to keep an eye on
         | it.
        
         | smt88 wrote:
         | What if you allowed trusted contributors to "donate" their
         | browsing to your index?
         | 
         | AltaVista and Yahoo did that with browser plugins in the 90s.
        
         | bullen wrote:
         | Do you have some sort of PageRank?
        
         | mrkramer wrote:
         | I'm sorry to say but your project is 20 years old and it had no
         | impact at all. You are doing something wrong. Innovation and
         | initiative is needed ala Bitcoin and DeFi not hobby projects
         | which are not picking up in popularity and utility.
        
           | ErrrNoMate wrote:
           | Bitcoin and DeFi don't have utility outside of gambling and
           | pump and dumps. Not everything (tbh not really anything)
           | needs crypto.
        
             | jquery wrote:
             | Crypto's biggest achievement is being the financial
             | equivalent of the gulf war oil fires. Just massive
             | pollution. Think of all the good things that computing
             | could be used for... we used to have all kinds of
             | interesting collaboration projects. Instead we are setting
             | those CPU cycles on fire for short term profit.
        
               | Sohcahtoa82 wrote:
               | Imagine if all that processing power was used for
               | Folding@Home.
               | 
               | The problem is that cryptocurrencies do not inherently
               | need tons of processing power to operate. You could
               | theoretically run the entire Bitcoin network on a
               | Raspberry Pi. But the PoW algorithm was designed to
               | always produce a block every 10 minutes, no matter how
               | much hashing power was dedicated to the network. Everyone
               | wanted a piece of the block reward pie, so the arms race
               | was created.
               | 
               | Proof-of-stake algorithms would eliminate this problem
               | entirely, but PoS is a shitty "rich get richer" method.
               | Granted, with how expensive mining power is, even PoW
               | results in the rich getting richer, but at least it
               | doesn't result in the wasting of gigawatts of
               | electricity.
        
             | mrkramer wrote:
             | Read Bitcoin whitepaper. Bitcoin was meant to decentralize
             | trust and to eradicate fraud through transparent
             | decentralized database called Blockchain. It is certainly
             | more impactful than hobby search engines taking in
             | consideration Bitcoin was also hobby project but really
             | revolutionary one.
             | 
             | Go search what Larry Page said 20 years ago: If innovation
             | is commercially successful it can have more widespread
             | impact.
        
               | _jal wrote:
               | The bitcoin brainworms do bad things to people.
               | 
               | I suggest you update your patter some, though. A good
               | coin scam needs to sound a lot less dated.
        
               | spiderice wrote:
               | So your response to the author saying "I'm trying to be
               | commercially successful, but it's really hard for these
               | reasons" is "You should try being commercially
               | successful"?
               | 
               | Ok...
        
               | mrkramer wrote:
               | I respect his effort but the project is 20 years old and
               | yet not commercially successful? There must be a reason
               | behind it. The project is not good enough. Like I said
               | only innovation can displace Google. Innovation is not
               | something new and different innovation is something
               | better.
        
         | ma2rten wrote:
         | _It 's much more expensive now to build a large index (50B+
         | pages)_
         | 
         | Do you have a cost estimate? Also could you be more selective
         | in indexing, e.g. by having users requests sites to be crawled.
        
           | ampersandy wrote:
           | Requiring users to know what sites they want in advance
           | somewhat defeats the purpose of a search engine, no?
        
             | robbomacrae wrote:
             | Not at all. You only have to fail the first request. It is
             | an approach I took with my own attempt at a search engine
             | way back! In fact I know personally that there is at least
             | one patent out there that suggests initial 1st time request
             | users being asked to provide the appropriate response as an
             | efficient way to teach systems for future users.
             | 
             | Obviously failing first requests isn't ideal but for
             | popular requests it quickly becomes insignificant.
             | Wikipedia might (if they don't already) want to make a
             | similar suggestion for users to contribute when finding a
             | low content/missing page.
        
             | convolvatron wrote:
             | since sites are so desperate to be indexed, doesn't it seem
             | better to put the onus on them to announce themselves? it
             | would be great if dns registries publshed public keys ..
             | maybe they do in newer schemes?
        
         | thoughtstheseus wrote:
         | Perhaps trolling the entire web is not useful today? I'd love a
         | search engine where I can whitelist sites or take an existing
         | whitelist from trusted curators.
        
           | hawthornio wrote:
           | I'm really interested in this as well. I use DDG and whenever
           | I'm doing research I tend to add ".edu" because there are so
           | many spam sites.
        
           | erhk wrote:
           | Trusted curators is a dangerous dependency
        
             | thoughtstheseus wrote:
             | It is. The alternative is scooping everything and using
             | algos to curate. That seems worse imo.
        
               | sdfjkl wrote:
               | Perhaps vote on results like on Reddit posts? Gets the
               | junk sites down (and out of the index eventually).
        
               | marginalia_nu wrote:
               | Given Reddit is notorious for it's problems with
               | astroturfing and vote bots, I don't think this is a
               | particularly promising approach.
        
               | Retric wrote:
               | Any open voting system is going to be under serious SEO
               | pressure.
               | 
               | That's the real issue, Google has indirectly infected the
               | web with junk sites optimized for it. Any new search
               | engine now has a huge hurdle to sort through all the junk
               | and if it succeeds the SEO industry is just going to
               | target them.
               | 
               | A more robust approach is simply pay people to evaluate
               | websites. Assuming it costs say 2$ per domain to either
               | whitelist or block that's ~300 million for the current
               | web and you need to repeat that effort over time. Of
               | course it's a clear cost vs accuracy tradeoff. Delist
               | sites that have copies and suddenly people will try to
               | poison the well to delist competitors etc etc.
        
               | Nasrudith wrote:
               | Adding a gatekeeper collecting rent isn't a solution -
               | the people using SEO are already spending money to get
               | their name up high on the list.
        
               | arein3 wrote:
               | Reddit is a heavily gatekeeped community by the mods in
               | regards to specific topics
        
               | 1024core wrote:
               | Reddit is an extreme example of group think. Try posting
               | something pro-Trump (I mean, surely even that guy has a
               | positive thing or two to be said about him) and you'll
               | get banned in some subs. Or you may get banned simply
               | because the mod doesn't like the fact that you don't toe
               | the party line.
        
               | notriddle wrote:
               | Also, vote bots
        
               | pessimizer wrote:
               | That just means that you have to curate the people
               | allowed to vote. Otherwise, it would be rule by the
               | obsessed and the search engine optimizers, and the junk
               | sites will dominate the index.
               | 
               | I'm not convinced that Google's recursive AI algos aren't
               | a functional equivalent. They let you vote by tracking
               | your clicks.
        
             | dcow wrote:
             | That's why you don't make it a hard dependency and let
             | people curate their own list of taste makers. They can
             | share and exchange info about who good taste makers are and
             | good one might even charge for access to exclusive flavors.
        
             | dragonwriter wrote:
             | Plus, it scales less well than pure algorithmic search.
             | This fight already happened, with a much smaller internet.
        
               | shituonui wrote:
               | It works really, really well for libraries. Research
               | libraries (and research librarians) are phenomenally
               | valuable. I've missed them any time I'm not at a
               | university.
               | 
               | Both curators and algorithms are valuable. This goes for
               | finding books, for finding facts and figures, for finding
               | clothes, for finding dishwashers, and for pretty much
               | everything else.
               | 
               | I love the fact that I have search engines and online
               | shopping, but that shouldn't displace libraries and
               | brick-and-mortar. Curation and the ability to talk to a
               | person are complementary to the algorithmic approach.
        
               | dragonwriter wrote:
               | > It works really, really well for libraries
               | 
               | It scales extremely poorly. It works very well for
               | situations where there are customers/sponsors are willing
               | to spend lots of money for quality, because then the cost
               | scaling doesn't matter as much; research libraries,
               | Lexis/Nexus Westlaw, etc. all do this, but it's not
               | cheap, and the cost scaling with the size of the corpus
               | _sucks_ compared to algorithmic search.
               | 
               | It is among the approaches to internet search that lost
               | to more purely algorithmic search, because it scales
               | poorly in cost.
        
               | thoughtstheseus wrote:
               | +book stores. Curators can use algorithms to help them
               | curate... Google's SE is taking signals from poor
               | curators imo.
        
             | Zamicol wrote:
             | How about just a meritocratic rating? Even here on HN I
             | would appreciate some sort of weight on expert/experienced
             | opinion. Although in theory I like the idea that every
             | thought is judged on its own, the context of the author is
             | more relevant the deeper the subject. That's one of the
             | reasons I still read https://lobste.rs. It has a niche
             | audience with industry experience.
        
               | marginalia_nu wrote:
               | > meritocratic rating
               | 
               | That is literally PageRank.
        
             | klankoo wrote:
             | Trusted consumers are better. The original page-rank algo
             | was organic and bottom-up. But now it's the person not the
             | page. Businesses compete for interaction not inbound links.
             | So if you can make a modern page-rank that follows
             | interaction instead of links and isn't a walled garden then
             | I'd invest.
        
               | politician wrote:
               | I could make that work, but what do you mean by "walled
               | garden" in this context?
        
           | technobabbler wrote:
           | That's a great idea.
        
           | GordonS wrote:
           | Heh, I guess you mean "trawling" - trolling the entire web is
           | something very different :)
        
             | giardini wrote:
             | "Trolling" is fine, see e.g.
             | https://grammarist.com/usage/trawl-
             | troll/#:~:text=Troll%20fo....
        
               | romwell wrote:
               | Well, no, it's not fine.
               | 
               | See e.g. _the source you linked_ , which explains the
               | difference.
        
               | GrinningFool wrote:
               | Not in this context - "trolling" as described there would
               | apply to targeted indexing of a specific site; while
               | "trawling" would refer to a wide net that attempts to
               | catch all the sites.
        
             | hdjjhhvvhga wrote:
             | Then again, if you look at today's search results, where
             | everything above the fold belongs to Google, maybe we have
             | been trolled indeed.
        
             | rodiger wrote:
             | Depending on the intended metaphor, trolling could work too
             | :) https://en.wikipedia.org/wiki/Trolling_(fishing)
        
             | xwdv wrote:
             | What would trolling the entire web look like?
        
       | lolinder wrote:
       | The consistent theme every time this comes up is that dealing
       | with the sheer weight of the internet is almost impossible today.
       | SEO spam is hard to fight and the index gets too heavy. However,
       | I wonder if this is a sign that we're looking at the problem
       | wrong.
       | 
       | What if instead of even _trying_ to index the entire web, we
       | moved one step back towards the curated directories of the early
       | web? Give users a search engine and indexer that they control and
       | host. Allow them to  "follow" domains (or any partial URLs, like
       | subreddits) that they trust.
       | 
       | Make it so that you can configure how many hops it is allowed to
       | take from those trusted sources, similar to LinkedIn's levels of
       | connections. If I'm hosting on my laptop, I might set it at 1
       | step removed, but if I've got an S3 bucket for my index I might
       | go as far as 3 or 4 steps removed.
       | 
       | There are further optimizations that you could do, such as having
       | your instance _not_ index Wikipedia or Stack Overflow or whatever
       | (instead using the built-in search and aggregating results).
       | 
       | I'm sure there are technical challenges I'm not thinking of, and
       | this would absolutely be a tool that would best serve power users
       | and programmers rather than average internet users. Such an
       | engine wouldn't ever replace Google, but I'd think it would go a
       | long way to making a better search engine for a single user's (or
       | a certain subset of users') everyday web experience.
        
         | lawwantsin17 wrote:
         | I'm sure the algorithms are making echo chambers worse.
         | Curating news opinion sites based on a prediction score of how
         | often Chicken Little was right about the sky falling after the
         | fact would surface reliable journalists and actual psychics!
        
         | supernovae wrote:
         | It's flawed from the get go if reddit is the basis.
        
           | loonster wrote:
           | As much as I like to hate on reddit (I'm a permanently
           | suspended user), not every sub there is trash. There are some
           | great subs there on very specific niche topics.
        
         | djwayne35 wrote:
         | I agree, I think we are looking at the problem wrong. And this
         | is a very insightful comparison with the linkedin levels of
         | connections idea. I am working on something with this. One
         | thing to point out is that when we think of searching through
         | information, we are searching though an information structure
         | aka graph of knowledge. Whatever idea or search term we are
         | thinking of is connected to a bunch of other ideas. All those
         | connected ideas represent the search space or the knowledge
         | graph we are trying to parse. So one way in the past people
         | have tried to approach this is they try to make a predefined
         | knowledge graph or an ontology around a domain. They try to set
         | up the structure of how the information should be and then they
         | fill in the data. The goal is to dynamically create an
         | ontology., Idk if anyone has really figured this out. But,
         | Palantir with Foundry does something related. They sorta
         | dynamically create an ontology ontop of a company's data. This
         | lets people find relationships between data and more easily
         | search through their data. Check this out to learn more
         | https://sudonull.com/post/89367-Dynamic-ontology-As-Palantir...
        
         | Nasrudith wrote:
         | The retro idea of curation seems popular here but everybody
         | forgets why it lost out in the first place. It just doesn't
         | scale in the first place. Not to mention demands - people
         | usually want tools which lower mental effort and are intuitive
         | as opposed to ones which are precise but in an obtuse metric.
         | Most would not find a hardware mouse that consisted of two
         | keypads for X and Y coordinates and a left click and right
         | click button very useful.
         | 
         | Similarly everyone maintaining your their own index is
         | cumbersome overkill in redundancy, processing power, and human
         | effort in return for a stunted network graph which is worse for
         | all metrics people usually actually care about. In terms of
         | catching on even "antipattern search engines" that attempt to
         | create an ideological echo chamber would probably catch on
         | better.
         | 
         | Short of search engine experiments/start up attempts the only
         | other useful application I can see is "rude web-spidering"
         | which deliberately disrespects all requests to not index pages
         | left publicly accessible as search engines generally try not to
         | be good tools for cracker wardriving for PR and liability
         | reasons. It would be a good whitehat or greyhat tool as doors
         | secured by politeness only are not secure.
        
         | mclightning wrote:
         | That's basically what I'm doing with my search
         | "site:reddit.com" I wonder if anyone at Google is aware of this
         | trend and taking notes.
        
           | a-r-t wrote:
           | Reddit is missing a huge opportunity by not improving their
           | crappy search functionality.
        
           | copperx wrote:
           | I estimate that about half of my searches have either
           | site:reddit.com or site:news.ycombinator.com at the end. In
           | fact, I have an autocomplete snippet on my Mac so I don't
           | have to type all that.
        
             | marksbrown wrote:
             | FYI this is exactly what the hashbangs in DDG do!
        
         | skyde wrote:
         | what you are suggesting would make the problem of echo chamber
         | (bubble) worse than it is today!
        
           | Nasrudith wrote:
           | Awkwardly complaints about echo chamber as a problem tends to
           | not refer to feedback dynamics (crudely but disambiguating
           | refered to as circle jerk) so much "People disagree with me,
           | the nerve of them!". It is not viable to have parties A
           | through Z sharing the same world and all having absolute
           | control over all others. We see this same complaint every
           | time modernation comes up, let alone the fundamentals of
           | democracy.
        
           | loonster wrote:
           | Bubbles are great if you are on the outside looking in at how
           | a specific group thinks. Bubbles are horrible if you are on
           | the inside trying to explore your thoughts.
        
         | lessname wrote:
         | This might work well in some situations (e.g. research,
         | development), however it would also increase the effect of echo
         | chambers I think.
        
           | theduder99 wrote:
           | echo chambers are what most people want :)
        
           | jonathankoren wrote:
           | Part of the thing with echo chambers is that the search terms
           | themselves can be indicative of a particular bubble. For
           | example, there's a difference in the people that refer to the
           | Bureau of Alcohol, Tobacco, and Firearms by the official
           | initialism, "ATF", and those that use "BATF". There's a
           | strong antigun control bent to the `"BATF" guns` query,
           | compared to the `"ATF" guns` query.
           | 
           | If you're indexing forums or social media, the same site is
           | going to give back the bubbled responses, possibly without
           | the person even being aware they're in a bubble.
           | 
           | https://www.google.com/search?q=%22BATF%22+guns&client=safar.
           | ..
           | 
           | https://www.google.com/search?q=%22ATF%22+guns&client=safari.
           | ..
        
           | lolinder wrote:
           | Possibly, but I'm not convinced.
           | 
           | Google's not exactly working against the echo chamber
           | problem, and I think that's because to do so would be to work
           | against its own reason for existing. There are two goals here
           | that are fundamentally at odds with each other:
           | 
           | 1) Finding what you're looking for.
           | 
           | 2) Finding a new perspective on something.
           | 
           | A search engine's job is to address the first challenge:
           | finding something that the user is looking for. The search
           | engine might end up serving both needs if they're looking for
           | a new perspective on something, but if these two goals ever
           | come into conflict with each other the search engine does
           | (and I would argue it _should_ ) choose the first goal.
           | Failing to do so will just lead to people ignoring the
           | results.
        
       | swframe2 wrote:
       | Have a look at gpt-3 if you want to see what the future dominant
       | search engine will be. It will not find relevant results, it will
       | write it on the fly customized for exactly what you want to read.
       | (Maybe products will just ship to your door and be auto paid
       | because the future ad targeting AIs will know you so well.)
        
         | marginalia_nu wrote:
         | What if you are looking for something written by a human?
        
           | swframe2 wrote:
           | You can always go to a library or bookstore.
        
             | marginalia_nu wrote:
             | Let's imagine I want to talk to the author of the content.
             | How can I do that if it's just a souped up markov chain?
        
               | throwawayffffas wrote:
               | The markov chain can also power a chat bot.
        
               | marginalia_nu wrote:
               | But then they would need to know that the person sending
               | the email is the same person that read a specific
               | article.
        
       | baggachipz wrote:
       | https://kagi.com/ is a new engine (and Orion Browser) which seems
       | like what you're talking about. I've been using it some and like
       | it so far. The browser is fantastic.
        
       | drcongo wrote:
       | I've been using kagi.com for a month or so now, and it
       | consistently beats DDG and Ecosia for result quality. I'd guess
       | it beats Google too, since last time I used Google it was nothing
       | but ads and spam which is why I stopped.
        
         | freediver wrote:
         | Thank you for the vote of confidence! Better than Google is our
         | goal, glad you perceive it that way.
        
           | drcongo wrote:
           | You're welcome. I'm really impressed with it most of the
           | time. Still not made it on to the Orion beta though ;)
        
       | mkbkn wrote:
       | I am a non-dev and Ecosia and DuckDuckGo are perfect for me. Not
       | used Google since more than 3 years now.
        
       | moralestapia wrote:
       | Please do it! Google is now complete trash.
       | 
       | Also gmail, used to have the best spam filters out there, now
       | it's utter crap. Emails from my google analytics account, for
       | whatever reason and disregarding how many times I have clicked on
       | "Not Spam", go to spam, and it's their own service; while
       | messages who are textbook spam ("Hi, I just got some inheritance
       | ...") go to my inbox.
       | 
       | AI (in its current state) is crap, when is the industry going to
       | accept these are the emperor's new clothes.
        
       | chilling wrote:
       | Yesterday there was a discussion[1] about it and someone
       | suggested yandex.com. I'm using it since than and really love it.
       | It's like going back to 2003 where everything was just plain and
       | simple.
       | 
       | [1]: https://news.ycombinator.com/item?id=29393467
        
       | pydry wrote:
       | Early 2000s google index ran in a garage. The current google
       | index has dedicated power stations.
       | 
       | It's a bit like the car industry - you could run a startup from
       | your garage in the early days but you need titanic amounts of
       | capital to compete now thanks to vertical integration.
       | 
       | Major governments and billionaires can compete but everybody else
       | is locked out of the market (most "startups" use bings index).
        
         | R0b0t1 wrote:
         | I was thinking about exactly that. If they used simpler index
         | would they be getting better results? There's not a lot of
         | selective pressure so they just keep adding to the index
         | algorithm.
        
       | rasengan wrote:
       | This is how Private Search [1] works since it decouples the
       | search from the user. This means nobody knows both who searched
       | and what they searched for. This is a huge leap for privacy in
       | search.
       | 
       | [1] https://private.sh
        
         | jaywalk wrote:
         | Looks like your comment here caused enough curiosity to take
         | the service down.
        
         | snarkypixel wrote:
         | Is it a proxy to other search engines or are they building
         | their own?
        
           | rasengan wrote:
           | It's a multi part partnership with Gigablast. Gigablast sees
           | the searches, but not who searches. Private.sh sees who
           | searches, but not what they search for.
        
         | gadrev wrote:
         | Just tried it and it worked for me.
        
         | gbmatt wrote:
         | and i work with rasengan on private.sh so yes there's some
         | issue there. one of the back end servers is returning a max
         | capacity error of sorts... we are checking into it.
        
         | Lucasoato wrote:
         | Tried it but it just says: "Something went wrong. Please try
         | again."
        
           | ZetaZero wrote:
           | same here
        
             | rasengan wrote:
             | It should be working now! Thanks for the heads up. There
             | was a traffic issue.
        
       | andrewclunn wrote:
       | What about a search engine that only indexed information and
       | technology "alternative" sites, specifically to give you the
       | results most likely to be purged or demoted from Google's
       | results? Would be simple enough in scope and have a built in
       | market and use.
        
       | beefman wrote:
       | Can you also create a web comparable to the 2005 web?
       | 
       | Well, it's wikipedia. So just create a search engine for that,
       | since their search sucks rocks.
        
       | criddell wrote:
       | Why don't you want personalized results? If I search for "subaru
       | service" I want to find Austin Subaru, not Thorp Subaru in Cape
       | Town.
        
         | arthur_sav wrote:
         | I pretty much hate "personalized" search recommendations. If
         | i'm looking for something it's usually not in relation to me
         | but in relation to the world.
         | 
         | If i wanted something more relevant to me, then i would specify
         | what aspect of relevance (country, gender, age etc...) i would
         | like instead of playing the guessing game.
        
           | criddell wrote:
           | > If i'm looking for something it's usually not in relation
           | to me but in relation to the world.
           | 
           | If that's true, then I don't think you are a typical search
           | engine user.
           | 
           | The personalization should just be used for defaults. You can
           | always make a more specific query to focus on aspects you are
           | interested in.
        
         | vikingerik wrote:
         | Why didn't you just search "austin subaru service"? If you want
         | a query narrowed down by location, that's your job to say so.
         | 
         | Sure, it feels great when the engine guesses something like
         | that correctly -- but it comes out worse overall for the
         | plentiful cases where you have to try to compensate for it
         | guessing wrong.
        
           | criddell wrote:
           | Why should I have to do all that work? I want the machine to
           | do it for me.
           | 
           | I can only think of examples where I want personalization.
           | What's an example query where it interferes?
        
             | jeffbee wrote:
             | Amazing that the same site that thinks copilot will just
             | generate programs for us also thinks it is literally a
             | crime for a search engine to infer anything.
        
       | not2b wrote:
       | I think you're being nostalgic for something you don't remember
       | very well.
       | 
       | In that era, Google would return a match based on words that
       | appear in the links to a URL but not in the article itself,
       | meaning that it was easy to produce "Googlebombs". For example,
       | from 2005-2007 the top hit for "miserable failure" was the
       | Wikipedia article for George W. Bush.
       | 
       | See https://www.screamingfrog.co.uk/google-bombs/ for some of the
       | "better" ones.
        
       | chrisgoman wrote:
       | too many crappy websites, probably needs a "committee" to
       | whitelist domains (only good quality ones) but probably too much
       | work for not enough money or needs some monetization strategy
        
       | nfriedly wrote:
       | I think DuckDuckGo is closer to what you want. Same results for
       | everyone, better privacy, and they're proactive about improving
       | their results.
       | 
       | https://duckduckgo.com/
       | 
       | Part of the problem is that there's a lot more low-quality
       | content to wade through now than there was in 2005. I think the
       | Google of 2005 would have trouble delivering quality results
       | today also.
        
         | hunterb123 wrote:
         | DDG never worked great for me, and it doesn't have it's own
         | index.
         | 
         | Brave search has been my daily driver and it works wonderful.
        
           | adolph wrote:
           | I'll give it a try, somehow I missed the announcement even
           | though I'm a Brave user...
           | 
           | https://brave.com/search/
        
         | MuffinFlavored wrote:
         | > I think the Google of 2005 would have trouble delivering
         | quality results today also.
         | 
         | What would you attribute to their modern 2021 success then?
         | Just throwing a ton of money at amazing engineers to hone in
         | their complex algorithm to tweak it to still return what us
         | humans quantify as "good" results? Especially if they are
         | waning through a sea of low-quality content as you say.
        
         | gbmatt wrote:
         | both ddg and brave are bing (microsoft) in disguise.
        
           | pythux wrote:
           | This is not correct. Brave Search owns its own (growing)
           | index and relies on third-parties like Bing for some fraction
           | of the requests. Which is not the same thing as relying fully
           | on Bing or third-parties for results like so many meta-search
           | engines. More detailed answer here:
           | https://search.brave.com/help/independence
           | 
           | Edit: Forgot to say that I work on Brave Search.
        
             | gbmatt wrote:
             | brave 'falls back' to bing. which in my experience is most
             | of the time. in fact, out of all the queries i did a while
             | back, they all seemed to come directly from bing. is there
             | a way to disable the reliance on bing and get pure 'brave
             | only' results? and can you be more specific as to what this
             | fraction is? do you blend at all?
        
               | pythux wrote:
               | You can check exactly which fraction of the results were
               | fetched from Brave's index vs. third-parties using the
               | "independence score" found in the setting drawers
               | (opening can be done with the cog icon at the top right
               | of any page on search.brave.com). There is there a global
               | and personalized score of independence (respectively
               | aggregated on all user's and for your queries only).
               | 
               | Explanation is also found here with screenshots:
               | https://search.brave.com/help/independence
        
               | gbmatt wrote:
               | So Brave is still dependent on Google and Bing it seems.
               | Also is this Brave's CEO:
               | https://www.bbc.com/news/technology-26868536
               | https://www.nytimes.com/2020/12/22/business/brave-
               | brendan-ei... ? "Brendan Eich's opposition to same-sex
               | marriage cost him his job at Mozilla." "Covid comments
               | get a tech C.E.O. in hot water, again."
        
         | ricardo81 wrote:
         | Does DDG have any of its own organic results yet, or is it
         | still entirely Bing/Yandex?
        
         | DavideNL wrote:
         | > a lot more low-quality content
         | 
         | I wish there was an easy way to filter ALL search results, by
         | permanently excluding specific websites, and/or keywords.
         | 
         | Surely there has to be some browser extension that does this...
        
           | BoxOfRain wrote:
           | https://news.ycombinator.com/item?id=29404860
           | 
           | Not got round to trying it yet though.
        
             | DavideNL wrote:
             | Great and it even supports iOS...!
        
           | MayeulC wrote:
           | Excluding, or penalizing for, advertising and trackers could
           | do wonders against perverse incentives and SEO, IMO. It would
           | also be a better experience for the reader.
        
         | Kiro wrote:
         | Try searching for the same thing from your computer and your
         | phone and you will get different results. Also, their results
         | come from Bing so any improvement happens at Microsoft.
        
           | JohnFen wrote:
           | They do use Bing, but not solely Bing. DDG isn't just a
           | frontend to a different search engine.
        
             | bla3 wrote:
             | It's a bing frontend with a few special cases handled
             | differently. For most queries, you get bing results. Easy
             | to check by comparing results.
        
         | jpswade wrote:
         | DuckDuckGo isn't really a search engine, it's a website that
         | uses bing's api.
        
           | cyberbanjo wrote:
           | Not just Bing, but nearly every search engine you've ever
           | used https://duckduckgo.com/bang?q=
        
         | Sunspark wrote:
         | This. DDG is my primary search engine now and has been for
         | awhile.
         | 
         | I don't use Google anymore to search unless I really need to.
         | The algos they use today are not the same classic ones that
         | actually returned results.
        
           | kspacewalk2 wrote:
           | And if you really need to, DDG !bangs[0] make a search as
           | simple as "!g mother google help me". The keyword thing is
           | also available in Firefox as a browser feature, and elsewhere
           | I'm sure, but nevertheless, it makes switching to DDG easier.
           | 
           | (Plus I can directly go to the wiki page by using "!w", "!gm"
           | for google maps, etc.)
           | 
           | [0] https://duckduckgo.com/bang
        
             | eevilspock wrote:
             | The only bang I use is !gvb since DDG doesn't support
             | verbatim searches.
        
               | jay3ss wrote:
               | Is this the same as enclosing the terms in quotes and
               | using the !g bang?
        
           | JohnFen wrote:
           | Same. For the sorts of searching I do, anyway, the results I
           | get from DDG tend to be better than what I get from Google.
           | Google tries to infer what I want rather than take me at my
           | word, and is very bad at it.
        
       | dragonwriter wrote:
       | > Why doesn't anyone create a search engine comparable to
       | 2005-Google?
       | 
       | Because the universe being searched isn't the internet of 2005
       | and earlier, and because user expectations have moved on, too.
       | 
       | Plus the index expense.
        
       | axegon_ wrote:
       | Two major reasons: costs to build and maintain and manpower
       | needed. Both are practically impossible to come by.
        
       | kumarsw wrote:
       | I feel like we are at the low point or even losing the battle
       | between search engines and SEO spam. Maybe it is time for the
       | Yahoo-style curated directory to return? We seem to be getting a
       | microcosm of this with the awsome-* GitHub lists and Gemini with
       | its near-nonexistent search.
        
       | erpellan wrote:
       | Even if Google dusted off their 2005 codebase and ran it on
       | today's web it wouldn't come close to the results quality of
       | Google in 2005. The SEO industry has been in an arms race with
       | the search engines for 16 years. 2005 Google would be like a
       | goldfish in a piranha tank.
        
       | WalterBright wrote:
       | I'd like to see categories like travel, science, history, art,
       | etc. The web pages could pick which categories their page falls
       | into using meta tags. The user has the option of selecting which
       | category they are interested in searching within.
        
       | mrkramer wrote:
       | They do[0] but nobody cares anymore. Google controls web
       | distribution through Google Chrome. I think we are at the point
       | of no return. There won't be any competition anytime soon no
       | matter what US government does. Only innovation can displace
       | Google.
       | 
       | [0] https://search.marginalia.nu/
        
         | BbzzbB wrote:
         | Marginalia is great to find blog posts, personal sites and
         | other long form content, but it's not a replacement for Google
         | nor intends to.
        
           | mrkramer wrote:
           | But it is a good start and foundation for something bigger
           | and better.
        
           | egberts1 wrote:
           | Funny. Marginalia has an option for No JavaScript but I
           | cannot even do an HTTP "POST" with JavaScript disable at my
           | web browser.
           | 
           | Disclaimer: I study for malicious JS stuff.
        
           | marginalia_nu wrote:
           | It does operate on a scale and principle fairly similar to
           | early 2000s google, so the comparison isn't that far off, but
           | yeah, it's quite some way before it's viable for general
           | search. Dunno if I'll ever get there, but it does
           | consistently seem to get better so who knows.
        
             | BbzzbB wrote:
             | Isn't it's familiarity to early Google a side-effect of the
             | early Internet being text-heavy sites in the first place
             | rather than a similarity in the search engine? Unless I am
             | misunderstanding your site's intent, even if you reach the
             | dream engine you are trying to achieve, I won't be using it
             | to search answers for coding questions on SO, how-tos for
             | car repair, sites to stream movies, governmental page for X
             | need, transcript for earnings calls, etc.
             | 
             | In my experience it is better than Google at what it does
             | if I'm looking for long-form texts (exception being
             | scientific/peer-reviewed articles, Google tends to shoot me
             | those for the type of queries I make on Marginalia), but is
             | very much complementary rather than a replacement.
        
               | marginalia_nu wrote:
               | I guess it depends on what you are looking for on the
               | Internet I guess.
               | 
               | Right now the biggest problem with Marginalia is that it
               | has a fairly uneven quality level. For some queries it's
               | absolutely incredible. For others, it doesn't really
               | provide much useful results at all. I do think it's
               | possible to even that out a considerable bit, to make it
               | more viable for general queries. It's never going to be
               | able to answer every query, but it probably could answer
               | a lot more than it does.
        
       | freediver wrote:
       | We are building one [1] as well as a few other people that I am
       | aware of with different approaches and business models.
       | 
       | We also need to be aware that when we remember past times it
       | usually carries a romantic, nostalgic note. Web is very different
       | than it was 15 years ago and the problem of search has evolved.
       | 
       | What you are looking for is basically 'grep for the web' but it
       | is just one facet of search that we use today. 15 years ago you
       | would not get an instant answer to a question like you do today
       | and many users would not be able to live without that today.
       | There are also maps and location based answers, all sorts of
       | widgets like translation etc. Also world became more polarized so
       | an objective best search result became more difficult to produce,
       | specially for events covered in news, which means bias inevitably
       | starts to creep in.
       | 
       | This is not to say that Google is good or bad today, it is what
       | it is and they are doing best they can. Startups like ours see an
       | opportunity on the market, in large part to help savvy users find
       | what they want.
       | 
       | [1] https://kagi.com
        
       | motohagiography wrote:
       | I do like the idea that instead of crawling and indexing, the
       | next generation search will likely be more like a federated
       | community search app that indexes the stuff members actually
       | read. Google search isn't so much a repository as a consensus
       | about what's important, hence why it's so politicized to the
       | point of becoming unreliable, but also why it too is vulnerable
       | to disruption.
       | 
       | Imo, 2005 google got initial traction because of its tech forum
       | post indexing, as I remember my switch to it was because it
       | became an extension and then replacement for manpages. In that
       | sense, what made it good was it reflected the consensus of what
       | its incredibly influential userbase thought was important and
       | just managed that really well. The demographic impact of the U.S.
       | Gen X all using it at once didn't hurt either.
       | 
       | The equivalent today, as a lot of us say, is that blockchains are
       | in the 1997 internet phase, and the service that makes the
       | content of those as navigable as the 90's internet, will likely
       | grow in a similar way.
       | 
       | Search that provides young people with privacy and freedom to
       | pursue their true interests will be the dominant strategy. Its
       | success will be because it's a product that rides growth, and not
       | because it "solved a problem." Imo, we all index too much on the
       | privacy pattern because the freedom pattern is too risky.
       | 
       | What's changed since that time are the maturity of things like
       | Bloom and other probabilistic filters, Apple's private set
       | intersection, differential privacy, zksnarks, and everybody you'd
       | ask an opinion from now gets their content through mobile
       | devices. Apple's ecosystem is equipped to do this kind of search,
       | but they're too exposed politically to get into it. Meta will
       | likely go there, but nobody's going to trust them willingly.
       | 
       | A protocol that generated a cryptograpically strong anonymous
       | index from your browsing - and instead of putting it on google's
       | servers, it was on a chain, or the content index information and
       | its evolving consensus score was included in something like a DNS
       | record - may still unseat these ensconced interests. IPFS and
       | other P2P or torrents might do something like that as well.
       | Blockchains maybe good for that consensus/desire score.
       | 
       | It's not something you architect and design top down that has to
       | solve all cases, it will be just another useful product that
       | grows while riding a demographic change. It would be on the level
       | of inventing HTML/HTTP again, which, when you think about it, was
       | just another dude making a thing he needed.
        
       | BbzzbB wrote:
       | No mention of DDG in the comments? Is there a reason I'm not
       | seeing or it's just not the preferred alt-search on HN? Seems to
       | have been working fine for me when I struggle to get past the
       | funnels and content mills on Google.
        
         | pantulis wrote:
         | I dont find search results to be too relevant (at least for me,
         | also Spaniard here). It is my default search engine only for
         | the bang commands.
        
         | KennyBlanken wrote:
         | For me, DDG results are even worse than Google. It's set as my
         | default and I'd say at least half of my searches in DDG
         | generate completely useless results...pages of obviously SEO'd
         | garbage.
         | 
         | DDG also doesn't support showing a site's basic structure in
         | the search results (ie, the card of a company's website with
         | Products, Contact Us, Support, etc) and the preview text is
         | garbo as well...it reminds me of 1990's era electronic card
         | catalog search excerpts.
         | 
         | I look at the first page or two, give up, search google. While
         | I have to hunt a bit in the results, I do eventually get what I
         | wanted.
        
           | infinitezest wrote:
           | Every time this comes up I'll see a few people talk about how
           | the results aren't relevant but it has not been my
           | experience. I've been using DDG as my main search engine for
           | a few years and never have to go beyond the first page. I
           | really curious why that is.
        
             | JohnFen wrote:
             | My experience is like yours -- DDG is legitimately better
             | than Google. My hypothesis is that it's related to how you
             | construct searches. I expect Google probably does better if
             | you learn how to talk to it, since it seems to want to
             | interpret your query rather than take it literally.
             | 
             | My searches tend to be keyword-oriented rather than natural
             | language. I think DDG does better with those.
        
         | RDaneel0livaw wrote:
         | I was looking for this as well! I use it daily and have for
         | years. Love it.
        
         | Kiro wrote:
         | DDG doesn't have their own index (they're getting their results
         | from Bing) so not really relevant to this question.
        
           | BbzzbB wrote:
           | I.. didn't know that. However, trying it just now in
           | incognito I don't get the same results[0] (some different
           | links, and most re-ordered). Is Duck repurposing Bing's
           | results? I've tested with "how to get rich", a great bait for
           | bad content (try it on Google without an adblocker, if you
           | dare).
           | 
           | [0]: https://pastebin.com/xC45hL1i
        
             | Kiro wrote:
             | I don't know what DDG is doing but I'm imagining that they
             | send in the raw queries while you can't get around Bing's
             | personalisation even in incognito. I get very similar
             | results for "how to get rich", but only after setting "All
             | regions" on DDG.
             | 
             | Bing:
             | 
             | 1. How to Get Rich: 10 Things Wise and Rich People Do
             | 
             | 2. 5 Ways to Get Rich - wikiHow
             | 
             | 3. 16 Proven Ways On How To Get Rich Quick (2021 Edition) -
             | TPS
             | 
             | 4. How to Get Rich - NerdWallet
             | 
             | 5. How to Get Rich: Follow our Step by Step Plan to Build
             | ...
             | 
             | DDG:
             | 
             | 1. How to Get Rich: 10 Things Wise and Rich People Do
             | 
             | 2. 5 Ways to Get Rich - wikiHow
             | 
             | 3. How to Get Rich - NerdWallet
             | 
             | 4. 16 Proven Ways On How To Get Rich Quick (2021 Edition) -
             | TPS
             | 
             | 5. How to Get Rich: 8 Steps to Make Your First Million ...
             | 
             | It's no secret that DDG is using Bing so they're not trying
             | to hide it. An easy way to verify it is to search for "what
             | is my ip" on DDG and look for results where the IP number
             | has been cached in the snippet, e.g.:
             | 
             | www.myipnumber.com What is my IP number - my IP address -
             | MyIpNumber.com What is my IP Number? The IP Number of this
             | machine is: 157.55.39.192. This number can also be
             | represented as a 32-bit decimal number 2637637568, or as a
             | 32-bit hexadecimal number 0x9D3727C0 . (Note that if you
             | are part of an internal network then this is the IP number
             | of your local server, the machine which is connected to the
             | external ...
             | 
             | If you do an IP lookup on 157.55.39.192 you will see that
             | it's in fact "Microsoft bingbot".
        
       | marksbrown wrote:
       | I'd like a way of automatically filtering for websites that :
       | 
       | * Don't use JS * Don't use Google analytics * Don't weigh more
       | than a few kB per page * Don't show any sites with ads
       | 
       | That would be a place to begin.
        
       | fnord77 wrote:
       | information-dense pages of yore have been replaced by really
       | wordy, probably generated SEO optimized blog junk.
        
       | tigerlily wrote:
       | Surely there must be some way to have distributed search compute
       | a la folding/seti@home or those mersenne prime guys.
       | 
       | I'd gladly pool in some of my CPU time if it helps build a better
       | search.
        
         | teddyh wrote:
         | https://yacy.net/
        
           | tigerlily wrote:
           | Thanks!
        
       | michaelyuan2012 wrote:
       | here is Drop Side Trailer information, it should be helpful for
       | those logistic company.
       | 
       | https://www.dreamtruegroup.com/drop-side-trailer/
        
       | mrfusion wrote:
       | I've always wondered why you can't use SEO optimizations for
       | GOOGLE as a negative weight and penalize those pages.
       | 
       | For example if my search term appears in the URL I can almost
       | guarantee I don't want that page.
        
       | ChrisArchitect wrote:
       | related 2 days ago:
       | 
       |  _Ask HN: Has Google search become quantitatively worse?_
       | 
       | https://news.ycombinator.com/item?id=29392702
       | 
       | Inviting all the paranoid/speculative/hearsay/personal experience
       | responses. Lame Ask HNs!!!!!
        
       | michaelyuan2012 wrote:
       | for the people in logistic business area who search in Google,
       | Flatbed Container Semi Trailer, it should be good for your
       | reference.
       | 
       | https://www.dreamtruegroup.com/flatbed-container-semi-traile...
        
       | ravenstine wrote:
       | I think what [some] people actually want isn't the Google of 2005
       | but to have a search engine where they don't feel like they're
       | being manipulated.
        
       | anotheraccount9 wrote:
       | Check out the dead internet theory. If most people browse 1% of
       | the web, what's up with popular search engines?
        
       | ab_testing wrote:
       | I think a lot of people are ignoring the issue that the web has
       | changed considerably since 2005. It is approximately 10 times
       | larger in terms of number of websites and web pages. And a lot of
       | it is SEO junk that is just designed for search engines to be
       | easier to parse and show ads in your face.
       | 
       | Also user preferences have changed in the last decade or so. I
       | know millenaials and users in their late 30's or early 40's still
       | yearn for the old web where they would type a search term and
       | correct results would astonish them. However, younger users tend
       | to gravitate to videos and that is why a large portion of the
       | google results are now video results.
        
       | rovingEngine wrote:
       | I think Google was "better" from a users point of view in 2005
       | because it wasn't that good at selling ads yet. I still remember
       | the epiphany of the first time I used Google in 1999. It was
       | amazing.
       | 
       | I've thought the same about pre-ad Twitter and Facebook.
       | 
       | Early on, startups with free services look a lot like non-profits
       | and just maximize user benefit to grow. The problem is they're
       | not non-profits, and have to make money at some point. That has
       | tended to mean ads.
       | 
       | I'd easily pay, say, $9/mo to have access to an ad-free search
       | engine that made me feel the way 1999 Google did.
        
         | mmmmmbop wrote:
         | $9/mo is not going to cut it. Google's domestic annual revenue
         | per user in 2019 was $256. [0] That's $21.33 per month. Not all
         | of Google's revenue is from Ads, of course, but the vast
         | majority is. (Let's ignore for now the valid counterpoint that
         | Ads are increasingly served on other Google properties than
         | Search.)
         | 
         | But even charging users $21.33/mo for an ad-free search
         | experience most likely wouldn't be enough. By providing such an
         | option, you'd greatly reduce the value of the remaining Ads
         | pool.
         | 
         | The optimistic perspective on this is that if you are one of
         | the users with disposable income, you're essentially
         | subsidizing a great search engine and a suite of other tools
         | for the less well-off ones.
         | 
         | [0] https://miro.medium.com/max/6545/0*YTqXb-F5UiVhtlIS
        
           | rovingEngine wrote:
           | Let's say ads will always make more money (I have no reason
           | to believe they won't), and that's required to be the
           | dominant search engine because the web is big and expensive
           | to organize.
           | 
           | I'd bet there's some way to characterize what I and others
           | liked about the earlier web and create a search engine that
           | just worries about that stuff. I'd pay $9/mo for whatever 1/3
           | of Google's spend per user would get me. That's not to say
           | this thing would "beat" Google, but it could profitably
           | exist.
        
       | BitwiseFool wrote:
       | Natural Language Processing is a pox on modern search engines. I
       | suspect that Google et. al. wanted to transform their product
       | into an answer engine that powers voice assistants like Siri and
       | just assumed everyone would naturally like the new way better. I
       | can't stand how Google is always trying to guess what I want,
       | rather than simply returning non-personalized results solely
       | based on exactly what I typed in the textbox.
       | 
       | While that may be good for most people, there is still a lot of
       | power and utility in simple keyword-driven searches. Sadly, it
       | seems like every major search engine _has_ to follow Google 's
       | lead.
        
         | maxlamb wrote:
         | What's a pox?
        
           | mattanimation wrote:
           | a disease or plague
        
           | rocqua wrote:
           | Saying X is a pox on Y means saying X is bad for Y.
           | 
           | It originates from the disease 'the pox'.
        
         | marginalia_nu wrote:
         | I think _some_ NLP is strictly beneficial for a search engine.
         | You may think  "grep for the web" sounds like a good idea, but
         | let me tell you, having tried this, manually going through
         | every permutation of plural forms of words and manually
         | iterating the order of words to find a result is a chore and a
         | half.
         | 
         | Like, instead of trying                 PDP11 emulator
         | PDP-11 emulator       "PDP 11" emulator       PDP11 emulators
         | PDP-11 emulators       "PDP 11" emulators       PDP11 emulation
         | PDP-11 emulation       "PDP 11" emulation
         | 
         | Basic NLP can do that a lot faster without introducing a lot of
         | problems.
         | 
         | I do think Google currently goes way overboard with the NLP. It
         | often feels like the query parser is an adversary you need to
         | outsmart to get to the good results, rather than something
         | that's actually helpful. That's not a great vibe. However, I
         | think the big problem isn't what they are doing, but how little
         | control you have over the process.
        
           | kenny11 wrote:
           | I get that for general-purpose searches this is a good idea,
           | but it would be nice if there was an easy way to disable this
           | when you know you don't want it - for example, for most
           | programming searches, if I type SomeAPINameHere the most
           | relevant results will always be those that include my search
           | term verbatim. I don't need Google to helpfully suggest "Did
           | you mean Some API Name Here?", which will virtually always
           | return lower-quality search results.
           | 
           | Early Google was a breath of fresh air compared to the
           | stemming that its competitors at the time did, but nowadays
           | even putting search terms in quotes doesn't seem to return
           | the same quality of results for these types of queries that
           | Google used to have.
        
             | thisisnotatest wrote:
             | I feel your pain. Two workarounds when Google gets it wrong
             | are to put the term in quotation marks, or to enable
             | Verbatim mode in the toolbelt. (I know various people have
             | come up with ways to add "Google Verbatim" as a search
             | engine option in their browser, or use a browser extension
             | to make Verbatim enabled by default.)
             | 
             | Disclaimer: I work on Google search.
        
               | Y_Y wrote:
               | Both of these options are disappointing, in my
               | experience. Verbatim mode seems weirdly broken sometimes
               | (maybe it's overly strict), and quoting things is rarely
               | enough to convince Google that you really want to search
               | for exactly that thing and not some totally different
               | thing that it considers to be a synonym.
               | 
               | One porridge is too hot and the other is too cold. I know
               | Google could find a happy compromise here if it wanted
               | to. In fact, I bet there's some internal-only hacked-
               | together version that works this way and actually gives
               | an acceptable user experience for the kind of people who
               | have shown up to this thread to show their
               | dissatisfaction.
        
               | vdqtp3 wrote:
               | Try this, go to Google and type in "eggzackly this".
               | 
               | Two results not containing "eggz" at all. Two results
               | containing "eggzackly<punctuation>this" Two results
               | containing "eggzackly" but missing "this".
               | 
               | Google Search is broken. It no longer does what it's
               | directed, it just takes a guess. I suspect part of this
               | is because someone decided that "no results found" was
               | the worst possible result a search engine could give.
        
           | KennyBlanken wrote:
           | Google does go way overboard with "NLP". Starting at least 5
           | years ago there was a trend toward "similar" matching and
           | search result quality nose-dived.
           | 
           | You can search for, say, "cycling (insert product category
           | here)" and get motorcycle related results. Why? Because to
           | google "cycling" = "biking" and "motorcycles" are "bikes",
           | bob's your uncle, now you're getting hits for motorcycle
           | products.
           | 
           | Every time I try to do a very specific search I can see from
           | the search results how google tries to "help", especially if
           | the topic is esoteric. The pages actually about the esoteric
           | thing I'm searching for get drowned in a sea of SEO'd
           | bullshit about a word/topic that is 1-2 degrees of separation
           | from each other in a thesaurus. I'm sure someone at google is
           | very, very proud of this because it increases their measure
           | for search user satisfaction X percent.
           | 
           | It does this thesaurus crap even with words in quotes, which
           | is especially infuriating.
        
             | marginalia_nu wrote:
             | Yeah. It's one of those things where it's invisible where
             | it works and enraging when it doesn't. That's generally not
             | a failure mode that's desirable. It at least should require
             | extremely low failure rates to motivate.
        
           | JohnHaugeland wrote:
           | "Basic NLP can do that a lot faster without introducing a lot
           | of problems."
           | 
           | This is called "stemming" and is not sensibly approached with
           | machine learning.
        
             | marginalia_nu wrote:
             | Of course, but stemming is a fairly basic technique in NLP,
             | as is POS-tagging. NLP is not machine learning.
        
               | brokensegue wrote:
               | Modern NLP basically is machine learning
        
               | marginalia_nu wrote:
               | You can still do NLP without machine learning though, and
               | a lot of the sorts of computational linguistics a search
               | engine needs for keyword extraction and query parsing
               | doesn't require particularly fancy algorithms. What it
               | needs is fast algorithms, and that's not something you're
               | gonna get with ML.
        
               | JohnHaugeland wrote:
               | Stemming is not meaningfully a natural language
               | processing technique, any more than arithmetic is a
               | technique of linear equations.
        
               | necovek wrote:
               | At the very least,
               | https://en.wikipedia.org/wiki/Natural_language_processing
               | seems to disagree.
               | 
               | (So do I: NLP does not have to be machine learning/AI
               | based)
        
               | marginalia_nu wrote:
               | Is it not the processing of natural language?
        
               | JohnHaugeland wrote:
               | Would you call addition a system of linear equations?
               | 
               | No, you don't use the college senior label for the
               | highschool freshman topic. You use the smallest label
               | that fits.
               | 
               | It's string processing.
               | 
               | NLP is actually understanding the language. Stemming is
               | simple string matching.
               | 
               | Playing the technicality game to stretch fields to
               | encompass everything you think even marginally related
               | isn't being thorough or inclusive; it's being bloated,
               | and losing track of the meaning of the term.
               | 
               | Splitting on spaces also isn't NLP.
        
               | marginalia_nu wrote:
               | Stemming is a task specific to a natural language. You
               | can't run an English stemmer on French and get good
               | results, for example.
               | 
               | All NLP is, strictly speaking, more or less elaborate
               | string matching.
               | 
               | > Splitting on spaces also isn't NLP.
               | 
               | String splitting can be, but it's a bit borderline. I'll
               | argue you're in NLP territory if it doesn't split "That
               | FBI guy i.e. J. Edgar Hoover." into four "sentences".
        
               | necovek wrote:
               | > NLP is actually understanding the language.
               | 
               | That's actually not an accepted terminology. There's,
               | indeed, this:
               | https://en.wikipedia.org/wiki/Natural-
               | language_understanding
               | 
               | Not sure why are you so adamant that yours is the "true
               | meaning", when NLP existed long before machine learning
               | and AI were used for it. And even if not, every term can
               | be defined differently, so it should be normal to have
               | different institutions/people define NLP differently.
        
           | JPKab wrote:
           | Semantic search requires NLP. So does the Q&A format the OP
           | is complaining about. People conflate all things NLP to the
           | latter, and forget about the former.
        
           | BitwiseFool wrote:
           | Maybe I'm not using the right qualifiers around the term NLP.
           | The kind of NLP I was referring to is something like "Hey
           | google, what is natural language processing?" and orienting
           | the search around people asking questions in standard(ish)
           | English like they would to another person.
        
             | marginalia_nu wrote:
             | NLP is very heavily integrated into search, so I don't
             | think it's really possible to decouple them. But I agree
             | the whole BonziBuddy thing they've got going now is
             | annoying and it's especially unfortunate how it's replaced
             | the search functionality. I'd have a lot more patience with
             | it if I could choose this functionality when I wanted to
             | ask a question.
        
             | gk1 wrote:
             | That's known as Open Domain Question Answering[1] and is
             | only a subset of NLP.
             | 
             | [1] https://www.pinecone.io/learn/question-answering/
        
         | wpietri wrote:
         | I doubt they assumed it was better. I expect they did a ton of
         | user testing and found that it was better for most people. And
         | I'm sure it is. HN users are very much a niche audience these
         | days.
        
           | gk1 wrote:
           | Right. Bing switched to this method as well, as did Facebook,
           | Twitter, Amazon, and pretty much every other company that has
           | the ML resources to do this. They obviously had a good reason
           | to do so, beyond assumptions.
        
       | s1k3s wrote:
       | I don't know how Google was in 2005, but in ~2010 I was able to
       | pull a website on #1 with 0 cash spent, just by manipulating PR.
       | That doesn't seem great to me.
        
       | flipdot wrote:
       | Not sure if this is any close to what you're trying to find, but
       | there's https://github.com/benbusby/whoogle-search
        
       | nickpp wrote:
       | Because we're not having a 2005-Web anymore. More to the point,
       | SEO & Google have evolved together. To have barely relevant
       | results today you need to be _good_. That takes stellar talent
       | which costs huge amounts of money.
       | 
       | Thus, the Google of today, which is optimized to extract that
       | money from us.
        
         | Const-me wrote:
         | > To have barely relevant results today you need to be good
         | 
         | An easy way to become way better than google -- detect google
         | ads on pages, and penalize these pages in the index. For
         | obvious reason, google search is incapable of doing so.
        
         | ginko wrote:
         | But shouldn't all the blogspam be so hyperoptimized for
         | Google's algorithm that is should be straightforward to detect
         | and ignore/downrank it?
        
           | marginalia_nu wrote:
           | Yeah, I do this with my search engine. Works pretty well. A
           | complementary approach that works well is to look at where
           | blogs written by humans link. Very few spam blogs get links
           | from humans.
        
           | elcomet wrote:
           | It's not that easy, they are optimized for many metrics..
        
           | beingflo wrote:
           | No because google's algorithm is not well known publicly.
           | Also, if it was straightforward to detect then google could
           | downrank it as well.
        
             | kbelder wrote:
             | I wonder if you could evaluate a page using your own
             | algorithm, which is probably not gamed as much as Google's
             | (because who cares about your search engine?)
             | 
             | Then, check Google's ranking of the page. If it is much
             | higher than it seems the page should be, assume the page is
             | being SEO hyper-optimized and penalize the page
             | proportionately.
             | 
             | Basically, using the variance between Google's model and
             | your model as an indicator of an SEO spam page.
        
             | ginko wrote:
             | The point is that SEO would just immediately adapt to
             | Google's changes. If a smaller search engine filtered these
             | out then it would likely stay under the radar.
        
               | thefreeman wrote:
               | you know that legitimate sites perform SEO as well,
               | right?
        
               | marginalia_nu wrote:
               | SEO often seems to be a compensation for the fact that a
               | site doesn't have particularly worthwhile content. So
               | punishing SEO surprisingly does promote higher quality of
               | search results.
        
               | all2 wrote:
               | Yes and no. A lot of those sites are small local
               | businesses trying to get found. A front page listing can
               | be the difference between surviving and going under. Much
               | of the time the blog spam is what floats hours, contact
               | info, and services provided to the first page.
        
               | marginalia_nu wrote:
               | Be that as it may, search ranking is a zero sum game. The
               | unfair advantage SEO gives this particular struggling
               | business means another goes under. I'd rather punish the
               | guy trying to game the system than the one with enough
               | principles not to.
        
               | pessimizer wrote:
               | The difference is far more likely to be in capability or
               | expertise than principles.
        
               | marginalia_nu wrote:
               | Either way, capability for fuckery is not something I'd
               | want to encourage.
        
           | nickpp wrote:
           | I _read_ auto-generated pages almost to the end before
           | realizing it was SEO spam. (I am not a native English speaker
           | though)
           | 
           | With content copying, shuffling and AI generating, I am
           | afraid we are on the cusp of auto content generators passing
           | some restricted Turing test where readers really think it's
           | an actual human that wrote it.
           | 
           | As for me, I leant that for certain "hot topics", simply
           | doing a generic search on Google is not a good idea anymore.
        
         | thisisnotatest wrote:
         | Yes, I think you'd call it a Red Queen Problem:
         | 
         | "Here, you see, it takes all the running you can do to keep in
         | the same place."
         | 
         | -Lewis Carroll's Through the Looking Glass
        
       | willcipriano wrote:
       | I'd like to see a "just search" engine, all it does it search for
       | a specific string, case insensitively, across the entire web. No
       | curation or anything, just sorted in lexicological order closest
       | match first maybe falling back to page age if it has more then
       | one exact match. Perhaps give me some regular expressions as
       | well.
        
         | jeffbee wrote:
         | That would be easily the worst search engine ever deployed.
         | Imagine just returning all docs containing the word "bicycle"
         | in chronological order. Useless.
        
           | willcipriano wrote:
           | For "Bicycle" it would suck but I don't often use search
           | engines that way, for "High Timber ALX 29" you'd probably get
           | something like this:
           | https://www.schwinnbikes.com/products/high-timber-
           | alx-29?var...
           | 
           | I wouldn't use it for everything but sometimes that is the
           | exact behavior that I want. I'd use duck duck go for more
           | general searches.
        
             | jeffbee wrote:
             | That is the top hit on google for that search, so what's
             | your complaint?
        
               | willcipriano wrote:
               | Take a random part number off your car, or a portion of a
               | error message and try finding that. It's annoying to have
               | to scroll down over a page or two of autogenerated SEO
               | answers to get to something useful. The first result to
               | appear on the internet is less likely to be SEO and more
               | likely to be the manufacturers documentation or the git
               | commit that spawned your error. It isn't always, but
               | that's why you have more then one search engine.
               | 
               | Secondarily I think a search engine that is very simple
               | in it's model and operation is useful for more general
               | free speech purposes. If the major search engines decide
               | they don't like a site like the pirate bay a search for
               | '"Pirate Bay" And "Torrents"' on a search engine that
               | does not curate could still get you there. I guess the
               | point is without curation you have to work harder to find
               | what you want, but nobody is actively preventing you from
               | finding anything. It would help keep everybody honest.
        
         | prox wrote:
         | Maybe a "stability factor" could be calculated. Whereas earlier
         | new content was king, I now value a stable long term source of
         | information. So domain age + page age + content variability +
         | dependency on ads. That might give more honest sources a go.
        
           | willcipriano wrote:
           | That's a good idea, I'd make it a option. Do you want newest
           | first, oldest first or by stability?
        
       ___________________________________________________________________
       (page generated 2021-12-02 23:02 UTC)