[HN Gopher] Ask HN: Why doesn't anyone create a search engine co...
___________________________________________________________________
Ask HN: Why doesn't anyone create a search engine comparable to
2005 Google?
I seem to recall that Google consistently produced relevant results
and strictly respected search operators in 2005 (?), unlike the
modern Google. And back then, I think search results were the same
for everyone, rather than being customized for each user.
Author : syedkarim
Score : 233 points
Date : 2021-12-02 15:10 UTC (7 hours ago)
| keddad wrote:
| While I feel that Google has become worse in last couple of
| years, I'm pretty sure it is still better now when 15 years ago.
| Maybe it is just some kind of nostalgia?
| micromacrofoot wrote:
| the internet has changed, partially due to google's influence
|
| instead of discussion forums and Q&A sites, everyone's on
| facebook/twitter/discord/slack/snapchat/tiktok/etc... none of
| that is really very google friendly
|
| online marketing and SEO is a _much_ larger industry now, so
| with less (by % of total) searchable content generated by
| people (which is on social media) a lot of the high-ranking
| content that appears in search is highly optimized marketing
|
| then you have other kind of weird things like... half of all
| internet traffic being bots
| abhaynayar wrote:
| I'm probably the only person who doesn't think Google search has
| deteriorated. I play security CTFs, so a lot of times I have to
| search for peculiar technical details on various software. Also,
| like any other human being, I also make generic queries. In both
| cases, I feel like I almost always get to the desired webpage
| within the top few results.
| Ellipse0934 wrote:
| It honestly depends on what you are searching for.
|
| Case 1: You just want the name of the website, or an article,
| example "Facebook" -> fb.com, "Gordan Ramsay" -> Wiki/official
| website/Celeb gossip website you are good. Not much competition
| here.
|
| Case 2: You are looking for something technical like "GNU rnano
| CVE-abcde"/"OpenBSD ARM64 Qualcomm Wifi driver not working",
| you are again in the fine territory, not much if any money to
| be made here so very less competition. There will be the
| official forums, websites, maybe some conference websites in
| this category.
|
| Case 3: "Chicken potpie recipe", "How to be more organised":
| This is the category where people are trying to game the SEO
| algos. How the hell do recipe websites with 27 popups, 12000
| word essay on the secret family history ends up on top ? There
| are a huge number of passionately made simple recipe websites
| but they have to be "found" by us. For the second query I
| mention about being more organized I think most people are
| looking for some sort of a review article which looks at some
| various schools of thoughts regarding discipline, cleanliness
| pointing to further resources and exploring the why and what to
| do for this. Here the search engine needs to determine the
| context of the query which is fairly abstract and then the
| internal heuristics it uses are supposed to drive it to a
| meaningful list of websites. Maybe the average joe would like
| to click cosmopolitan's article but I would never do that.
| Based on my previous click history maybe google should
| determine what I kind of links am I looking for. But when they
| figure that out they'd much faster use this behavioral insight
| for advertisers. A great search engine is basically a primitive
| personal librarian, I'd pay a yearly subscription for one.
|
| The internet is vast and it has stuff that I don't know about.
| How my 7 word abstract query is gonna get me there is the
| question mark. Also, for a lot of queries the top results can
| be plagued by spammy/fraud results which are on top because
| they managed to trick the SEO algos. These bad actors were not
| as prevalent for 2005 google.
| jeffbee wrote:
| Well no, it's you and me and the whole google search quality
| evaluation team and everyone who works on google search and
| like 99% of the general public as well. The meme of falling
| search quality infects only HN. Mostly what people are
| complaining about is that the quality of the web itself is in
| free-fall.
| wodenokoto wrote:
| The web has changed drastically. I'd imagine 2005-google engine
| today would be nothing but abandoned Wordpress blogs with comment
| spam.
| warning26 wrote:
| I suspect this is exactly it--a lot of what made 2005-era-
| Google good wasn't necessarily Google's own doing. It was that
| SEO people hadn't yet fully figured out how to game the system
| yet.
|
| If you took an exact copy of Google circa-2005 and had it crawl
| today's web, you'd probably get mostly "SEO optimized"
| irrelevant blogspam.
| ghaff wrote:
| And even more copy-pasted spam than already exists.
|
| The early Google (and other even earlier search engines) were
| invented for an Internet world which, if not pristine and pure,
| was at least mostly fairly legit content. Today's Internet is
| probably 90% deliberate spammers and scammers.
| simonebrunozzi wrote:
| These guys [0] have built something really close to 2005-Google,
| and possibly slightly better.
|
| The parent company, Tiscali, was a huge hit in the 1990s, as it
| provided internet access to millions of Italians. It went through
| some struggle for several years, but lately the original founder,
| Renato Soru, came back to run the company.
|
| The company is based in Cagliari, the capital of Sardinia, Italy.
|
| [0]: https://www.istella.it/en/
| jakub_g wrote:
| Cliqz wanted to build new search engine but failed. It's just too
| difficult to operate at that scale and break the existing
| monopoly of big G.
|
| https://www.burda.com/en/news/cliqz-closes-areas-browser-and...
|
| https://news.ycombinator.com/item?id=23031520
|
| https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...
| hunterb123 wrote:
| And then Brave bought them and it succeeded.
|
| Cliqz is now Brave Search, I use it for all my devices, it's
| great.
|
| Works better than DDG and sometimes better than Google.
|
| I only hash bang every 100 searches or so, most of the time
| Google doesn't have it either. It's just to make sure.
|
| http://search.brave.com/
| arthur_sav wrote:
| What if we didn't try to replicate google. Smaller and niche
| search engines would probably work better in this new world of
| vast information.
| vangelis wrote:
| They have, sort of: https://search.marginalia.nu/
| pkamb wrote:
| I would use a search engine that only indexed Reddit, Stack
| Exchange, Wikipedia, and a small number of other sites.
|
| And that specifically blocked Pinterest, Quora, most non-personal
| "blogs", etc.
|
| People suggest DDG ! operators, but I don't want to use a site's
| (bad, single-site) search box. I want a multi-site SERP that only
| displays results from known good sites, which are customizable.
| monkeybutton wrote:
| Too bad they whitelist which bots can access their sitemaps!
| pkamb wrote:
| Even rules such as "if there is a Wikipedia result in the top
| 10, display it first".
| all2 wrote:
| If I could add sites I liked to the index that'd be great. Find
| a blogger/hacker I like? Add to the index. Can I share my index
| with others? Can I include their indices in my searches?
|
| Search engine as a social media platform? If I follow you, now
| I can search in your indices?
| vincent_s wrote:
| Some people try:
|
| https://www.mojeek.com/
|
| https://fireball.com/
|
| https://search.brave.com/
| 7373737373 wrote:
| Time for an https://github.com/sindresorhus/awesome search
| engines?
| ColinHayhurst wrote:
| Mojeek founder story here: https://blog.mojeek.com/2021/03/to-
| track-or-not-to-track.htm...
|
| No-tracking and independent from the start. Now at 4.6 billion
| pages with own infrastructure and IP. Went to market in 2020
| with contextual ads and API. Self-disclosure: CEO
| bullen wrote:
| Do you use some sort of PageRank?
| prox wrote:
| Never heard of Mojeek. I will try it for a month and see how
| it works. Currently using DDG 99% of the time.
| snovv_crash wrote:
| HN is wild: 30m after something is mentiond, the CEO chimes
| in.
| ColinHayhurst wrote:
| You might call this a search engine based on the principle of
| Information Neutrality.
|
| "Information Neutrality is the principle to treat all information
| provided (by a service) equally. The information provided, after
| being processed by an information-neutral service, is the same
| for every user requesting it, independent of the user's
| attributes, including, e.g., origin, history or personal
| preferences and independent of the financial or influential
| interest of the service provider, as well as independent of the
| timeliness of information."
|
| I wrote about this in relation to search [0]. We need to be
| allowed more freedom to choose search engines and services. One
| (default or selected) choice for search is unhealthy. We
| shouldn't have to choose between Google or Bing; DuckDuckGo or
| Startpage; Brave or Ecosia; Mojeek or Gigablast ..... Personally
| I use all 8 of these and more, as also explained [0].
|
| [0] https://blog.mojeek.com/2021/09/multiple-choice-in-
| search.ht...
| gorgoiler wrote:
| Random thought, based especially on using DuckDuckGo for two
| years:
|
| Search engine isn't singular, it's plural.
|
| (1) Search engine for something I know exists.
|
| (2) Search engine for finding something new.
|
| There's a market for both, but you don't have to solve both
| problems with the same product.
|
| Sometimes I switch to Google for the former, but the latter works
| well enough for me that I don't care what else Google would've
| shown me.
|
| More often than not, my feeling is Google would only have shown
| me more ads in addition to whatever I could already find
| elsewhere.
| llaolleh wrote:
| Everyone runs in the other direction anytime a search engine is
| mentioned. The thought of competing with Google turns people off.
|
| Even in 2021, despite how bad it's become, it's still miles ahead
| of other competitors.
| prox wrote:
| I disagree. A lot of people I know already switched to
| Duckduckgo. Google's ability to get relevant results is
| dropping like a brick, while the quality of DDG has been
| improving slowly but steadily.
| datenarsch wrote:
| I wish I could agree but from my experience, DDG's search
| results aren't really that great. Often even worse than
| Google's.
|
| And another private company is not the answer I believe. We
| need something more drastic, an open-source search engine
| organized as a genuine non-profit organization. Something
| like that. Otherwise, whatever replaces Google will just turn
| into another Google as soon as it gets any momentum.
| amelius wrote:
| Also, where are the books about writing a search engine?
|
| Knuth's "Searching and Sorting" volume desperately needs an
| update.
| mindcrime wrote:
| I don't even know if anybody has written a book specifically
| about search at "web scale" (no MongoDB jokes here, please).
| But about the closest things I know of would be something like:
|
| https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
|
| https://www.amazon.com/Information-Retrieval-Implementing-Ev...
|
| https://www.amazon.com/Introduction-Information-Retrieval-Ch...
| indymike wrote:
| Brave's new search engine seems to work pretty well. Have been
| using it as my primary for about 10 days, and so far, I've only
| had to revert to Google once, and when I did the results were
| chock full of spam.
| concinds wrote:
| The nice thing about Brave Search is that they're trying to
| create an index completely independent from Bing/Google, and
| they seem to be trying to innovate on ways to get there as well
| with their Web Discovery Project[0], unlike DuckDuckGo. They've
| announced Brave Search will get ads soon, with a premium
| version without ads, which I think is acceptable given the
| costs of running an independent index sustainably.
|
| [0]: https://brave.com/privacy/browser/#web-discovery-project
| travisgriggs wrote:
| Can echo this. About 30 days going all devices. I'd say about
| once a day I do a !g, and rarely do I actually find something
| there, it usually just ends up being a confirmation search.
| lgrialn wrote:
| What I miss most of all from the Good Old Days was getting as
| many hits back as I could read.
|
| Rather than being told "No, there are only eight pages of results
| on anything in the goddamned world. Really. Would I lie to you?"
| greyman wrote:
| 1) Google is better at AI, for example let's take this sloppy
| search: "some joke where you can't tell if it is serious or joke"
|
| It is called Poe's law, and Google returned it at #4. Bing or
| Duckduckgo don't have a clue...
|
| 2) They have a years of user's data, like for specific term, they
| see what users clicked most, so they see which results were
| perceived as most relevant. It is hard to catch up if you dont
| have such data.
|
| 3) They developed anti-spamming tools during the years of
| fighting against SEO-spammers.
| wmil wrote:
| > Google is better at AI, for example let's take this sloppy
| search: "some joke where you can't tell if it is serious or
| joke"
|
| My problem there is that I don't expect or want my search
| engine to do that. The counter case is where I remember a quote
| from and article and want to find the article. Old Google would
| help me find matching text and I could quickly find the
| original article. Current Google will try to interpret the text
| and give me some nonsense based on that.
|
| AI has ruined other Google features... the "search by image"
| feature now analyzes the image, returns a generic tag like
| "woman", and shows me the wikipedia article on women as the
| first result.
|
| Old search by image had tineye like functionality and you could
| find the source of images.
| lolpython wrote:
| > 1) Google is better at AI, for example let's take this sloppy
| search: "some joke where you can't tell if it is serious or
| joke"
|
| > It is called Poe's law, and Google returned it at #4. Bing or
| Duckduckgo don't have a clue...
|
| Interesting, I was looking for a good benchmark like this. For
| me Google returned it at #5 with an image/related terms
| carousel before it which places it physically more around #7 on
| the page. Brave Search (never tried it before today) puts Poe's
| Law at #8. So Google is still better.
|
| But the other results are mostly worse (IMO) on Google. Here
| are the first 8 results:
|
| - 175 Bad Jokes That You Can't Help But Laugh At - Reader's
| (rd.com)
|
| - 57 Hilarious, Silly Jokes No One Is Too Old to Laugh At
| (bestlifeonline.com)
|
| - 145 Best Dad Jokes That Will Have the Whole Family Laughing
| (countryliving.com)
|
| - Sarcasm, Self-Deprecation, and Inside Jokes: A User's Guide
| (hbr.org)
|
| - Poe's law - Wikipedia (wikipedia.org)
|
| - Managing Conflict with Humor - HelpGuide.org (helpguide.org)
|
| - 175 Bad Jokes That Are So Cringeworthy, You Can't ... -
| Parade (parade.com)
|
| - Encouraging Your Child's Sense of Humor (for Parents) - Kids
| ... (kidshealth.org)
|
| And here are the first 8 results from Brave Search:
|
| - phrase requests - Is there a word for "pretending to joke
| when ... (english.stackexchange.com)
|
| - Joke - Wikipedia (wikipedia.org)
|
| - "Are you joking or serious?" - The Caffeinated Autistic
| (thecaffeinatedautistic.wordpress.com)
|
| - How do I tell when people are joking or being serious?
| (reddit.com/r/socialskills)
|
| - be a joke | meaning of be a joke in Longman Dictionary of
| (ldoceonline.com)
|
| - Quote by Ricky Gervais: "If you can't joke about the most
| (goodreads.com)
|
| - How can you tell if someone is joking with you or not?
| (quora.com)
|
| - Poe's law - Wikipedia (wikipedia.org)
|
| -----
|
| edit: I did not count to 8 correctly the first time. Fixed
| that.
| pkamb wrote:
| The Brave results though seem to contain "good sites" whereas
| the Google results are content mill blogspam. The exact
| placement of Poe's Law is somewhat less important.
| lolpython wrote:
| I agree. I switched to Brave Search after running this
| test.
| gbmatt wrote:
| Ha, yes, I've done that at https://gigablast.com/ . The biggest
| problems now are the following: 1) Too hard to spider the web.
| Gatekeeper companies like Cloudflare (owned in part by Google)
| and Cloudfront make it really difficult for upstart search
| engines to download web pages. 2) Hardware costs are too high.
| It's much more expensive now to build a large index (50B+ pages)
| to be competitive.
|
| I believe my algorithms are decent, but the biggest problem for
| Gigablast is now the index size. You do a search on Gigablast and
| say, well, why didn't it get this result that Google got. And
| that's because the index isn't big enough because I don't have
| the cash for the hardware. btw, I've been working on this engine
| for over 20 years and have coded probably 1-2M lines of code on
| it.
| easton wrote:
| You can be whitelisted so Cloudflare doesn't slow you down (or
| block you): https://support.cloudflare.com/hc/en-
| us/articles/36003538743...
| fsflover wrote:
| > 2) Hardware costs are too high.
|
| Which is why the next big search engine should be distributed:
| https://yacy.net.
| wruza wrote:
| No way to test it right away, demo peer 502-es.
| indymike wrote:
| I've used Gigablast off and on for a long time (I think I first
| discovered Gigablast in 2006 or so). Would be cool to have a
| registration service for legitimate spiders. I used to run a
| team that scraped jobs and delivered them (by fax, email, us
| mail as require by law) to local veteran's employment staffers
| for compliance. We were contracted by huge companies (at one
| point about 700 of the fortune 1000) to do so, and often our
| spiders would be blocked by the employer's IT department even
| though the HR team was paying us big bucks to do so.
| mrlinx wrote:
| Where did you read that google/alphabet owns part of
| Cloudflare?
| [deleted]
| bloudermilk wrote:
| Assuming OP is referring to Google Venture's participation in
| at least one of Cloudflare's rounds.
|
| https://www.crunchbase.com/funding_round/cloudflare-
| series-d...
| skyde wrote:
| what kind of index is Gigablast using? traditional inverted
| index like Lucene or something more esoteric?
|
| I know Google and Bing both use weird data-structure like
| BitFunnel
|
| https://www.microsoft.com/en-us/research/publication/bitfunn...
| collin128 wrote:
| Have you ever looked at the Amazon file?
|
| I'll see if I can track down the link but I remember somebody
| sharing a dump with me from Amazon that apparently was a recent
| scrape.
|
| Edit: https://registry.opendata.aws/commoncrawl/
| web007 wrote:
| That's Common Crawl, they do the spidering of some billions
| of webpages but that's still a tiny percentage of the web
| versus Google or Bing.
| visarga wrote:
| Common Crawl is being used to train the likes of GPT-3 and
| mine image-text pairs for CLIP. I wonder how much useful
| content is missing, we're going to use all the web text,
| images and video soon and then what do we do? We run out of
| natural content. No more scaling laws.
| cschmidt wrote:
| Do you have any stats on that? I've always wondered about
| the coverage of Common Crawl, if you include all the
| historical crawl files too.
| DavidCole1 wrote:
| Interesting. I had some interests in building a search engine
| myself (for playing around ofcourse). I had read a blog post by
| Michael Nielson [1] which had sparked my interest. Do you have
| any written material about your architecture and stuff like
| that? Would love to read up.
|
| [1]: https://michaelnielsen.org/ddi/how-to-crawl-a-quarter-
| billio...
| gbmatt wrote:
| there's some stuff here : https://github.com/gigablast/open-
| source-search-engine
| DavidCole1 wrote:
| Thank you.
| entropie wrote:
| Holy, thats a huge codebase. Github even shows no
| code/syntax hl for many cpp files because they are so big.
|
| I fiddled around and searched for some not so well known
| sites in germany and the results were surprisingly good.
| But it looks really... aged.
| ramboldio wrote:
| maybe just add small webpages into your index, dont bother yo
| execute JS and dont download any images.
|
| The content quality will be higher and it's a lot cheaper.
| woutr_be wrote:
| Out of curiosity, why would not executing JavaScript or not
| downloading images equal higher content quality?
| lloydatkinson wrote:
| With a slightly fresher coat of paint this could be very
| popular. For example, no grey background.
| justinzollars wrote:
| This is great! I found something other engines do not pick up!
| apparently I signed an agile manifesto in 2010
| https://agilemanifesto.org/display/000000190.html
| yumraj wrote:
| > Cloudflare (owned in part by Google)
|
| Please elaborate. Is there a special relationship between
| Cloudflare and Google?
| spullara wrote:
| Google Capital is an investor:
| https://www.forbes.com/sites/katevinton/2015/09/22/google-
| mi...
| yumraj wrote:
| That is not the same as being owned by Google.
| SamBam wrote:
| > You do a search on Gigablast and say, well, why didn't it get
| this result that Google got. And that's because the index isn't
| big enough
|
| I wionder how much this is true, and how much (despite all our
| rhetoric to the contrary) it's because we have actually come to
| expect Google's modern proprietary page ranking, which counts
| more than just inbound links but all sorts of other signals
| (freshness, relevance to our previous queries, etc.).
|
| We dislike the additional signals when it feels like Google is
| trying to second-guess our intentions, but we probably don't
| notice how well they work when they give us the result we
| expect in the first three links.
| JacobThreeThree wrote:
| I think people also have an inflated recollection of how good
| Google actually was back in 2005.
|
| Back then Google was only going up against indexes and link-
| rings, not 2021 Google/Bing/DDG/etc.
| pbhjpbhj wrote:
| 2005? There were loads of other search engines (SE), and
| many meta-SE: hotbot, dogpile, metacrawler, ... (IIRC),
| plenty more.
|
| There was also indexes, which Yahoo, AOL (remember them!)
| had but there was, what was it called, dmoz?, the open web
| directory. When Google started, being in the right web
| directory gave you a boost in SERPs as it was used as a
| domain trust indicator, and the categories were used for
| keywords. Of course it got gamed hard.
|
| Google was good, but I used it as an alt for maybe 6 months
| before it won over my main SE at the time. I've tried but
| can't remember what SE that was, Omni-something??
|
| One of the main things Google had was all the extra
| operators like link: inurl:, etc., but they had Boolean
| logic operators too at one point I think.
| pronik wrote:
| I've been using Altavista at that time, every now and then
| switching to Northern Light. Everything else was abysmal.
| Google blew them out of the water in terms of speed,
| quality, simplicity, unclutterdness and everything else. I
| can't remember ever retraining muscle memory so fast when
| switching to Google. So, no, Google has been great then and
| apart from people actively working against the algorithm is
| still good now, but obviously a completely different beast.
| romwell wrote:
| Well if the result didn't appear in the first 5-10 pages,
| it's probably not in the index.
|
| You can see it with other search engines. I challenge you to
| come up with a Google query for which a first-page result
| won't be seen within the first 10 pages of Bing results for
| the same query.
|
| (Bonus points if that result is relevant).
|
| There's only so much tweaking that personalization and other
| heuristic can do.
|
| But if something is missing from them index, that's it.
| jldugger wrote:
| I assume the author has the ability to search the index to
| see if your preferred Google result is even indexed.
| garaetjjte wrote:
| >Gigablast has teamed up with Imperial Family Companies
|
| Associating with that crank (responsible for recent freenode
| drama) is very off-putting.
| loo wrote:
| Oh no, you see he isn't responsible, it's everyone else! /s
| djbusby wrote:
| I don't get it, what's the fuzz here?
| meepmorp wrote:
| The guy who took over Freenode styles himself as the
| crown prince of korea; IFC is his company.
| [deleted]
| [deleted]
| sockaddr wrote:
| Nice.
|
| I'd pay 5-10$/mo for a search engine that didn't just funnel me
| into the revenue-extracting regions of the web like Google
| does.
| RhysU wrote:
| A subscriber-supported search engine sounds cool to me. Any
| precedent?
| samcrawford wrote:
| Kagi.com does this. In closed beta at the moment, but you
| can email and request access.
| gianthockey495 wrote:
| You'll like https://neeva.com/
| xtracto wrote:
| Copernic ( https://copernic.com/ ) had Copernic Agent
| Professional, a for-pay desktop application that had really
| good search features, a while ago . Not sure if they
| discontinued it.
| gompertz wrote:
| Wow blast from the past. I think I was using Copernic all
| the way back in 2003... Forgot all about them. Thanks!
| mirker wrote:
| If you have customers, does that mean the incremental gain from
| an improved index costs too much to store? Or are you talking
| about computational costs?
| gbmatt wrote:
| it's both storage and computational. they go hand in hand.
| Minor49er wrote:
| I really love how the results organize multiple matching pages
| from the same domain. This is really cool.
| afrcnc wrote:
| how recent are your results? 1-2h? 1 day?
| gbmatt wrote:
| it's continually spidering. just not at a high rate.
| actually, back in the day i had real time updates while
| google was doing the 'google dance'. that caused quite a stir
| in the web dev community because people could see their pages
| in the index being updated in real time whereas google took
| up to 30 days to do it.
| 1vuio0pswjnm7 wrote:
| I really like GigaBlast.
|
| I wrote a "meta" search utility for myself that can query
| multiple search engines from the command line.^1 It mixes the
| results into a simplified SERP ("metaSERP"), optimised for a
| text-only browser, with indicators to show which search engine
| each result came from. The key feature is that it allows for
| what I might call "continuation searches". Each metaSERPs
| contains timestamps in its page source to indicate when
| searches were executed, as well as preformatted HTTP paths. The
| next search can thus pick up where the previous one left off.
| Thus I can, if desired, build a maximum-sized metaSERP for each
| query.
|
| The reason I wrote this is because search engines (not
| GigaBlast) funded by ads are increasingly trying to keep users
| on page one, where the "top ads" are, and they want to keep the
| number of results small. That's one change from 2005 and
| earlier. With AltaVista I used to dig deep into SERPs and there
| was a feeling of comprehensiveness; leave no stone unturned.
| Google has gradually ruined the ability to perform this type of
| searching with their now secretive and obviously biased behind-
| the-scenes ranking procedures.
|
| Why is there no way to re-order results according to objective
| criteria, e.g., alphabetical; the user must accept the search
| engines' ordering, giving them the ability to "hide" results on
| pages the user will never view or simply not return them. That
| design is more favorable to advertising and less favorable to
| intellectual curiosity.
|
| Each metaSERP, OTOH, is a file and is saved in a search
| directory for future reference; I will often go back to
| previous queries. I can later add more results to a metaSERP if
| desired. I actually like that GigaBlast's results are different
| than other search engines. The variety of results I get from
| different sources arguably improves the quality of the
| metaSERP. And, of course, metSERPs can be sorted according to
| objective criteria.
|
| This is, AFAIK, a different way of searching. The "meta-search
| engines" of yesteryear did not do "continuations", probably
| because it was not necessary. Nor was there en expectation that
| user would want to save meta-searches to local files. Users
| were not trying to minimise their usage of a website, they were
| not trying to "un-google".
|
| Today's world of web search is different, IMO. There seems to
| be a belief that the operator of a search engine can guess what
| a user is searching for, that a user who sends a query is only
| searching for one specific thing, and that the website has an
| ad to match with that query. At least, those are the only
| searches that really matter for advertising purposes.
| Serendipitous discovery while perusing results is not
| contemplated in the design. By serendipitous discovery I do not
| mean sending a random query, e.g., adding an "I'm feeling
| lucky" button, which to me always seemed like a bad joke.
|
| The only downside so far is I ocassionally have to prune "one-
| off" searches that I do not want to save from the search
| directory. I am going to add an indicator at search time that a
| search is to be considered "ephemeral" and not meant to be
| saved. Periodically these ephemeral searches can then be pruned
| from the search directory automatically.
|
| 1. Of course this is not limited to web search engines. I also
| include various individual site search engines, e.g., Github.
| xwdv wrote:
| How much cash do you need?
| JPKab wrote:
| Regarding the Gatekeeper companies like Cloudflare, it sounds
| like anti-competitive behavior that could potentially be
| targeted with anti-trust legislation, correct?
| adolph wrote:
| At a theoretical level it looks like Cloudflare won't block
| search engine crawlers. The docs are very Google and Bing
| oriented and also oriented towards supporting their
| customers, not random new search engine crawler.
|
| _Cloudflare allows search engine crawlers and bots. If you
| observe crawl issues or Cloudflare challenges presented to
| the search engine crawler or bot, contact Cloudflare support
| with the information you gather when troubleshooting the
| crawl errors via the methods outlined in this guide._
|
| https://support.cloudflare.com/hc/en-us/articles/200169806
| danielmarkbruce wrote:
| No, it is not.
|
| Cloudflare is giving it's customers what they want. They
| don't want all kinds of bots claiming to be search engines
| crawling their sites. Cloudflare isn't hurting cloudflare
| competitors by doing this. Cloudflare isn't hurting their
| customers by doing this. To repeat - most websites don't want
| lots and lots of crawlers. They want the 2 or 3 which matter
| and no more, because at some point it's difficult to tell
| what the crawler is doing... (is it a search engine???). They
| aren't obliged to help search engines. Even if Cloudflare
| wasn't offering this, bigger customers would roll their own
| and do.. more or less the same thing.
| shashashasha___ wrote:
| i would assume its mostly anti scraping protection which is
| mostly for privacy. you don't want to allow everyone scrap
| your website, pull and use your info. for example from fb,
| ig, LinkedIn, github, .... you can build a really big
| profiling db on people that way. so websites need to know you
| are a legit search engine first
| karmanyaahm wrote:
| people can still be targeted if that information is public.
| anti scraping sounds like security by obscurity
| gbmatt wrote:
| it should be. there should be some sort of 'bots rights' to
| level the playing field. perhaps this is something the FTC
| can look into. but, as it is right now big tech continues to
| keep their iron grip on the web and i don't see that changing
| any time soon. big tech has all the money and controls access
| to all the data and supply chains to prevent anyone else from
| being a competitive threat.
|
| look at linkedin (owned by microsoft unspiderable by all but
| google/bing). github (now microsoft using this to fuel its AI
| coding buddy, but if you try to spider this at capacity your
| IP is banned) facebook (unspiderable) .. the list goes on and
| on ..
|
| and as you can see, data is required to train advanced AI
| systems, too. So big tech has the advantage there as well.
| especially when they can swoop in and corrupt once non-profit
| companies like openai, and make them [partially] for-profit.
|
| and to rant on (yes, this is what i do :)) it very difficult
| to buy a computer now. have you tried to buy a raspberry pi
| or even a jetson nano lately? Who is getting preferred access
| to the chip factories? Does anyone know? Is big tech getting
| dibs on all the microchips now too?
| technobabbler wrote:
| Cloudflare functions kinda like a private security company.
| They don't go around blocking sites willy-nilly, site owners
| have to specifically choose to use their service (and maybe
| pay for it), configuring the bot blocking rules themselves.
|
| That's not really Cloudflare's fault. Someone has to do it,
| whether it's them or a competitor or sys admins manually
| making firewall rules. Cloudflare just happens to be good
| enough and darned affordable, so many choose to use them.
|
| Hosting costs for small site owners would be much more
| expensive without Cloudflare shielding and caching.
| foxfluff wrote:
| No, I think it is partially Cloudflare's fault because they
| offer this service and make it easy to deploy. This shit
| has exploded with Cloudflare's popularity.
|
| Nobody _has_ to do it, but a lot of people will do it when
| they notice there 's an easy way to do it. Cloudflare is
| very much an enabler of bad behavior here. Now a lot of
| sites just toggle that on without even thinking about
| collateral.
| gbmatt wrote:
| I've had extensively dealing with Cloudflare. They have a
| complex whitelisting system that is difficult to get on,
| and they also have an 'AI' system that determines if you
| should be kicked off that whitelist for whatever reason.
|
| Furthermore, they give Google preferred treatment in their
| UIs and backend algos because it is the incumbent and
| nobody cares about other smaller search engines. So there's
| a lot of detail to how they work in this domain.
|
| It's 100% Cloudflare's fault, and it's up to them to give
| everyone a fair shot. They just don't care. Also, you are
| overlooking the fact that Google is a major investor (and
| so is Bing and Baidu). So really this exacerbates the
| issue. Should Google be allowed (either directly or
| indirectly) to block competing crawlers from dowloading web
| pages?
| technobabbler wrote:
| These are all great points.
| danielmarkbruce wrote:
| It isn't up to them to give everyone a fair shot. That
| isn't what their customers actually want. Cloudflare
| aren't in the "fair shots for all search engines"
| business. They are in the "stop requests you don't want
| hitting your servers" business.
| AtlasBarfed wrote:
| "targeted with anti-trust legislation"
|
| Um, this is America. Every market is basically a trust,
| cartel, or monopoly.
|
| And I don't know if you can hear that, but there is literally
| laughter in the halls of power. All the show hearings by
| congress on social media and tech companies only has to do
| with two things:
|
| 1) one political party thinking the other is getting an
| advantage by them
|
| 2) shaking them down for more lobbying and campaign donations
|
| No one in the halls of power give two shits about
| competition. Larger companies mean larger campaign donations,
| and more powerful people to hobnob with if/when you leave or
| lose your political office.
|
| Of course I think that breaking up the cartels in every major
| sector would lead to massive improvements: more companies is
| more employment, more domestic employment, more people trying
| to innovate in management and product development, more
| product choice, lower prices, more competition, more
| redundancy/tolerance to supply chain disruption, less
| corruption in government and possibly better regulation.
|
| Every large company brazenly does market abuse up and to the
| point of one and only one limiter: the "bad PR" line. So I
| guess we have that.
| Archelaos wrote:
| I tried out four search words with your search engine, and I am
| not convinced that it is mainly the index size and not the
| algorithm that is to blame for bad search results. There are
| way too much high ranking false positives. Here is what I
| tried: a) "Berlin": 1. The movie
| festival "Berlinale" 2. The Wikipedia entry about Berlin
| 3. Something about a venue "Little Berlin", but the link
| resolves to an online gaming site from Singapure 4.
| "Visit Berlin", the official tourism site of Berlin 5.
| The hash tag "#Berlin" on Twitter 6. "1011 Now" a local
| news site for Lincoln, Nebraska 7. "Freie Universitat
| Berlin" 8. Some random "Berlin" videos on Youtube
| 9. The Berlin Declaration of the Open Access Initiative
| 10. Some random "Berlin" entries on IMDb 11. A "Berlin"
| Nightclub from Chicago 12. Some random "Berlin" books on
| Amazon 13. The town of Berlin, Maryland 14. Some
| random "Berlin" entries on Facebook 15. The BMW Berlin
| Marathon b) "philosophy" 1. The
| Wikipedia entry about philosophy 2. "Skin Care,
| Fragrances, and Bath & Body Gifts" from philosophy.com 3.
| "Unconditional Love Shampoo, Bath & Shower Gel" from
| philosophy.com 4. Definition of Philosophy at
| Dictionary.com 5. The Stanford Encyclopedia of Philosophy
| 6. PhilPapers, an index and bibliography of philosophy 7.
| The University of Science and Philosophy, a rather
| insignificant institution that happens to use the domain
| philosophy.org 8. "What Can I Do With This Major?"
| section about philosophy 9. Pages on "philosophy" from
| "Psychology Today". I looked at the first and found it to be
| too short and eclectic to be useful. 10. The Department
| of philosophy of Tufts University c) "history"
| 1. Some random pages from history.com 2. "Watch Full
| Episodes of Your Favorite Shows" from history.com 3. Some
| random pages from history.org 4. "Battle of Bunker Hill
| begins" from history.com 5. Some random "History" pages
| from bbc.co.uk 6. Some random pages from historyplace.com
| 7. The hash tag "#history" on Twitter 8. The Missouri
| Historical Society (mohistory.com) 9. Some random pages
| from History Channel 10. Some random pages from the U.S.
| Census Bureau (www.census.gov/history/) d)
| "Caesar" 1. The Wikipedia entry about Caesar
| 2. Little Caesars Pizza 3. "CAESAR", a source for body
| measurement data. But the link is dead and resolves to SAE
| International, a professional association for engineering
| 4. The Caesar Stiftung, a neuroethology institute 5. Some
| random "Caesar" books on Amazon 6. Hotels and Casinos of
| a Caesars group 7. A very short bio of Julius Ceasar on
| livius.org 8. Texts on and from Caesar provided by a
| University of Chicago scholar 9. (Extremely short)
| articles related to Caesar from britannica.com 10.
| "Syria: Stories Behind Photos of Killed Detainees | Human
| Rights Watch". The photos were by an organization called the
| Caesar Files Group
|
| So what I can see are some high ranked false positives that are
| somehow using the search term, but not in its basic meaning
| (a3, a11, b2, b3, d2, d3, d4, d6) or not even that (a6). Some
| results are ranking prominently although they are of minor
| importance for the (general) search term (a9, a13, b7, b8 --
| perhaps a15 and d10). Then there are the links to the usual
| suspects such as Wikipedia, Twitter, Amazon, etc. (a2, a5, a8,
| a10, a12, a14, b7, c5, d1, d5); I understand that Wikipedia
| articles are featuring prominently, but for the others I would
| rather go directly to eg. Amazon when I am interested in
| finding a book (or use a search term like "Caesar amazon" or
| "Caesar books"). Well, and then there are the search results
| that are not completely off, but either contain almost no
| information, at least compared to the corresponding Wikipedia
| article and its summary (b4, b9, d7, d9), or that are too
| specific for the general search term (c1, c2, c3, c4, c6, c9,
| c10).
|
| That leaves me with the following more or less high quality
| results (outside of the Wikipedia pages): a1, a4, a7, b5, b6,
| b10, and d8. The a15 and d10 results I could tolerate if there
| had been more high quality results in front of them; but as a
| fourth and second, respectively, good result they seem to me to
| be too prominent. Also in the case of "Berlin" a4 should have
| been more prominent than a1, and a7 is somewhat arbitrary,
| because Humbolt University and the Technical University of
| Berlin are likewise important; what is completely missing is
| the official Website of the city of Berlin (English version at
| www.berlin.de/en/).
|
| All in all, I would say that your ranking algorithm lacks
| semantic context. It seems the prominence of an entry is mainly
| determined by either just being from the big players like
| Twitter, Youtube, Amazon, Facebook, etc. or by the search term
| appearing in the domain name or the path of the resource,
| regardless of the quality of the content.
| kilburn wrote:
| I don't know about others, but when I think of the "good old
| google days" I'm _not_ expecting the results for your example
| queries to be any good.
|
| In those days querying took some effort but the effort paid
| off. The results for "history" just couldn't matter less in
| this mindset. You search for "USA history" or "house commons
| history" or "lake whatever history" instead. If the results
| come up with unexpected things mixed in, you refine the
| query.
|
| It was almost like a dialog. As a user, you brought in some
| context. The engine showed you its results, with a healthy
| mix of examples of everything it thought was in scope. Then
| you narrowed the scope by adding keywords (or forcing
| keywords out). Rinse and repeat. As a user, you were in
| command and the results reflected that.
|
| The idea that the engine should "understand what you mean" is
| what took us to the current state. Now it feels like queries
| don't matter anymore. Google thinks it knows the semantics
| better than you, and steering it off its chosen path is
| sometimes obnoxiously hard.
| yellowstuff wrote:
| I get what you mean, but part of the whole initial appeal
| of Google was that it gave much more relevant results
| initially than Altavista or the other options. That was why
| Google put in the audacious "I'm feeling lucky" button.
| DarkmSparks wrote:
| This was the result of two things. mapreduce. using links
| rank the pages.
|
| Using links to rank the pages is not really possible any
| longer because of seo spam links.
| 1_player wrote:
| Yeah but it's from that same philosophy that Google
| Search is useless as it optimises for the first result.
|
| There is no search engine that searches literally for
| what you asked and nothing else. Search is shit in 2021
| because it tries to be too clever. I'm more clever than
| it, let me do the refining.
| 1024core wrote:
| > The idea that the engine should "understand what you
| mean" is what took us to the current state. Now it feels
| like queries don't matter anymore. Google thinks it knows
| the semantics better than you, and steering it off its
| chosen path is sometimes obnoxiously hard.
|
| Bingo! If you cede control to Google, it _will_ do what
| it's optimized to do, and not what _you_ are looking for.
| dgivney wrote:
| I think you have some great feedback here but for me it also
| highlights how subjective search results can be for
| individuals - for example, these false positives that you
| mention (b2, b3) appear as the top result on Google for me
| for that query.
|
| It makes me think there must be some fairly large segment of
| the population that want that domain returned as a result for
| their query, no?
| 1024core wrote:
| OK, I'll bite. How would _you_ rank the results for each of
| those queries?
| melony wrote:
| What we need is a net neutrality doctrine on the server side.
| Bandwidth is hardly scarce outside of AWS's business model. Ban
| the crawler user-agent dominance by the big search engine
| players. "Good behaviour" should be enforced via rate limiting
| that equally applies to all crawlers, without exemption for
| certain big players.
| ColinHayhurst wrote:
| https://knuckleheads.club/
| hdjjhhvvhga wrote:
| Regarding the gatekeeper problem: it's a wild guess but maybe
| if there was a way to involve users by organizing distributed
| scraping just for the sake of building a decent index, I'm sure
| many of them would help.
| gbmatt wrote:
| yes, large proxy networks are potential solutions. but they
| cost money, and you are playing a cat and mouse game with
| turing tests, and some sites require a login. furthermore,
| people have tried to use these to spider linkedin (sometimes
| creating fake accounts to login) only to be sued by microsoft
| who swings the CFAA at them. so you start off with an
| intellectual desire to make a nice search engine and end up
| getting sidetracked into this pit of muck and having
| microsoft try to put you in jail. and, no, i'm not the one
| microsoft was suing.
| 1cvmask wrote:
| Why do you have a user account with a login?
| subsubzero wrote:
| curious how you implemented the index, memory based or disk
| based? Either way you are right, HW costs are extremely
| expensive and you would need a lot of high RAM/high core count
| machines to return such a large index to the endusers in a low
| latency fashion.
| 1_player wrote:
| If you're serious about this, add a paid tier. Until it's free,
| I don't trust you will not ever sell my data to make bank.
| jermaustin1 wrote:
| You are going to pay for a generalized web search when
| DDG/Google/Bing/etc are free?
| Closi wrote:
| I would - the problem with those services is that they
| prioritise the results that generate the search engine the
| most money rather than give me the best results, and then
| indexes my searches to track and advertise to me throughout
| the web.
|
| A clear pricing transaction sounds much nicer to me. Should
| generate better results too.
| 1_player wrote:
| Yes. I use Brave Search and I hope they add a paid tier,
| which I think they have confirmed they'll add at a later
| date.
|
| If you don't pay, you are the product. Simple as that.
| pythux wrote:
| https://twitter.com/brave/status/1466510541128548362?s=20
| duckmysick wrote:
| > If you don't pay, you are the product.
|
| If not enough people pay, there's no product.
| 1_player wrote:
| If nobody pays, there's even less of it. Not sure what's
| your point.
| Nasrudith wrote:
| Why do people think a paid tier will prevent their data from
| being sold after pocketing it? Aside from that if they go
| bankrupt then it isn't theirs to not give away anymore for
| one.
| InfiniteRand wrote:
| Not sure if you're looking for feedback, but the News search
| could use some work, I searched for "Ethiopia" and almost all
| of the articles were unrelated to Ethiopia except for the
| existence of some link somewhere on the page.
|
| Your general web search seems pretty good, although I've just
| given it a casual glance. I think your News search could be
| improved by just filtering the general search results for News-
| related content, since the "Ethiopia" content I get there is
| certainly Ethiopia-related.
|
| In any case, an interesting product, I'll try to keep an eye on
| it.
| smt88 wrote:
| What if you allowed trusted contributors to "donate" their
| browsing to your index?
|
| AltaVista and Yahoo did that with browser plugins in the 90s.
| bullen wrote:
| Do you have some sort of PageRank?
| mrkramer wrote:
| I'm sorry to say but your project is 20 years old and it had no
| impact at all. You are doing something wrong. Innovation and
| initiative is needed ala Bitcoin and DeFi not hobby projects
| which are not picking up in popularity and utility.
| ErrrNoMate wrote:
| Bitcoin and DeFi don't have utility outside of gambling and
| pump and dumps. Not everything (tbh not really anything)
| needs crypto.
| jquery wrote:
| Crypto's biggest achievement is being the financial
| equivalent of the gulf war oil fires. Just massive
| pollution. Think of all the good things that computing
| could be used for... we used to have all kinds of
| interesting collaboration projects. Instead we are setting
| those CPU cycles on fire for short term profit.
| Sohcahtoa82 wrote:
| Imagine if all that processing power was used for
| Folding@Home.
|
| The problem is that cryptocurrencies do not inherently
| need tons of processing power to operate. You could
| theoretically run the entire Bitcoin network on a
| Raspberry Pi. But the PoW algorithm was designed to
| always produce a block every 10 minutes, no matter how
| much hashing power was dedicated to the network. Everyone
| wanted a piece of the block reward pie, so the arms race
| was created.
|
| Proof-of-stake algorithms would eliminate this problem
| entirely, but PoS is a shitty "rich get richer" method.
| Granted, with how expensive mining power is, even PoW
| results in the rich getting richer, but at least it
| doesn't result in the wasting of gigawatts of
| electricity.
| mrkramer wrote:
| Read Bitcoin whitepaper. Bitcoin was meant to decentralize
| trust and to eradicate fraud through transparent
| decentralized database called Blockchain. It is certainly
| more impactful than hobby search engines taking in
| consideration Bitcoin was also hobby project but really
| revolutionary one.
|
| Go search what Larry Page said 20 years ago: If innovation
| is commercially successful it can have more widespread
| impact.
| _jal wrote:
| The bitcoin brainworms do bad things to people.
|
| I suggest you update your patter some, though. A good
| coin scam needs to sound a lot less dated.
| spiderice wrote:
| So your response to the author saying "I'm trying to be
| commercially successful, but it's really hard for these
| reasons" is "You should try being commercially
| successful"?
|
| Ok...
| mrkramer wrote:
| I respect his effort but the project is 20 years old and
| yet not commercially successful? There must be a reason
| behind it. The project is not good enough. Like I said
| only innovation can displace Google. Innovation is not
| something new and different innovation is something
| better.
| ma2rten wrote:
| _It 's much more expensive now to build a large index (50B+
| pages)_
|
| Do you have a cost estimate? Also could you be more selective
| in indexing, e.g. by having users requests sites to be crawled.
| ampersandy wrote:
| Requiring users to know what sites they want in advance
| somewhat defeats the purpose of a search engine, no?
| robbomacrae wrote:
| Not at all. You only have to fail the first request. It is
| an approach I took with my own attempt at a search engine
| way back! In fact I know personally that there is at least
| one patent out there that suggests initial 1st time request
| users being asked to provide the appropriate response as an
| efficient way to teach systems for future users.
|
| Obviously failing first requests isn't ideal but for
| popular requests it quickly becomes insignificant.
| Wikipedia might (if they don't already) want to make a
| similar suggestion for users to contribute when finding a
| low content/missing page.
| convolvatron wrote:
| since sites are so desperate to be indexed, doesn't it seem
| better to put the onus on them to announce themselves? it
| would be great if dns registries publshed public keys ..
| maybe they do in newer schemes?
| thoughtstheseus wrote:
| Perhaps trolling the entire web is not useful today? I'd love a
| search engine where I can whitelist sites or take an existing
| whitelist from trusted curators.
| hawthornio wrote:
| I'm really interested in this as well. I use DDG and whenever
| I'm doing research I tend to add ".edu" because there are so
| many spam sites.
| erhk wrote:
| Trusted curators is a dangerous dependency
| thoughtstheseus wrote:
| It is. The alternative is scooping everything and using
| algos to curate. That seems worse imo.
| sdfjkl wrote:
| Perhaps vote on results like on Reddit posts? Gets the
| junk sites down (and out of the index eventually).
| marginalia_nu wrote:
| Given Reddit is notorious for it's problems with
| astroturfing and vote bots, I don't think this is a
| particularly promising approach.
| Retric wrote:
| Any open voting system is going to be under serious SEO
| pressure.
|
| That's the real issue, Google has indirectly infected the
| web with junk sites optimized for it. Any new search
| engine now has a huge hurdle to sort through all the junk
| and if it succeeds the SEO industry is just going to
| target them.
|
| A more robust approach is simply pay people to evaluate
| websites. Assuming it costs say 2$ per domain to either
| whitelist or block that's ~300 million for the current
| web and you need to repeat that effort over time. Of
| course it's a clear cost vs accuracy tradeoff. Delist
| sites that have copies and suddenly people will try to
| poison the well to delist competitors etc etc.
| Nasrudith wrote:
| Adding a gatekeeper collecting rent isn't a solution -
| the people using SEO are already spending money to get
| their name up high on the list.
| arein3 wrote:
| Reddit is a heavily gatekeeped community by the mods in
| regards to specific topics
| 1024core wrote:
| Reddit is an extreme example of group think. Try posting
| something pro-Trump (I mean, surely even that guy has a
| positive thing or two to be said about him) and you'll
| get banned in some subs. Or you may get banned simply
| because the mod doesn't like the fact that you don't toe
| the party line.
| notriddle wrote:
| Also, vote bots
| pessimizer wrote:
| That just means that you have to curate the people
| allowed to vote. Otherwise, it would be rule by the
| obsessed and the search engine optimizers, and the junk
| sites will dominate the index.
|
| I'm not convinced that Google's recursive AI algos aren't
| a functional equivalent. They let you vote by tracking
| your clicks.
| dcow wrote:
| That's why you don't make it a hard dependency and let
| people curate their own list of taste makers. They can
| share and exchange info about who good taste makers are and
| good one might even charge for access to exclusive flavors.
| dragonwriter wrote:
| Plus, it scales less well than pure algorithmic search.
| This fight already happened, with a much smaller internet.
| shituonui wrote:
| It works really, really well for libraries. Research
| libraries (and research librarians) are phenomenally
| valuable. I've missed them any time I'm not at a
| university.
|
| Both curators and algorithms are valuable. This goes for
| finding books, for finding facts and figures, for finding
| clothes, for finding dishwashers, and for pretty much
| everything else.
|
| I love the fact that I have search engines and online
| shopping, but that shouldn't displace libraries and
| brick-and-mortar. Curation and the ability to talk to a
| person are complementary to the algorithmic approach.
| dragonwriter wrote:
| > It works really, really well for libraries
|
| It scales extremely poorly. It works very well for
| situations where there are customers/sponsors are willing
| to spend lots of money for quality, because then the cost
| scaling doesn't matter as much; research libraries,
| Lexis/Nexus Westlaw, etc. all do this, but it's not
| cheap, and the cost scaling with the size of the corpus
| _sucks_ compared to algorithmic search.
|
| It is among the approaches to internet search that lost
| to more purely algorithmic search, because it scales
| poorly in cost.
| thoughtstheseus wrote:
| +book stores. Curators can use algorithms to help them
| curate... Google's SE is taking signals from poor
| curators imo.
| Zamicol wrote:
| How about just a meritocratic rating? Even here on HN I
| would appreciate some sort of weight on expert/experienced
| opinion. Although in theory I like the idea that every
| thought is judged on its own, the context of the author is
| more relevant the deeper the subject. That's one of the
| reasons I still read https://lobste.rs. It has a niche
| audience with industry experience.
| marginalia_nu wrote:
| > meritocratic rating
|
| That is literally PageRank.
| klankoo wrote:
| Trusted consumers are better. The original page-rank algo
| was organic and bottom-up. But now it's the person not the
| page. Businesses compete for interaction not inbound links.
| So if you can make a modern page-rank that follows
| interaction instead of links and isn't a walled garden then
| I'd invest.
| politician wrote:
| I could make that work, but what do you mean by "walled
| garden" in this context?
| technobabbler wrote:
| That's a great idea.
| GordonS wrote:
| Heh, I guess you mean "trawling" - trolling the entire web is
| something very different :)
| giardini wrote:
| "Trolling" is fine, see e.g.
| https://grammarist.com/usage/trawl-
| troll/#:~:text=Troll%20fo....
| romwell wrote:
| Well, no, it's not fine.
|
| See e.g. _the source you linked_ , which explains the
| difference.
| GrinningFool wrote:
| Not in this context - "trolling" as described there would
| apply to targeted indexing of a specific site; while
| "trawling" would refer to a wide net that attempts to
| catch all the sites.
| hdjjhhvvhga wrote:
| Then again, if you look at today's search results, where
| everything above the fold belongs to Google, maybe we have
| been trolled indeed.
| rodiger wrote:
| Depending on the intended metaphor, trolling could work too
| :) https://en.wikipedia.org/wiki/Trolling_(fishing)
| xwdv wrote:
| What would trolling the entire web look like?
| lolinder wrote:
| The consistent theme every time this comes up is that dealing
| with the sheer weight of the internet is almost impossible today.
| SEO spam is hard to fight and the index gets too heavy. However,
| I wonder if this is a sign that we're looking at the problem
| wrong.
|
| What if instead of even _trying_ to index the entire web, we
| moved one step back towards the curated directories of the early
| web? Give users a search engine and indexer that they control and
| host. Allow them to "follow" domains (or any partial URLs, like
| subreddits) that they trust.
|
| Make it so that you can configure how many hops it is allowed to
| take from those trusted sources, similar to LinkedIn's levels of
| connections. If I'm hosting on my laptop, I might set it at 1
| step removed, but if I've got an S3 bucket for my index I might
| go as far as 3 or 4 steps removed.
|
| There are further optimizations that you could do, such as having
| your instance _not_ index Wikipedia or Stack Overflow or whatever
| (instead using the built-in search and aggregating results).
|
| I'm sure there are technical challenges I'm not thinking of, and
| this would absolutely be a tool that would best serve power users
| and programmers rather than average internet users. Such an
| engine wouldn't ever replace Google, but I'd think it would go a
| long way to making a better search engine for a single user's (or
| a certain subset of users') everyday web experience.
| lawwantsin17 wrote:
| I'm sure the algorithms are making echo chambers worse.
| Curating news opinion sites based on a prediction score of how
| often Chicken Little was right about the sky falling after the
| fact would surface reliable journalists and actual psychics!
| supernovae wrote:
| It's flawed from the get go if reddit is the basis.
| loonster wrote:
| As much as I like to hate on reddit (I'm a permanently
| suspended user), not every sub there is trash. There are some
| great subs there on very specific niche topics.
| djwayne35 wrote:
| I agree, I think we are looking at the problem wrong. And this
| is a very insightful comparison with the linkedin levels of
| connections idea. I am working on something with this. One
| thing to point out is that when we think of searching through
| information, we are searching though an information structure
| aka graph of knowledge. Whatever idea or search term we are
| thinking of is connected to a bunch of other ideas. All those
| connected ideas represent the search space or the knowledge
| graph we are trying to parse. So one way in the past people
| have tried to approach this is they try to make a predefined
| knowledge graph or an ontology around a domain. They try to set
| up the structure of how the information should be and then they
| fill in the data. The goal is to dynamically create an
| ontology., Idk if anyone has really figured this out. But,
| Palantir with Foundry does something related. They sorta
| dynamically create an ontology ontop of a company's data. This
| lets people find relationships between data and more easily
| search through their data. Check this out to learn more
| https://sudonull.com/post/89367-Dynamic-ontology-As-Palantir...
| Nasrudith wrote:
| The retro idea of curation seems popular here but everybody
| forgets why it lost out in the first place. It just doesn't
| scale in the first place. Not to mention demands - people
| usually want tools which lower mental effort and are intuitive
| as opposed to ones which are precise but in an obtuse metric.
| Most would not find a hardware mouse that consisted of two
| keypads for X and Y coordinates and a left click and right
| click button very useful.
|
| Similarly everyone maintaining your their own index is
| cumbersome overkill in redundancy, processing power, and human
| effort in return for a stunted network graph which is worse for
| all metrics people usually actually care about. In terms of
| catching on even "antipattern search engines" that attempt to
| create an ideological echo chamber would probably catch on
| better.
|
| Short of search engine experiments/start up attempts the only
| other useful application I can see is "rude web-spidering"
| which deliberately disrespects all requests to not index pages
| left publicly accessible as search engines generally try not to
| be good tools for cracker wardriving for PR and liability
| reasons. It would be a good whitehat or greyhat tool as doors
| secured by politeness only are not secure.
| mclightning wrote:
| That's basically what I'm doing with my search
| "site:reddit.com" I wonder if anyone at Google is aware of this
| trend and taking notes.
| a-r-t wrote:
| Reddit is missing a huge opportunity by not improving their
| crappy search functionality.
| copperx wrote:
| I estimate that about half of my searches have either
| site:reddit.com or site:news.ycombinator.com at the end. In
| fact, I have an autocomplete snippet on my Mac so I don't
| have to type all that.
| marksbrown wrote:
| FYI this is exactly what the hashbangs in DDG do!
| skyde wrote:
| what you are suggesting would make the problem of echo chamber
| (bubble) worse than it is today!
| Nasrudith wrote:
| Awkwardly complaints about echo chamber as a problem tends to
| not refer to feedback dynamics (crudely but disambiguating
| refered to as circle jerk) so much "People disagree with me,
| the nerve of them!". It is not viable to have parties A
| through Z sharing the same world and all having absolute
| control over all others. We see this same complaint every
| time modernation comes up, let alone the fundamentals of
| democracy.
| loonster wrote:
| Bubbles are great if you are on the outside looking in at how
| a specific group thinks. Bubbles are horrible if you are on
| the inside trying to explore your thoughts.
| lessname wrote:
| This might work well in some situations (e.g. research,
| development), however it would also increase the effect of echo
| chambers I think.
| theduder99 wrote:
| echo chambers are what most people want :)
| jonathankoren wrote:
| Part of the thing with echo chambers is that the search terms
| themselves can be indicative of a particular bubble. For
| example, there's a difference in the people that refer to the
| Bureau of Alcohol, Tobacco, and Firearms by the official
| initialism, "ATF", and those that use "BATF". There's a
| strong antigun control bent to the `"BATF" guns` query,
| compared to the `"ATF" guns` query.
|
| If you're indexing forums or social media, the same site is
| going to give back the bubbled responses, possibly without
| the person even being aware they're in a bubble.
|
| https://www.google.com/search?q=%22BATF%22+guns&client=safar.
| ..
|
| https://www.google.com/search?q=%22ATF%22+guns&client=safari.
| ..
| lolinder wrote:
| Possibly, but I'm not convinced.
|
| Google's not exactly working against the echo chamber
| problem, and I think that's because to do so would be to work
| against its own reason for existing. There are two goals here
| that are fundamentally at odds with each other:
|
| 1) Finding what you're looking for.
|
| 2) Finding a new perspective on something.
|
| A search engine's job is to address the first challenge:
| finding something that the user is looking for. The search
| engine might end up serving both needs if they're looking for
| a new perspective on something, but if these two goals ever
| come into conflict with each other the search engine does
| (and I would argue it _should_ ) choose the first goal.
| Failing to do so will just lead to people ignoring the
| results.
| swframe2 wrote:
| Have a look at gpt-3 if you want to see what the future dominant
| search engine will be. It will not find relevant results, it will
| write it on the fly customized for exactly what you want to read.
| (Maybe products will just ship to your door and be auto paid
| because the future ad targeting AIs will know you so well.)
| marginalia_nu wrote:
| What if you are looking for something written by a human?
| swframe2 wrote:
| You can always go to a library or bookstore.
| marginalia_nu wrote:
| Let's imagine I want to talk to the author of the content.
| How can I do that if it's just a souped up markov chain?
| throwawayffffas wrote:
| The markov chain can also power a chat bot.
| marginalia_nu wrote:
| But then they would need to know that the person sending
| the email is the same person that read a specific
| article.
| baggachipz wrote:
| https://kagi.com/ is a new engine (and Orion Browser) which seems
| like what you're talking about. I've been using it some and like
| it so far. The browser is fantastic.
| drcongo wrote:
| I've been using kagi.com for a month or so now, and it
| consistently beats DDG and Ecosia for result quality. I'd guess
| it beats Google too, since last time I used Google it was nothing
| but ads and spam which is why I stopped.
| freediver wrote:
| Thank you for the vote of confidence! Better than Google is our
| goal, glad you perceive it that way.
| drcongo wrote:
| You're welcome. I'm really impressed with it most of the
| time. Still not made it on to the Orion beta though ;)
| mkbkn wrote:
| I am a non-dev and Ecosia and DuckDuckGo are perfect for me. Not
| used Google since more than 3 years now.
| moralestapia wrote:
| Please do it! Google is now complete trash.
|
| Also gmail, used to have the best spam filters out there, now
| it's utter crap. Emails from my google analytics account, for
| whatever reason and disregarding how many times I have clicked on
| "Not Spam", go to spam, and it's their own service; while
| messages who are textbook spam ("Hi, I just got some inheritance
| ...") go to my inbox.
|
| AI (in its current state) is crap, when is the industry going to
| accept these are the emperor's new clothes.
| chilling wrote:
| Yesterday there was a discussion[1] about it and someone
| suggested yandex.com. I'm using it since than and really love it.
| It's like going back to 2003 where everything was just plain and
| simple.
|
| [1]: https://news.ycombinator.com/item?id=29393467
| pydry wrote:
| Early 2000s google index ran in a garage. The current google
| index has dedicated power stations.
|
| It's a bit like the car industry - you could run a startup from
| your garage in the early days but you need titanic amounts of
| capital to compete now thanks to vertical integration.
|
| Major governments and billionaires can compete but everybody else
| is locked out of the market (most "startups" use bings index).
| R0b0t1 wrote:
| I was thinking about exactly that. If they used simpler index
| would they be getting better results? There's not a lot of
| selective pressure so they just keep adding to the index
| algorithm.
| rasengan wrote:
| This is how Private Search [1] works since it decouples the
| search from the user. This means nobody knows both who searched
| and what they searched for. This is a huge leap for privacy in
| search.
|
| [1] https://private.sh
| jaywalk wrote:
| Looks like your comment here caused enough curiosity to take
| the service down.
| snarkypixel wrote:
| Is it a proxy to other search engines or are they building
| their own?
| rasengan wrote:
| It's a multi part partnership with Gigablast. Gigablast sees
| the searches, but not who searches. Private.sh sees who
| searches, but not what they search for.
| gadrev wrote:
| Just tried it and it worked for me.
| gbmatt wrote:
| and i work with rasengan on private.sh so yes there's some
| issue there. one of the back end servers is returning a max
| capacity error of sorts... we are checking into it.
| Lucasoato wrote:
| Tried it but it just says: "Something went wrong. Please try
| again."
| ZetaZero wrote:
| same here
| rasengan wrote:
| It should be working now! Thanks for the heads up. There
| was a traffic issue.
| andrewclunn wrote:
| What about a search engine that only indexed information and
| technology "alternative" sites, specifically to give you the
| results most likely to be purged or demoted from Google's
| results? Would be simple enough in scope and have a built in
| market and use.
| beefman wrote:
| Can you also create a web comparable to the 2005 web?
|
| Well, it's wikipedia. So just create a search engine for that,
| since their search sucks rocks.
| criddell wrote:
| Why don't you want personalized results? If I search for "subaru
| service" I want to find Austin Subaru, not Thorp Subaru in Cape
| Town.
| arthur_sav wrote:
| I pretty much hate "personalized" search recommendations. If
| i'm looking for something it's usually not in relation to me
| but in relation to the world.
|
| If i wanted something more relevant to me, then i would specify
| what aspect of relevance (country, gender, age etc...) i would
| like instead of playing the guessing game.
| criddell wrote:
| > If i'm looking for something it's usually not in relation
| to me but in relation to the world.
|
| If that's true, then I don't think you are a typical search
| engine user.
|
| The personalization should just be used for defaults. You can
| always make a more specific query to focus on aspects you are
| interested in.
| vikingerik wrote:
| Why didn't you just search "austin subaru service"? If you want
| a query narrowed down by location, that's your job to say so.
|
| Sure, it feels great when the engine guesses something like
| that correctly -- but it comes out worse overall for the
| plentiful cases where you have to try to compensate for it
| guessing wrong.
| criddell wrote:
| Why should I have to do all that work? I want the machine to
| do it for me.
|
| I can only think of examples where I want personalization.
| What's an example query where it interferes?
| jeffbee wrote:
| Amazing that the same site that thinks copilot will just
| generate programs for us also thinks it is literally a
| crime for a search engine to infer anything.
| not2b wrote:
| I think you're being nostalgic for something you don't remember
| very well.
|
| In that era, Google would return a match based on words that
| appear in the links to a URL but not in the article itself,
| meaning that it was easy to produce "Googlebombs". For example,
| from 2005-2007 the top hit for "miserable failure" was the
| Wikipedia article for George W. Bush.
|
| See https://www.screamingfrog.co.uk/google-bombs/ for some of the
| "better" ones.
| chrisgoman wrote:
| too many crappy websites, probably needs a "committee" to
| whitelist domains (only good quality ones) but probably too much
| work for not enough money or needs some monetization strategy
| nfriedly wrote:
| I think DuckDuckGo is closer to what you want. Same results for
| everyone, better privacy, and they're proactive about improving
| their results.
|
| https://duckduckgo.com/
|
| Part of the problem is that there's a lot more low-quality
| content to wade through now than there was in 2005. I think the
| Google of 2005 would have trouble delivering quality results
| today also.
| hunterb123 wrote:
| DDG never worked great for me, and it doesn't have it's own
| index.
|
| Brave search has been my daily driver and it works wonderful.
| adolph wrote:
| I'll give it a try, somehow I missed the announcement even
| though I'm a Brave user...
|
| https://brave.com/search/
| MuffinFlavored wrote:
| > I think the Google of 2005 would have trouble delivering
| quality results today also.
|
| What would you attribute to their modern 2021 success then?
| Just throwing a ton of money at amazing engineers to hone in
| their complex algorithm to tweak it to still return what us
| humans quantify as "good" results? Especially if they are
| waning through a sea of low-quality content as you say.
| gbmatt wrote:
| both ddg and brave are bing (microsoft) in disguise.
| pythux wrote:
| This is not correct. Brave Search owns its own (growing)
| index and relies on third-parties like Bing for some fraction
| of the requests. Which is not the same thing as relying fully
| on Bing or third-parties for results like so many meta-search
| engines. More detailed answer here:
| https://search.brave.com/help/independence
|
| Edit: Forgot to say that I work on Brave Search.
| gbmatt wrote:
| brave 'falls back' to bing. which in my experience is most
| of the time. in fact, out of all the queries i did a while
| back, they all seemed to come directly from bing. is there
| a way to disable the reliance on bing and get pure 'brave
| only' results? and can you be more specific as to what this
| fraction is? do you blend at all?
| pythux wrote:
| You can check exactly which fraction of the results were
| fetched from Brave's index vs. third-parties using the
| "independence score" found in the setting drawers
| (opening can be done with the cog icon at the top right
| of any page on search.brave.com). There is there a global
| and personalized score of independence (respectively
| aggregated on all user's and for your queries only).
|
| Explanation is also found here with screenshots:
| https://search.brave.com/help/independence
| gbmatt wrote:
| So Brave is still dependent on Google and Bing it seems.
| Also is this Brave's CEO:
| https://www.bbc.com/news/technology-26868536
| https://www.nytimes.com/2020/12/22/business/brave-
| brendan-ei... ? "Brendan Eich's opposition to same-sex
| marriage cost him his job at Mozilla." "Covid comments
| get a tech C.E.O. in hot water, again."
| ricardo81 wrote:
| Does DDG have any of its own organic results yet, or is it
| still entirely Bing/Yandex?
| DavideNL wrote:
| > a lot more low-quality content
|
| I wish there was an easy way to filter ALL search results, by
| permanently excluding specific websites, and/or keywords.
|
| Surely there has to be some browser extension that does this...
| BoxOfRain wrote:
| https://news.ycombinator.com/item?id=29404860
|
| Not got round to trying it yet though.
| DavideNL wrote:
| Great and it even supports iOS...!
| MayeulC wrote:
| Excluding, or penalizing for, advertising and trackers could
| do wonders against perverse incentives and SEO, IMO. It would
| also be a better experience for the reader.
| Kiro wrote:
| Try searching for the same thing from your computer and your
| phone and you will get different results. Also, their results
| come from Bing so any improvement happens at Microsoft.
| JohnFen wrote:
| They do use Bing, but not solely Bing. DDG isn't just a
| frontend to a different search engine.
| bla3 wrote:
| It's a bing frontend with a few special cases handled
| differently. For most queries, you get bing results. Easy
| to check by comparing results.
| jpswade wrote:
| DuckDuckGo isn't really a search engine, it's a website that
| uses bing's api.
| cyberbanjo wrote:
| Not just Bing, but nearly every search engine you've ever
| used https://duckduckgo.com/bang?q=
| Sunspark wrote:
| This. DDG is my primary search engine now and has been for
| awhile.
|
| I don't use Google anymore to search unless I really need to.
| The algos they use today are not the same classic ones that
| actually returned results.
| kspacewalk2 wrote:
| And if you really need to, DDG !bangs[0] make a search as
| simple as "!g mother google help me". The keyword thing is
| also available in Firefox as a browser feature, and elsewhere
| I'm sure, but nevertheless, it makes switching to DDG easier.
|
| (Plus I can directly go to the wiki page by using "!w", "!gm"
| for google maps, etc.)
|
| [0] https://duckduckgo.com/bang
| eevilspock wrote:
| The only bang I use is !gvb since DDG doesn't support
| verbatim searches.
| jay3ss wrote:
| Is this the same as enclosing the terms in quotes and
| using the !g bang?
| JohnFen wrote:
| Same. For the sorts of searching I do, anyway, the results I
| get from DDG tend to be better than what I get from Google.
| Google tries to infer what I want rather than take me at my
| word, and is very bad at it.
| dragonwriter wrote:
| > Why doesn't anyone create a search engine comparable to
| 2005-Google?
|
| Because the universe being searched isn't the internet of 2005
| and earlier, and because user expectations have moved on, too.
|
| Plus the index expense.
| axegon_ wrote:
| Two major reasons: costs to build and maintain and manpower
| needed. Both are practically impossible to come by.
| kumarsw wrote:
| I feel like we are at the low point or even losing the battle
| between search engines and SEO spam. Maybe it is time for the
| Yahoo-style curated directory to return? We seem to be getting a
| microcosm of this with the awsome-* GitHub lists and Gemini with
| its near-nonexistent search.
| erpellan wrote:
| Even if Google dusted off their 2005 codebase and ran it on
| today's web it wouldn't come close to the results quality of
| Google in 2005. The SEO industry has been in an arms race with
| the search engines for 16 years. 2005 Google would be like a
| goldfish in a piranha tank.
| WalterBright wrote:
| I'd like to see categories like travel, science, history, art,
| etc. The web pages could pick which categories their page falls
| into using meta tags. The user has the option of selecting which
| category they are interested in searching within.
| mrkramer wrote:
| They do[0] but nobody cares anymore. Google controls web
| distribution through Google Chrome. I think we are at the point
| of no return. There won't be any competition anytime soon no
| matter what US government does. Only innovation can displace
| Google.
|
| [0] https://search.marginalia.nu/
| BbzzbB wrote:
| Marginalia is great to find blog posts, personal sites and
| other long form content, but it's not a replacement for Google
| nor intends to.
| mrkramer wrote:
| But it is a good start and foundation for something bigger
| and better.
| egberts1 wrote:
| Funny. Marginalia has an option for No JavaScript but I
| cannot even do an HTTP "POST" with JavaScript disable at my
| web browser.
|
| Disclaimer: I study for malicious JS stuff.
| marginalia_nu wrote:
| It does operate on a scale and principle fairly similar to
| early 2000s google, so the comparison isn't that far off, but
| yeah, it's quite some way before it's viable for general
| search. Dunno if I'll ever get there, but it does
| consistently seem to get better so who knows.
| BbzzbB wrote:
| Isn't it's familiarity to early Google a side-effect of the
| early Internet being text-heavy sites in the first place
| rather than a similarity in the search engine? Unless I am
| misunderstanding your site's intent, even if you reach the
| dream engine you are trying to achieve, I won't be using it
| to search answers for coding questions on SO, how-tos for
| car repair, sites to stream movies, governmental page for X
| need, transcript for earnings calls, etc.
|
| In my experience it is better than Google at what it does
| if I'm looking for long-form texts (exception being
| scientific/peer-reviewed articles, Google tends to shoot me
| those for the type of queries I make on Marginalia), but is
| very much complementary rather than a replacement.
| marginalia_nu wrote:
| I guess it depends on what you are looking for on the
| Internet I guess.
|
| Right now the biggest problem with Marginalia is that it
| has a fairly uneven quality level. For some queries it's
| absolutely incredible. For others, it doesn't really
| provide much useful results at all. I do think it's
| possible to even that out a considerable bit, to make it
| more viable for general queries. It's never going to be
| able to answer every query, but it probably could answer
| a lot more than it does.
| freediver wrote:
| We are building one [1] as well as a few other people that I am
| aware of with different approaches and business models.
|
| We also need to be aware that when we remember past times it
| usually carries a romantic, nostalgic note. Web is very different
| than it was 15 years ago and the problem of search has evolved.
|
| What you are looking for is basically 'grep for the web' but it
| is just one facet of search that we use today. 15 years ago you
| would not get an instant answer to a question like you do today
| and many users would not be able to live without that today.
| There are also maps and location based answers, all sorts of
| widgets like translation etc. Also world became more polarized so
| an objective best search result became more difficult to produce,
| specially for events covered in news, which means bias inevitably
| starts to creep in.
|
| This is not to say that Google is good or bad today, it is what
| it is and they are doing best they can. Startups like ours see an
| opportunity on the market, in large part to help savvy users find
| what they want.
|
| [1] https://kagi.com
| motohagiography wrote:
| I do like the idea that instead of crawling and indexing, the
| next generation search will likely be more like a federated
| community search app that indexes the stuff members actually
| read. Google search isn't so much a repository as a consensus
| about what's important, hence why it's so politicized to the
| point of becoming unreliable, but also why it too is vulnerable
| to disruption.
|
| Imo, 2005 google got initial traction because of its tech forum
| post indexing, as I remember my switch to it was because it
| became an extension and then replacement for manpages. In that
| sense, what made it good was it reflected the consensus of what
| its incredibly influential userbase thought was important and
| just managed that really well. The demographic impact of the U.S.
| Gen X all using it at once didn't hurt either.
|
| The equivalent today, as a lot of us say, is that blockchains are
| in the 1997 internet phase, and the service that makes the
| content of those as navigable as the 90's internet, will likely
| grow in a similar way.
|
| Search that provides young people with privacy and freedom to
| pursue their true interests will be the dominant strategy. Its
| success will be because it's a product that rides growth, and not
| because it "solved a problem." Imo, we all index too much on the
| privacy pattern because the freedom pattern is too risky.
|
| What's changed since that time are the maturity of things like
| Bloom and other probabilistic filters, Apple's private set
| intersection, differential privacy, zksnarks, and everybody you'd
| ask an opinion from now gets their content through mobile
| devices. Apple's ecosystem is equipped to do this kind of search,
| but they're too exposed politically to get into it. Meta will
| likely go there, but nobody's going to trust them willingly.
|
| A protocol that generated a cryptograpically strong anonymous
| index from your browsing - and instead of putting it on google's
| servers, it was on a chain, or the content index information and
| its evolving consensus score was included in something like a DNS
| record - may still unseat these ensconced interests. IPFS and
| other P2P or torrents might do something like that as well.
| Blockchains maybe good for that consensus/desire score.
|
| It's not something you architect and design top down that has to
| solve all cases, it will be just another useful product that
| grows while riding a demographic change. It would be on the level
| of inventing HTML/HTTP again, which, when you think about it, was
| just another dude making a thing he needed.
| BbzzbB wrote:
| No mention of DDG in the comments? Is there a reason I'm not
| seeing or it's just not the preferred alt-search on HN? Seems to
| have been working fine for me when I struggle to get past the
| funnels and content mills on Google.
| pantulis wrote:
| I dont find search results to be too relevant (at least for me,
| also Spaniard here). It is my default search engine only for
| the bang commands.
| KennyBlanken wrote:
| For me, DDG results are even worse than Google. It's set as my
| default and I'd say at least half of my searches in DDG
| generate completely useless results...pages of obviously SEO'd
| garbage.
|
| DDG also doesn't support showing a site's basic structure in
| the search results (ie, the card of a company's website with
| Products, Contact Us, Support, etc) and the preview text is
| garbo as well...it reminds me of 1990's era electronic card
| catalog search excerpts.
|
| I look at the first page or two, give up, search google. While
| I have to hunt a bit in the results, I do eventually get what I
| wanted.
| infinitezest wrote:
| Every time this comes up I'll see a few people talk about how
| the results aren't relevant but it has not been my
| experience. I've been using DDG as my main search engine for
| a few years and never have to go beyond the first page. I
| really curious why that is.
| JohnFen wrote:
| My experience is like yours -- DDG is legitimately better
| than Google. My hypothesis is that it's related to how you
| construct searches. I expect Google probably does better if
| you learn how to talk to it, since it seems to want to
| interpret your query rather than take it literally.
|
| My searches tend to be keyword-oriented rather than natural
| language. I think DDG does better with those.
| RDaneel0livaw wrote:
| I was looking for this as well! I use it daily and have for
| years. Love it.
| Kiro wrote:
| DDG doesn't have their own index (they're getting their results
| from Bing) so not really relevant to this question.
| BbzzbB wrote:
| I.. didn't know that. However, trying it just now in
| incognito I don't get the same results[0] (some different
| links, and most re-ordered). Is Duck repurposing Bing's
| results? I've tested with "how to get rich", a great bait for
| bad content (try it on Google without an adblocker, if you
| dare).
|
| [0]: https://pastebin.com/xC45hL1i
| Kiro wrote:
| I don't know what DDG is doing but I'm imagining that they
| send in the raw queries while you can't get around Bing's
| personalisation even in incognito. I get very similar
| results for "how to get rich", but only after setting "All
| regions" on DDG.
|
| Bing:
|
| 1. How to Get Rich: 10 Things Wise and Rich People Do
|
| 2. 5 Ways to Get Rich - wikiHow
|
| 3. 16 Proven Ways On How To Get Rich Quick (2021 Edition) -
| TPS
|
| 4. How to Get Rich - NerdWallet
|
| 5. How to Get Rich: Follow our Step by Step Plan to Build
| ...
|
| DDG:
|
| 1. How to Get Rich: 10 Things Wise and Rich People Do
|
| 2. 5 Ways to Get Rich - wikiHow
|
| 3. How to Get Rich - NerdWallet
|
| 4. 16 Proven Ways On How To Get Rich Quick (2021 Edition) -
| TPS
|
| 5. How to Get Rich: 8 Steps to Make Your First Million ...
|
| It's no secret that DDG is using Bing so they're not trying
| to hide it. An easy way to verify it is to search for "what
| is my ip" on DDG and look for results where the IP number
| has been cached in the snippet, e.g.:
|
| www.myipnumber.com What is my IP number - my IP address -
| MyIpNumber.com What is my IP Number? The IP Number of this
| machine is: 157.55.39.192. This number can also be
| represented as a 32-bit decimal number 2637637568, or as a
| 32-bit hexadecimal number 0x9D3727C0 . (Note that if you
| are part of an internal network then this is the IP number
| of your local server, the machine which is connected to the
| external ...
|
| If you do an IP lookup on 157.55.39.192 you will see that
| it's in fact "Microsoft bingbot".
| marksbrown wrote:
| I'd like a way of automatically filtering for websites that :
|
| * Don't use JS * Don't use Google analytics * Don't weigh more
| than a few kB per page * Don't show any sites with ads
|
| That would be a place to begin.
| fnord77 wrote:
| information-dense pages of yore have been replaced by really
| wordy, probably generated SEO optimized blog junk.
| tigerlily wrote:
| Surely there must be some way to have distributed search compute
| a la folding/seti@home or those mersenne prime guys.
|
| I'd gladly pool in some of my CPU time if it helps build a better
| search.
| teddyh wrote:
| https://yacy.net/
| tigerlily wrote:
| Thanks!
| michaelyuan2012 wrote:
| here is Drop Side Trailer information, it should be helpful for
| those logistic company.
|
| https://www.dreamtruegroup.com/drop-side-trailer/
| mrfusion wrote:
| I've always wondered why you can't use SEO optimizations for
| GOOGLE as a negative weight and penalize those pages.
|
| For example if my search term appears in the URL I can almost
| guarantee I don't want that page.
| ChrisArchitect wrote:
| related 2 days ago:
|
| _Ask HN: Has Google search become quantitatively worse?_
|
| https://news.ycombinator.com/item?id=29392702
|
| Inviting all the paranoid/speculative/hearsay/personal experience
| responses. Lame Ask HNs!!!!!
| michaelyuan2012 wrote:
| for the people in logistic business area who search in Google,
| Flatbed Container Semi Trailer, it should be good for your
| reference.
|
| https://www.dreamtruegroup.com/flatbed-container-semi-traile...
| ravenstine wrote:
| I think what [some] people actually want isn't the Google of 2005
| but to have a search engine where they don't feel like they're
| being manipulated.
| anotheraccount9 wrote:
| Check out the dead internet theory. If most people browse 1% of
| the web, what's up with popular search engines?
| ab_testing wrote:
| I think a lot of people are ignoring the issue that the web has
| changed considerably since 2005. It is approximately 10 times
| larger in terms of number of websites and web pages. And a lot of
| it is SEO junk that is just designed for search engines to be
| easier to parse and show ads in your face.
|
| Also user preferences have changed in the last decade or so. I
| know millenaials and users in their late 30's or early 40's still
| yearn for the old web where they would type a search term and
| correct results would astonish them. However, younger users tend
| to gravitate to videos and that is why a large portion of the
| google results are now video results.
| rovingEngine wrote:
| I think Google was "better" from a users point of view in 2005
| because it wasn't that good at selling ads yet. I still remember
| the epiphany of the first time I used Google in 1999. It was
| amazing.
|
| I've thought the same about pre-ad Twitter and Facebook.
|
| Early on, startups with free services look a lot like non-profits
| and just maximize user benefit to grow. The problem is they're
| not non-profits, and have to make money at some point. That has
| tended to mean ads.
|
| I'd easily pay, say, $9/mo to have access to an ad-free search
| engine that made me feel the way 1999 Google did.
| mmmmmbop wrote:
| $9/mo is not going to cut it. Google's domestic annual revenue
| per user in 2019 was $256. [0] That's $21.33 per month. Not all
| of Google's revenue is from Ads, of course, but the vast
| majority is. (Let's ignore for now the valid counterpoint that
| Ads are increasingly served on other Google properties than
| Search.)
|
| But even charging users $21.33/mo for an ad-free search
| experience most likely wouldn't be enough. By providing such an
| option, you'd greatly reduce the value of the remaining Ads
| pool.
|
| The optimistic perspective on this is that if you are one of
| the users with disposable income, you're essentially
| subsidizing a great search engine and a suite of other tools
| for the less well-off ones.
|
| [0] https://miro.medium.com/max/6545/0*YTqXb-F5UiVhtlIS
| rovingEngine wrote:
| Let's say ads will always make more money (I have no reason
| to believe they won't), and that's required to be the
| dominant search engine because the web is big and expensive
| to organize.
|
| I'd bet there's some way to characterize what I and others
| liked about the earlier web and create a search engine that
| just worries about that stuff. I'd pay $9/mo for whatever 1/3
| of Google's spend per user would get me. That's not to say
| this thing would "beat" Google, but it could profitably
| exist.
| BitwiseFool wrote:
| Natural Language Processing is a pox on modern search engines. I
| suspect that Google et. al. wanted to transform their product
| into an answer engine that powers voice assistants like Siri and
| just assumed everyone would naturally like the new way better. I
| can't stand how Google is always trying to guess what I want,
| rather than simply returning non-personalized results solely
| based on exactly what I typed in the textbox.
|
| While that may be good for most people, there is still a lot of
| power and utility in simple keyword-driven searches. Sadly, it
| seems like every major search engine _has_ to follow Google 's
| lead.
| maxlamb wrote:
| What's a pox?
| mattanimation wrote:
| a disease or plague
| rocqua wrote:
| Saying X is a pox on Y means saying X is bad for Y.
|
| It originates from the disease 'the pox'.
| marginalia_nu wrote:
| I think _some_ NLP is strictly beneficial for a search engine.
| You may think "grep for the web" sounds like a good idea, but
| let me tell you, having tried this, manually going through
| every permutation of plural forms of words and manually
| iterating the order of words to find a result is a chore and a
| half.
|
| Like, instead of trying PDP11 emulator
| PDP-11 emulator "PDP 11" emulator PDP11 emulators
| PDP-11 emulators "PDP 11" emulators PDP11 emulation
| PDP-11 emulation "PDP 11" emulation
|
| Basic NLP can do that a lot faster without introducing a lot of
| problems.
|
| I do think Google currently goes way overboard with the NLP. It
| often feels like the query parser is an adversary you need to
| outsmart to get to the good results, rather than something
| that's actually helpful. That's not a great vibe. However, I
| think the big problem isn't what they are doing, but how little
| control you have over the process.
| kenny11 wrote:
| I get that for general-purpose searches this is a good idea,
| but it would be nice if there was an easy way to disable this
| when you know you don't want it - for example, for most
| programming searches, if I type SomeAPINameHere the most
| relevant results will always be those that include my search
| term verbatim. I don't need Google to helpfully suggest "Did
| you mean Some API Name Here?", which will virtually always
| return lower-quality search results.
|
| Early Google was a breath of fresh air compared to the
| stemming that its competitors at the time did, but nowadays
| even putting search terms in quotes doesn't seem to return
| the same quality of results for these types of queries that
| Google used to have.
| thisisnotatest wrote:
| I feel your pain. Two workarounds when Google gets it wrong
| are to put the term in quotation marks, or to enable
| Verbatim mode in the toolbelt. (I know various people have
| come up with ways to add "Google Verbatim" as a search
| engine option in their browser, or use a browser extension
| to make Verbatim enabled by default.)
|
| Disclaimer: I work on Google search.
| Y_Y wrote:
| Both of these options are disappointing, in my
| experience. Verbatim mode seems weirdly broken sometimes
| (maybe it's overly strict), and quoting things is rarely
| enough to convince Google that you really want to search
| for exactly that thing and not some totally different
| thing that it considers to be a synonym.
|
| One porridge is too hot and the other is too cold. I know
| Google could find a happy compromise here if it wanted
| to. In fact, I bet there's some internal-only hacked-
| together version that works this way and actually gives
| an acceptable user experience for the kind of people who
| have shown up to this thread to show their
| dissatisfaction.
| vdqtp3 wrote:
| Try this, go to Google and type in "eggzackly this".
|
| Two results not containing "eggz" at all. Two results
| containing "eggzackly<punctuation>this" Two results
| containing "eggzackly" but missing "this".
|
| Google Search is broken. It no longer does what it's
| directed, it just takes a guess. I suspect part of this
| is because someone decided that "no results found" was
| the worst possible result a search engine could give.
| KennyBlanken wrote:
| Google does go way overboard with "NLP". Starting at least 5
| years ago there was a trend toward "similar" matching and
| search result quality nose-dived.
|
| You can search for, say, "cycling (insert product category
| here)" and get motorcycle related results. Why? Because to
| google "cycling" = "biking" and "motorcycles" are "bikes",
| bob's your uncle, now you're getting hits for motorcycle
| products.
|
| Every time I try to do a very specific search I can see from
| the search results how google tries to "help", especially if
| the topic is esoteric. The pages actually about the esoteric
| thing I'm searching for get drowned in a sea of SEO'd
| bullshit about a word/topic that is 1-2 degrees of separation
| from each other in a thesaurus. I'm sure someone at google is
| very, very proud of this because it increases their measure
| for search user satisfaction X percent.
|
| It does this thesaurus crap even with words in quotes, which
| is especially infuriating.
| marginalia_nu wrote:
| Yeah. It's one of those things where it's invisible where
| it works and enraging when it doesn't. That's generally not
| a failure mode that's desirable. It at least should require
| extremely low failure rates to motivate.
| JohnHaugeland wrote:
| "Basic NLP can do that a lot faster without introducing a lot
| of problems."
|
| This is called "stemming" and is not sensibly approached with
| machine learning.
| marginalia_nu wrote:
| Of course, but stemming is a fairly basic technique in NLP,
| as is POS-tagging. NLP is not machine learning.
| brokensegue wrote:
| Modern NLP basically is machine learning
| marginalia_nu wrote:
| You can still do NLP without machine learning though, and
| a lot of the sorts of computational linguistics a search
| engine needs for keyword extraction and query parsing
| doesn't require particularly fancy algorithms. What it
| needs is fast algorithms, and that's not something you're
| gonna get with ML.
| JohnHaugeland wrote:
| Stemming is not meaningfully a natural language
| processing technique, any more than arithmetic is a
| technique of linear equations.
| necovek wrote:
| At the very least,
| https://en.wikipedia.org/wiki/Natural_language_processing
| seems to disagree.
|
| (So do I: NLP does not have to be machine learning/AI
| based)
| marginalia_nu wrote:
| Is it not the processing of natural language?
| JohnHaugeland wrote:
| Would you call addition a system of linear equations?
|
| No, you don't use the college senior label for the
| highschool freshman topic. You use the smallest label
| that fits.
|
| It's string processing.
|
| NLP is actually understanding the language. Stemming is
| simple string matching.
|
| Playing the technicality game to stretch fields to
| encompass everything you think even marginally related
| isn't being thorough or inclusive; it's being bloated,
| and losing track of the meaning of the term.
|
| Splitting on spaces also isn't NLP.
| marginalia_nu wrote:
| Stemming is a task specific to a natural language. You
| can't run an English stemmer on French and get good
| results, for example.
|
| All NLP is, strictly speaking, more or less elaborate
| string matching.
|
| > Splitting on spaces also isn't NLP.
|
| String splitting can be, but it's a bit borderline. I'll
| argue you're in NLP territory if it doesn't split "That
| FBI guy i.e. J. Edgar Hoover." into four "sentences".
| necovek wrote:
| > NLP is actually understanding the language.
|
| That's actually not an accepted terminology. There's,
| indeed, this:
| https://en.wikipedia.org/wiki/Natural-
| language_understanding
|
| Not sure why are you so adamant that yours is the "true
| meaning", when NLP existed long before machine learning
| and AI were used for it. And even if not, every term can
| be defined differently, so it should be normal to have
| different institutions/people define NLP differently.
| JPKab wrote:
| Semantic search requires NLP. So does the Q&A format the OP
| is complaining about. People conflate all things NLP to the
| latter, and forget about the former.
| BitwiseFool wrote:
| Maybe I'm not using the right qualifiers around the term NLP.
| The kind of NLP I was referring to is something like "Hey
| google, what is natural language processing?" and orienting
| the search around people asking questions in standard(ish)
| English like they would to another person.
| marginalia_nu wrote:
| NLP is very heavily integrated into search, so I don't
| think it's really possible to decouple them. But I agree
| the whole BonziBuddy thing they've got going now is
| annoying and it's especially unfortunate how it's replaced
| the search functionality. I'd have a lot more patience with
| it if I could choose this functionality when I wanted to
| ask a question.
| gk1 wrote:
| That's known as Open Domain Question Answering[1] and is
| only a subset of NLP.
|
| [1] https://www.pinecone.io/learn/question-answering/
| wpietri wrote:
| I doubt they assumed it was better. I expect they did a ton of
| user testing and found that it was better for most people. And
| I'm sure it is. HN users are very much a niche audience these
| days.
| gk1 wrote:
| Right. Bing switched to this method as well, as did Facebook,
| Twitter, Amazon, and pretty much every other company that has
| the ML resources to do this. They obviously had a good reason
| to do so, beyond assumptions.
| s1k3s wrote:
| I don't know how Google was in 2005, but in ~2010 I was able to
| pull a website on #1 with 0 cash spent, just by manipulating PR.
| That doesn't seem great to me.
| flipdot wrote:
| Not sure if this is any close to what you're trying to find, but
| there's https://github.com/benbusby/whoogle-search
| nickpp wrote:
| Because we're not having a 2005-Web anymore. More to the point,
| SEO & Google have evolved together. To have barely relevant
| results today you need to be _good_. That takes stellar talent
| which costs huge amounts of money.
|
| Thus, the Google of today, which is optimized to extract that
| money from us.
| Const-me wrote:
| > To have barely relevant results today you need to be good
|
| An easy way to become way better than google -- detect google
| ads on pages, and penalize these pages in the index. For
| obvious reason, google search is incapable of doing so.
| ginko wrote:
| But shouldn't all the blogspam be so hyperoptimized for
| Google's algorithm that is should be straightforward to detect
| and ignore/downrank it?
| marginalia_nu wrote:
| Yeah, I do this with my search engine. Works pretty well. A
| complementary approach that works well is to look at where
| blogs written by humans link. Very few spam blogs get links
| from humans.
| elcomet wrote:
| It's not that easy, they are optimized for many metrics..
| beingflo wrote:
| No because google's algorithm is not well known publicly.
| Also, if it was straightforward to detect then google could
| downrank it as well.
| kbelder wrote:
| I wonder if you could evaluate a page using your own
| algorithm, which is probably not gamed as much as Google's
| (because who cares about your search engine?)
|
| Then, check Google's ranking of the page. If it is much
| higher than it seems the page should be, assume the page is
| being SEO hyper-optimized and penalize the page
| proportionately.
|
| Basically, using the variance between Google's model and
| your model as an indicator of an SEO spam page.
| ginko wrote:
| The point is that SEO would just immediately adapt to
| Google's changes. If a smaller search engine filtered these
| out then it would likely stay under the radar.
| thefreeman wrote:
| you know that legitimate sites perform SEO as well,
| right?
| marginalia_nu wrote:
| SEO often seems to be a compensation for the fact that a
| site doesn't have particularly worthwhile content. So
| punishing SEO surprisingly does promote higher quality of
| search results.
| all2 wrote:
| Yes and no. A lot of those sites are small local
| businesses trying to get found. A front page listing can
| be the difference between surviving and going under. Much
| of the time the blog spam is what floats hours, contact
| info, and services provided to the first page.
| marginalia_nu wrote:
| Be that as it may, search ranking is a zero sum game. The
| unfair advantage SEO gives this particular struggling
| business means another goes under. I'd rather punish the
| guy trying to game the system than the one with enough
| principles not to.
| pessimizer wrote:
| The difference is far more likely to be in capability or
| expertise than principles.
| marginalia_nu wrote:
| Either way, capability for fuckery is not something I'd
| want to encourage.
| nickpp wrote:
| I _read_ auto-generated pages almost to the end before
| realizing it was SEO spam. (I am not a native English speaker
| though)
|
| With content copying, shuffling and AI generating, I am
| afraid we are on the cusp of auto content generators passing
| some restricted Turing test where readers really think it's
| an actual human that wrote it.
|
| As for me, I leant that for certain "hot topics", simply
| doing a generic search on Google is not a good idea anymore.
| thisisnotatest wrote:
| Yes, I think you'd call it a Red Queen Problem:
|
| "Here, you see, it takes all the running you can do to keep in
| the same place."
|
| -Lewis Carroll's Through the Looking Glass
| willcipriano wrote:
| I'd like to see a "just search" engine, all it does it search for
| a specific string, case insensitively, across the entire web. No
| curation or anything, just sorted in lexicological order closest
| match first maybe falling back to page age if it has more then
| one exact match. Perhaps give me some regular expressions as
| well.
| jeffbee wrote:
| That would be easily the worst search engine ever deployed.
| Imagine just returning all docs containing the word "bicycle"
| in chronological order. Useless.
| willcipriano wrote:
| For "Bicycle" it would suck but I don't often use search
| engines that way, for "High Timber ALX 29" you'd probably get
| something like this:
| https://www.schwinnbikes.com/products/high-timber-
| alx-29?var...
|
| I wouldn't use it for everything but sometimes that is the
| exact behavior that I want. I'd use duck duck go for more
| general searches.
| jeffbee wrote:
| That is the top hit on google for that search, so what's
| your complaint?
| willcipriano wrote:
| Take a random part number off your car, or a portion of a
| error message and try finding that. It's annoying to have
| to scroll down over a page or two of autogenerated SEO
| answers to get to something useful. The first result to
| appear on the internet is less likely to be SEO and more
| likely to be the manufacturers documentation or the git
| commit that spawned your error. It isn't always, but
| that's why you have more then one search engine.
|
| Secondarily I think a search engine that is very simple
| in it's model and operation is useful for more general
| free speech purposes. If the major search engines decide
| they don't like a site like the pirate bay a search for
| '"Pirate Bay" And "Torrents"' on a search engine that
| does not curate could still get you there. I guess the
| point is without curation you have to work harder to find
| what you want, but nobody is actively preventing you from
| finding anything. It would help keep everybody honest.
| prox wrote:
| Maybe a "stability factor" could be calculated. Whereas earlier
| new content was king, I now value a stable long term source of
| information. So domain age + page age + content variability +
| dependency on ads. That might give more honest sources a go.
| willcipriano wrote:
| That's a good idea, I'd make it a option. Do you want newest
| first, oldest first or by stability?
___________________________________________________________________
(page generated 2021-12-02 23:02 UTC)