[HN Gopher] Ask HN: Why do search engines not let you blacklist ...
___________________________________________________________________
Ask HN: Why do search engines not let you blacklist spam domains?
Whenever I'm searching for anything even mildly off the beaten
path, it's not uncommon for the top results to be SEO stuffed spam
websites, or maybe even real websites that I can't access (like
paywalls or requiring adblocker exceptions to proceed). Usually
pages from the same domains are top-ranked for other related
searches too. As a user I'd love to be able to tell my search
engine to "Never show me results from this domains" (similar to
blocking an account on Twitter) - but as far as I can tell there is
no way to do this in either Google or DuckDuckGo search. This
seems like such low-hanging fruit to me that I'm wondering if other
people have ever wanted this, and if there's actually a reason not
to do it.
Author : pketh
Score : 160 points
Date : 2022-03-29 13:56 UTC (9 hours ago)
| g105b wrote:
| I wish I could do this to tell DDG that I don't want to see any
| amazon.com products in my listing. I fell out with Amazon years
| ago and have been shopping independently since, but they have a
| stronghold over search engines with their out-of-stock listings.
| timbit42 wrote:
| Try the uBlacklist plugin mentioned in other comments.
| guelo wrote:
| I want a search engine that let's me exclude sites that have ads.
| Or even just exclude javascript.
| timbit42 wrote:
| You can exclude JS with the NoScript addon.
| beamatronic wrote:
| If search is free, then you are the product. Now if there was a
| paid search engine, and you were the customer, you would expect
| customer service, customization, no ads, privacy protection, etc.
| ncann wrote:
| No ads/no privacy/no customer service are fair game for a free
| service, but there's no reason why it should include no
| customization as well.
| birthdaydog wrote:
| I think people are right that the motivation is profit, but I'm
| not convinced it's to manipulate you in any way. I don't think
| driving you towards SEO'd blog spam is really all that profitable
| to google.
|
| My guess is that it's because it's an abusable feature, and that
| means hiring human moderators for it.
| lprubin wrote:
| How is it abusable if you're only blocking the domains for
| yourself?
| ajsnigrutin wrote:
| ...because pinterest would lose all their pageviews.
| shrikant wrote:
| For the same reason that streaming services don't really give you
| the ability to filter by cast/crew or hide stuff you've already
| seen: to gently guide you into avenues that are more profitable
| for them, regardless of what you say you really want.
| rg111 wrote:
| Prime Video has an option where I can hide specific movies and
| seasons of series.
|
| I use this feature a lot, and it works.
| btgeekboy wrote:
| I just tried searching Netflix, HBO, and Prime for Bruce
| Willis; all showed me his movies. Is that not the filtering
| you're expecting?
| lostcolony wrote:
| That's a fuzzy search. If I search on Netflix for "Bruce
| Willis", the first ten or so items are movies with him. Then
| there are miscellaneous thrillers and whatnot; unless I look
| at the details I don't know that they don't have Bruce Willis
| in them.
|
| A filter gives me a binary outcome; it either has Bruce
| Willis, or it doesn't.
|
| All of which is sort of the ops point I think; the search
| engine is fuzzy because it wants to show you many things for
| profit reasons. I don't think it's geared toward maximizing
| profitability at the expense of what you're looking for,
| exactly, so much as engagement as a proxy for profitability.
| The more you use the service, the stickier you likely are as
| a paid subscriber, so they'll happily shovel things that
| don't match the search (but are similar!...and that quickly
| jump the shark into being quite dissimilar, but hey, maybe!)
| to try and keep your eyeballs.
|
| A better demonstration of this handling is searching for
| movie titles they don't have. "The Princess Bride" on
| Netflix, for instance, doesn't even tell you "We don't have
| that", but instead "Explore titles related to: The Princess
| Bride". And while the first suggestion, The Neverending
| Story, is kinda a fair suggestion, some slightly later ones,
| like Top Gun and Zoolander, feel like a stretch.
| klibertp wrote:
| I suspect the GP thought about something more complex, like
| "a movie directed by X or Y between year XXXX and YYYY
| starring Z but not V in the genre ... with tags ... but
| excluding ones tagged with ..." You can probably do this on
| IMDB or something similar, but then you can't watch it there.
|
| I'm largely guessing - I don't watch movies, but I read a lot
| of manga online, and the only site that allows similar
| queries and still has UX a bit better than late '90s (ie.
| mangaupdates.com) is a fan-made (ie. pirated) one with
| porn/hentai...
| Joeri wrote:
| I'm not convinced they are targeting profitability, because
| that is hard to connect back to individual user action. They
| target what is easy to measure at the user level, probably
| engagement, and operate on the assumption that higher
| engagement means higher profit.
| x3ro wrote:
| why do you think it's difficult for google to measure which
| clicks to what websites are profitable for them? maybe i'm
| misunderstanding, but I would expect that this would be one
| of the primary things google tries to infer from you..
| asab wrote:
| I see lots of "google is evil" narrative here... in reality they
| could add a feature to collect spam flags and still disregard
| user preferences. It's just that the data will probably not help
| them, since adversaries are much more motivated to manipulate it
| for SEO profit than the average user who is unlikely to repeat
| that search
| rbinv wrote:
| Google used to have that feature:
| https://www.ghacks.net/2011/03/10/google-adds-block-all-doma...
|
| In my opinion, back then, they needed the data as a training set
| for spammy domain detection. Now that SERP spam is no longer a
| serious issue (in Google's eyes anyway), why bother. Google
| always knows what's best for you.
| ncann wrote:
| Wow, I'm surprised that I never remembered that feature
| existed. I wonder why they removed it, just like why they
| removed the insanely helpful "Cached" feature.
| Blackthorn wrote:
| Cached likely went away for the same reason image search was
| made to suck, legal issues.
| naoqj wrote:
| Cached still exists, press the 3 points next to the url.
| Blackthorn wrote:
| Still exists in some urls. It used to be virtually
| universal.
| naikrovek wrote:
| don't blame google when robots.txt directives are to
| blame.
| Blackthorn wrote:
| I didn't...
| schappim wrote:
| Yeah, as a webmaster you can opt out of cache.
| [deleted]
| poglet wrote:
| On the topic of image search, duckduckgo links directly to
| the image file like Google used to do. It's my go to for
| image search.
| Arnavion wrote:
| >Now that SERP spam is no longer a serious issue (in Google's
| eyes anyway)
|
| Given the number of clones of StackOverflow and GitHub that
| show up on the front page, even at the cost of replacing the
| original SO and GH links that they copied from, I can only
| assume either Google's eyes are blind or the search engine devs
| are so good at their job that they never need to Google
| anything themselves.
| tetsusaiga wrote:
| Another fun one-- next time a public figure dies, especially
| a slightly more obscure one, just google them/their death.
|
| Whole first page is just poorly written fiverr style
| articles, all by people for whom English is clearly a second
| language, at best.
|
| Eventually, legit stuff gets on the first page too, but
| theres a few instances where weeks later the first result is
| still one of these garbage seo-torture sites.
|
| Has really made me lose my faith in google lately.
| heavyset_go wrote:
| The more time you spend looking for the result you want, the more
| ads you will see. Some of those spam domains are filled with ads,
| as well.
| jolmg wrote:
| Adding -site:baddomain.com still seems to work in both Google and
| DuckDuckGo. You should be able to include that in the URL
| template used for search so it gets added to all queries. You can
| build your blacklist that way. E.g. https://duckd
| uckgo.com/?t=hk&iar=images&q=-site:pinterest.com+-site:flickr.com
| +%s
|
| As an aside, I haven't used Google in a while, and I find it
| interesting how the first page shows only like 5 results and at
| the very end. The rest of the page is widgets like "top stories"
| and "people also ask".
| baskethead wrote:
| What if we all started clicking on every ad on every spam page?
| Would they eventually get kicked off all the ad platforms for
| fraud?
|
| The only reason why Google and Facebook work is because their
| algorithms believe that we only click on the ads that we're
| interested in. If we click on every single ad, we will completely
| break their system within 3-6 months.
| kelnos wrote:
| I've long wondered the same! I recently started using Kagi, and
| it does have that feature. I've already blocked several domains
| from search results there, and I think the results pages are
| better (for me, anyway) as a result.
| sytelus wrote:
| Implicit signals are far more significant than explicit. Search
| engine already know which links are not getting clicks and which
| links are causing re-queries and which links causes pressing of
| back button.
|
| People do dumb things if you ask them not to show results from a
| domain or even "report spam". For example, Fox News and CNN will
| all be marked as spam millions times a day even after people are
| finding what they are looking for.
| layer8 wrote:
| Google in particular doesn't care about what features the small
| minority of power users would want. Otherwise they wouldn't have
| removed so many over the years.
| aaomidi wrote:
| The second this is added, the spam domains will just multiply.
| User interventions with spam doesn't work.
| Retric wrote:
| Domains aren't free the way sending email is. So, forcing more
| domains would be a significant expense that would seriously
| discourage these companies. Also, they would be easier to
| detect.
| charcircuit wrote:
| It practically is. You just add a new record to create a new
| domain.
| krono wrote:
| uBlock Origin static filters to the rescue!
|
| Block results from specific domains on Google or DDG:
| google.*##.g:has(a[href*="thetopsites.com"])
| duckduckgo.*##.results > div:has(a[href*="thetopsites.com"])
|
| And it's even possible to target element content with regex with
| the `:has-text(/regex/)` selector.
| google.*##.g:has(*:has-text(/bye topic of noninterest/i))
| duckduckgo.*##.results > div:has(*:has-text(/bye topic of
| noninterest/i))
|
| Bonus content: Ever tried getting rid of Medium's obnoxious
| cookie notification? Just nuke it from orbit on all domains:
| *##body>div:has(div:has-text(/To make Medium work.*Privacy
| Policy.*Cookie Policy/i))
| Inviz wrote:
| well that leaves blanks, right?
| krono wrote:
| No, these completely remove any matched results
| smusamashah wrote:
| If first page is all filtered links, we will get an empty
| first page.
| Kiro wrote:
| Yes, and creates blanks. Not visually but where it would be
| 10 results there are now 9.
| rg111 wrote:
| I received an ivite withing half an hour.
|
| So, I guess it varies.
| samcrawford wrote:
| Filtering out the spam results is only half the problem. In my
| experience, a legitimate site's content is cloned by a spam
| site, and that one appears in a Google search and the
| legitimate one does not. The example that keeps hitting me is
| GitHub Issues.
|
| Filtering out the spam only removes the clones; it doesn't get
| the good results back in.
| jccalhoun wrote:
| it isn't even just 'clones' because so many sites will just
| summarize an article from somewhere else and give a link to
| it. Sometimes it is a game of telephone with one site
| summarizing a 2nd site which is a summary of a 3rd and so on.
| I want a search engine to show me the original source not the
| one with the best SEO
| 3np wrote:
| Host a personal (potentially shared with friends) searx (for
| multi-engine) or whoogle (google only) instance. Filter out
| some domains completely, rewrite others. The rewrite part is
| what allows you to substitute spam clone sites for the real
| deal. At least searx does dedupe already.
|
| The time spent (including maintenance) will be paid back
| faster than you might expect.
|
| Optionally rewrite some sites to altfronts like
| nitter/scribe/piped. If you care about spending time on
| privacy and decoupling searches from visits, you can set up
| arbitrary proxying rules.
|
| One benefit among others over browser extensions is that it's
| a one-time setup for all your devices and clients. All you
| need to do on reinstall is to change the default search
| engine.
| krono wrote:
| Sure, but at least it prevents you from accidentally clicking
| those unwanted results - something I kept doing all the time.
|
| Either way, OP's ask was for a way to blacklist results, and
| I'm providing a method to accomplish exactly that. Edit: The
| rest is up to Google.
| james2doyle wrote:
| Nice! I've used HTML attributes as targets as well. They aren't
| random so they are also easy to target. I use this one on
| Twitter: twitter.*##div[aria-label="Timeline:
| Trending now"]
|
| This will hide the trending tweets box. You can see it's
| targeting the `aria-label` attribute on the `div` element
| DanAtC wrote:
| Kagi allows you to weigh domains.
| dharmab wrote:
| Including fully blocking them if desired!
| qwnp wrote:
| Also you can completely block domains, and pin to the top of
| the results.
|
| I just started using Kagi, and they definitely seem to be doing
| a lot of stuff right.
| Gys wrote:
| I switched to Kagi yesterday and I am so far very impressed.
|
| I did this search to answer another comment and later did the
| Google one:
|
| https://kagi.com/search?q=What+does+that+site+make+you+drag+...
|
| https://google.com/search?q=What+does+that+site+make+you+dra...
|
| Kagi shows the exact answer in the first result. Google gives
| no relevant answers on the first five pages (did not go
| further). Probably because Google focusses on 'highlight' too
| much. Kagi seems very aware of the context.
| luguenth wrote:
| Funny side effect of this post: For me the first link in the
| kagi results is now the link to kagi itself with the same
| search term.
| kelnos wrote:
| Oof, I feel like Kagi should be filtering out results that
| reference itself.
| Kye wrote:
| The pricing (in the FAQ) is a non-starter for me, but there
| are probably enough people with the money for it to make a
| nice business out of it. Maybe it'll be enough to apply some
| competitive pressure to Google.
|
| edit: to be clear, it's not unwillingness, I literally do not
| have $10-30/month to spare without compromising my progress
| toward not being perpetually broke. Not everyone here is a
| well-compensated tech worker.
| dharmab wrote:
| I'm hoping they figure out an easy way to let users charge
| this as a business expense (maybe an annual subscription?)
| For programming it's as valuable to me as a good editor
| plugin!
| pcmoney wrote:
| The quintessential be the customer or the product being
| sold dilemma...
| after_care wrote:
| The pricing is actually one of Kagi's most interesting
| features to me. There's an old phrase "If you aren't paying
| for the service then you're the product". I'd be suspicious
| of any price too low, as it would indicate they have
| another revenue stream.
| Buttons840 wrote:
| I think that saying was updated to "even if you're
| paying, you're still probably the product"? No? There's
| some truth to it, but it's not as simple as these sayings
| suggest.
| KerryJones wrote:
| Ooh, interesting, how do you get an invite?
| theIV wrote:
| kagi.com has a "sign-up for beta" link. Didn't take long
| before I received an invite (week or two?).
| I_dev_outdoors wrote:
| I just signed up myself.
|
| I like how they ask you what features you want as you are
| signing up. I mentioned that I want to be able to vote on
| results and have my own results prioritized by the votes
| of me and my peers.
| jtbayly wrote:
| wondering the same.
| warent wrote:
| seems cool, thanks. I've applied for the beta
| notreallyserio wrote:
| I'm really happy with Kagi. Being able to block and boost
| results is a killer feature. They're also more willing to show
| that they couldn't find results for searches. Compare that to
| Google, whose engineers designed it to (seemingly always)
| exclude the most specific term in a search to return irrelevant
| results.
| [deleted]
| shafyy wrote:
| I've also switched to Kagi recently and I think I find myself
| almost never doing !g (compared to DDG). And if I go to Google,
| it's mostly for hyperlocal results.
| Doctor_Fegg wrote:
| So far I've only found two areas where Google beats Kagi. One
| is, as you say, local/POI search. The other is deep diving
| into family history stuff - I've been tracing a 19th century
| composer and Kagi didn't surface most of the interesting
| documents. But for pretty much everything else it seems
| better than Google.
| freediver wrote:
| Thanks for the feedback! (Kagi founder here)
|
| Local results are on our roadmap, we should be launching it
| within few weeks.
|
| I am wondering if can you report problematic queries to
| https://kagifeedback.org so we can take a look?
| [deleted]
| drunner wrote:
| https://news.ycombinator.com/item?id=30374905 which in turn
| points to https://github.com/quenhus/uBlock-Origin-dev-filter
|
| Doesn't answer you question. I wish google would use their
| enormous resources to fight this copy spam (and maybe its so hard
| that they are but you can't tell).
| lordgilman wrote:
| I've been using these blacklists with uBlacklist to great effect:
|
| https://github.com/arosh/ublacklist-stackoverflow-translatio...
| https://github.com/arosh/ublacklist-github-translation
| throwaway158497 wrote:
| Because no product manager has been able to push it through
| various layers of beauracracy (yet). Plus more customizations
| means, their ML based personalized SERP page has failed in
| understanding your intent/needs.
| sharps1 wrote:
| If I had to guess, it's a lot of data they don't want to store.
|
| Current solution is either ublacklist, or add filters to uBlock
| Origin. Both are linked in the thread
| https://news.ycombinator.com/item?id=29546433
|
| https://github.com/h-matsuo/uBlacklist-subscription-for-deve...
| ncann wrote:
| Data storage can't be the reason. No matter how you implement
| it, the storage required is peanuts compared to things like
| Youtube/Drive/Google Photos.
| [deleted]
| elmerfud wrote:
| It's a simple answer because you don't pay them for that. They
| run a balancing act of providing you useful results while also
| surreptitiously shoving trash, that they actually make money off
| of, at you and hoping you won't notice or complain.
| dharmab wrote:
| Kagi will be paid when it exits beta and has this feature!
| jbverschoor wrote:
| Which means you can never use it in an incognito session
| without logging in
| dharmab wrote:
| An incognito session provides privacy against other people
| who have casual access to your computer. It provides no
| extra security or privacy in the context of search
| providers.
| CWuestefeld wrote:
| Disagree. When the FBI comes to them asking for a list of
| everyone who's done a search for "pipe bomb", your
| anonymity would have been welcome.
| freediver wrote:
| While there is certainly possibility of FBI asking for
| this, I am afraid we couldn't respond to such request.
| Reason is simply we do not log or associate searches with
| an account.
|
| Our business model is selling subscriptions, not data.
|
| (Kagi Founder here)
| kelnos wrote:
| I've been using Kagi for the past couple months (and love
| it so far!), but whenever I open a private browsing
| window, I revert back to DDG.
|
| Even though you say you don't associate search queries
| with accounts, it's hard for me to get over the "mental
| hump" to truly feel that in my bones and feel comfortable
| entering a "private" query while being signed in. I know
| it's not entirely rational. I mean, maybe it is a little
| bit rational: I only have your word (some random person
| that I don't know) that you don't -- and, critically,
| will never -- log queries associated with account. But
| regardless, it's hard to _feel_ private when I 'm
| entering something into a form on a website where I'm
| signed in as myself.
| dharmab wrote:
| Kagi doesn't log much identifying information, so the
| government would have to compel them to change their
| system with a national security letter LavaBit style.
|
| https://kagi.com/privacy
| rusk wrote:
| They meant anonymously
| PaulHoule wrote:
| If the search results were perfect you'd never look at an ad.
| beamatronic wrote:
| Sometimes I am searching for information, and I want facts.
| Sometimes I am researching a purchase, and I want the sellers
| to _compete_ for my purchase. Perhaps it makes sense to give
| search engines more context, such as, whether your intent is
| to spend money or not.
| lijogdfljk wrote:
| A good argument for some distributed foss search then i
| guess. Wonder if it could ever be made to have search results
| return reasonably fast in a distributed large index?
| baremetal wrote:
| i would wait longer if the results were better.
| fsflover wrote:
| https://yacy.net
| rc_mob wrote:
| kagi does. i love it. whenever they move from beta to paid I'll
| be a paying customer
| throw_m239339 wrote:
| Because ultimately there is a conflict of interest between you as
| a search user, and google actual customers, advertisers who are
| often spammers themselves. Someone needs to pay for google search
| and since it isn't you...
|
| Google might have been better in the past, but since there is
| absolutely no serious competition whatsoever from a market
| perspective, Google technically doesn't really need to care about
| the quality of its search results anymore, only maximizing
| profits.
| MiddleEndian wrote:
| Gotta do it on your end. The uBlackList extension works for
| several search engines in Firefox and Chrome.
| fxtentacle wrote:
| In my opinion, most mainstream websites are spam by now. I
| understand why Google won't let me vote on the issue because
| they'll surely not like the results ;)
| ColinHayhurst wrote:
| Because Google and Bing (therefore Duck) are increasingly answer
| engines, keeping you on their page, supporting the SEO ecosystem
| and most importantly their ad revenues and network customers.
|
| We and the other few independent search engines have not made
| enough dent on the market to suffer SEO spam. We'll have a way to
| deal with it (watch this space). Right now you'll certainly get
| results "off the beaten path" and with one click you can try out
| 8 other search options [0].
|
| [0] https://blog.mojeek.com/2022/02/search-choices-enable-
| freedo...
| mdasen wrote:
| Not only are they becoming answer engines, in Google's case
| there's a decent chance that they're making money off the
| spammy pages.
|
| If the top results are "SEO stuffed spam websites", they're
| probably also loaded with ads. If they're loaded with ads,
| there's a good chance that they're using Google's ads.
|
| If the top results are "SEO stuffed spam websites", there's a
| good chance they're chock-full of affiliate links. If a search
| for "best baby formula" is going to end up costing Amazon an
| affiliate fee via an "SEO stuffed spam website", Amazon might
| as well just buy a Google ad for that keyword and cut out the
| "SEO stuffed spam website". If the results go to pages that are
| just going to cost a seller money anyway, it gives the seller
| more incentive to buy ads since they aren't getting free
| traffic from the search engine anyway.
|
| While being an answer engine keeps you on their page longer,
| feeding you SEO spam also keeps you coming back to their page;
| feeding you SEO spam signals to potential advertisers that they
| won't be getting free traffic from the search engine so they
| might as well pay for the traffic via ads.
|
| I'm not suggesting that it's a conspiracy to send you bad
| results, but it does seem likely that as long as they aren't
| losing traffic to competitors, it might not be something that
| becomes a priority.
| ColinHayhurst wrote:
| > If the top results are "SEO stuffed spam websites", they're
| probably also loaded with ads. If they're loaded with ads,
| there's a good chance that they're using Google's ads.
|
| Bingo. If you can't get clicks on AdWords get them on
| AdSense.
| csmeder wrote:
| My go-to test search term is: "grass-fed beef restaurant in sf"
|
| You do much better than Google (Google will always include "Top
| 10 Best Grass Fed Organic Steak in San Francisco, CA" from Yelp
| and then link to places that don't have Grass-fed beef
| options.)
|
| However, currently, your first ranking option is Pinterest spam
| FYI:
| https://www.mojeek.com/search?q=grass+fed+beef+restaurant+in...
| -
| https://www.mojeek.com/search?q=grass+fed+beef+restaurant+in...
|
| Your second option is the correct kind of result (a blog from a
| local that actually answers the question):
| https://www.grassfedgirl.com/paleo-friendly-restaurants-in-s...
|
| Where most of the results on Google are not correct. They are
| mostly articles about Steaks (and a few actual restaurants that
| serve grass-fed beef, so that is good). Actually, FYI your
| results don't include these restaurants. eg. It would be nice
| if this Google result showed up in your results:
|
| "SF / SOMA - Belcampo We source grass-fed and finished,
| pasture-raised meats directly from our own climate positive CA
| farms and seasonal vegetables from local farms."
| ColinHayhurst wrote:
| Thanks. useful feedback.
|
| As you point out #1 organic link on G is Yelp. They currently
| block all but G, Bing and Yahoo! - we'll get in touch with
| them again. https://www.yelp.com/robots.txt
|
| Organic link #3 on G is Belcampo; we have some of that
| indexed so we'll take a look:
| https://www.mojeek.com/search?q=food+site%3Abelcampo.com
| stonemetal12 wrote:
| >we'll get in touch with them again.
|
| Feel free not to. Follow the link, it isn't a quality
| result.
| ColinHayhurst wrote:
| Fair point. Some other big sites we are blocked from - we
| don't sweat. It keeps down the noise-to-signal ratio!
___________________________________________________________________
(page generated 2022-03-29 23:01 UTC)