[HN Gopher] Filters to block and remove copycat-websites from Du...
___________________________________________________________________
Filters to block and remove copycat-websites from DuckDuckGo,
Google and other
Author : gleb_the_human
Score : 145 points
Date : 2022-02-17 16:27 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| sebazzz wrote:
| I use Kagi as a search engine and can just block the site from
| the search results.
| wanderingmind wrote:
| Kagi filters are great for programming, but still evolving for
| others. I still see a lot of pinterest results. You can block
| domains by adding them to Kagi blocklist through Settings ->
| Personalized Results -> Blocked Domains.
| lolinder wrote:
| I started using Kagi recently, and so far haven't had to block
| a single site. Their filters are great!
| xarope wrote:
| I've been recently searching for some very specific keyword
| stuff, and bumping into a lot of sites which seemed like just
| reformatted copy-n-paste of stackoverflow and various mailing
| lists, adding zero value and clogging up the top 100 search
| results.
|
| Now that I see the HUGE number of copycat sites in the
| stackoverflow_copycats.txt file, I am beginning to understand
| what's going on.
|
| Thanks!
| Kovah wrote:
| Does anyone has an idea how to make this work in Brave without
| uBlock? I added the block list to custom filters
| (brave://adblock/) but results for those spam sites are still
| shown in Duckduck.
| aunty_helen wrote:
| I've been using uBlacklist which adds a little block this site
| button to the google search results. Handy.
|
| It is a browser extension and I haven't looked too deeply into it
| so if that's important to you perhaps have a browse over their
| repo etc before installing.
| OGWhales wrote:
| I use that too, great for blocking pinterest from flooding all
| of your google image search results.
| reillyse wrote:
| I really don't like these websites and I smash that back button
| as soon as I realize I've landed on one.
|
| That said I'm amazed they are still showing up at the top of
| google search. My understanding was that that kind of behavior
| (which I think at least some other people do too) combined with
| the fact that they are just copying another much higher page
| ranked website would mean that they are highly unlikely to rank
| above the relevant stack overflow article that they are duping.
| So what is happening here?
| giancarlostoro wrote:
| Reminds me of going to Google Images, and getting sent to
| Pinterest... which is not where the image is sourced out of.
| hughrr wrote:
| Pinterest is one of those sites that really makes me want to
| strangle someone. It's just an abhorrent walled garden of
| other people's property.
| giancarlostoro wrote:
| If I were on Pinterest looking up things, fine. But I'm on
| Google, not trying to find a mirror of what I want, I want
| what I want.
|
| Edit: I have a friend who works there, but not as an
| engineer haha I'm pretty sure I've told him my woes with
| pinterest. My wife loves pinterest though. It allows her to
| come up with amazing design ideas and art ideas.
| blacksmith_tb wrote:
| I am not sure how GOOG weighs what happens after you click on a
| result, it would be clever of them to notice how quickly you
| click on another result for the same search and slightly
| downgrade the first link (though, what happens if you open the
| first three links into new tabs before you actually visit them,
| say). My assumption was that they just counted clicks as an
| upvote, so if these scammers can make it into the first page of
| results, they will tend to stay.
| ChefboyOG wrote:
| For a long time now, Google has weighted behavioral signals
| similar to what you describe. "Bounce Rate" is the percentage
| of users who quickly leave your site after clicking. "Dwell
| Time" is the amount of time a user spends on a page.
|
| There's even a cottage industry around gaming these signals.
| See SerpClix and the like.
| ajsnigrutin wrote:
| So how come pinterest is still on top with many searches?
| jonas21 wrote:
| Why not? Pinterest is a popular site, with lots of
| content all linked to each other. Many people probably
| spend a long time there after clicking a result.
| dylan604 wrote:
| The inordinate amount of time to click away all of the
| dark ui login screens just to see the content before
| making the decision its not what you wanted already
| increased the dwell time to longer than other sites.
| xenadu02 wrote:
| This is not even the first or second time Google has rolled
| out changes that allowed SEO spam sites copying Stackoverflow
| or Wikipedia to rank higher than the original.
|
| They did fix this at one point in time by figuring out which
| site posted the content first and penalizing the copycats,
| but it appears the fix is once again broken.
| reillyse wrote:
| I figure they must just be monitoring the original content
| and republishing it before it's indexed by google. The
| searches are so specific and niche that generally ranking
| isn't hard it's beating the og that's hard.
|
| I just don't know how they are managing to get indexed
| before the big name established sites. Perhaps they are
| succeeding on some small percentage and that is what we are
| seeing?
|
| Perhaps they have an additional trick to make it look like
| they posted the content first, perhaps internal links or
| something.
| dtech wrote:
| I find that hard to believe, the SO questions are often
| years old, the GH ones months.
| [deleted]
| [deleted]
| dawnerd wrote:
| FYI whoever made this, you can create clickable links to import
| filters. For example:
| https://subscribe.adblockplus.org/?location=https://raw.gith...
|
| Quick edit: I know the domain is ABP but ublock origin picks it
| up.
| poulpy123 wrote:
| Great, I will try that soon. These websites are infuriating
| pajko wrote:
| Another useful extension like this is
| https://iorate.github.io/ublacklist/
| Melatonic wrote:
| Anybody know of a way I could bulk import these into NextDNS?
| cmroanirgo wrote:
| Missing from the title is:
|
| > _Specific to dev websites like StackOverflow or GitHub._
|
| Before I noticed that, I had searched for pinterest and found
| nothing. Even marking the HN title with "dev" would be good.
|
| If this were my list I'd add w3schools because to me, it's low
| quality, especially compared to mozilla.
| hlbjhblbljib wrote:
| > I had searched for pinterest and found nothing
|
| So it's working as intended and blocking low effort spam sites
| oxguy3 wrote:
| Cool idea! I was surprised that Wikipedia mirrors aren't
| included, as I encounter them constantly and they drive me
| bonkers. I opened an issue: https://github.com/quenhus/uBlock-
| Origin-dev-filter/issues/2...
| ummonk wrote:
| Yeah it's really frustrating when I read a poorly sourced
| Wikipedia article and I'm trying to search for other sources on
| the claims in the article but all I get is clones of the
| Wikipedia article.
| nhoughto wrote:
| Making it this obvious how to set this up made me finally do it,
| no more junk results (well less junk..). Thanks!
| willis936 wrote:
| Bless you. This took 30 seconds to put on my phone and laptop and
| has already improved my results so much.
| mcfedr wrote:
| I never understood why Google isn't blocking these crap results,
| it's really making my experience of search really bad for a light
| of my searches
| brimble wrote:
| Do they do a good job at getting clickthroughs on Google ads on
| their site? :-/
|
| Does the rate of ad-clicking on the results page increase if
| most of the "natural" results are crap? :-(
| lumost wrote:
| I've noticed a recent trend where the copy cat/adware sites
| are "up-ranked" relative to original content. This would be
| the expected behavior of a search engine optimizing for
| clicks and revenue.
| slig wrote:
| They used to be superb on detecting duplicated content. They
| also were extremely good at detecting spam/ham. Nowadays it
| feels like they don't even care anymore and whatever filters
| they have are either broken or untrained.
| ahelwer wrote:
| Looks great and much better than my piecemeal efforts, although I
| recommend linking to a specific commit of all.txt so you aren't
| opening up your browser's ublock origin filter list to arbitrary
| remote control. Like:
|
| https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...
| Quenhus wrote:
| As the author of the filter, I strongly agree with you.
| However, I believe it would be too tedious for most people to
| update the filter "by hand". I think I'm going to add this
| important security information in the README.
| btdmaster wrote:
| To be fair, they are quite nice when the official website is down
| or blocked...
| kipchak wrote:
| Google's cached version of pages can be another useful option,
| if you click on the ellipses to the right of a search result's
| address and then "cached" in the bottom right hand corner of
| the "About this result" box.
| userbinator wrote:
| Came here to post a similar sentiment. I've rescued very useful
| content from mirroring sites that was gone from the original.
| You can filter them out if you want, but don't forget you're
| doing that or you may not find what you're after.
| Jerry2 wrote:
| ... or DMCA'd.
|
| Over the past year, I've noticed that quite a few repos that I
| used to track have disappeared. I keep a local bookmarks list
| now because if a "starred" project is removed or DMCA'd, Github
| does not tell you about it and they remove any mention of the
| repo from the "starred" list.
___________________________________________________________________
(page generated 2022-02-17 23:00 UTC)