[HN Gopher] Ask HN: Someone is proxy-mirroring my website, can I...
       ___________________________________________________________________
        
       Ask HN: Someone is proxy-mirroring my website, can I do anything?
        
       Hi Hacker News community,  I'm trying to deal with a very
       interesting (to me) case. Someone is proxy-mirroring all content of
       my website under a different domain name.  - Original:
       https://www.saashub.com  - Abuser/Proxy-mirror:
       https://sukuns.us.to  My ideas of resolution:  1) Block them by IP
       - That doesn't work as they are rotating the IP from which the
       request is coming.  2) Block them by User Agent - They are
       duplicating the user-agent of the person making the request to
       sukuns.us.to  3) Add some JavaScript to redirect to the original
       domain-name - They are stripping all JS.  4) Use absolute URLs
       everywhere - they are rewriting everything www.saashub.com to their
       domain name.  i.e. I'm out of ideas. Any suggestions would be
       highly appreciated.  p.s. what is more, Bing is indexing all of
       SaaSHub's content under sukuns.us.to -\\_(tsu)_/-. I've reported a
       copyright infringement, but I have a feeling that it could take
       ages to get resolved.
        
       Author : stanislavb
       Score  : 388 points
       Date   : 2022-12-12 08:00 UTC (15 hours ago)
        
       | lessname wrote:
       | If they copy the Javascript too, you could a code that checks if
       | the domain matches and if not show a blank page (or something
       | like that)
        
       | frankzander wrote:
       | Had the same problem. They used a scraper which runs on Amazon
       | AWS ... so I blocked all Amazon AWS IPs (google for the list of
       | IPs ... and than for a script which creates you NGINX rules for
       | all IPs). Works quite well.
        
       | trinovantes wrote:
       | They are probably using some public cloud service so simply
       | banning all IPs from cloud ASNs [1] will usually be enough.
       | Downside is you're also banning any users using VPNs
       | 
       | [1] https://github.com/brianhama/bad-asn-list
        
         | stanislavb wrote:
         | Thanks, that seems like something I could work on if I can't
         | find a better solution. Cheers.
        
         | gary_0 wrote:
         | Another resource that can be used to check for abusive client
         | IPs is https://github.com/firehol/firehol
        
       | halifaxbeard wrote:
       | Setup Cloudflare on the domain and turn on "bot fight mode".
       | 
       | If the TLS ciphers the client proposes for negotiation doesn't
       | align with the client's User-Agent they get a CAPTCHA.
       | 
       | I would suspect that whoever is doing this proxy-mirroring isn't
       | smart enough to ensure the TLS ciphers align with the User-Agent
       | they're passing through.
        
         | strictnein wrote:
         | This is the correct first step.
        
         | nezirus wrote:
         | I would agree with the above, as an easier version of TLS
         | fingerprinting. One could also ise nginx/haproxy to extract
         | enough TLS info, and detect requests xoming through proxy Magic
         | string: JA3 fingerprint
        
         | supriyo-biswas wrote:
        
       | Fire-Dragon-DoL wrote:
       | Could you prepend full urls to your website on assets and trigger
       | cors requests on all assets? That would make it really annoying
       | to proxy
        
       | politelemon wrote:
       | Add a link rel="canonical" to your pages as well, it should give
       | engines a hint that your domain is the legit one.
       | 
       | https://webmasters.stackexchange.com/questions/56326/canonic...
       | 
       | I noticed that the other domain is hotlinking your images. So you
       | can disable image hotlinking, by only allowing certain domains as
       | the referers. If you block hotlinked images then the other domain
       | will not look as good. Remember to do it for SVGs too.
       | 
       | https://ubiq.co/tech-blog/prevent-image-hotlinking-nginx/
       | 
       | Finally I also see they are using a CDN called Statically to host
       | some assets off your domain. You can block their scrapers by user
       | agent listed here:
       | 
       | http://statically.io/docs/whitelisting-statically/
        
         | stanislavb wrote:
         | I think they are replacing all mentions of saashub.com with
         | their domain. Also, I'm not using statically.io, that's
         | something they are prepending in front of all images.
         | Automatically.
        
           | CGamesPlay wrote:
           | But Statically isn't forwarding the User-Agent of the
           | visitor, and they publish the list of User-Agents that they
           | use, which you can block.
        
           | politelemon wrote:
           | It's adding the CDN for some of the images but not all of
           | them, so you'd have to cover both
        
           | matt_heimer wrote:
           | Sometimes the replacement is done with simple pattern
           | matching. Try different forms of encoding you domain to see
           | if you can get through their replacement.
        
       | cph123 wrote:
       | I used to do this to other websites (we won't go into why) - one
       | thing that may help you is to always return your HTML responses
       | gzipped, regardless of whether the client asked for them or not,
       | so ignore Accept-Encoding preferences. This makes it harder for
       | their server to rewrite your URLs on demand, and most clients
       | will accept gzipped responses.
        
       | btbuildem wrote:
       | Lots of good suggestions here, let me throw one more in the pot
       | -- could you do an equivalent of a "ghost ban"?
       | 
       | Instead of blocking their IPs, detect if the traffic is coming
       | from the abuser's IPs, and serve different content -- blank,
       | irrelevant, offensive, copyright violations, etc.
        
       | SergeAx wrote:
       | Does it hurt you in any way? If not - I would just leave it
       | alone. Google can tell a copy from original. I tried to search
       | some arbitrary text from your website - there is no trace of
       | copycat in Google SERP.
       | 
       | What striked me, though, is that a copycat website is waaaay
       | faster than your original. If I were in your shoes, I would
       | invest my time and effort into speeding up the site. Unlike
       | hunting some script kiddies, that will bring palpable benefits.
        
         | zmmmmm wrote:
         | Bing is directing searches for their service to the fake web
         | site, which is then serving up porn after a few seconds delay.
         | I'd say it is hurting them.
        
       | ycommentator wrote:
       | Lots of great ideas here. A slight variation or emphasis on some:
       | Specifically aim to advertise your own site on the other one.
       | While you can anyway. Free advertising to their (should be your)
       | audience, in return for what they're doing... Seems fair!
        
       | Ra8 wrote:
       | You could maybe put your website behind a captcha? Google's
       | recaptcha works behind the scene, so it won't affect normal
       | users.
        
       | pera wrote:
       | *.us.to are FreeDNS subdomains, I would contact them.
       | Additionally you could do a whois and contact the ISP.
        
       | someweirdperson wrote:
       | If direct (non proxied) access from the search engine spiders can
       | be identified serve the real robots.txt. Otherwise disable
       | crawling. Also, switch meta noindex like this.
        
       | signaru wrote:
       | Some "less technical" suggestions after being inspired by other
       | creative suggestions here:
       | 
       | Put brandings/personalizations/signatures in your pages that are
       | not easy to remove to remove automatically. Include your site URL
       | if possible. The idea is that if a visitor sees these on a
       | different site, it becomes obvious that the content doesn't
       | belong there.
       | 
       | Write an article page about these things happening, specifically
       | mentioning the mirroring site URLs, and see if they will also
       | blindly mirror it.
        
       | [deleted]
        
       | NorwegianDude wrote:
       | Add the worst content imaginable to the page, but don't make it
       | visible by default. If the site strips JS, then use CSS to only
       | show the terrible content when it's shown on that domain. You can
       | use css to check the current domain based on e.g. links.
       | 
       | Extra points if you can cause legal trouble for whoever runs the
       | site. If you're hosting rather large files, then you can also
       | hide content by default that will never be loaded on your site,
       | but will load on the other site. Add a large file to your site,
       | then reference that file a few thousand times with query params
       | to ensure cache busting, and then make the browser load it all
       | using CSS when it detects that it runs on the other site.
        
       | notorandit wrote:
        
       | musabg wrote:
       | 1. Create fake url endpoint. And go to that endpoint in the
       | adversary's website, when your server gets request, flag the ip.
       | Do this nonstop with a script.
       | 
       | 2. Create fake html elements and put unique strings inside. And
       | you can search that string in search engines for finding similar
       | fake sites on different domains.
       | 
       | 3. Create fake html element and put all request details in
       | encrypted format. Visit adversary's website and look for that
       | element and flag that ip OR flag the headers.
       | 
       | 4. Buy proxy databases, and when any user requests your webpage,
       | check if its a proxy.
       | 
       | 5. Instead of banning them, return fake content (fake titles and
       | fake images etc) if proxy is detected OR the ip is flagged.
       | 
       | 6. Don't ban the flagged ip's. She/He's gonna find another one.
       | Make them angry and their user's angry so they give up on you.
       | 
       | 7. Maybe write some bad words to the user on random places in the
       | HTML when you detect flagged ip's :D So the user's will leave the
       | site and this will reduce the SEO point of the adversary. Will be
       | downranked.
       | 
       | 8. Enable image hotlinking protection. Increase the cost of
       | proxying for them.
       | 
       | 9. Use @document CSS to hide the stuff when the URL is different.
       | 
       | 10. Send abuse mail request to the hosting site.
       | 
       | 11. Send abuse mail request to the domain provider.
       | 
       | 12. Look for the flagged IPs and try to find the proxy provider.
       | If you find, send mail to them too.
       | 
       | Edit: More ideas sparkled in my mind when I was in toilet:
       | 
       | 1. Create fake big css files (10MB etc). And repeatedly download
       | that from the adversary's website. This should cost them too much
       | money on proxies.
       | 
       | 2. When you detect proxy, return too big fake HTML files (10GB)
       | etc. That could crash their server if they load the HTML into the
       | memory when parsing.
        
         | eloff wrote:
         | Seems like a good use case for a zip bomb. Return some tiny
         | gzipped content that expands to 1gb.
        
           | christophilus wrote:
           | Yeah. Their proxy is parsing the HTML and stripping it /
           | modifying it, so they're obviously unzipping the responses on
           | their servers. Create the honeypot endpoint, and if you get a
           | request from that endpoint, reply with a zip bomb.
           | 
           | Then, write a little script that repeatedly hits that
           | honeypot URL. I quite like this idea.
        
             | eloff wrote:
             | Awesome, do post a follow-up on HN, I want to hear how this
             | war with the proxy asshats plays out.
        
         | jwsteigerwalt wrote:
         | I really like #9, this seems like a simple way to make your
         | site unusable except via the methods you desire.
        
         | mkoryak wrote:
         | I like how you think. These are all great ideas!
         | 
         | Reminds me of a time some real estate website hotlinked a ton
         | of images from my website. After I asked them to stop and they
         | ignored me I added an nginx rewrite rule to send them a bunch
         | of pictures of houses that were on fire.
         | 
         | For some reason they stopped using my website as their image
         | host after that.
        
           | pwdisswordfish9 wrote:
           | #5 and #6 are key. Don't try to block them directly, just get
           | them delisted. When you've worked out a way to identify which
           | requests belong to the scammer, feed them content that the
           | search engines and their ad partners will penalize them for.
        
           | davidrupp wrote:
           | Bummed that I can upvote this only once. Excellent work.
        
           | graderjs wrote:
           | LOL! Thank you for the laugh. This is great.
        
           | smaudet wrote:
           | Is the primary motivator to do this?
           | 
           | I'm curious if they are stealing anything else, e.g. are they
           | selling ads/tracking, do they replace order forms with their
           | own...
        
             | mkoryak wrote:
             | because I asked them to stop doing it, and they didn't.
             | Technically they were stealing my bandwidth.
             | 
             | Also to teach them an important lesson about the internet.
        
               | Firmwarrior wrote:
               | haha, they're just lucky you didn't introduce them to
               | Goatse
        
               | mkoryak wrote:
               | well actually...
               | 
               | there was another time a site hotlinked to a js file.
               | After asking them to stop, i found that they had a
               | contact form with a homebrew captcha which created the
               | letters image like http://evilsite.com/cgi-
               | bin/captcha.jpg?q=ansr
               | 
               | A little while later, their captcha form had a hidden
               | input appended with the correct answer value, and the
               | word to solve was changed to a new 4 letter word from a
               | dictionary of interesting 4 letter words. The form still
               | worked because of the hidden input. I might have changed
               | the name on the "real" input also.
        
           | egberts1 wrote:
           | What a sure-fire way to toast them! Kudos!
        
           | spmurrayzzz wrote:
           | Signal boosting suggestion #1 here. Great idea.
           | 
           | Additionally if they decide to blackhole the fake/honeypot
           | url, since you mentioned they pass along the user agent, you
           | could mixin some token in a randomized user agent string that
           | your scraper uses so that you could duck-type the request on
           | your end to signal when to capture the egress ip.
        
         | tgtweak wrote:
         | Going defcon3 on proxies
         | 
         | You can also write some obfuscated inline JavaScript that
         | checks the current hostname and compares to the expected one
         | and redirects when not aligned.
        
           | aembleton wrote:
           | They are stripping all JS.
        
         | aliswe wrote:
         | Why return big files when you can return small files at
         | excruciatingly slow speeds? modems are hot again!
        
           | luch wrote:
           | that's probably the best advice. Instead of denying the
           | proxy, just make it shitty to use for the end-user.
        
         | scarmig wrote:
         | > Create fake big css files (10MB etc). And repeatedly download
         | that from the adversary's website. This should cost them too
         | much money on proxies.
         | 
         | Doesn't that also cost you an equal amount? You'll be serving
         | them an equal amount that they proxy to the end user.
         | 
         | It's not even necessarily a cost for them; you're assuming that
         | the host is owned and paid for by the abuser. If it's simply
         | been hijacked (quite possible), you're just racking up costs
         | for another victim.
        
         | blantonl wrote:
         | Shadow nefarious techniques are the best. Don't give them clear
         | indications that there is a problem.
         | 
         | For example, I had an app developer start stealing API content,
         | so once I determined points to key from them, instead of
         | blocking them I simply randomized the API content details
         | returned to their user's apps.
         | 
         | Hey, API calls look good, the app looks like it is working, no
         | problem right? Well, the users of the app were pissed and the
         | negative reviews rolled in. It was glorious.
        
           | kokekolo wrote:
           | Serious question -- is there a way to defend from this
           | "stealing the API" thing? E.g. building an authentication of
           | some sort and then including a key with your app?
        
             | supriyo-biswas wrote:
             | Of course HN doesn't like anything that's reminiscent of
             | DRM, but Apple's App Attest and Google's Play integrity API
             | can help dispense online services to valid clients only.
        
         | dspillett wrote:
         | _> Maybe write some bad words to the user on random places in
         | the HTML_
         | 
         |  _> Create fake big css files (10MB etc). And repeatedly
         | download that from the adversary 's website. This should cost
         | them too much money on proxies._
         | 
         | Be careful when doing things like this, including the shock
         | image option mentioned in other comments, as then it could
         | become an arsehole race with them trying to DoS your site in
         | retribution. Then again, going through more official channels
         | could also get the same reaction, so...
         | 
         |  _> When you detect proxy, return too big fake HTML files
         | (10GB) etc. That could crash their server if they load the HTML
         | into the memory when parsing._
         | 
         | Make sure you are setup to always compress outgoing content, so
         | that you can send GBs of mostly single-token content with MBs
         | of bandwidth.
        
         | stanislavb wrote:
         | Oh, I love these. I will use some of them. Many thanks!
        
         | habibur wrote:
         | point no.1 will do. that's the solution.
        
         | rgrieselhuber wrote:
         | Any recommendations on proxy database providers?
        
           | gary_0 wrote:
           | http://iplists.firehol.org/ looks free and very
           | comprehensive. It has whole bunch of sub-lists of IPs that
           | are likely to be sources of abuse, including datacenters and
           | VPNs, and it gets updated frequently. Github:
           | https://github.com/firehol/firehol
        
         | auselen wrote:
         | Fake 10GB html can be a zip bomb?
        
         | [deleted]
        
         | spiffytech wrote:
         | > 5. Instead of banning them, return fake content (fake titles
         | and fake images etc) if proxy is detected OR the ip is flagged.
         | 
         | > 6. Don't ban the flagged ip's. She/He's gonna find another
         | one. Make them angry and their user's angry so they give up on
         | you.
         | 
         | There's a popular blog that no longer gets linked on HN.
         | 
         | The author didn't like the discussions HN had around his
         | writing, so any visitors with HN as the referer are shown
         | goatse, a notorious upsetting image, instead of the blog
         | content.
        
           | GTP wrote:
           | Out of curiosity, which blog are you talking about?
        
             | ignoramous wrote:
             | https://news.ycombinator.com/from?site=jwz.org
        
             | [deleted]
        
           | mschuster91 wrote:
           | Goatse? I assume you're referring to jwz - that blog shows a
           | testicle in an egg cup if it sees a HN referrer.
        
             | spiffytech wrote:
             | Yeah, jwz. Looks like I got mixed up - goatse has been a
             | popular choice for this kind of thing, but jwz went with a
             | different image.
             | 
             | Fortunately, there are many upsetting images for the OP to
             | choose from!
        
           | someweirdperson wrote:
           | Does anyone not have their referer header supressed or faked?
        
         | [deleted]
        
         | MadVikingGod wrote:
         | I remember years ago there was a way to DDoS a server by
         | opening the connection and sending data REALLY slow, like 1
         | byte a second. I wonder if there is a way to do the opposite of
         | that, where ever request is handed off to a worker which slow
         | enough to keep the connection alive. I doubt this can scale
         | well, but just a thought.
        
           | macNchz wrote:
           | The "opposite" thing you're describing sounds like a tarpit:
           | https://en.m.wikipedia.org/wiki/Tarpit_(networking)
        
           | ambicapter wrote:
           | Slow loris attack
           | https://en.wikipedia.org/wiki/Slowloris_(computer_security)
        
         | rich_sasha wrote:
         | I read once a suggestion to serve gzipped requests which,
         | gzipped, are tiny, but un-gzipped are enormous. Like GBs of 0s.
         | 
         | Not sure how you actually do it and if it serves your purpose
         | but sounded neat.
        
           | e1g wrote:
           | It's called a "zip bomb" (popularized by Silicon Valley [1]),
           | and there is a good guide (and pre-generated 42kB .zip file
           | to blow up most web clients) at
           | https://www.bamsoftware.com/hacks/zipbomb/
           | 
           | [1] https://www.youtube.com/watch?v=jnDk8BcqoR0
        
         | geocrasher wrote:
         | Passive Aggressive FTW. These are all fantastic ideas.
        
         | DoctorOW wrote:
         | In my search for this I found @document isn't super supported
         | [0] I suggested something like:                   a[href*=
         | "sukuns.us.to"] {          display:none;          }
         | 
         | Then use SRI to enforce that CSS.
         | 
         | [0]: https://caniuse.com/mdn-css_at-rules_document
        
           | ignoramous wrote:
           | If they're rewriting html, I guess sanitizing css won't be
           | beyond them.
        
           | JohnAaronNelson wrote:
           | Seems like it would be fairly easy to use this pseudo
           | selector, and apply it to every element on the page. Making
           | them show up as empty to the user
        
             | DoctorOW wrote:
             | You could add a data attribute to the html tag of the
             | document with the current URL, I.E.                 <html
             | data-path="https://www.saashub.com/about">
             | 
             | then hide the full page with:                 html
             | {display: none;}       html[data-path*="saashub.com"]
             | {display:block;}
        
               | emsixteen wrote:
               | This seems quite elegant and easy. Obviously in addition
               | to other measures, but I like it.
        
               | DoctorOW wrote:
               | Honestly this is my favorite HN post in a while I've had
               | a lot of fun thinking over this challenge.
        
               | asciii wrote:
               | I'm with you, too!
        
           | sublinear wrote:
           | I know this is just a game that never ends, but if they're
           | already rewriting the HTTP requests what's stopping them from
           | rewriting the page contents in the response?
           | 
           | SRI is for the situation where a CDN has been poisoned, not
           | this.
        
             | DoctorOW wrote:
             | It might not explicitly be what SRI is meant for but it'll
             | narrow the proxy's options to:
             | 
             | A. Blank page
             | 
             | B. Let the find and replace update the CSS. Generate new
             | hashes in the HTML.
             | 
             | C. Find someone new to pick on.
             | 
             | B is time and potentially computationally expensive, so it
             | makes C a better option.
        
               | sublinear wrote:
               | A doesn't work because B doesn't prevent the attacker
               | from regexing out the hash altogether and changing the
               | domain name in the tags to their own.
        
           | ChrisMarshallNY wrote:
           | How about something like...                   body[href*=
           | "<OFFENDING URL>"] {             background-image:
           | url("http://goatse...");          }
           | 
           | Ala: http://ascii.textfiles.com/archives/1011
        
             | petepete wrote:
             | Or just make the whole page rotate
             | body[href*= "<OFFENDING URL>"] {           animation:
             | rotation 20s infinite linear;         }
             | @keyframes rotation {           from {
             | transform: rotate(0deg);           }           to {
             | transform: rotate(359deg);           }         }
        
             | hbn wrote:
             | We're trying to punish the people running the proxy mirror,
             | not the users who stumble upon them just trying to use the
             | site
        
               | LawTalkingGuy wrote:
               | You could look at it as trying to get them blocked by
               | search engines. Can you detect when they're proxying a
               | search bot as opposed to a user? As for punish, you don't
               | have to make it eye-bleach, just enough to make it firmly
               | NSFW so nobody can get any business value from it, or
               | even use it safely at work.
               | 
               | A little soft NSFW would also greatly accelerate them
               | being added to a block list, especially if you were to
               | submit their site to the blocklists as soon as you
               | started including it. You can include literally anything
               | that won't get you arrested. Terrorist manifestos, the
               | anarchists cookbook, insane hentai porn... Use all those
               | block categories - gore/extreme, terrorist, adult, etc.
        
               | ChrisMarshallNY wrote:
               | In that case, write some JS, that wanders around the
               | Hubble site, randomly downloading full-res TIFF images
               | for the background, or that randomly displays Disney
               | images.
        
         | LinuxBender wrote:
         | These are the best ideas, especially SEO poisoning and
         | alternate images. If their point is to steal content and
         | rankings then poisoning the well should discourage this in the
         | future. I suspect their actual goal is to have a low-effort
         | high SEO site to abuse as a watering hole for phishing attacks.
         | 
         | As a side note, their domain is linked in this thread so they
         | are seeing HN in their access logs and probably reading this.
         | It should make for an interesting arms race. Or red/blue team
         | event.
        
           | IMSAI8080 wrote:
           | They said the attacker was passing through the client's user
           | agent. If they get a user agent that is GoogleBot, they could
           | check if the requesting IP is actually a valid Google data
           | centre (there is a published list of IPs). If the IP is not
           | Google directly, they could return a blank page therefore
           | causing Google to index nothing through the mirrored site.
        
             | LinuxBender wrote:
             | This is a good idea, though it may be short lived since the
             | attackers are likely reading this due to the referrers in
             | the logs. They may add an ACL to counter this but it might
             | be interesting to see how long that works.
        
       | cweagans wrote:
       | Just block all of the OVH IP ranges?
       | 
       | https://ipinfo.io/AS16276
        
       | amelius wrote:
       | Generate your pages from Javascript.
        
       | m1sta_ wrote:
       | Dynamically generate all the content in the browser to a canvas
       | element. No HTML to steal.
       | 
       | More simply you could just make all the HTML links broken unless
       | some obfusticated or server-backed algorithm is run on them.
       | Think google search results.
        
       | 0xbkt wrote:
       | You can be able to identify those requests by inspecting the TLS
       | cipher. Cloudflare Workers has that value in
       | `request.cf.tlsCipher`[0]. Keep in mind the collateral damage it
       | may have, though.
       | 
       | [0] https://developers.cloudflare.com/workers/runtime-
       | apis/reque...
        
       | adql wrote:
       | When someone did it to us we replaced the served content with ads
       | for our site.
        
         | jansan wrote:
         | That sounds like the most elegant approach so far.
        
       | 3nt0py__ wrote:
       | well... the only reasonable thing to do is to find a hosting that
       | accepts monero as a payment, rent a baremetal server with ipmi
       | acces, cript the hardisk with lusk and veracrypt, scan 0.0.0.0/0
       | for unpached dns servers and start ddossing the mirror site
        
       | mortehu wrote:
       | Can you use the fact they they're proxying to prove to Bing and
       | Google webmaster tools that you own their domain, and delist it?
       | The verification is done by serving a file provided by
       | Bing/Google.
        
         | shhsshs wrote:
         | If they're proxying /.well-known/acme-challenge/, you should be
         | able to get a TLS certificate in their name through Lets
         | Encrypt.
        
       | khiqxj wrote:
       | i cant see your site. the mirror/proxy is useful and doing its
       | job.
       | 
       | Access denied Error code 1020
       | 
       | You do not have access to www.saashub.com.
       | 
       | The site owner may have set restrictions that prevent you from
       | accessing the site.
       | 
       | Error details Caret icon Was this page helpful?
       | 
       | Performance & security by Cloudflare External link
        
       | than3 wrote:
       | Have you considered enabling HSTS on the webserver with dynamic
       | endpoints and rate limiting after detecting the flagged IP,
       | literally to a crawl?
       | 
       | I seem to recall someone doing something similar at one point
       | hosting files and setting up resources that get pulled down only
       | on flagged IPs such as a 300kb gzip encoded file that tries to
       | expand to 100TB.
        
       | nickphx wrote:
       | What about blocking by ssl fingerprinting? Established browsers
       | have known fingerprints derived from how the ssl request is made,
       | supported connection options, etc.
        
       | marginalia_nu wrote:
       | Look at your traffic logs and see if you can't fingerprint the
       | scraper. Should be relatively easy since they're mirroring your
       | entire site.
       | 
       | Then instead of blocking the fingerprint, poison the data.
       | Introduce errors that are hard to detect. Maybe corrupt the URLs,
       | or use the incorrect description or category. Be creative, but
       | make it kind of shit.
       | 
       | It's easy to work around blocks. Working around poisoned data is
       | much harder.
        
         | buro9 wrote:
         | This... there are definitely aspects of the proxy that they
         | aren't configuring or are unaware of.
         | 
         | i.e. ssl_cipher, http_x_requested_with, http_accept... and the
         | order of all headers supplied... the casing of all headers
         | supplied... TLS client HELO.
         | 
         | It is relatively easy, if you have enough signals, to
         | essentially create a fingerprint that they won't understand how
         | it works. Yet it will be effective at blocking it regardless of
         | the IP.
         | 
         | Once you add enough of these together it will be hard for them
         | to get around it without being obvious as they do so.
         | 
         | Super aggressive... those same fingerprints will reveal legit
         | browser traffic and the fingerprints for things like Google-
         | bot... so you could go towards a whitelist rather than
         | blocklist. But this is a place you'd have to actively manage as
         | new variations arise constantly.
        
           | RockRobotRock wrote:
           | This is some really cool anti-scraping inside baseball. Is it
           | safe to say that Cloudflare uses these techniques for weeding
           | out bots?
        
       | heartbeats wrote:
       | What about steganography?
       | 
       | If you change subtle details about spelling, spacing, formatting,
       | etc by the source IP, then you can look at one of their pages and
       | figure out which IP it was scraped from.
       | 
       | Then, just add goatse to all pages requested by that IP.
       | Alternatively, replace every other sentence with GPT-generated
       | nonsense.
       | 
       | EDIT: it should be quite easy to use JS to fingerprint the
       | scraper. The downside is that you will also block all NoScript
       | users.
        
       | coding123 wrote:
       | Their domain is likely to be considered hostile, and drop in
       | Google results.
        
       | __oh_es wrote:
       | Facebook had multiple approaches to keep users seeing ads (ie
       | your message) on their site despite ad blockers. Could you mix a
       | message amongst some content mixed up with some elements?
       | Hopefully would not affect rankings too much, but could at least
       | reach users. https://www.bbc.co.uk/news/technology-46508234.amp
       | 
       | Base64 encoding images with watermarks may also be worth a shout.
       | 
       | Love the zip bombing.
       | 
       | Long shot but I wonder if its possible to execute some script on
       | their server.
        
       | shahidkarimi wrote:
       | What an idea. I will do same with some popular websites.
        
       | gildas wrote:
       | Block all requests having "https://sukuns.us.to" as "Referer"
       | HTTP header.
        
         | pastacacioepepe wrote:
         | Requests are proxied so the proxy can rewrite the Referer HTTP
         | header at will, AFAIK.
        
           | gildas wrote:
           | It looks like they're also downloading images directly from
           | your domain, I see https://www.saashub.com/images/app/service
           | _logos/129/k2q4pxz... for example in my debugger.
           | 
           | Edit: you could maybe add a <meta> tag to define a CSP in but
           | I guess they will remove it [1].
           | 
           | [1] https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP
        
       | sagarpatil wrote:
       | What's the motivation for someone to proxy mirror a site?
       | 
       | Does copied content even rank in Google? How are they driving the
       | traffic to it?
        
         | nigma1337 wrote:
         | Motivation is money, the proxy site is serving their own ads.
         | 
         | According to OP, it ranks pretty well on bing.
        
       | [deleted]
        
       | batch12 wrote:
       | Have you looked at filtering the traffic by ASN? You may be able
       | to identify the provider your adversary likes to use and apply
       | some of the controls musabg suggested to any traffic sourcing
       | from these networks.
       | 
       | I have a website doing this to one of my domains. I have let it
       | slide for now since I get value out of users that use their site
       | too, but I have thought about packing their content with
       | advertisements to turn the tables a bit.
        
       | acomjean wrote:
       | DMCA takedown?
       | 
       | They're serving your copyrighted content. Seems like what it was
       | made for.
        
       | 55555 wrote:
       | Anyone have any idea what this does? It's embedded in the copycat
       | site's source:
       | earlierindians.com/39faf03aa687eeefffbe787537b56e15/invoke.js
        
         | mhasbini wrote:
         | Deobfuscated
         | https://gist.github.com/mhasbini/97b471911866e7c7016fa6f27e7...
        
           | 55555 wrote:
           | Cool bot detection!
        
       | nstart wrote:
       | Sending a mail to the hosting provider is helpful. Also, if you
       | are looking at blocking IP you can try blocking an entire ASN
       | temporarily to see how that works. It's one thing for someone to
       | destroy a server and reimage it on the same service. It's another
       | thing to destroy it and bring it up on a fresh provider.
       | Currently the attacker is using dedipath for example. Block the
       | ASN while waiting for their abuse team to respond.
        
       | tacone wrote:
       | Some years ago I let expire my blog domain, only to find out that
       | somebody bought it and was serving a mirror of my content plus
       | scam ads. I reported them to their DNS provider and they were
       | gone in 2-3 weeks.
        
       | scarmig wrote:
       | This is a game of cat and mouse; although engineering approaches
       | are fun, it's primarily an organizational/legal challenge, not a
       | red/blue team exercise.
       | 
       | The first line of defense is contacting the relevant authorities.
       | This means search engines, the hosting provider, and the owner of
       | the domain (who may not be the abuser). Be polite and provide
       | relevant evidence. Make it easy for them to act on it. There'll
       | be some turnaround time and it's not always successful, but it's
       | the best way to get a meaningful resolution to the issue.
       | 
       | What about in the meantime? If all the source IPs are from one
       | ASN, just temporarily block all IPs originating from that ASN.
       | There'll be some collateral damage, but most of your users won't
       | be affected.
        
       | dutchbrit wrote:
       | Warning: Don't visit the proxy mirror at work, I was redirected
       | to xcams/adult content.
        
         | geoah wrote:
         | weird, wasn't the case here.
        
         | agotterer wrote:
         | The mirror is injecting their own ads. My guess is it was just
         | a malicious ad forcing the redirect. It could still happen, but
         | it doesn't appear to be the main intention of the mirror site.
        
       | sdrinf wrote:
       | If the host (DediPath) is not respecting DMCA notices, one other
       | thing you can do is adding the requester's IP address to every
       | page, eg as a div class. If the responses are live proxied, this
       | will surface the cloner's front-facing IP address, and you can
       | block that (and their ASN) specifically.
        
         | stanislavb wrote:
         | I'm not sure I can understand your advice.
        
           | fendy3002 wrote:
           | That's clever, and I just understand after a while.
           | 
           | Now let's say that your website will show the ip of whoever
           | visit it, in one of textbox. When you access it it shows your
           | ip. When the proxy sever access it it shows the proxy sever's
           | ip. When you access the proxy site, the proxy site will
           | access your site, having their ip on one of the text box,
           | then return the page with their ip to you.
           | 
           | The more advanced method is to encrypt the ip and put it
           | hidden somewhere, on later for you to decrypt it, get the ip
           | and black list them.
        
           | ktpsns wrote:
           | I think it works the following: Assuming the proxy has a
           | different IP pointing to it's client, by inserting the IP it
           | uses to connect to the original server into the HTTP reply
           | (HTML/body code), it can be exposed to the OP. However, since
           | he seems to have access logs and seems to understand the
           | proxy requests pretty well, I wonder how it actually helps.
        
           | pornel wrote:
           | Add a comment (or attribute or JS with a string literal) to
           | your HTML that contains IP address of whoever requested the
           | page. Obscure it somehow so it's not obvious that the HTML
           | contains the IP address. Then check source code of the copy,
           | and you'll see who requested it. You can then go after that
           | IP.
           | 
           | BTW: if they're removing/replacing domain name of your site,
           | try obscuring it with HTML entities. This may dodge simple
           | find'n'replace.
        
         | MildlySerious wrote:
         | To extend on this, I wouldn't use clear text for this. Create a
         | HMAC of the IP and add it somewhere in the page, makes it
         | harder to realize what's happening and for the adversary to
         | work around it.
        
         | filleokus wrote:
         | Oh, I like this idea. Would be pretty easy to automate it by
         | setting up some script scraping the IP revealed on their site,
         | adding it to the block list as they rotate around. Clever.
        
           | than3 wrote:
           | Wouldn't they be able to do the same preventing you from
           | scraping the site? They may have many IPs to work with, but
           | you may not?
        
       | psyfi wrote:
       | Fail2ban might help
       | 
       | Even with IP rotation, a proxy website would probably generate
       | more traffic than normal from these few IPs, tweak fail2ban vars
       | so you to make it less likely to trigger on false positives
       | (larger number of requests / larger amount of time) but block the
       | violating IPs for long period, few days for example.
       | 
       | I hope it helps
        
       | meltedcapacitor wrote:
       | Use them a free CDN? User page from your domain actually
       | downloads content through them but with your ads. (Maybe for the
       | continents which are not your primary market.)
       | 
       | (Less economical if they're not caching anything.)
        
       | cheri9 wrote:
        
       | zupa-hu wrote:
       | You can overload their servers as they pass on any URL to you.
       | Just make sure their resource usage is significantly more than
       | yours. Eg. serving a gzipped response that contains millions
       | copies of the same URL to your site. Ideally, make it not fit in
       | their memory. Or, simply large enough to take really long to
       | compute.
       | 
       | Put it under a URL only you know, then start DoS-ing it.
       | 
       | Of course that requires you to be able to serve a prepared
       | gzipped resonse, depends on your stack.
        
       | tekno45 wrote:
       | Block everyone else except them and start hosting disney content.
       | Then give the mouse a ring
        
       | mouzogu wrote:
       | add some CSS to mess with their URLs
       | 
       | a[href*="sukuns"] { font-size: 500px!important; color:
       | LimeGreen!important; }
       | 
       | pretty much destroys the page. i guess eventually they would give
       | up in the specificity battle.
       | 
       | probably more stuff you could do with CSS to mess with them.
        
       | rglover wrote:
       | You might be able to do an origin filter on the headers for
       | requests to your backend (https://developer.mozilla.org/en-
       | US/docs/Web/HTTP/Headers/X-...). Check and make sure the
       | x-forwarded-for header is what you expect and if not, block the
       | request at middleware level.
        
       | werid wrote:
       | .us.to subdomains are sourced from (dynamic) dns provider,
       | FreeDNS: https://freedns.afraid.org/
        
       | modeless wrote:
       | HN probably won't like this but if they are blocking all JS you
       | can make all content invisible with CSS and use JS to unhide it
       | before page load finishes. Temporarily of course until these guys
       | go away.
       | 
       | The nice thing about this is it can be made arbitrarily complex.
       | For example you can make the page actually blank and fetch all
       | the normal, real content with JS after validating the user's
       | browser as much as you like on both client and server. That's
       | what Cloudflare's bot shield stuff does. Since JS is Turing
       | complete there is no shortcut that the proxy can take to avoid
       | running your real JS if you obfuscate it enough. They would have
       | to solve the halting problem.
       | 
       | What a determined adversary would do is run your code in a
       | sandbox that spoofs the URL, so then your job becomes detecting
       | the sandbox. But it's unlikely they would escalate to this point
       | when there are so many other sites on the internet to copy.
        
         | SpeedilyDamage wrote:
         | HN contains multitudes, I love this response.
         | 
         | At the very least you collect info about their sophistication
         | level; will they adapt to adversity or will the bail/move on?
        
           | modeless wrote:
           | I say that because I know there are a lot of people on HN who
           | browse with JS off and rail against sites that require it.
           | But sometimes you need it.
        
         | jeroenhd wrote:
         | I very much would leave the website if I'm opening the site for
         | the first time and it doesn't even render partially, but I
         | recognise people like me are the minority of potential
         | visitors.
         | 
         | Unless you have a tiny target audience that includes people who
         | tend to disable Javascript, this seems like a fine solution. I
         | use JS to hide email from the most basic of scrapers most of my
         | sites and so far it has worked wonders.
         | 
         | However, long term this seems like a solution that will be
         | difficult to maintain. Either the proxy will start stripping
         | CSS as well (s/hidden/visible) or it'll require constantly
         | playing along in a cat-and-mouse game that you don't stand to
         | gain anything in by winning.
         | 
         | I would add random, page-like endpoints that only you know and
         | request them through their proxy (through VPNs/TOR/you name
         | it). Ban the entire /48 (IPv6) or /24 (IPv4) or send them into
         | a tar pit (iptables -A INPUT -p tcp -m tcp -dport 80 -j TARPIT
         | for the IP address you target) to exhaust their resources.
        
         | ww520 wrote:
         | I personally like to use JavaScript on websites and indifferent
         | to the no-js movement. I think the assumption of HN being no-js
         | is because of the vocal few.
        
       | tyingq wrote:
       | Other posts mentioned ways to detect, like:                 -
       | "bait urls" that other crawlers won't touch       - trigger by
       | request volume and filter out legit crawlers       - find
       | something unique about the headers in their requests that you can
       | identify them with
       | 
       | One additional suggestion is to not block them, but rather, serve
       | up different content to them. Like have a big pool of fake pages
       | and randomly return that content. If they get a 200/OK and some
       | content they are less likely to check that anything is wrong.
       | 
       | Another idea is to serve them something that you can then report
       | as some type of violation to Google, or something (think
       | SafeSearch) that gets their site filtered.
        
       | satya71 wrote:
       | We had a similar situation, though it was just a snapshot. We
       | found out they were hosting using S3, and filed a DMCA request
       | with AWS. It was taken down and hasn't returned.
        
       | MohammadAZ wrote:
       | I'm not going to focus on the problem here. I just want to say
       | that I like the idea behind your website, a good source of market
       | research. Bookmarked it.
        
       | thenickdude wrote:
       | Maybe they're also proxying URLs like the HTML verification files
       | that search engines have you upload to claim the domain as your
       | own?
       | 
       | You may be able to claim their domain out from under them and
       | then mess with search settings (e.g. In Google Search Console you
       | can remove URLs from search results).
        
       | henryackerman wrote:
       | I wonder what they get out of this. Injecting ads perhaps?
        
         | kjs3 wrote:
         | Landing place for a phishing campaign. See: watering hole
         | attack.
        
         | chillfox wrote:
         | Yep, they injects ads
        
         | askiiart wrote:
         | I saw another comment saying that the site sometimes redirects
         | to adult content sites.
        
       | wpietri wrote:
       | One strategy tip: don't play cat and mouse. As you've
       | demonstrated, if you change one thing, they will figure it out
       | and change one thing. Not only does that not work, but you are
       | training them that it's worth trying to beat your latest change.
       | 
       | Instead, plot a few different changes and throw them in all at
       | once. Preferably in a way where they will have to solve all of
       | the changes at the same time to figure out what happened and get
       | things working again. Also, favor changes that are harder to
       | detect. E.g., pure IP blocks are easier to detect than tarpitting
       | and returning fake/corrupted content. The longer their feedback
       | loops, the more likely it is that they'll just give up and go be
       | a parasite somewhere else.
        
         | yonixw wrote:
         | I love it, also add a randomness, there is nothing more
         | frustrating than a problem that only reproducers sometimes!
        
           | wpietri wrote:
           | Excellent idea!
        
         | DamnInteresting wrote:
         | > _pure IP blocks are easier to detect than tarpitting and
         | returning fake /corrupted content_
         | 
         | I recently had to employ such a strategy against some extremely
         | aggressive card testers (criminals with lists of stolen credit
         | cards who automate stuffing card info into a donation form to
         | test which cards are still working). Instead of blocking their
         | IPs, I started feeding them randomly generated false responses
         | with a statistically accurate "success" rate. They ran tens of
         | thousands of card tests over many days, and 99% of the data
         | they collected was bogus. It amuses me to know that I polluted
         | their data and wasted so much of their time and effort. Jerks.
        
           | wpietri wrote:
           | This warms my heart and it's a great example of lengthening
           | the feedback loop.
        
       | antonyh wrote:
       | One thing you can do is add a canonical to each page, which will
       | help solve the the Bing/Google issue until they realise it's
       | there. Do it before they add one.
       | 
       | You're already using Cloudflare, you could try talking to their
       | support or just turning up settings to make it more strict for
       | bots.
        
       | simonz05 wrote:
       | Another option would be to use a service like Cloudflare, which
       | offers protections against scraping and other malicious behavior.
       | This can help prevent the proxy-mirror site from being able to
       | access your site's content.
       | 
       | https://blog.cloudflare.com/introducing-scrapeshield-discove...
        
         | stanislavb wrote:
         | I have Cloudflare at the front already. The issue is that they
         | are not actively scraping the content but rather mirroring it
         | on demand.
        
           | DoctorOW wrote:
           | If you're on Cloudflare and they're stripping JS wouldn't a
           | JS challenge be appropriate?
           | 
           | https://developers.cloudflare.com/fundamentals/get-
           | started/c...
        
             | pharmakom wrote:
             | Pretty anti-user though
        
               | DoctorOW wrote:
               | Because it requires JS? How about this then...
               | a[href*= "sukuns.us.to"] {          display:none;
               | }
               | 
               | Then use SRI to enforce that CSS.
        
           | maybe_pablo wrote:
           | Just an idea but maybe you could cause big loads on their
           | servers by requesting in parallel a large amount of urls
           | where you actively serve a gzipped massive html file that is
           | full of links to your website.
           | 
           | EDIT: or building up on what user zhouyisu says above you can
           | generate your perfect match IP blacklist by calling urls via
           | the abusing site that automatically puts any caller into the
           | blacklist.
        
             | is_true wrote:
             | I'm not sure how easy would be to serve a "zip bomb"
             | without getting into trouble, but it would be neat
        
           | fragmede wrote:
           | Then where they're coming from should be exceedingly visible
           | in logs (via splunk or whatever), so deny those requests.
        
           | nixcraft wrote:
           | Enable bot protection in Cloudflare as per your plan
           | https://developers.cloudflare.com/bots/get-started/
        
       | pornel wrote:
       | If they are stripping all JS, you can make the page work only
       | with JS enabled :/
        
       | achillean wrote:
       | The following can be done for free without an API key or Shodan
       | account:
       | 
       | 1. Grab the list of IPs that you've already identified and feed
       | them through nrich (https://gitlab.com/shodan-public/nrich):
       | "nrich bad-ips.txt"
       | 
       | 2. See if all of the offending IPs share a common open port/
       | service/ provider/ hostname/ etc. Your regular visitors probably
       | connect from IPs that don't have any open ports exposed to the
       | Internet (or just 7547).
       | 
       | 3. If the IPs share a fingerprint then you could lazily enrich
       | client IPs using https://internetdb.shodan.io and block them in
       | near real-time. You could also do the IP enrichment before
       | returning content but then you're adding some latency (<40ms) to
       | every page load which isn't ideal.
        
       | T3RMINATED wrote:
        
       | [deleted]
        
       | someweirdperson wrote:
       | If they are serving all files, that should work for systems that
       | check if you are the owner by asking to serve a file as a
       | response to a challenge.
       | 
       | The copy is using ZeroSSL. This seems to use a similar mechanism
       | like letsencrypt to verify certs. Maybe, you could get their
       | certificate by serving the response to their challenge from your
       | server. Not idea how to proceed from there.
       | 
       | Or activating the google webmaster tools. Maybe there's some
       | setting "remove from index" or "upload sitemap" that could reduce
       | its visibility on google.
        
         | watusername wrote:
         | That's actually a good idea. Apparently it's possible to revoke
         | the certificates via the ACME API, even when you are using
         | another ACME account:
         | https://letsencrypt.org/docs/revoking/#using-a-different-aut...
        
       | naijagoal wrote:
       | Same here - https://www.naijagoal.com
        
       | VanTheBrand wrote:
       | This will change if they switch hosting but here's a list of all
       | the ip prefixes for their current hosting provider.
       | 
       | https://bgp.he.net/AS35913#_prefixes
       | 
       | The IPs they switch between may all be from this pool.
        
       | NicoJuicy wrote:
       | Quick and easy first:
       | 
       | 1) Add a watermark to your images when they proxy to you.
       | 
       | Stolen image from {url}
       | 
       | 2) Add a js script when the url differs from yours and display a
       | message + redirect
        
       | [deleted]
        
       | santah wrote:
       | Same thing happened to me and my service (https://next-
       | episode.net) almost 2 years ago.
       | 
       | I wrote a HN post about it as well:
       | https://news.ycombinator.com/item?id=26105890, but to spare you
       | all the irrelevant details and digging in the comments for
       | updates - here is what worked for me - you can block all their
       | IPs, even though they may have A LOT and can change them on each
       | call:
       | 
       | 1) I prepared a fake URL that no legitimate user will ever visit
       | (like website_proxying_mine.com/search?search=proxy_mirroring_hac
       | ker_tag)
       | 
       | 2) I loaded that URL like 30 thousand times
       | 
       | 3) from my logs, I extracted all IPs that searched for
       | "proxy_mirroring_hacker_tag" (which, from memory, was something
       | like 4 or 5k unique IPs)
       | 
       | 4) I blocked all of them
       | 
       | After doing the above, the offending domains were showing errors
       | for 2-3 days and then they switched to something else and left me
       | alone.
       | 
       | I still go back and check them every few months or so ...
       | 
       | P.S. My advice is to remove their URL from your post here. This
       | will not help with search engines picking up their domain and
       | ranking it with your content ...
        
         | nuccy wrote:
         | Instead of blocking by IP, just check SERVER_NAME/HTTP_SERVER
         | variables in your backend/web server (or even in JavaScript of
         | the page check window.location.hostname) and in case those
         | include anything but original hostname, redirect to the
         | original website (or serve different content with a warning to
         | the visitor). If you have apache2/nginx this can be easily
         | achieved by creating a default virtualhost (which is not your
         | website), and additionally creating explicitly your website
         | virtualhost. Then the default virtualhost can have a proper
         | redirect while serving any other hostname.
         | 
         | Those variables are populated by the browser, unless proxying
         | server is rewring them, your web-server will be able to detect
         | imposter and serve him/her with a redirect. If rewrites are
         | indeed in place, then check in the frontend. Blocking by IP is
         | the last option if nothing else works.
        
           | michaelmior wrote:
           | As the OP mentioned, JS is stripped and URLs are being
           | written, so I doubt either of those approaches will work.
        
             | nuccy wrote:
             | Making js essential is not that hard, right? Just "display:
             | none" on the root element, which is removed by js :)
             | 
             | More sophisticated options can been found in other
             | comments.
        
               | nuccy wrote:
               | The other kind of problem is if the website is not really
               | proxied but rather dumped, patched and re-served. In such
               | case the only option (if JavaScript frontend redirect
               | doesn't work) is blocking by IP the dumping server.
               | 
               | To identify IPs, as pointed in the root comment of this
               | thread, you can create a one-pixel link to a dummy page,
               | which dumping software would visit, but a human wouldn't.
               | So you will see who visited that specific page and block
               | those IPs for good.
        
               | michaelmior wrote:
               | I would think you'd want to be careful about search
               | engines with that approach. Assuming the OP wants their
               | site indexed, you could end up unintentionally blocking
               | crawlers.
        
               | tofuahdude wrote:
               | Tail wagging the dog is never a good answer.
        
               | michaelmior wrote:
               | Forcing all users of your website to use JavaScript to
               | get around a scammer is pretty heavy-handed.
        
               | krater23 wrote:
               | Show me one website that today really works without
               | javascript.
        
               | jethro_tell wrote:
               | https://news.ycombinator.com
        
               | c22 wrote:
               | I've been surfing without javascript since 2015. Most
               | websites continue to work fine without it (though some
               | aesthetic breakage is pretty standard). About 25% of
               | sites become unusable, usually due to some poorly
               | implemented cookie consent popup. I don't feel like I'm
               | missing out on anything by simply refusing to patronize
               | these sites. I will selectively turn JS on in some
               | specific cases where dynamic content is required to
               | deliver the value prop.
        
               | cpleppert wrote:
               | Same, I even wrote a chrome extension to enable js on the
               | current domain using a keyboard shortcut; but it has
               | gotten to be more of a pain especially on landing pages.
        
               | michaelmior wrote:
               | > Most websites continue to work fine without it
               | 
               | > About 25% of sites become unusable
               | 
               | These two statements seem pretty contradictory. 75% feels
               | like a low threshold for "most."
        
               | c22 wrote:
               | Most is more than half.
        
               | michaelmior wrote:
               | In casual conversation, I would never interpret most as
               | being solely more than half. However, it seems like
               | perhaps most people agree with you :)
               | 
               | https://english.stackexchange.com/questions/55920/is-
               | most-eq...
        
               | c22 wrote:
               | In my entirely casual understanding of English _most_
               | means the set that has more members than any other. When
               | the comparison is binary (sites that work vs sites that
               | don 't) then "more than half" is both necessary and
               | sufficient as a definition.
               | 
               | When comparing more than two options most _could_ be
               | significantly less than half (e.g. if I have two red
               | balls, and one ball each of blue, purple, green, orange,
               | pink, and yellow, then the color I have the most of is
               | red, despite representing only one quarter of the total
               | balls.)
               | 
               | That said, any attribute attaining more than half of the
               | pie _must_ be most.
        
               | crooked-v wrote:
               | Even JS-heavy websites are moving towards being usable
               | without Javascript with server side rendering.
        
               | psychphysic wrote:
               | Just explain why in a way that vanishes with JS enabled.
               | Like other have said it'll not need to be used for long.
        
               | kelnos wrote:
               | Presumably OP would only have to do this for a limited
               | time, until the scammer gives up and moves on to an
               | easier target. It's not the best, but I don't think it's
               | as bad as you say.
        
               | klyrs wrote:
               | It's trivial to strip that "display: none" out, too.
        
               | brazzledazzle wrote:
               | Yea if they're already rewriting content to serve ads
               | (likely since they're probably not doing this for
               | altruistic reasons) you're just putting off the
               | inevitable. While blocking or captcha'ing source IPs is
               | also a cat and mouse game it's much more effective for a
               | longer period of time.
        
             | mnutt wrote:
             | Maybe an html <meta> redirect tag that bounces through a
             | tertiary domain before redirecting to your real one? If
             | they noticed you were doing it they could mitigate it, but
             | they might deem it too much effort and just go away.
             | 
             | You might also start with the hypothesis that they're using
             | regex for JS removal and try various script injection
             | tricks...
        
               | michaelmior wrote:
               | If they're already stripping JS, I can't imagine it would
               | be a lot of work to also remove the <meta> redirect.
        
         | [deleted]
        
         | NullPrefix wrote:
         | >4) I blocked all of them
         | 
         | Don't block them. Show dicks instead
        
         | blinding-streak wrote:
         | Side note: great idea for a website. This could be really
         | helpful. You got a new user here.
        
           | focusedone wrote:
           | Wow, hadn't seen this before. Awesome site!
        
           | santah wrote:
           | Thanks!
        
           | mhlakhani wrote:
           | I have to agree, my SO has been looking for something like
           | this for a long time. Signing up today!
        
         | rexreed wrote:
         | For 2) you mean you loaded it from the adversary's proxy site,
         | just to clarify?
        
           | [deleted]
        
           | santah wrote:
           | Yes, constructed the honeypot URL using the proxy site and
           | called it (thousands of times) so I can get them to fetch it
           | from my server through their IP so I can log it.
        
             | WirelessGigabit wrote:
             | They literally proxy your website? I thought they'd cache
             | it... that makes more sense now in your statement that you
             | hit their website with a specially formatted url. Since
             | they pass that through to you you can filter on that.
             | 
             | Also: since you say 4k-5k IPs... any of them from cloud
             | providers? And specific location?
        
               | santah wrote:
               | No cloud providers as far as I'm aware.
               | 
               | They were all from the same 4-5 ASN networks, all based
               | in Russia.
        
               | adventured wrote:
               | If you happen to use Cloudflare.... Cloudflare ->
               | Firewall rules -> Russia JS Challenge (or block)
        
               | justsomehnguy wrote:
               | Residential proxy botnet.
        
               | tofuahdude wrote:
               | Why do they bother doing this domain proxy stuff in the
               | first place?
        
               | justsomehnguy wrote:
               | High quality content with a good standing in Google =>
               | unique and quality impressions => more revenue from the
               | ads they insert in the content.
        
             | everybodyknows wrote:
             | Also for (2), any worries that your own providers might
             | imagine you're trying to mount some half-baked DOS
             | campaign?
        
               | santah wrote:
               | Wasn't really worried about that.
               | 
               | I didn't do it as a super quick burst, but in a space of
               | multiple hours.
               | 
               | First because the proxy servers were super slow and
               | second - I couldn't automate it - their servers had some
               | kind of bot detection which would catch me calling the
               | URLs through script.
               | 
               | Instead, I installed a browser extension which would
               | automatically reload a browser tab after specified
               | timeout (I've set it to 10 sec or something) and I opened
               | like 50 tabs of the honeypot URL and left it there to
               | reload for hours ...
        
         | marklit wrote:
         | As soon as you have a few of their IPs, look them up on
         | ipinfo.io/1.2.3.4 and you'll find they probably belong to a
         | handful of hosting firms. You can get each firm's entire IP
         | list on that page and add all of those CIDRs to your block
         | list. Saves you needing to make 30K web requests.
         | 
         | In most countries in the western world, there are 3-4 major
         | ISPs and this is where 99% of your legit traffic comes from.
         | Regular people don't browse the web proxying via hosting
         | centres as Cloudflare will treat them with suspicion on all the
         | websites they protect.
        
           | reincoder wrote:
           | The site seems to be hosted on OVH cloud. OP should report
           | this to them.
           | 
           | https://www.ovh.com/abuse/
           | 
           | Found the hosting information from here:
           | https://host.io/us.to
        
             | KomoD wrote:
             | Consider reaching out to Afraid.org first,
             | https://freedns.afraid.org/contact/
             | 
             | They are the ones providing the subdomain
        
             | ElijahLynn wrote:
             | THIS ^^
        
         | stanislavb wrote:
         | Thanks for the advice. I will give a go to some of these. p.s.
         | I can't remove the URL as the post is not editable anymore. I'm
         | just waking up... in Australia.
        
           | DoreenMichele wrote:
           | The mod can though, if you email him at hn@ycombinator.com.
        
         | chris_wot wrote:
         | Makes me wonder if you could switch serving content based on
         | the URLs. So they redirect back to your website. Or display
         | images marked as copyrighted.
        
           | santah wrote:
           | I tried but couldn't redirect back to my website as they
           | stripped / rewrote all JS.
        
             | rot13xor wrote:
             | You could have a "stolen content" pure HTML/CSS banner that
             | gets removed by Javascript. Only proxy site visitors will
             | see the banner because the proxy deleted the Javascript.
        
               | dorgo wrote:
               | some people like me will see the "stolen content" banner
               | on the original website. And attackers can trivially
               | remove it as soon as they get aware of it.
        
             | t0suj4 wrote:
             | Would it be possible to hide a hash/encoded URL somewhere
             | in JS and delete the site/redirect if the hash/encoded URL
             | contained something unexpected?
        
         | khiqxj wrote:
         | 8chan like every forum ever has dumb moderators who dont know
         | how to do their job / over extend their hand (and the
         | moderation position of web forums seems to attract people with
         | certain mental disorders that make them seek out perceived
         | microinjustices which the definition thereof changes from day
         | to day)
         | 
         | there were a bunch of sites mirroring 8chan to steal content
         | 
         | these were useful because they had both a simpler / lighter /
         | better user interface (aside from images being missing), and
         | posts / threads that were deleted would stay on the mirrors.
         | being able to see deleted posts / threads was highly useful as
         | the moderation on such sites tends to be utterly useless and
         | the output of a random number generator. it was hilarious
         | reading "zigforum" instead of "8chan" in all the posts as the
         | mirror replaced certain words to thinly veil their operation.
         | they even had a reply button that didnt seem to work or was
         | just fake.
         | 
         | tl;dr the web is broken and only is good when "abused" by
         | proxy/mirrors
        
         | otikik wrote:
         | Once you have their IP addresses you can make them serve
         | anything you want. Set your imagination free.
         | 
         | For starters: copyright-infringing material.
        
           | layer8 wrote:
           | Unless you hold the necessary rights to the copyrighted
           | material, that would make you a copyright infringer yourself.
        
         | bvinc wrote:
         | Might I suggest a spin on this: instead of blocking the IPs,
         | consider serving up different content to those IPs.
         | 
         | You could make a page that shames their domain name for
         | stealing content. You could make a redirect page that redirects
         | people to your website. Or you could make a page with
         | absolutely disgusting content. I think it would discourage them
         | from playing the cat and mouse game with you and fixing it by
         | getting new IPs.
        
           | nomel wrote:
           | > Or you could make a page with absolutely disgusting
           | content.
           | 
           | Not if you value the people who might move to the real
           | domain.
        
             | Mikealcl wrote:
             | You could do this without effecting normal traffic
             | depending on uniqueness of ip doing the scraping.
             | 
             | Love the idea.
        
               | swsieber wrote:
               | I think you missed the point - if people show up at
               | $PROXY expect nice stuff but see junk, then they won't
               | move over to $REAL and instead blame $REAL.
               | 
               | E.g. you'd like some way to redirect people from $PROXY
               | site to $REAL site, and disgusting content on $PROXY
               | won't do that - it'll reflect poorly on $REAL
        
               | heelix wrote:
               | If you can identify the crawler - you can provide
               | 'dynamic' content for that specific user context.
        
               | nomel wrote:
               | It's a proxy, so there's no "crawler". It's just an agent
               | relaying to the user. Passing something to this proxy
               | agent just passes it directly to the user.
        
           | sprior wrote:
           | "Or you could make a page with absolutely disgusting
           | content." You've never heard of Rule 34, have you...
        
             | walrus01 wrote:
             | obviously somebody too young to have seen the method of
             | using an http redirect to the goatse hello.jpg for unwanted
             | requests
             | 
             | edit: or when somebody embed-links your image inside some
             | forum, replace the original filename with the contents of
             | hello.jpg
        
           | hedora wrote:
           | One possibility: Serve different content, but only if the
           | user agent is a search engine scraper. Wait a bit to poison
           | their search rankings, then block them.
        
             | zhengyi13 wrote:
             | ... be careful with this.
             | 
             | Assuming you've monetized your content with ads, depending
             | on your ads provider, this may have deleterious effects on
             | your account with that provider, as they may then assume
             | you're trying to game ads revenue.
        
               | chipsa wrote:
               | The mirror is almost certainly running their own ads,
               | given they strip the JavaScript out.
        
       | pornel wrote:
       | Don't block their IPs, but rather return them subtly wrong
       | content that isn't broken at the first glance. Insert typos,
       | replace important terms, inject nonsense technobabble, make URLs
       | point to wrong pages, inject off-topic SEO-spammy keywords that
       | search engines will see as the SEO spam they are.
        
       | ycommentator wrote:
       | My networking knowledge isn't great, so apologies if this is
       | wrong. But if it's not wrong, it could help.
       | 
       | FIND THE IP FOR THE DOMAIN                 PS > ping sukuns.us.to
       | Pinging sukuns.us.to [45.86.61.166] with 32 bytes of data:
       | Reply from 45.86.61.166: bytes=32 time=319ms TTL=39       ...
       | 
       | REVERSE DNS TO FIND HOST
       | https://dnschecker.org/ip-whois-lookup.php?query=45.86.61.166
       | 
       | Apparently it's "Dedipath".
       | 
       | And that WHOIS lookup gives an abuse email address:
       | "Abuse contact for '45.86.60.0 - 45.86.61.255' is
       | 'abuse@dedipath.com'"
       | 
       | So you could try emailing that address. They may take the site
       | down, or hopefully more than that...
        
         | mxuribe wrote:
         | This is not a bad idea, though i would guess that if these guys
         | change IPs, then it will be annoying to spend your time sneding
         | emails, etc. But, then i thought: why not automate this with
         | some simple scripts? You have al;ready outlined your recipe, so
         | simply automate the steps...But the more i thought of the
         | automation around this, you need to be creful not to turn into
         | a "spammer of sorts, constantly sending emails...certainly, you
         | wouild be sending legitimate emails, but if they change their
         | IPs more often, that might trigger your automatiomn more often,
         | somewhat turning you into a mild "spammer", right? :-) I'm not
         | suggesting you abandon your apporoach, but simply to remember
         | to not overdo it with big scale of emails sent out. ;-)
        
         | mmcgaha wrote:
         | Block all of the prefixes that their AS announces too:
         | https://bgp.tools/as/35913#prefixes
        
         | RockRobotRock wrote:
         | Abuse contacts never work. I've never had any success hounding
         | them about malicious sites they host.
        
           | mobilio wrote:
           | actually works very well when it's combined with DMCA
           | takedown request.
        
       | zhouyisu wrote:
       | Maybe some logic honeypot would be good, such as a infinite
       | content paging list with some random trigger hidden at pages with
       | non-sense titles. When one IP hits these triggers, it is
       | automatically banned.
       | 
       | Bots will trigger it by walking through all pages, but real human
       | would not click in since the paging is non-sense and titles are
       | non-sense.
        
         | stanislavb wrote:
         | Yeah, but I don't want to ban bots. Also, they are not actively
         | crawling anything, but rather mirroring the content on demand.
         | At least that's my observation so far... thanks anyways.
        
       | jamal-kumar wrote:
       | Believe it or not ICANN actually takes abuse reports seriously:
       | 
       | https://www.icann.org/resources/pages/abuse-2014-01-29-en
        
       | alexfromapex wrote:
       | There are ways you can fix this yourself but like all things it's
       | way easier to just get a managed solution. CloudFlare or similar
       | should give the necessary tools to block these types of sites.
        
       | amurf wrote:
       | Instead of rendering server side, render client side. If they
       | block JS, they get nothing. In JS script check for hostname and
       | if it matches their hostname, don't render anything.
       | 
       | Potential downsides: SEO.
        
       | colesantiago wrote:
       | If you have a trademark, domain takedown always works.
        
         | 2000UltraDeluxe wrote:
         | That isn't neccesary true. While I definitely agree that the OP
         | should pursue that path, many providers are in other
         | jurisdictions.
        
       | reaktivo wrote:
       | Redirect via meta tag
       | 
       | <meta http-equiv="refresh" content="time; URL=new_url" />
        
         | Benanov wrote:
         | won't survive a grep
        
           | chris_wot wrote:
           | Still never easy, they need to do a small amount of work.
           | Shot across the bows, so to speak.
        
       | jFriedensreich wrote:
       | there are infinite mitigations and it will always boil down to
       | how much they want to do this vs how much you want to prevent
       | them. in the end they could render in a remote controlled browser
       | and use cdn or aws ip adresses en mass, i would consider
       | highjacking their users in subtle ways like replacing pictures or
       | text with obscenities or legal disclaimers. unfortunately their
       | motivation is ad selling to other dodgy companies so unlikely you
       | can mitigate that way. i would also invest in getting the seo in
       | order and having them removed from google if possible. lastly
       | there are solutions like cloudflare turnstile that impact normal
       | users not as much as in days of captchas
        
         | highwaylights wrote:
         | Maybe op only needs to do enough to undermine their website,
         | rather than drive them away.
         | 
         | it's possible the combination of blocked image hotlinks,
         | watermarking the domain inside the images, and CSS trickery
         | that messes up the page on the proxy (along with whatever other
         | steps that can be thought of to make it look wrong or erroneous
         | on the proxied site) could get op bumped to #1 on search on
         | enough links that it no longer matters.
         | 
         | Given the other site isn't generating original content it's
         | unlikely to ever get its google juice back.
         | 
         | On a side note - does Google have an option for this? I'm sure
         | they must have encountered this before and it helps the quality
         | of their results too to block obviously fudged content.
        
           | jFriedensreich wrote:
           | I would try googles phishing report (as others here have
           | reported and allready done)https://safebrowsing.google.com/sa
           | febrowsing/report_phish/ even the example here is not
           | targeted at stealing user dat per se
        
       | oliv__ wrote:
       | This made me wonder whether something similar was happening to my
       | domain?
       | 
       | How would one go about finding out?
        
       | mkoryak wrote:
       | > They are stripping all JS
       | 
       | Are they now?
       | 
       | Add a `visibility: hidden` to random elements on the page, and
       | show them with javascript.
       | 
       | OR
       | 
       | Are they removing _all_ js? Have you checked whether they remove
       | `<body onclick="some javascript code that injects some other
       | code">` ?
       | 
       | You can try to do script injection _into your own site_ to see if
       | their mirroring software is smart enough to deal with all the
       | different xss vectors.
       | 
       | Bonus points: if they remove your <body onhover=>` attribute, add
       | a style like
       | 
       | body { display: none} body[onhover='the js code that they will
       | remove'] {display: block}
        
         | SeriousM wrote:
         | Just try some of the polimorphic xss tricks hacker try in order
         | to get JavaScript into a page. Portswigger has a wonderful page
         | of an extensive xss list.
        
       | balls187 wrote:
       | Um, so I clicked the second link, and was redirected to a not-
       | safe-for-work website.
       | 
       | Luckily, I am at home, and my children are at school.
       | 
       | I have no idea what happened, or why I got redirected, but I can
       | certainly suggest not taking up the idea to serve disgusting
       | content (given I clicked a link that someone on HN posted, I
       | shouldn't be subjected to that).
        
         | [deleted]
        
       | napolux wrote:
       | Want to have some fun?
       | 
       | Happened to me back in the days of blogging.
       | 
       | Posted an image of me mocking them on my blog. Sure enough they
       | published it and they didn't notice for a while. They stopped it
       | soon after :)
        
       | LinuxBender wrote:
       | I tried to look up their site then realized I block "us.to"
       | locally. Since you have their site linked in this thread they are
       | likely seeing the HN thread as a referrer in their access logs
       | and reading this. I expect this to turn into an ongoing battle as
       | a result, but maybe this could be a fun learning exercise for
       | everyone here.
       | 
       | The current IP 45.86.61.166 is likely a compromised host [1]
       | which tells me you are dealing with one of the gangs that create
       | watering holes for phishing attacks and plan to use your content
       | to lure people in. They probably have several thousand
       | compromised hosts to play with. Since others mentioned you could
       | change the content on your site, I would suggest adding the EICAR
       | string [2] throughout the proxied content as well so that people
       | using anti-malware software might block it. They are probably
       | parking multiple phishing sites on the same compromised hosts
       | [3].
       | 
       | This would also be a game of whack-a-mole but if you can find a
       | bunch of their watering hole sites and get the certificate
       | fingerprints and domains into a text file, give them to ZeroSSL
       | and see if they can mass revoke them. Not many browsers validate
       | this but it might get another set of eyes on the gang abusing
       | their free certs.
       | 
       | If you have a lot of spare time on your hands, you could automate
       | scripting the gathering of the compromised proxy hosts they are
       | using and submit the IP, server name, domain name to the hosting
       | provider with the subject "Host: ${IP}, ${Hostname}, compromised
       | for phishing watering hole attacks". Only do this if you can
       | automate it as many server providers have so many of these
       | complaints they end up in a low priority bucket. Use the abuse@,
       | legal@ and security@ aliases for the hosting company along with
       | whatever they have on their abuse contact page. Send these emails
       | from a domain you do not care about as it will get flagged as
       | spam.
       | 
       | Another option would be to draft a very easy to understand email
       | that explains what is occurring and give that to Google and Bing.
       | Even better would be if we could get the eyes of Tavis Ormandy
       | from Google's vulnerability research team to think of ways to
       | break this type of plagiarized content. Perhaps ping him on
       | Twitter and see if he is up to the challenge of solving this in a
       | generalized way to defeat the watering holes.
       | 
       | I can think of a few other things that would trip up their
       | proxies but no point in mentioning it here since the attackers
       | are reading this.
       | 
       | [1] - https://www.shodan.io/host/45.86.61.166
       | 
       | [2] - https://www.eicar.org/download-anti-malware-testfile/
       | 
       | [3] -
       | https://urlscan.io/result/af93fb90-f676-4300-838f-adc5d16b47...
        
         | hedora wrote:
         | Lol:
         | 
         | > _How to delete the test file from your PC_
         | 
         | > _We understand (from the many emails we receive) that it
         | might be difficult for you to delete the test file from your
         | PC. After all, your scanner believes it is a virus infected
         | file and does not allow you to access it anymore. At this point
         | we must refer to our standard answer concerning support for the
         | test file. We are sorry to tell you that EICAR cannot and will
         | not provide AV scanner specific support. The best source to get
         | such information from is the vendor of the tool which you
         | purchased._
         | 
         | > _Please contact the support people of your vendor. They have
         | the required expertise to help you in the usage of the tool.
         | Needless to say that you should have read the user's manual
         | first before contacting them._
        
       | JW_00000 wrote:
       | What about a slightly alternative approach, where instead of
       | trying to block the abuser, you try to make it clear to end users
       | what the real website is? E.g. in your logo image, include the
       | real domain name "saashub.com". Have some introduction text on
       | your home page "Here at saashub.com, we compare SaaS products
       | ...." When your images are hotlinked, replace them with text like
       | "This is a fraudulent website, find us at saashub.com". Anything
       | that can make it obvious to end users that they're on the wrong
       | website when they visit the abuser's URL.
       | 
       | By the way, I've also reported the abuser as a phishing/fraud
       | website through
       | https://safebrowsing.google.com/safebrowsing/report_phish/?u...
        
         | riz_ wrote:
         | Not sure if this would help since:
         | 
         | > 4) Use absolute URLs everywhere - they are rewriting
         | everything www.saashub.com to their domain name.
        
           | lukevp wrote:
           | Embed the welcome text in an image then!
        
       | arnorhs wrote:
       | I would just contact cloudflare via discord. They will know what
       | your best recourse is
        
       | MR4D wrote:
       | I just checked the offending site - it's full of malware. I think
       | if you report that aspect then you might get faster resolution
       | from the search engines.
        
       ___________________________________________________________________
       (page generated 2022-12-12 23:01 UTC)