https://github.com/niespodd/browser-fingerprinting

Skip to content
 
Sign up

  * Why GitHub?
    Features -
      + Mobile -
      + Actions -
      + Codespaces -
      + Packages -
      + Security -
      + Code review -
      + Issues -
      + Integrations -
      + GitHub Sponsors -
      + Customer stories-
  * Team
  * Enterprise
  * Explore
      + Explore GitHub -

    Learn and contribute

      + Topics -
      + Collections -
      + Trending -
      + Learning Lab -
      + Open source guides -

    Connect with others

      + The ReadME Project -
      + Events -
      + Community forum -
      + GitHub Education -
      + GitHub Stars program -
  * Marketplace
  * Pricing
    Plans -
      + Compare plans -
      + Contact Sales -
      + Education -

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}

niespodd / browser-fingerprinting Public

  * Notifications
  * Star 276
  * Fork 17
  * 

Analysis of Bot Protection systems with available countermeasures .
How to defeat anti-bot system  and get around browser fingerprinting
scripts [?][?] when scraping the web?

niespodd.github.io/browser-fingerprinting/
276 stars 17 forks
Star
Notifications

  * Code
  * Issues 1
  * Pull requests 0
  * Actions
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Security
  * Insights

main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
1 branch 0 tags
Code

Latest commit

@niespodd
niespodd Merge branch 'main' of github.com:niespodd/
browser-fingerprinting int...
...
dc9fbb0 Oct 20, 2021
Merge branch 'main' of github.com:niespodd/browser-fingerprinting
int...

...o main

dc9fbb0

Git stats

  * 51 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
assets
Merge branch 'main' of github.com:niespodd/browser-fingerprinting
int...
Oct 19, 2021
docs
Fallback for iOS
Jun 11, 2021
tester
Fallback for iOS
Jun 11, 2021
.gitignore
Add tester v1
Jun 8, 2021
.nojekyll
Create .nojekyll
Jun 8, 2021
README.md
Merge branch 'main' of github.com:niespodd/browser-fingerprinting
int...
Oct 19, 2021
tester_preview.png
Add tester (w/ added mixpanel tracking)
Jun 10, 2021
View code
[                    ]
Avoiding bot detection: How to scrape the web without getting
blocked?  Where to begin building undetectable bot? Recommended
services List of anti-bot software providers How do you know who is
getting you blocked? Available stealth browsers with automation
features Technical insights into bypassing bot detection Random,
maybe useful puppeteer-extra-plugin-stealth  Multilogin, Kameleo and
others  Fingerprint test pages Non-technical notes Binary detection
Traffic clustering Gateways, captchas & co Support

README.md

 Avoiding bot detection: How to scrape the web without getting
blocked? 

Whether you're just starting to build a web scraper from scratch and
wondering what you're doing wrong because your solution isn't
working, or you've already been working with crawlers for a while and
are stuck on a page that gives you an error saying you're a bot, you
can't go any further, keep reading.

Anti-bot solutions have evolved in recent years. More and more
websites are introducing security measures: from simple ones, such as
filtering IP addresses according to their geolocation, to advanced
ones based on in-depth analysis of browser parameters and behavioral
analysis. All this makes web scraping content more difficult and
costly than a few years ago. Nevertheless, it is still possible. Here
I highlight a few tips that you may find helpful.

 Where to begin building undetectable bot?

Below you can find list of curated services that I used to get around
different anti-bot protections. Depending on your use-case you may
need one of the following:

   Scenario/                Solution                   Example
    use-case
                                                That comes handy when
                                                you scrape websites
                                                like Amazon, Walmart
                                                or public LinkedIn
                                                pages. That is any
Short-lived                                     website where no
sessions without Pool of rotating IP addresses  sign-in is required.
auth                                            You plan to make a
                                                high number of
                                                short-lived sessions
                                                and can afford being
                                                blocked every now and
                                                then.
                                                This is useful when
                                                the website uses a
Geographically   Region-specific pool of IP     firewall similar to
restricted       addresses                      the one from
websites                                        Cloudflare to block
                                                entire geography from
                                                accessing it.
                                                The most common
                                                scenario here is
Long-lived       Repeatable pool of IP          social media
sessions after   addresses and stable set of    automation e.g. you
sign-in          browser fingerprints           build a tool to
                                                automate social media
                                                accounts to manage
                                                ads more efficiently.
                                                There is a number of
                                                websites utilizing
                                                FingerprintJS that
                                                can be easily
                 Use of popular evasion         bypassed when you
Javascript-based libraries, similar to          employ open-source
detection        puppeteer-extra-plugin-stealth plugins such as the
                                                aforementioned
                                                puppeteer stealth
                                                plugin to work with
                                                your existing
                                                software.
                                                These are one of the
                                                most advanced cases.
                                                Mainstream examples
                 Natural looking browser        are credit card
Detection with   fingerprints. That is, having  processors such as
browser          covered the whole surface that Adyen or Stripe. A
fingerprinting   is being validated by the      very sophisticated
techniques       installed Javascript solution  browser fingerprint
                 on the target website.         is being created to
                                                detect credit fraud,
                                                or prompt additional
                                                authorization from
                                                the user.
                                                Good examples are
                                                sneakers marketplace
Unique set of    Specialized bot software that  websites and
detection        targets the unique detection   e-commerce shops,
techniques       surface of the target website. reportedly being
                                                under heavy attack
                                                from custom made bot
                                                software.
                 Before diving into any of the
Simple           above, if you are targeting a
custom-made      smaller website, it is very
detection        likely that all you need is a  -
techniques       Scrapy script with tweaks, a
                 cheap data-center proxy, and
                 you are good to go.

Once you have decided on what type of evasion is going to be needed
in your project, you can use the list below to pick the best provider
for your project:

 Recommended services

   Type                 Service                        Note
                                             One of the most
                                             reliable, stable and
           BrightData (formerly Luminati     recommended proxy
           Networks)                         provider. Best to begin
           [brightdata]                      there and if it turns
                                             out to be too pricey,
                                             move to cheaper
                                             alternatives.
                                             An alternative to
Proxy      Global Peer to Business Proxy     BrightData that is three
           Network - infatica.io             times cheaper, but
           [infatica]                        however do mind their
                                             terms of use.
                                             Competitor to BrightData
                                             with very similar
           Oxylabs                           pricing model. Rumor has
           [oxylabs]                         it that they have a
                                             better TCP
                                             fingerprinting masking
                                             mechanism in place.
                                             One of the most advanced
                                             stealthy scraping as a
                                             service. At times it may
           ScrapingBee                       be cheaper than building
           [scrapingbe]                      a dedicated scraping
Scraping                                     solution - they do not
as a                                         charge for the amount of
service                                      traffic used.
                                             Handy when your project
                                             is about one-off
           Apify.io                          scraping. Their data
           [apify]                           understanding algorithm
                                             makes extracting data a
                                             breeze.
De-captcha Anti Captcha: Captcha Solving
as a       Service. Bypass reCAPTCHA,        Self-explanatory.
service    FunCaptcha (...)                  Bitcoin accepted [?].
           [anticaptch]

 List of anti-bot software providers

This is a non-exhaustive list of companies that provide the most
advanced anti-bot solutions for businesses ranging from smaller
e-commerce sites to Fortune 500 companies:

  * Akamai Bot Manager by Akamai
  * Advanced Bot Protection by Imperva (former Distil Networks)
  * DataDome Bot Protection
  * PerimeterX
  * Shape Security
  * Cloudflare Bot Management
  * Barracuda Advanced Bot Protection
  * HUMAN
  * Kaskada
  * Alibaba Cloud Anti-Bot Service
  * Travatar

 How do you know who is getting you blocked?

[botty_mcbo]

Join extra.community. There runs an automated tester Botty McBotface
that uses several complicated techniques to determine what exact
protection a tested website uses (credits to berstend and others from
#insiders).

 Available stealth browsers with automation features

Important You use this software at your own risk. Some of them
contain malwares just fyi. I do not recommend using them.

Stealth Browser Puppeteer Selenium Evasions SDK/Tooling Origin
GoLogin         [?]        [?]                           + 
Incogniton      [?]        [?]               [?]          
ClonBrowser     [?]        [?]               [?]          
MultiLogin      [?]        [?]               [?]           + 
Indigo Browser  [?]        [?]               [?]          
GhostBrowser                                        
Kameleo         [?]        [?]               [?]          
AntBrowser                                          
CheBrowser                       /[?]                

Legend:  - Evasion based on noise.  - No. [?] - Acceptable (with
support libraries or not).  - Very nice.

---------------------------------------------------------------------

A  on this repo will be appreciated!

---------------------------------------------------------------------

 Technical insights into bypassing bot detection

Here I study various aspects of evasion techniques used to get around
bot detection systems used by major online websites. I cover both
technical and non-technical matters, including recommendations,
references to scientific papers and more.

The technical findings that I am sharing below are based on
observations of running web scraping scripts for a few months against
websites protected by the major anti-bot solution vendors.

I constantly add stuff to this section. Over time I will try to make
it look&feel more structured.

 Random, maybe useful

  * Cap FPS for Chromium with software rendering --use-gl=swiftshader
    - Limit CPU usage from SwiftShader by redraw freq. of Chromium in
    AVD
  * Unlike some public comments on that matter chrome devtools
    protocol actually works on AVD-s with puppeteer
  * Abusing GPU cache to create persistent tracking identifiers

 puppeteer-extra-plugin-stealth 

[?] Win /  Fail /  Tie :

  * [?] Client Hints - Shipped recently. In line with Chromium cpp
    implementation.
  * [?] General navigator and window properties
  * [?] Chrome plugins and native extensions - This includes both
    Widevine DRM extension, as well as Google Hangouts, safe-browsing
    etc.
  *  p0f - detect host OS from TCP struct - Not possible to fix via
    Puppeteer APIs. Used in Akamai Bot Manager to match against JS
    and browser headers (Client Hints and User-Agent). There is a
    detailed explaination of the issue. The most reliable evasion
    seems to be not spoofing host OS at all, or using OSfooler-ng.
  *  Browser dimensions - Although stealth plugin provides
    window.outerdimensions evasion, it won't work without correct
    config on non-default OS in headless mode; almost always fails
    when viewport size >= screen resolution (low screen resolution
    display on the host).
  *  core-estimator - This can detect mismatch between
    navigator.hardwareConcurrency and SW/WW execution profile. Not
    possible to limit/bump the ServiceWorker/WebWorker thread limit
    via existng Puppeteer APIs.
  *  WebGL extensions profiling - desc. tbd
  *  RTCPeerConnection when behind a proxy - Applies to both SOCKS
    and HTTP(S) proxies.
  *  Performance.now - desc. tbd (red pill)
  *  WebGL profiling - desc. tbd
  *  Behavior Detection - desc. tbd (events, params, ML+AI buzz)
  *  Font fingerprinting - desc. tbd (list+version+renderer via HTML
    &canvas)
  *  Network Latency - desc. tbd (integrity check: proxy det., JS
    networkinfo, dns resolv profiling&timing)
  *  Battery API - desc. tbd
  *  Gyroscope and other (mostly mobile) device sensors - desc. tbd

 Multilogin, Kameleo and others 

  *  General navigator and window properties - As per Multilogin
    documentation custom browser builds typically lag behind the
    latest additions added by browser vendors. In this case modified
    Chromium M7X is used (almost 10 versions behind when writing
    this).
  *  Font masking - Font fingerprinting still leaks host OS due to
    use of different font rendering backends on Win/Lin/Mac. However,
    the basic "font whitelisting" technique can help to slightly
    rotate browser fingerprint.
  *  Inconsistencies - Profile misconfiguration leads to early
    property/behavior inconsitency detection.
  *  Native extensions - Unlike puppeteer-extra-plugin-stealth
    custom Chromium builds such as ML and Kameleo provide at most an
    override for native plugins and extensions shipped with Google
    Chrome.
  *  AudioContext APIs and WebGL property override - Manipulation of
    original canvas and audio waveform can be detected with custom
    JS.
  * [?] Audio and GL noise

tbd (if you have an active subscription in any of these services and
don't mind sharing an account drop me an email [?])

 Fingerprint test pages

These websites may be useful to test fingerprinting techniques
against a web scraping software

         Test page                            Notes
                            Not 100% realiable as it often displays
                            "inconsistent" to Chrome after a new
https://pixelscan.net/      update, but worth checking as the author
                            adds new interesting detection features
                            every now and then
https://browserleaks.com/   Doesn't need introduction 
https://f.vision/           Good quality test page from some  guys
https://                    Commercial service with free reputation
www.ipqualityscore.com/     check against popular blacklists
ip-reputation-check
https://antcpt.com/eng/     ReCaptcha score as well as some
information/demo-form/      interesting notes on how to optimize
recaptcha-3-test-score.html captcha solving costs
https://ja3er.com/          SSL/TLS fingerprint
https://fingerprintjs.com/  Good for basic tests - from people who
demo/                       believe and claim can create unique
                            fingerprints "99.5%" of the time
https://                    -
coveryourtracks.eff.org/
https://www.deviceinfo.me/  -
https://amiunique.org/      -
http://uniquemachine.org/   -
http://dnscookie.com/       -
https://whatleaks.com/      -
https://kitchensink.ssl.fun -
/vendor/shape/fp

 Non-technical notes

I need to make a general remark to people who are evaluating (and/or)
planning to introduce anti-bot software on their websites. Anti-bot
software is nonsense. Its snake oil sold to people without technical
knowledge for heavy bucks.

Blocking bot traffic is based on the premise that you (or your
technology provider) can distinguish bots from real users. To make
this happen various privacy-invasive techniques are applied. To date
none of them has been proved to be successful against specialized web
scraping tools. Anti-bot software is all about reducing cheap bot
traffic. It makes the process of scraping more expensive and
complicated, but does not make it entirely impossible.

Anti-bot software vendors use detection techniques that fall into one
of these two categories:

 Binary detection

No specialized web scraping software is used. Vendor can detect the
bad traffic based on information openly disclosed by the scraper e.g.
User-Agent header, connection parameters etc.

As a result only bots that are not targeted to scrape specific
website are blocked. This will make most of the managers happy,
because the overall number of bad traffic goes down and it may almost
look like there is no more bot traffic on the website. Wrong.

 Traffic clustering

More advanced web scrapers make use of residential proxies and
implement complex evasion techniques to fool anti-bot software to
think that the web scraper is a real user. No detection mechanism
exists to get around this due to technical limitation of web
browsers.

In this case, most of the time the vendor will be only able to
cluster the bad traffic by finding patterns in bot traffic and
behavior. This is where browser fingerprinting comes into play. The
problem with banning the traffic here is that it may turn out to be a
risky operation when bots are successfully mimicking real users.
There is a chance that by blocking bots the website will become
unavailable to real visitors.

 Gateways, captchas & co

If you think this is a way to go google "captcha resolve api".

 Support

If you have problems with scraping specific website, write me a short
email at dniespodziany@gmail.com. Let's have a quick tete-a-tete
consultation via Skype .

Have I mentioned a  would be appreciated? :-)

[?] Ethereum address 0x380a4b41fB5e0e1EB8c616eBD56f62f8F934Bab6

About

Analysis of Bot Protection systems with available countermeasures .
How to defeat anti-bot system  and get around browser fingerprinting
scripts [?][?] when scraping the web?

niespodd.github.io/browser-fingerprinting/

Topics

bot crawler scraper automation recaptcha web spider detection 
chromium chromedriver fingerprinting stealth webscraping 
bot-detection chromium-browser puppeteer

Resources

Readme

Languages

  * JavaScript 99.7%
  * HTML 0.3%

  * (c) 2021 GitHub, Inc.
  * Terms
  * Privacy
  * Security
  * Status
  * Docs

 

  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.