https://github.com/niespodd/browser-fingerprinting Skip to content Sign up * Why GitHub? Features - + Mobile - + Actions - + Codespaces - + Packages - + Security - + Code review - + Issues - + Integrations - + GitHub Sponsors - + Customer stories- * Team * Enterprise * Explore + Explore GitHub - Learn and contribute + Topics - + Collections - + Trending - + Learning Lab - + Open source guides - Connect with others + The ReadME Project - + Events - + Community forum - + GitHub Education - + GitHub Stars program - * Marketplace * Pricing Plans - + Compare plans - + Contact Sales - + Education - [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} niespodd / browser-fingerprinting Public * Notifications * Star 276 * Fork 17 * Analysis of Bot Protection systems with available countermeasures . How to defeat anti-bot system and get around browser fingerprinting scripts [?][?] when scraping the web? niespodd.github.io/browser-fingerprinting/ 276 stars 17 forks Star Notifications * Code * Issues 1 * Pull requests 0 * Actions * Security * Insights More * Code * Issues * Pull requests * Actions * Security * Insights main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 1 branch 0 tags Code Latest commit @niespodd niespodd Merge branch 'main' of github.com:niespodd/ browser-fingerprinting int... ... dc9fbb0 Oct 20, 2021 Merge branch 'main' of github.com:niespodd/browser-fingerprinting int... ...o main dc9fbb0 Git stats * 51 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time assets Merge branch 'main' of github.com:niespodd/browser-fingerprinting int... Oct 19, 2021 docs Fallback for iOS Jun 11, 2021 tester Fallback for iOS Jun 11, 2021 .gitignore Add tester v1 Jun 8, 2021 .nojekyll Create .nojekyll Jun 8, 2021 README.md Merge branch 'main' of github.com:niespodd/browser-fingerprinting int... Oct 19, 2021 tester_preview.png Add tester (w/ added mixpanel tracking) Jun 10, 2021 View code [ ] Avoiding bot detection: How to scrape the web without getting blocked? Where to begin building undetectable bot? Recommended services List of anti-bot software providers How do you know who is getting you blocked? Available stealth browsers with automation features Technical insights into bypassing bot detection Random, maybe useful puppeteer-extra-plugin-stealth Multilogin, Kameleo and others Fingerprint test pages Non-technical notes Binary detection Traffic clustering Gateways, captchas & co Support README.md Avoiding bot detection: How to scrape the web without getting blocked? Whether you're just starting to build a web scraper from scratch and wondering what you're doing wrong because your solution isn't working, or you've already been working with crawlers for a while and are stuck on a page that gives you an error saying you're a bot, you can't go any further, keep reading. Anti-bot solutions have evolved in recent years. More and more websites are introducing security measures: from simple ones, such as filtering IP addresses according to their geolocation, to advanced ones based on in-depth analysis of browser parameters and behavioral analysis. All this makes web scraping content more difficult and costly than a few years ago. Nevertheless, it is still possible. Here I highlight a few tips that you may find helpful. Where to begin building undetectable bot? Below you can find list of curated services that I used to get around different anti-bot protections. Depending on your use-case you may need one of the following: Scenario/ Solution Example use-case That comes handy when you scrape websites like Amazon, Walmart or public LinkedIn pages. That is any Short-lived website where no sessions without Pool of rotating IP addresses sign-in is required. auth You plan to make a high number of short-lived sessions and can afford being blocked every now and then. This is useful when the website uses a Geographically Region-specific pool of IP firewall similar to restricted addresses the one from websites Cloudflare to block entire geography from accessing it. The most common scenario here is Long-lived Repeatable pool of IP social media sessions after addresses and stable set of automation e.g. you sign-in browser fingerprints build a tool to automate social media accounts to manage ads more efficiently. There is a number of websites utilizing FingerprintJS that can be easily Use of popular evasion bypassed when you Javascript-based libraries, similar to employ open-source detection puppeteer-extra-plugin-stealth plugins such as the aforementioned puppeteer stealth plugin to work with your existing software. These are one of the most advanced cases. Mainstream examples Natural looking browser are credit card Detection with fingerprints. That is, having processors such as browser covered the whole surface that Adyen or Stripe. A fingerprinting is being validated by the very sophisticated techniques installed Javascript solution browser fingerprint on the target website. is being created to detect credit fraud, or prompt additional authorization from the user. Good examples are sneakers marketplace Unique set of Specialized bot software that websites and detection targets the unique detection e-commerce shops, techniques surface of the target website. reportedly being under heavy attack from custom made bot software. Before diving into any of the Simple above, if you are targeting a custom-made smaller website, it is very detection likely that all you need is a - techniques Scrapy script with tweaks, a cheap data-center proxy, and you are good to go. Once you have decided on what type of evasion is going to be needed in your project, you can use the list below to pick the best provider for your project: Recommended services Type Service Note One of the most reliable, stable and BrightData (formerly Luminati recommended proxy Networks) provider. Best to begin [brightdata] there and if it turns out to be too pricey, move to cheaper alternatives. An alternative to Proxy Global Peer to Business Proxy BrightData that is three Network - infatica.io times cheaper, but [infatica] however do mind their terms of use. Competitor to BrightData with very similar Oxylabs pricing model. Rumor has [oxylabs] it that they have a better TCP fingerprinting masking mechanism in place. One of the most advanced stealthy scraping as a service. At times it may ScrapingBee be cheaper than building [scrapingbe] a dedicated scraping Scraping solution - they do not as a charge for the amount of service traffic used. Handy when your project is about one-off Apify.io scraping. Their data [apify] understanding algorithm makes extracting data a breeze. De-captcha Anti Captcha: Captcha Solving as a Service. Bypass reCAPTCHA, Self-explanatory. service FunCaptcha (...) Bitcoin accepted [?]. [anticaptch] List of anti-bot software providers This is a non-exhaustive list of companies that provide the most advanced anti-bot solutions for businesses ranging from smaller e-commerce sites to Fortune 500 companies: * Akamai Bot Manager by Akamai * Advanced Bot Protection by Imperva (former Distil Networks) * DataDome Bot Protection * PerimeterX * Shape Security * Cloudflare Bot Management * Barracuda Advanced Bot Protection * HUMAN * Kaskada * Alibaba Cloud Anti-Bot Service * Travatar How do you know who is getting you blocked? [botty_mcbo] Join extra.community. There runs an automated tester Botty McBotface that uses several complicated techniques to determine what exact protection a tested website uses (credits to berstend and others from #insiders). Available stealth browsers with automation features Important You use this software at your own risk. Some of them contain malwares just fyi. I do not recommend using them. Stealth Browser Puppeteer Selenium Evasions SDK/Tooling Origin GoLogin [?] [?] + Incogniton [?] [?] [?] ClonBrowser [?] [?] [?] MultiLogin [?] [?] [?] + Indigo Browser [?] [?] [?] GhostBrowser Kameleo [?] [?] [?] AntBrowser CheBrowser /[?] Legend: - Evasion based on noise. - No. [?] - Acceptable (with support libraries or not). - Very nice. --------------------------------------------------------------------- A on this repo will be appreciated! --------------------------------------------------------------------- Technical insights into bypassing bot detection Here I study various aspects of evasion techniques used to get around bot detection systems used by major online websites. I cover both technical and non-technical matters, including recommendations, references to scientific papers and more. The technical findings that I am sharing below are based on observations of running web scraping scripts for a few months against websites protected by the major anti-bot solution vendors. I constantly add stuff to this section. Over time I will try to make it look&feel more structured. Random, maybe useful * Cap FPS for Chromium with software rendering --use-gl=swiftshader - Limit CPU usage from SwiftShader by redraw freq. of Chromium in AVD * Unlike some public comments on that matter chrome devtools protocol actually works on AVD-s with puppeteer * Abusing GPU cache to create persistent tracking identifiers puppeteer-extra-plugin-stealth [?] Win / Fail / Tie : * [?] Client Hints - Shipped recently. In line with Chromium cpp implementation. * [?] General navigator and window properties * [?] Chrome plugins and native extensions - This includes both Widevine DRM extension, as well as Google Hangouts, safe-browsing etc. * p0f - detect host OS from TCP struct - Not possible to fix via Puppeteer APIs. Used in Akamai Bot Manager to match against JS and browser headers (Client Hints and User-Agent). There is a detailed explaination of the issue. The most reliable evasion seems to be not spoofing host OS at all, or using OSfooler-ng. * Browser dimensions - Although stealth plugin provides window.outerdimensions evasion, it won't work without correct config on non-default OS in headless mode; almost always fails when viewport size >= screen resolution (low screen resolution display on the host). * core-estimator - This can detect mismatch between navigator.hardwareConcurrency and SW/WW execution profile. Not possible to limit/bump the ServiceWorker/WebWorker thread limit via existng Puppeteer APIs. * WebGL extensions profiling - desc. tbd * RTCPeerConnection when behind a proxy - Applies to both SOCKS and HTTP(S) proxies. * Performance.now - desc. tbd (red pill) * WebGL profiling - desc. tbd * Behavior Detection - desc. tbd (events, params, ML+AI buzz) * Font fingerprinting - desc. tbd (list+version+renderer via HTML &canvas) * Network Latency - desc. tbd (integrity check: proxy det., JS networkinfo, dns resolv profiling&timing) * Battery API - desc. tbd * Gyroscope and other (mostly mobile) device sensors - desc. tbd Multilogin, Kameleo and others * General navigator and window properties - As per Multilogin documentation custom browser builds typically lag behind the latest additions added by browser vendors. In this case modified Chromium M7X is used (almost 10 versions behind when writing this). * Font masking - Font fingerprinting still leaks host OS due to use of different font rendering backends on Win/Lin/Mac. However, the basic "font whitelisting" technique can help to slightly rotate browser fingerprint. * Inconsistencies - Profile misconfiguration leads to early property/behavior inconsitency detection. * Native extensions - Unlike puppeteer-extra-plugin-stealth custom Chromium builds such as ML and Kameleo provide at most an override for native plugins and extensions shipped with Google Chrome. * AudioContext APIs and WebGL property override - Manipulation of original canvas and audio waveform can be detected with custom JS. * [?] Audio and GL noise tbd (if you have an active subscription in any of these services and don't mind sharing an account drop me an email [?]) Fingerprint test pages These websites may be useful to test fingerprinting techniques against a web scraping software Test page Notes Not 100% realiable as it often displays "inconsistent" to Chrome after a new https://pixelscan.net/ update, but worth checking as the author adds new interesting detection features every now and then https://browserleaks.com/ Doesn't need introduction https://f.vision/ Good quality test page from some guys https:// Commercial service with free reputation www.ipqualityscore.com/ check against popular blacklists ip-reputation-check https://antcpt.com/eng/ ReCaptcha score as well as some information/demo-form/ interesting notes on how to optimize recaptcha-3-test-score.html captcha solving costs https://ja3er.com/ SSL/TLS fingerprint https://fingerprintjs.com/ Good for basic tests - from people who demo/ believe and claim can create unique fingerprints "99.5%" of the time https:// - coveryourtracks.eff.org/ https://www.deviceinfo.me/ - https://amiunique.org/ - http://uniquemachine.org/ - http://dnscookie.com/ - https://whatleaks.com/ - https://kitchensink.ssl.fun - /vendor/shape/fp Non-technical notes I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks. Blocking bot traffic is based on the premise that you (or your technology provider) can distinguish bots from real users. To make this happen various privacy-invasive techniques are applied. To date none of them has been proved to be successful against specialized web scraping tools. Anti-bot software is all about reducing cheap bot traffic. It makes the process of scraping more expensive and complicated, but does not make it entirely impossible. Anti-bot software vendors use detection techniques that fall into one of these two categories: Binary detection No specialized web scraping software is used. Vendor can detect the bad traffic based on information openly disclosed by the scraper e.g. User-Agent header, connection parameters etc. As a result only bots that are not targeted to scrape specific website are blocked. This will make most of the managers happy, because the overall number of bad traffic goes down and it may almost look like there is no more bot traffic on the website. Wrong. Traffic clustering More advanced web scrapers make use of residential proxies and implement complex evasion techniques to fool anti-bot software to think that the web scraper is a real user. No detection mechanism exists to get around this due to technical limitation of web browsers. In this case, most of the time the vendor will be only able to cluster the bad traffic by finding patterns in bot traffic and behavior. This is where browser fingerprinting comes into play. The problem with banning the traffic here is that it may turn out to be a risky operation when bots are successfully mimicking real users. There is a chance that by blocking bots the website will become unavailable to real visitors. Gateways, captchas & co If you think this is a way to go google "captcha resolve api". Support If you have problems with scraping specific website, write me a short email at dniespodziany@gmail.com. Let's have a quick tete-a-tete consultation via Skype . Have I mentioned a would be appreciated? :-) [?] Ethereum address 0x380a4b41fB5e0e1EB8c616eBD56f62f8F934Bab6 About Analysis of Bot Protection systems with available countermeasures . How to defeat anti-bot system and get around browser fingerprinting scripts [?][?] when scraping the web? niespodd.github.io/browser-fingerprinting/ Topics bot crawler scraper automation recaptcha web spider detection chromium chromedriver fingerprinting stealth webscraping bot-detection chromium-browser puppeteer Resources Readme Languages * JavaScript 99.7% * HTML 0.3% * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.