https://github.com/reanalytics-databoutique/webscraping-open-project Skip to content Sign up * Product + Features + Mobile + Actions + Codespaces + Packages + Security + Code review + Issues + Integrations + GitHub Sponsors + Customer stories * Team * Enterprise * Explore + Explore GitHub + Learn and contribute + Topics + Collections + Trending + Learning Lab + Open source guides + Connect with others + The ReadME Project + Events + Community forum + GitHub Education + GitHub Stars program * Marketplace * Pricing + Plans + Compare plans + Contact Sales + Education [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} reanalytics-databoutique / webscraping-open-project Public * Notifications * Fork 7 * Star 246 Repository of open knowledge about web scraping in Python 246 stars 7 forks Star Notifications * Code * Pull requests 1 * Discussions * Actions * Security * Insights More * Code * Pull requests * Discussions * Actions * Security * Insights This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 1 branch 0 tags Code Latest commit @pigivinci pigivinci Update PerimeterX.md ... c9430bb May 27, 2022 Update PerimeterX.md c9430bb Git stats * 31 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time Images antibot update May 27, 2022 Pages Update PerimeterX.md May 27, 2022 .DS_Store playwright May 27, 2022 README.md Update README.md May 27, 2022 View code [ ] Web scraping with Python open knowledge Why Using Some Best Practice 1.Preliminary Study 1.1.Technology Stack 1.2.API search 1.3. JSON in HTML Search 1.4. Pagination 2. Code Best Practices 2.1. JSON 2.2. XPATHS 2.3. Indent using TABS 2.4. No formatting rules in numeric fields 2.5. Product List Page wins on Single Product Page 3. Tools 3.1. Headless python scrapers 3.2. Python scrapers with fully rendered browsers 3.3. Non Python scrapers with fully rendered browsers 4. Common anti-bot softwares & techniques 4.1. Anti-bot Softwares 4.2. Anti-bot Techniques 5. Test websites README.md Web scraping with Python open knowledge During the past several years at Re Analytics we've spent a lot of time finding the best practices for web scraping, to make it scalable and efficient to maintain. It's like the cat and mouse game, you need to be always updated on the latest developments but, at the same time, the information needed is very sparse on the net. For this reason, we started to centralize all the information we collected and the best practices we developed, to build a point of reference for the Python web scraping community. Feel free to add your contributions to this repository, sharing each other's knowledge will boost the value of this repository for everyone. Why Using Some Best Practice Our goal is to scrape as many sites as we can so we've always looked for these key elements to make a successful large-scale web scraping project. At the moment they are focused on web scraping of E-commerce website because it's what we've done for years, but we're open to integrate them with best practices derived from other industries. * Resilient execution: We want the code to be as low maintenance as possible * Faster maintenance: We work smarter if we find standard solutions, and do not have to decode creative creations every time. * Regulatory compliance: web scraping is a serious thing, we need to know exactly what tools are used. The following practices are always evolving and feel free to suggest yours. 1.Preliminary Study 1.1.Technology Stack Perform a technology stack evaluation for the target website using Wappalyzer Chrome Extension, with attention in the "Security" block. When a technology stack is detected under the "Security" section, please verify if in this list of technologies there is a specific solution for that technology. 1.2.API search Has the website some internal or public APIs for fetching the price\ product data? If so, this is the best scenario available and we should use them to gather data 1.3. JSON in HTML Search Sometimes websites have JSON in their HTML, not only when there's an API. Finding this, will ensure stability. 1.4. Pagination How the website handles the pagination of product catalogue? Internal services that provide the html code of the catalogue are preferred vs loading the full page code 2. Code Best Practices 2.1. JSON Use json if available (on html of the page or from API). It's less prone to changes 2.2. XPATHS Use Xpaths, not css selectors for getting a clearer code. 2.3. Indent using TABS Use tabs for indentation instead of spaces - code weights less and it's easier to detect badly indented structure 2.4. No formatting rules in numeric fields Don't insert rules for cleaning prices or numeric fields: formats change over different countries and are not standards, let's keep this task to post scraping phases in the DBs. 2.5. Product List Page wins on Single Product Page Load the fewer pages you can. Try to see if the fields you need are all available from product catalogue pages and try avoiding enter the single product page. 3. Tools 3.1. Headless python scrapers * Scrapy * scrapy_splash 3.2. Python scrapers with fully rendered browsers * Playwright * playwright_stealth 3.3. Non Python scrapers with fully rendered browsers * Puppeteer 4. Common anti-bot softwares & techniques 4.1. Anti-bot Softwares * Akamai * Cloudflare * Datadome * PerimeterX * Forter * Riskified 4.2. Anti-bot Techniques * Canvas Fingerprinting * WebGl * Browser Fingerprinting 5. Test websites Here's a list of websites where to test your scraper and find out how many checks it passes * https://bot.incolumitas.com/ one of the most complete set of tests for your scrapers * https://pixelscan.net/ check your ip and your machine About Repository of open knowledge about web scraping in Python Topics python scrapy-spider scrapy webscraping scrapysplash playwright Resources Readme Stars 246 stars Watchers 2 watching Forks 7 forks Releases No releases published Packages 0 No packages published * (c) 2022 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.