https://github.com/reanalytics-databoutique/webscraping-open-project

Skip to content
 
Sign up

  * Product
      + Features
      + Mobile
      + Actions
      + Codespaces
      + Packages
      + Security
      + Code review
      + Issues
      + Integrations
      + GitHub Sponsors
      + Customer stories
  * Team
  * Enterprise
  * Explore
      + Explore GitHub
      + Learn and contribute
      + Topics
      + Collections
      + Trending
      + Learning Lab
      + Open source guides
      + Connect with others
      + The ReadME Project
      + Events
      + Community forum
      + GitHub Education
      + GitHub Stars program
  * Marketplace
  * Pricing
      + Plans
      + Compare plans
      + Contact Sales
      + Education

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}

reanalytics-databoutique / webscraping-open-project Public

  * Notifications
  * Fork 7
  * Star 246

Repository of open knowledge about web scraping in Python

246 stars 7 forks
Star
Notifications

  * Code
  * Pull requests 1
  * Discussions
  * Actions
  * Security
  * Insights

More

  * Code
  * Pull requests
  * Discussions
  * Actions
  * Security
  * Insights

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
1 branch 0 tags
Code

Latest commit

@pigivinci
pigivinci Update PerimeterX.md
...
c9430bb May 27, 2022
Update PerimeterX.md
c9430bb

Git stats

  * 31 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
Images
antibot update
May 27, 2022
Pages
Update PerimeterX.md
May 27, 2022
.DS_Store
playwright
May 27, 2022
README.md
Update README.md
May 27, 2022
View code
[                    ]
Web scraping with Python open knowledge Why Using Some Best Practice
1.Preliminary Study 1.1.Technology Stack 1.2.API search 1.3. JSON in
HTML Search 1.4. Pagination 2. Code Best Practices 2.1. JSON 2.2.
XPATHS 2.3. Indent using TABS 2.4. No formatting rules in numeric
fields 2.5. Product List Page wins on Single Product Page 3. Tools
3.1. Headless python scrapers 3.2. Python scrapers with fully
rendered browsers 3.3. Non Python scrapers with fully rendered
browsers 4. Common anti-bot softwares & techniques 4.1. Anti-bot
Softwares 4.2. Anti-bot Techniques 5. Test websites

README.md

 Web scraping with Python open knowledge

During the past several years at Re Analytics we've spent a lot of
time finding the best practices for web scraping, to make it scalable
and efficient to maintain. It's like the cat and mouse game, you need
to be always updated on the latest developments but, at the same
time, the information needed is very sparse on the net. For this
reason, we started to centralize all the information we collected and
the best practices we developed, to build a point of reference for
the Python web scraping community. Feel free to add your
contributions to this repository, sharing each other's knowledge will
boost the value of this repository for everyone.

 Why Using Some Best Practice

Our goal is to scrape as many sites as we can so we've always looked
for these key elements to make a successful large-scale web scraping
project. At the moment they are focused on web scraping of E-commerce
website because it's what we've done for years, but we're open to
integrate them with best practices derived from other industries.

  * Resilient execution: We want the code to be as low maintenance as
    possible
  * Faster maintenance: We work smarter if we find standard
    solutions, and do not have to decode creative creations every
    time.
  * Regulatory compliance: web scraping is a serious thing, we need
    to know exactly what tools are used. The following practices are
    always evolving and feel free to suggest yours.

 1.Preliminary Study

 1.1.Technology Stack

Perform a technology stack evaluation for the target website using
Wappalyzer Chrome Extension, with attention in the "Security" block.
When a technology stack is detected under the "Security" section,
please verify if in this list of technologies there is a specific
solution for that technology.

 1.2.API search

Has the website some internal or public APIs for fetching the price\
product data? If so, this is the best scenario available and we
should use them to gather data

 1.3. JSON in HTML Search

Sometimes websites have JSON in their HTML, not only when there's an
API. Finding this, will ensure stability.

 1.4. Pagination

How the website handles the pagination of product catalogue? Internal
services that provide the html code of the catalogue are preferred vs
loading the full page code

 2. Code Best Practices

 2.1. JSON

Use json if available (on html of the page or from API). It's less
prone to changes

 2.2. XPATHS

Use Xpaths, not css selectors for getting a clearer code.

 2.3. Indent using TABS

Use tabs for indentation instead of spaces - code weights less and
it's easier to detect badly indented structure

 2.4. No formatting rules in numeric fields

Don't insert rules for cleaning prices or numeric fields: formats
change over different countries and are not standards, let's keep
this task to post scraping phases in the DBs.

 2.5. Product List Page wins on Single Product Page

Load the fewer pages you can. Try to see if the fields you need are
all available from product catalogue pages and try avoiding enter the
single product page.

 3. Tools

 3.1. Headless python scrapers

  * Scrapy
  * scrapy_splash

 3.2. Python scrapers with fully rendered browsers

  * Playwright
  * playwright_stealth

 3.3. Non Python scrapers with fully rendered browsers

  * Puppeteer

 4. Common anti-bot softwares & techniques

 4.1. Anti-bot Softwares

  * Akamai
  * Cloudflare
  * Datadome
  * PerimeterX
  * Forter
  * Riskified

 4.2. Anti-bot Techniques

  * Canvas Fingerprinting
  * WebGl
  * Browser Fingerprinting

 5. Test websites

Here's a list of websites where to test your scraper and find out how
many checks it passes

  * https://bot.incolumitas.com/ one of the most complete set of
    tests for your scrapers
  * https://pixelscan.net/ check your ip and your machine

About

Repository of open knowledge about web scraping in Python

Topics

python scrapy-spider scrapy webscraping scrapysplash playwright

Resources

Readme

Stars

246 stars

Watchers

2 watching

Forks

7 forks

Releases

No releases published

Packages 0

No packages published

  *  (c) 2022 GitHub, Inc.

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.