hngopher.com

       [HN Gopher] Show HN: PyDoll - Async Python scraping engine with ...
       ___________________________________________________________________
        
       Show HN: PyDoll - Async Python scraping engine with native CAPTCHA
       bypass
        
       Author : thalissonvs
       Score  : 107 points
       Date   : 2025-06-10 14:01 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | hk1337 wrote:
       | > Say goodbye to webdriver compatibility nightmares
       | 
       | That's cool but Chrome is the only browser I have had these
       | issues with. We have a cron process that uses selenium, initially
       | with Chrome, and every time there was a chrome browser update we
       | had to update the web driver. I switched it to Firefox and
       | haven't had to update the web driver since.
       | 
       | I like the async portion of this but this seems like
       | MechanicalSoup?
       | 
       | *EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.
        
         | VladVladikoff wrote:
         | I had the same problem and just added a few lines of code which
         | check the version and update it if required.
        
         | thalissonvs wrote:
         | I don't think it's similar. The library has many other features
         | that Selenium doesn't have. It has few dependencies, which
         | makes installation faster, allows scraping multiple tabs
         | simultaneously because it's async, and has a much simpler
         | syntax and element searching, without all the verbosity of
         | Selenium. Even for cases that don't involve captchas, I still
         | believe it's definitely worth using.
        
           | hk1337 wrote:
           | Similar to MechanicalSoup is what I meant, which uses
           | BeautifulSoup as well.
           | 
           | > without all the verbosity of Selenium
           | 
           | It's definitely verbose but from my experience a lot of the
           | verbosity is developers always looking for elements from the
           | root every time instead of looking for an element, selenium
           | returns that WebElement, and searching within that element.
        
         | at0mic22 wrote:
         | This one is not using webdrive, but raw chrome debugging
         | protocol
        
       | jdnier wrote:
       | Hi, just wondering what you're thinking about how your tool might
       | be abused.
        
         | thalissonvs wrote:
         | Well, it really depends on the user; there are many cases where
         | this can be useful. Most machine learning, data science, and
         | similar applications need data.
        
           | wang_li wrote:
           | >Most machine learning, data science, and similar
           | applications need data.
           | 
           | So. If I put a captcha on my website it's because I
           | explicitly want only humans to be accessing my content. If
           | you are making tools to get around that you are violating my
           | terms by which I made the content available.
           | 
           | No one should need a captcha. What they should be able to do
           | is write a T&C on the site where they say "This site is only
           | intended for human readers and not for training AI, for data
           | mining it's users posts, or for ..... and if you do use it
           | for any of these you agree to pay me $100,000,000,000." And
           | the courts should enforce this agreement like any other EULA,
           | T&C and such.
        
             | elbear wrote:
             | From what I remember a court in the US ruled that scraping
             | is legitimate use. I don't know the specifics, I just
             | remember reading this.
        
               | kej wrote:
               | It's far more nuanced than the headlines from that case
               | made it seem. Here is a good overview:
               | https://mccarthylg.com/is-web-scraping-
               | legal-a-2025-breakdow...
        
           | mrweasel wrote:
           | You know that the captcha is there to prevent you from doing
           | e.g. automated data mining, depends on the site obviously. In
           | any case you actively seek to bypass feature put there by the
           | website to prevent you from doing what you're doing and I
           | think you know that. Does that not give you any moral
           | concerns?
           | 
           | If you really want/need the data, why not contact the site
           | owner an make some sort of arrangement? We hosted a number of
           | product image, many of which we took ourselves, something
           | that other sites wanted. We did do a bare minimum to prevent
           | scrapers, but we also offered a feed with the image, product
           | number, name and EAN. We charged a small fee, but you then
           | got either an XML feed or a CSV and you could just pick out
           | the new additions and download those.
        
             | lazyasciiart wrote:
             | Because Facebook isn't open to making arrangements
        
             | thalissonvs wrote:
             | I'm not actually bypassing the captcha with reverse
             | engineering or anything like that, much less integrating
             | with external services. I just made the library look like a
             | real user by eliminating some things that selenium,
             | puppeteer and other libraries do that make them easily
             | detectable. You can still do different types of blocking,
             | such as blocking based on IP address, rate limiting, or
             | even using a captcha that requires a challenge, such as
             | recaptchav2
        
         | Galanwe wrote:
         | Well it can be abused of course, but capthas are used abusively
         | as well, so I would say it's fair game.
         | 
         | Lots of use cases for scraping are not DoS or information
         | stealing, but mere automation.
         | 
         | Proof of work should be used in these cases, it deters massive
         | scraping abuse by making it too expensive at scale, while
         | allowing legitimate small scale automation.
        
         | voidmain0001 wrote:
         | I will be using Pydoll for the following legitimate use case: a
         | franchisee is given access to their data as controlled by the
         | franchise through a web site. The franchisee uses browser
         | automation to retrieve its data but now the franchise has
         | deployed a WAF that blocks Chrome webdriver. This is not a
         | public web site and the data is not public so it frustrates the
         | franchisee because it just wants its data which is paid for by
         | its franchisee fees.
        
         | mannyv wrote:
         | Gee, I have this computer thing. How can it be abused?
        
           | e9a8a0b3aded wrote:
           | oi_oi_oi_got_a_licence_chum.jpg
        
         | bobajeff wrote:
         | Hi, as a non-webdev I want to know if rate limiting wouldn't
         | make this a non concern?
        
           | mrweasel wrote:
           | I still don't want you to create 1000 non-sense accounts,
           | even if you can only create 100 per hour.
        
             | overfeed wrote:
             | Then you need to level up & have defense in depth instead
             | of relying on security through obscurity.
             | 
             | On the public internet, web clients are _user_ agents, and
             | not all users are benign. This is an arms race: asking the
             | other side to unilaterally disarm is unlikely to work, so
             | you change what you can control.
        
         | wesselbindt wrote:
         | I am also wondering about this, and in case you have a chef's
         | knife in your kitchen, I would also like to hear if you have
         | any comment on how that may be abused.
        
       | bobbyraduloff wrote:
       | Is there a write up on how you deal with the captchas?
        
         | thalissonvs wrote:
         | you can check the official documentation, there's a section
         | 'Deep Dive'
        
       | whall6 wrote:
       | The web scraping arms race continues.
        
       | renegat0x0 wrote:
       | I think I will add this to my AIO package. My project allows to
       | crawl pages. Provides a barebones page, and scraping results are
       | passed as JSON.
       | 
       | This is something that was very useful for me not to setup
       | selenium for the x time. I just use one crawling server for my
       | projects.
       | 
       | Link:
       | 
       | https://github.com/rumca-js/crawler-buddy
        
         | thalissonvs wrote:
         | cool, left a star :)
        
       | nickspacek wrote:
       | As someone who uses ISPs and browser configurations that seem to
       | frustrate CloudFlare/reCaptcha to the point of frequently having
       | to solve them during day-to-day browsing, it would be interesting
       | to develop a proxy server that could automatically/transparently
       | solve captchas for me.
        
         | at0mic22 wrote:
         | cloudflare captcha can be easily passed with browser extension,
         | not much different from the suggested bypass
        
       | mfrye0 wrote:
       | Checking it out and I see you're using CDP.
       | 
       | It's been a bit, but I'm pretty sure use of CDP can be detected.
       | Has anything changed on that front, or are you aware and you're
       | just bypassing with automated captcha handling?
        
         | thalissonvs wrote:
         | CDP itself is not detectable. It turns out that other libraries
         | like puppeteer and playwright often leave obvious traces, like
         | create contexts with common prefixes, defining attributes in
         | the navigator property.
         | 
         | I did a clean implementation on top of the CDP, without many
         | signals for tracking. I added realistic interactions, among
         | other measures.
        
       ___________________________________________________________________
       (page generated 2025-06-10 23:00 UTC)