[HN Gopher] Show HN: PyDoll - Async Python scraping engine with ...
___________________________________________________________________
Show HN: PyDoll - Async Python scraping engine with native CAPTCHA
bypass
Author : thalissonvs
Score : 107 points
Date : 2025-06-10 14:01 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| hk1337 wrote:
| > Say goodbye to webdriver compatibility nightmares
|
| That's cool but Chrome is the only browser I have had these
| issues with. We have a cron process that uses selenium, initially
| with Chrome, and every time there was a chrome browser update we
| had to update the web driver. I switched it to Firefox and
| haven't had to update the web driver since.
|
| I like the async portion of this but this seems like
| MechanicalSoup?
|
| *EDIT* MechanicalSoup doesn't necessarily have async, AFAIK.
| VladVladikoff wrote:
| I had the same problem and just added a few lines of code which
| check the version and update it if required.
| thalissonvs wrote:
| I don't think it's similar. The library has many other features
| that Selenium doesn't have. It has few dependencies, which
| makes installation faster, allows scraping multiple tabs
| simultaneously because it's async, and has a much simpler
| syntax and element searching, without all the verbosity of
| Selenium. Even for cases that don't involve captchas, I still
| believe it's definitely worth using.
| hk1337 wrote:
| Similar to MechanicalSoup is what I meant, which uses
| BeautifulSoup as well.
|
| > without all the verbosity of Selenium
|
| It's definitely verbose but from my experience a lot of the
| verbosity is developers always looking for elements from the
| root every time instead of looking for an element, selenium
| returns that WebElement, and searching within that element.
| at0mic22 wrote:
| This one is not using webdrive, but raw chrome debugging
| protocol
| jdnier wrote:
| Hi, just wondering what you're thinking about how your tool might
| be abused.
| thalissonvs wrote:
| Well, it really depends on the user; there are many cases where
| this can be useful. Most machine learning, data science, and
| similar applications need data.
| wang_li wrote:
| >Most machine learning, data science, and similar
| applications need data.
|
| So. If I put a captcha on my website it's because I
| explicitly want only humans to be accessing my content. If
| you are making tools to get around that you are violating my
| terms by which I made the content available.
|
| No one should need a captcha. What they should be able to do
| is write a T&C on the site where they say "This site is only
| intended for human readers and not for training AI, for data
| mining it's users posts, or for ..... and if you do use it
| for any of these you agree to pay me $100,000,000,000." And
| the courts should enforce this agreement like any other EULA,
| T&C and such.
| elbear wrote:
| From what I remember a court in the US ruled that scraping
| is legitimate use. I don't know the specifics, I just
| remember reading this.
| kej wrote:
| It's far more nuanced than the headlines from that case
| made it seem. Here is a good overview:
| https://mccarthylg.com/is-web-scraping-
| legal-a-2025-breakdow...
| mrweasel wrote:
| You know that the captcha is there to prevent you from doing
| e.g. automated data mining, depends on the site obviously. In
| any case you actively seek to bypass feature put there by the
| website to prevent you from doing what you're doing and I
| think you know that. Does that not give you any moral
| concerns?
|
| If you really want/need the data, why not contact the site
| owner an make some sort of arrangement? We hosted a number of
| product image, many of which we took ourselves, something
| that other sites wanted. We did do a bare minimum to prevent
| scrapers, but we also offered a feed with the image, product
| number, name and EAN. We charged a small fee, but you then
| got either an XML feed or a CSV and you could just pick out
| the new additions and download those.
| lazyasciiart wrote:
| Because Facebook isn't open to making arrangements
| thalissonvs wrote:
| I'm not actually bypassing the captcha with reverse
| engineering or anything like that, much less integrating
| with external services. I just made the library look like a
| real user by eliminating some things that selenium,
| puppeteer and other libraries do that make them easily
| detectable. You can still do different types of blocking,
| such as blocking based on IP address, rate limiting, or
| even using a captcha that requires a challenge, such as
| recaptchav2
| Galanwe wrote:
| Well it can be abused of course, but capthas are used abusively
| as well, so I would say it's fair game.
|
| Lots of use cases for scraping are not DoS or information
| stealing, but mere automation.
|
| Proof of work should be used in these cases, it deters massive
| scraping abuse by making it too expensive at scale, while
| allowing legitimate small scale automation.
| voidmain0001 wrote:
| I will be using Pydoll for the following legitimate use case: a
| franchisee is given access to their data as controlled by the
| franchise through a web site. The franchisee uses browser
| automation to retrieve its data but now the franchise has
| deployed a WAF that blocks Chrome webdriver. This is not a
| public web site and the data is not public so it frustrates the
| franchisee because it just wants its data which is paid for by
| its franchisee fees.
| mannyv wrote:
| Gee, I have this computer thing. How can it be abused?
| e9a8a0b3aded wrote:
| oi_oi_oi_got_a_licence_chum.jpg
| bobajeff wrote:
| Hi, as a non-webdev I want to know if rate limiting wouldn't
| make this a non concern?
| mrweasel wrote:
| I still don't want you to create 1000 non-sense accounts,
| even if you can only create 100 per hour.
| overfeed wrote:
| Then you need to level up & have defense in depth instead
| of relying on security through obscurity.
|
| On the public internet, web clients are _user_ agents, and
| not all users are benign. This is an arms race: asking the
| other side to unilaterally disarm is unlikely to work, so
| you change what you can control.
| wesselbindt wrote:
| I am also wondering about this, and in case you have a chef's
| knife in your kitchen, I would also like to hear if you have
| any comment on how that may be abused.
| bobbyraduloff wrote:
| Is there a write up on how you deal with the captchas?
| thalissonvs wrote:
| you can check the official documentation, there's a section
| 'Deep Dive'
| whall6 wrote:
| The web scraping arms race continues.
| renegat0x0 wrote:
| I think I will add this to my AIO package. My project allows to
| crawl pages. Provides a barebones page, and scraping results are
| passed as JSON.
|
| This is something that was very useful for me not to setup
| selenium for the x time. I just use one crawling server for my
| projects.
|
| Link:
|
| https://github.com/rumca-js/crawler-buddy
| thalissonvs wrote:
| cool, left a star :)
| nickspacek wrote:
| As someone who uses ISPs and browser configurations that seem to
| frustrate CloudFlare/reCaptcha to the point of frequently having
| to solve them during day-to-day browsing, it would be interesting
| to develop a proxy server that could automatically/transparently
| solve captchas for me.
| at0mic22 wrote:
| cloudflare captcha can be easily passed with browser extension,
| not much different from the suggested bypass
| mfrye0 wrote:
| Checking it out and I see you're using CDP.
|
| It's been a bit, but I'm pretty sure use of CDP can be detected.
| Has anything changed on that front, or are you aware and you're
| just bypassing with automated captcha handling?
| thalissonvs wrote:
| CDP itself is not detectable. It turns out that other libraries
| like puppeteer and playwright often leave obvious traces, like
| create contexts with common prefixes, defining attributes in
| the navigator property.
|
| I did a clean implementation on top of the CDP, without many
| signals for tracking. I added realistic interactions, among
| other measures.
___________________________________________________________________
(page generated 2025-06-10 23:00 UTC)