hngopher.com

       [HN Gopher] SeleniumBase: Python APIs for web automation and byp...
       ___________________________________________________________________
        
       SeleniumBase: Python APIs for web automation and bypassing bot-
       detection
        
       Author : seleniumbase
       Score  : 149 points
       Date   : 2024-12-16 17:34 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | michael_j_x wrote:
       | I've been working with scrapers quite a lot. I started with
       | python requests, then to scrapy, then selenium, then selenium via
       | undetected_chromedriver, and once that started being detected
       | during a chrome update about a year ago, I've switched over to
       | seleniumbase. It got by undetected, but to get it working with
       | pre-downloaded drivers, I had to look into the code. I have
       | never, and I mean never, in all my python years, seen such a
       | horrible mess of code. We are talking 1000lines long methods,
       | with 20-30 different flags and branches Just horrible. I have
       | since switched to Playwright, which seems to be also undetected,
       | and offers a much saner interface.
        
         | seleniumbase wrote:
         | SeleniumBase modifies the webdriver so that it doesn't get
         | detected when used alongside the CDP stealth mode and methods.
         | It'll download chromedriver for you. Not sure what you mean by
         | the multiple branches, as there's just the primary one. What
         | 1000-line methods are you referring to? By "flags", do you mean
         | the different command-line options available? As for
         | Playwright, they aren't undetected: See
         | https://github.com/microsoft/playwright/issues/23884#issueco...
         | - "Playwright is an end-to-end testing framework, where we
         | expect you test on your own environments. Bypassing any form of
         | bot protection is not something we can act on. Thanks for your
         | understanding." On the contrary, SeleniumBase is OK with
         | bypassing bot detection:
         | https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
        
           | cyanmagenta wrote:
           | Not the commenter, but "multiple branches" in this context is
           | referring to if/else statements in the code, not source-
           | control branches. Similarly, "flags" is referring to function
           | arguments like a boolean "is_original." More generally, they
           | are just saying that the code has long, complicated, bug-
           | prone functions.
           | 
           | That said, I just spent a few minutes browsing the
           | SeleniumBase repro, and honestly it didn't seem that unusual
           | to me. Would be interested in seeing a specific example of
           | what the commenter had in mind.
        
         | mdaniel wrote:
         | rather than point-by-point rebuttal as the sibling requests, I
         | think this sums up the coding style pretty well:
         | https://github.com/seleniumbase/SeleniumBase/blob/v4.33.11/s...
        
           | seleniumbase wrote:
           | That method came from code that I accepted in a PR from
           | December 31, 2019:
           | https://github.com/seleniumbase/SeleniumBase/pull/459 Not a
           | true representation of most of the code today.
        
             | parineum wrote:
             | The code is in the code base. Presumably, it still gets
             | run. It doesn't make a difference if new code doesn't look
             | like that.
        
               | seleniumbase wrote:
               | Call it "legacy code" if you'd like. That specific part
               | is from a less common feature for setting options when
               | running on a Selenium Grid. The new CDP Mode isn't
               | compatible with The Grid (since CDP Mode makes direct CDP
               | API calls without making Selenium API calls).
        
               | MstWntd wrote:
               | it's always easier for people today to look at the work
               | of other people in the past and draw stupid conclusions..
               | don't mind them..
        
               | wisty wrote:
               | Bad old code has been battle tested. Bad new code has
               | not, and is more likely to have the show stopper bugs you
               | want to avoid.
        
               | seleniumbase wrote:
               | There's actually a lot of examples being used for testing
               | (https://github.com/seleniumbase/SeleniumBase/tree/master
               | /exa...), which are run regularly (locally and in GitHub
               | Actions). Plus, a lot of major companies are using
               | SeleniumBase: https://github.com/seleniumbase/SeleniumBas
               | e/blob/master/hel... (if something breaks, I find out
               | quickly)
        
           | harrall wrote:
           | That's not amazing code but that's not that bad. In the grand
           | scheme of things, that's not code debt that would ever
           | seriously make my life any harder.
        
             | TeMPOraL wrote:
             | Yup. At least it's self-contained and easy to step through
             | and modify if something breaks or needs to be changed.
             | 
             | And, a my previous PM would point out, even the copy-
             | pasting and verifying no mistakes were made was a solution
             | that took a fraction of the time a modern "clean" approach
             | would. She had a point; as much as I'm against writing
             | _this_ simple code in the general case, plenty of devs tend
             | to err towards overcomplicating solutions when given a
             | chance.
             | 
             | I mean, the modern, proper, Clean Code(tm) solution would
             | have this split into multiple files (not counting _tests_
             | ), and across two or three abstraction levels. I've seen
             | this happen enough that I can tell I'd much prefer working
             | with code like this capabilities parser (and hell, it can
             | be beaten into near-perfection in an hour or three).
        
         | edm0nd wrote:
         | Not sure if you have explored rolling captcha solving services
         | into your code. Its easy as fuck and you can do it in a few
         | lines of code. Check out DeathByCaptcha or AntiCaptcha. It's
         | like $2.99 per 1,000 successfully solved captchas.
         | 
         | I guess my point is, you dont have to be undetected nor write
         | 1000 lines of code to scrape or do whatever you are needing to
         | do always. Saved me a ton of headaches and time when captchas
         | are involved.
        
           | mintzworld wrote:
           | SeleniumBase is free, open-source, can bypass CAPTCHAs with a
           | few lines of code, and it works from the free tier of GitHub
           | Actions.
        
             | edm0nd wrote:
             | It cant bypass all captchas and thats what im talking
             | about.
        
               | mintzworld wrote:
               | According to live demos seen in
               | https://www.youtube.com/watch?v=Mr90iQmNsKM, it'll bypass
               | Cloudflare, Akamai, Shape Security, DataDome, Incapsula,
               | Kasada, and PerimeterX.
        
               | edm0nd wrote:
               | Okay, and? DeathByCaptcha can bypass all of those + all
               | other captchas.
               | 
               | Write a ton of code or just roll in a solving service
               | API. Ez decision and save a ton of time + get to scraping
               | faster.
        
               | seleniumbase wrote:
               | With SeleniumBase, you can bypass CAPTCHAs with one line
               | of code: `sb.uc_gui_click_captcha()`
        
               | parineum wrote:
               | It's like you're not even reading what he wrote.
        
               | edm0nd wrote:
               | okay but it doesnt solve all captchas but a solving
               | service does with a few more lines of code.
               | 
               | Can your script even do Google CAPTCHA and HCaptcha? What
               | about the captcha from Dread? (aint no way it can)
               | 
               | There is no need to bypass them when you can just solve
               | them.
        
               | seleniumbase wrote:
               | There's a reCAPTCHA on the Pokemon website. This
               | SeleniumBase example bypassed it: https://github.com/sele
               | niumbase/SeleniumBase/blob/master/exa...
        
               | Funnnny wrote:
               | > There is no need to bypass them when you can just solve
               | them.
               | 
               | There is no need to solve them when you can just bypass
               | them.
        
               | edm0nd wrote:
               | the point is you cant bypass them all but you CAN solve
               | them all.
        
               | mintzworld wrote:
               | Why pay to solve CAPTCHAs when SeleniumBase can bypass
               | them for free? SeleniumBase can also "solve" CAPTCHAs
               | (such as Cloudflare via click).
        
               | windexh8er wrote:
               | I feel like what you're saying is you have a vested
               | interest in the services you mentioned with all of this
               | scope creep to your OG argument.
        
         | pryelluw wrote:
         | Enterprise Python code. Somehow ends up being worse than Java
         | enterprise code. I'm too used to it at this point.
        
           | seleniumbase wrote:
           | The "Python vs Java" debate is probably one for a different
           | Hacker News post. :)
        
             | pryelluw wrote:
             | I meant that some of the code reminds me of enterprise
             | python. The kicker is that code that works > pretty code.
             | People here act as if ugly code is somehow lesser just
             | because it's ugly. Meanwhile there's a lot of ugly code
             | making millions of dollars.
             | 
             | Didn't mean to bash your project. Sorry if it came across
             | that way.
        
               | seleniumbase wrote:
               | It's OK. No offense was taken. It almost looked like the
               | conversation was expanding into a "Python vs Java"
               | debate, but (thankfully) it did not. I've seen both
               | worlds. I've seen advantages to both. I decided to stay
               | in the Python world.
        
         | bryanrasmussen wrote:
         | Maybe I am just a cynic but I would expect Playwright to be
         | detected when using Chrome, I mean I would expect it was to the
         | benefit of Google to make that happen for the sake of making
         | reCaptcha detect bots better.
         | 
         | That's actually why I've been scrapping my Playwright
         | automation (because I expect I will encounter problems even if
         | hasn't happened yet, cynical and paranoid) and moving towards
         | writing a browser extension to automate Firefox.
         | 
         | Basically my use case is automating tedious things for myself
         | not running bots at scale, so that's why it is imperative not
         | to get caught being "not human", because then risk account
         | problems.
        
           | robertlagrant wrote:
           | How can Google make that happen? Playwright's made by
           | Microsoft. It can use Firefox as a browser as well as Chrome.
        
       | lyu07282 wrote:
       | https://github.com/seleniumbase/SeleniumBase/blob/master/sel...
       | 
       | That's.. that works?? :D
        
         | seleniumbase wrote:
         | That patches chromedriver, (which gets renamed to uc_driver),
         | but patching by itself isn't enough to bypass bot-detection.
         | SeleniumBase also sets specific Chrome options and modifies
         | methods to use the Chrome Devtools Protocol.
        
           | lyu07282 wrote:
           | I was more astonished that you could just search and replace
           | a string in a PE/ELF binary without breaking everything, but
           | I take your solution over recompiling chrome anytime. Awesome
           | job, very well done!
        
             | seleniumbase wrote:
             | Thank you!
        
       | mintzworld wrote:
       | The "CDP Mode" used for bypassing CAPTCHAs and bot-detection has
       | it's own ReadMe within SeleniumBase:
       | https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
       | And there's a recent YouTube video with live demos:
       | https://www.youtube.com/watch?v=Mr90iQmNsKM
        
       | cruffle_duffle wrote:
       | As somebody who is now on the "need to scrape a website to get my
       | customers data for them" side of the fence... I get the reason
       | bot detection exists. If you want people to not scrape, offer
       | API's that allow customers or their software to log in using
       | oauth and let their software / LLM agent grab their data for
       | them.
        
         | chii wrote:
         | > If you want people to not scrape, offer API's
         | 
         | many sites want to prevent scrapers because they don't want
         | their information aggregated - things like price lists and
         | product availability etc.
         | 
         | I know groceries sites do this, to prevent customers from
         | knowing price histories of products. They want to raise prices,
         | then offer a discount to make it seem like the discount is
         | legitimate.
        
           | mintzworld wrote:
           | On the topic of scraping grocery sites, here's an example of
           | bypassing bot-detection on Albertsons: https://github.com/sel
           | eniumbase/SeleniumBase/blob/master/exa... (A demo of that is
           | in https://www.youtube.com/watch?v=Mr90iQmNsKM)
        
             | bryanrasmussen wrote:
             | It seems weird to me that works - when I do scroll into
             | views and similar behaviors in other code I do a random
             | scroll speed to simulate human behavior, but SeleniumBase
             | evidently doesn't.
             | 
             | Maybe I am just too paranoid.
        
               | seleniumbase wrote:
               | SeleniumBase CDP Mode uses `DOM.scrollIntoViewIfNeeded`
               | (https://chromedevtools.github.io/devtools-
               | protocol/tot/DOM/#...), so it only scrolls when elements
               | are offscreen, rather than always scrolling. This reduces
               | the number of scrolls needed. Also, it seems that most
               | anti-bot services are not looking at scrolling as a way
               | of identifying users.
        
       | theanonymousone wrote:
       | Is it demonstrably better than Playwright in bypassing Cloudflare
       | measures? I have some scraping projects and the "cat and mouse
       | game" (what's the right expression here?) got so much energy that
       | I finally went with an external dedicated scraping service. It
       | doesn't feel right that some scrapers are considered friendly
       | (e.g. Google), while smaller ones are vilified..
        
         | seleniumbase wrote:
         | There are live demos on YouTube of SeleniumBase bypassing
         | CAPTCHAs (if you want see first):
         | https://www.youtube.com/watch?v=Mr90iQmNsKM
        
       | coppsilgold wrote:
       | Is there a reason why the crawling and browser automation people
       | don't just patch the browser to be controlled with no possibility
       | of detection?
       | 
       | The web page is heavily restricted in what is can access through
       | various interfaces and you can feed it anything you want by
       | patching the browser. Once you do that the problem becomes just
       | simulating a legitimate user to a sufficient degree.
       | 
       | I wonder if that's what's already happening with CDP and
       | ReCAPTCHA and hCaptcha - the two services mentioned that are
       | strong and a problem. Are they detecting the "Stealth" or is it
       | just the lack of user activity and reputation? Is CDP by itself
       | detectable by some means?
        
         | seleniumbase wrote:
         | Patching chromedriver is a lot easier than patching the
         | browser. Plus, if you're just using a regular Chrome browser
         | for the automation, then there's nothing to patch. Automated
         | CDP calls aren't detectable if they don't leave any trace of
         | automation activity. However, since Google created CDP, they
         | might have ways of detecting automated CDP in ways that other
         | services cannot.
        
           | coppsilgold wrote:
           | What about faking mouse movement from inside the browser?
           | PyAutoGUI is not the right way to be doing this for
           | interacting with JavaScript that has no hope of interrogating
           | user operating system GUI interactions.
           | 
           | And it seems like it would be important to try and adopt
           | user-like mouse movement since JavaScript has access to this
           | information.
        
             | mintzworld wrote:
             | PyAutoGUI is the optimal tool for clicking things inside of
             | closed shadow-root elements, which are hidden to
             | JavaScript. Can use CDP for clicking other elements.
        
         | ghxst wrote:
         | The reason in my experience is that there's a high barrier of
         | entry for most devs when it comes to setting up an environment
         | for Chromium and a workflow for patches that still allows you
         | to quickly and easily pull in and apply upstream changes
         | whenever a new Chromium version releases.
         | 
         | In reality, if you know how to use CDP correctly and you have
         | control over the environment that you run the browser in, you
         | have to make very few browser patches.
         | 
         | What I mean with using CDP correctly is that, yes it is
         | detectable to a certain extent but it comes down to things like
         | enabling Runtime domain for example which you can easily
         | mitigate in your own solution but is something that libraries
         | like puppeteer / playwright often do out of the box (this is
         | where the "stealth" versions of these libraries come in, they
         | will either mitigate by disabling features or use some hacky
         | approaches to instrument the JS that runs on the pages).
         | 
         | Then when you move into an environment that is a lot more
         | stripped down (let's say from your home machine to docker) now
         | you run into A LOT of issues that you definitely are better off
         | fixing with browser patches, however figuring out what those
         | issues are and how to fix them is a huge feat in itself and
         | often will require you to have the ability to reverse engineer
         | things like Cloudflare, Akamai and other anti bot vendors just
         | to know what leaks you still have to patch.
         | 
         | It doesn't help that there is no end to misinformed articles on
         | things like "browser fingerprinting" that you encounter when
         | you try to solve your issues the first time you encounter them,
         | a lot of articles based on nothing but superstition, articles
         | that basically say "proxies are never good enough", "captchas
         | are getting out of hand" that get things wrong and will just
         | eat away at your sanity while trying to debug issues.
         | 
         | This is long enough of a rant already but maybe offers you some
         | insight, if you have any specific questions feel free to ask.
        
           | seleniumbase wrote:
           | The biggest issue with going from a home machine to a server
           | is that you may lose having a "residential IP address", which
           | is something that you'll want to have in order to prevent
           | automation from being blocked outright. Hence the popularity
           | of residential proxies. However, some servers live in a
           | residential IP space, which makes them optimal for running
           | web automation in. As was partially covered in
           | https://www.youtube.com/watch?v=Mr90iQmNsKM, GitHub Actions
           | appears to live in a "Residential IP space", which makes it a
           | good server choice for web automation.
        
             | ghxst wrote:
             | IP is definitely not the biggest issue in my experience, as
             | proxies are required at scale regardless, unless you get
             | into more theoretical areas like p0f.
             | 
             | The biggest issues are the ones that aren't obvious or
             | easily tested for like missing a particular font, being on
             | an abnormal gfx driver that produces an unidentified hash
             | for particular fingerprint methods, not having certain APIs
             | available that require browser patches, and then these
             | aspects will differ between anti bot vendors and the data
             | sets that they have.
             | 
             | The reason they can be hard to test for is that everything
             | is based on a trust score, which is potentially influenced
             | by anything from website load to things tied to your
             | personal session and for some vendors optionally even input
             | data.
        
           | coppsilgold wrote:
           | Why not create a library that you inject into the Chrome
           | process though?
           | 
           | It seems to me that playing a cat and mouse game with these
           | anti-bot systems is unnecessary. Design a system which mimics
           | a legitimate user to such a degree that it's either
           | indistinguishable from an actual user or would produce an
           | unacceptable level of false positives for the detection
           | system. This is not an even playing field, the bot has all
           | the advantages.
           | 
           | For example:
           | 
           | - Enumerate all the possible ways in which the webpage can
           | glean insight into user input/activity.
           | 
           | - Hook all these functions by injecting code into the
           | browser.
           | 
           | - Create functions that mimic user activities (mouse pathing,
           | aimless mouse wondering, random scrolls, clicks, text
           | selections, etc)
           | 
           | - Feed the outputs of these functions into the functions that
           | you hooked.
           | 
           | - Rip out whatever information you want from the Chrome data
           | structures in memory.
           | 
           | After all this, the only challenge that would remain is to
           | perfect the input functions that are supposed to mimic a
           | legitimate user. Perhaps also depending on how sophisticated
           | these anti-bot systems get, you may additionally need to
           | cultivate user browsing habit profiles to enter
           | advertising/spying databases as a real human.
        
       | nullday wrote:
       | Nice, but I still use puppeteer these days with just rebrowser-
       | patches - https://github.com/rebrowser/rebrowser-patches - works
       | just fine with vanilla chrome
        
       ___________________________________________________________________
       (page generated 2024-12-18 23:02 UTC)