[HN Gopher] SeleniumBase: Python APIs for web automation and byp...
___________________________________________________________________
SeleniumBase: Python APIs for web automation and bypassing bot-
detection
Author : seleniumbase
Score : 149 points
Date : 2024-12-16 17:34 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| michael_j_x wrote:
| I've been working with scrapers quite a lot. I started with
| python requests, then to scrapy, then selenium, then selenium via
| undetected_chromedriver, and once that started being detected
| during a chrome update about a year ago, I've switched over to
| seleniumbase. It got by undetected, but to get it working with
| pre-downloaded drivers, I had to look into the code. I have
| never, and I mean never, in all my python years, seen such a
| horrible mess of code. We are talking 1000lines long methods,
| with 20-30 different flags and branches Just horrible. I have
| since switched to Playwright, which seems to be also undetected,
| and offers a much saner interface.
| seleniumbase wrote:
| SeleniumBase modifies the webdriver so that it doesn't get
| detected when used alongside the CDP stealth mode and methods.
| It'll download chromedriver for you. Not sure what you mean by
| the multiple branches, as there's just the primary one. What
| 1000-line methods are you referring to? By "flags", do you mean
| the different command-line options available? As for
| Playwright, they aren't undetected: See
| https://github.com/microsoft/playwright/issues/23884#issueco...
| - "Playwright is an end-to-end testing framework, where we
| expect you test on your own environments. Bypassing any form of
| bot protection is not something we can act on. Thanks for your
| understanding." On the contrary, SeleniumBase is OK with
| bypassing bot detection:
| https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
| cyanmagenta wrote:
| Not the commenter, but "multiple branches" in this context is
| referring to if/else statements in the code, not source-
| control branches. Similarly, "flags" is referring to function
| arguments like a boolean "is_original." More generally, they
| are just saying that the code has long, complicated, bug-
| prone functions.
|
| That said, I just spent a few minutes browsing the
| SeleniumBase repro, and honestly it didn't seem that unusual
| to me. Would be interested in seeing a specific example of
| what the commenter had in mind.
| mdaniel wrote:
| rather than point-by-point rebuttal as the sibling requests, I
| think this sums up the coding style pretty well:
| https://github.com/seleniumbase/SeleniumBase/blob/v4.33.11/s...
| seleniumbase wrote:
| That method came from code that I accepted in a PR from
| December 31, 2019:
| https://github.com/seleniumbase/SeleniumBase/pull/459 Not a
| true representation of most of the code today.
| parineum wrote:
| The code is in the code base. Presumably, it still gets
| run. It doesn't make a difference if new code doesn't look
| like that.
| seleniumbase wrote:
| Call it "legacy code" if you'd like. That specific part
| is from a less common feature for setting options when
| running on a Selenium Grid. The new CDP Mode isn't
| compatible with The Grid (since CDP Mode makes direct CDP
| API calls without making Selenium API calls).
| MstWntd wrote:
| it's always easier for people today to look at the work
| of other people in the past and draw stupid conclusions..
| don't mind them..
| wisty wrote:
| Bad old code has been battle tested. Bad new code has
| not, and is more likely to have the show stopper bugs you
| want to avoid.
| seleniumbase wrote:
| There's actually a lot of examples being used for testing
| (https://github.com/seleniumbase/SeleniumBase/tree/master
| /exa...), which are run regularly (locally and in GitHub
| Actions). Plus, a lot of major companies are using
| SeleniumBase: https://github.com/seleniumbase/SeleniumBas
| e/blob/master/hel... (if something breaks, I find out
| quickly)
| harrall wrote:
| That's not amazing code but that's not that bad. In the grand
| scheme of things, that's not code debt that would ever
| seriously make my life any harder.
| TeMPOraL wrote:
| Yup. At least it's self-contained and easy to step through
| and modify if something breaks or needs to be changed.
|
| And, a my previous PM would point out, even the copy-
| pasting and verifying no mistakes were made was a solution
| that took a fraction of the time a modern "clean" approach
| would. She had a point; as much as I'm against writing
| _this_ simple code in the general case, plenty of devs tend
| to err towards overcomplicating solutions when given a
| chance.
|
| I mean, the modern, proper, Clean Code(tm) solution would
| have this split into multiple files (not counting _tests_
| ), and across two or three abstraction levels. I've seen
| this happen enough that I can tell I'd much prefer working
| with code like this capabilities parser (and hell, it can
| be beaten into near-perfection in an hour or three).
| edm0nd wrote:
| Not sure if you have explored rolling captcha solving services
| into your code. Its easy as fuck and you can do it in a few
| lines of code. Check out DeathByCaptcha or AntiCaptcha. It's
| like $2.99 per 1,000 successfully solved captchas.
|
| I guess my point is, you dont have to be undetected nor write
| 1000 lines of code to scrape or do whatever you are needing to
| do always. Saved me a ton of headaches and time when captchas
| are involved.
| mintzworld wrote:
| SeleniumBase is free, open-source, can bypass CAPTCHAs with a
| few lines of code, and it works from the free tier of GitHub
| Actions.
| edm0nd wrote:
| It cant bypass all captchas and thats what im talking
| about.
| mintzworld wrote:
| According to live demos seen in
| https://www.youtube.com/watch?v=Mr90iQmNsKM, it'll bypass
| Cloudflare, Akamai, Shape Security, DataDome, Incapsula,
| Kasada, and PerimeterX.
| edm0nd wrote:
| Okay, and? DeathByCaptcha can bypass all of those + all
| other captchas.
|
| Write a ton of code or just roll in a solving service
| API. Ez decision and save a ton of time + get to scraping
| faster.
| seleniumbase wrote:
| With SeleniumBase, you can bypass CAPTCHAs with one line
| of code: `sb.uc_gui_click_captcha()`
| parineum wrote:
| It's like you're not even reading what he wrote.
| edm0nd wrote:
| okay but it doesnt solve all captchas but a solving
| service does with a few more lines of code.
|
| Can your script even do Google CAPTCHA and HCaptcha? What
| about the captcha from Dread? (aint no way it can)
|
| There is no need to bypass them when you can just solve
| them.
| seleniumbase wrote:
| There's a reCAPTCHA on the Pokemon website. This
| SeleniumBase example bypassed it: https://github.com/sele
| niumbase/SeleniumBase/blob/master/exa...
| Funnnny wrote:
| > There is no need to bypass them when you can just solve
| them.
|
| There is no need to solve them when you can just bypass
| them.
| edm0nd wrote:
| the point is you cant bypass them all but you CAN solve
| them all.
| mintzworld wrote:
| Why pay to solve CAPTCHAs when SeleniumBase can bypass
| them for free? SeleniumBase can also "solve" CAPTCHAs
| (such as Cloudflare via click).
| windexh8er wrote:
| I feel like what you're saying is you have a vested
| interest in the services you mentioned with all of this
| scope creep to your OG argument.
| pryelluw wrote:
| Enterprise Python code. Somehow ends up being worse than Java
| enterprise code. I'm too used to it at this point.
| seleniumbase wrote:
| The "Python vs Java" debate is probably one for a different
| Hacker News post. :)
| pryelluw wrote:
| I meant that some of the code reminds me of enterprise
| python. The kicker is that code that works > pretty code.
| People here act as if ugly code is somehow lesser just
| because it's ugly. Meanwhile there's a lot of ugly code
| making millions of dollars.
|
| Didn't mean to bash your project. Sorry if it came across
| that way.
| seleniumbase wrote:
| It's OK. No offense was taken. It almost looked like the
| conversation was expanding into a "Python vs Java"
| debate, but (thankfully) it did not. I've seen both
| worlds. I've seen advantages to both. I decided to stay
| in the Python world.
| bryanrasmussen wrote:
| Maybe I am just a cynic but I would expect Playwright to be
| detected when using Chrome, I mean I would expect it was to the
| benefit of Google to make that happen for the sake of making
| reCaptcha detect bots better.
|
| That's actually why I've been scrapping my Playwright
| automation (because I expect I will encounter problems even if
| hasn't happened yet, cynical and paranoid) and moving towards
| writing a browser extension to automate Firefox.
|
| Basically my use case is automating tedious things for myself
| not running bots at scale, so that's why it is imperative not
| to get caught being "not human", because then risk account
| problems.
| robertlagrant wrote:
| How can Google make that happen? Playwright's made by
| Microsoft. It can use Firefox as a browser as well as Chrome.
| lyu07282 wrote:
| https://github.com/seleniumbase/SeleniumBase/blob/master/sel...
|
| That's.. that works?? :D
| seleniumbase wrote:
| That patches chromedriver, (which gets renamed to uc_driver),
| but patching by itself isn't enough to bypass bot-detection.
| SeleniumBase also sets specific Chrome options and modifies
| methods to use the Chrome Devtools Protocol.
| lyu07282 wrote:
| I was more astonished that you could just search and replace
| a string in a PE/ELF binary without breaking everything, but
| I take your solution over recompiling chrome anytime. Awesome
| job, very well done!
| seleniumbase wrote:
| Thank you!
| mintzworld wrote:
| The "CDP Mode" used for bypassing CAPTCHAs and bot-detection has
| it's own ReadMe within SeleniumBase:
| https://github.com/seleniumbase/SeleniumBase/blob/master/exa...
| And there's a recent YouTube video with live demos:
| https://www.youtube.com/watch?v=Mr90iQmNsKM
| cruffle_duffle wrote:
| As somebody who is now on the "need to scrape a website to get my
| customers data for them" side of the fence... I get the reason
| bot detection exists. If you want people to not scrape, offer
| API's that allow customers or their software to log in using
| oauth and let their software / LLM agent grab their data for
| them.
| chii wrote:
| > If you want people to not scrape, offer API's
|
| many sites want to prevent scrapers because they don't want
| their information aggregated - things like price lists and
| product availability etc.
|
| I know groceries sites do this, to prevent customers from
| knowing price histories of products. They want to raise prices,
| then offer a discount to make it seem like the discount is
| legitimate.
| mintzworld wrote:
| On the topic of scraping grocery sites, here's an example of
| bypassing bot-detection on Albertsons: https://github.com/sel
| eniumbase/SeleniumBase/blob/master/exa... (A demo of that is
| in https://www.youtube.com/watch?v=Mr90iQmNsKM)
| bryanrasmussen wrote:
| It seems weird to me that works - when I do scroll into
| views and similar behaviors in other code I do a random
| scroll speed to simulate human behavior, but SeleniumBase
| evidently doesn't.
|
| Maybe I am just too paranoid.
| seleniumbase wrote:
| SeleniumBase CDP Mode uses `DOM.scrollIntoViewIfNeeded`
| (https://chromedevtools.github.io/devtools-
| protocol/tot/DOM/#...), so it only scrolls when elements
| are offscreen, rather than always scrolling. This reduces
| the number of scrolls needed. Also, it seems that most
| anti-bot services are not looking at scrolling as a way
| of identifying users.
| theanonymousone wrote:
| Is it demonstrably better than Playwright in bypassing Cloudflare
| measures? I have some scraping projects and the "cat and mouse
| game" (what's the right expression here?) got so much energy that
| I finally went with an external dedicated scraping service. It
| doesn't feel right that some scrapers are considered friendly
| (e.g. Google), while smaller ones are vilified..
| seleniumbase wrote:
| There are live demos on YouTube of SeleniumBase bypassing
| CAPTCHAs (if you want see first):
| https://www.youtube.com/watch?v=Mr90iQmNsKM
| coppsilgold wrote:
| Is there a reason why the crawling and browser automation people
| don't just patch the browser to be controlled with no possibility
| of detection?
|
| The web page is heavily restricted in what is can access through
| various interfaces and you can feed it anything you want by
| patching the browser. Once you do that the problem becomes just
| simulating a legitimate user to a sufficient degree.
|
| I wonder if that's what's already happening with CDP and
| ReCAPTCHA and hCaptcha - the two services mentioned that are
| strong and a problem. Are they detecting the "Stealth" or is it
| just the lack of user activity and reputation? Is CDP by itself
| detectable by some means?
| seleniumbase wrote:
| Patching chromedriver is a lot easier than patching the
| browser. Plus, if you're just using a regular Chrome browser
| for the automation, then there's nothing to patch. Automated
| CDP calls aren't detectable if they don't leave any trace of
| automation activity. However, since Google created CDP, they
| might have ways of detecting automated CDP in ways that other
| services cannot.
| coppsilgold wrote:
| What about faking mouse movement from inside the browser?
| PyAutoGUI is not the right way to be doing this for
| interacting with JavaScript that has no hope of interrogating
| user operating system GUI interactions.
|
| And it seems like it would be important to try and adopt
| user-like mouse movement since JavaScript has access to this
| information.
| mintzworld wrote:
| PyAutoGUI is the optimal tool for clicking things inside of
| closed shadow-root elements, which are hidden to
| JavaScript. Can use CDP for clicking other elements.
| ghxst wrote:
| The reason in my experience is that there's a high barrier of
| entry for most devs when it comes to setting up an environment
| for Chromium and a workflow for patches that still allows you
| to quickly and easily pull in and apply upstream changes
| whenever a new Chromium version releases.
|
| In reality, if you know how to use CDP correctly and you have
| control over the environment that you run the browser in, you
| have to make very few browser patches.
|
| What I mean with using CDP correctly is that, yes it is
| detectable to a certain extent but it comes down to things like
| enabling Runtime domain for example which you can easily
| mitigate in your own solution but is something that libraries
| like puppeteer / playwright often do out of the box (this is
| where the "stealth" versions of these libraries come in, they
| will either mitigate by disabling features or use some hacky
| approaches to instrument the JS that runs on the pages).
|
| Then when you move into an environment that is a lot more
| stripped down (let's say from your home machine to docker) now
| you run into A LOT of issues that you definitely are better off
| fixing with browser patches, however figuring out what those
| issues are and how to fix them is a huge feat in itself and
| often will require you to have the ability to reverse engineer
| things like Cloudflare, Akamai and other anti bot vendors just
| to know what leaks you still have to patch.
|
| It doesn't help that there is no end to misinformed articles on
| things like "browser fingerprinting" that you encounter when
| you try to solve your issues the first time you encounter them,
| a lot of articles based on nothing but superstition, articles
| that basically say "proxies are never good enough", "captchas
| are getting out of hand" that get things wrong and will just
| eat away at your sanity while trying to debug issues.
|
| This is long enough of a rant already but maybe offers you some
| insight, if you have any specific questions feel free to ask.
| seleniumbase wrote:
| The biggest issue with going from a home machine to a server
| is that you may lose having a "residential IP address", which
| is something that you'll want to have in order to prevent
| automation from being blocked outright. Hence the popularity
| of residential proxies. However, some servers live in a
| residential IP space, which makes them optimal for running
| web automation in. As was partially covered in
| https://www.youtube.com/watch?v=Mr90iQmNsKM, GitHub Actions
| appears to live in a "Residential IP space", which makes it a
| good server choice for web automation.
| ghxst wrote:
| IP is definitely not the biggest issue in my experience, as
| proxies are required at scale regardless, unless you get
| into more theoretical areas like p0f.
|
| The biggest issues are the ones that aren't obvious or
| easily tested for like missing a particular font, being on
| an abnormal gfx driver that produces an unidentified hash
| for particular fingerprint methods, not having certain APIs
| available that require browser patches, and then these
| aspects will differ between anti bot vendors and the data
| sets that they have.
|
| The reason they can be hard to test for is that everything
| is based on a trust score, which is potentially influenced
| by anything from website load to things tied to your
| personal session and for some vendors optionally even input
| data.
| coppsilgold wrote:
| Why not create a library that you inject into the Chrome
| process though?
|
| It seems to me that playing a cat and mouse game with these
| anti-bot systems is unnecessary. Design a system which mimics
| a legitimate user to such a degree that it's either
| indistinguishable from an actual user or would produce an
| unacceptable level of false positives for the detection
| system. This is not an even playing field, the bot has all
| the advantages.
|
| For example:
|
| - Enumerate all the possible ways in which the webpage can
| glean insight into user input/activity.
|
| - Hook all these functions by injecting code into the
| browser.
|
| - Create functions that mimic user activities (mouse pathing,
| aimless mouse wondering, random scrolls, clicks, text
| selections, etc)
|
| - Feed the outputs of these functions into the functions that
| you hooked.
|
| - Rip out whatever information you want from the Chrome data
| structures in memory.
|
| After all this, the only challenge that would remain is to
| perfect the input functions that are supposed to mimic a
| legitimate user. Perhaps also depending on how sophisticated
| these anti-bot systems get, you may additionally need to
| cultivate user browsing habit profiles to enter
| advertising/spying databases as a real human.
| nullday wrote:
| Nice, but I still use puppeteer these days with just rebrowser-
| patches - https://github.com/rebrowser/rebrowser-patches - works
| just fine with vanilla chrome
___________________________________________________________________
(page generated 2024-12-18 23:02 UTC)