[HN Gopher] Web scraping with your web browser: Why not?
___________________________________________________________________
Web scraping with your web browser: Why not?
Includes working code. First article in a planned series.
Author : 8chanAnon
Score : 127 points
Date : 2024-10-01 17:55 UTC (1 days ago)
(HTM) web link (8chananon.github.io)
(TXT) w3m dump (8chananon.github.io)
| dewey wrote:
| I've read through that (hard to read, because of the bad
| formatting) but I still don't understand why you would do that
| instead of Playwright, Puppeteer etc. - The only reason seems to
| be "This technique certainly has its limits.".
| bdcravens wrote:
| Solutions that want to automate in the context of their
| customers' browser. For example, ListPerfectly, a solution for
| cross-listing to eBay, Poshmark, etc, does this in their
| browser extension.
| dewey wrote:
| This is just regular browser extension behavior though, a bit
| different than the use case talked about in this post.
| 8chanAnon wrote:
| >bad formatting
|
| If you can elaborate, I would very much appreciate it. I'm
| always interested in doing better.
|
| Why use Puppeteer etc. when you don't have to? What is the
| argument for using these additional tools versus not using
| them?
| gabrielsroka wrote:
| The red text on a yellow background is not great. Neither is
| a serif font. Also it should be JavaScript with a capital S.
| 8chanAnon wrote:
| Red text on yellow - you mean the website? Would you like
| the text to be darker?
|
| And Beautiful Soup should be BeautifulSoup. Who makes the
| rules?
| gabrielsroka wrote:
| Yes, the website. Better contrast is good. Black on white
| works great.
|
| Margins would also be nice on the left and right.
|
| Beautiful Soup is two words. Just look at their website.
| bityard wrote:
| Too much contrast makes it hard to read too. (With bonus
| eye strain.)
| 8chanAnon wrote:
| Is there a happy in-between? Maybe not. What looks
| perfect to one user might appear atrocious to another.
| What is a poor website operator to do???
|
| I dislike black-on-white and don't understand gray-on-
| black which seems to be popular now due to gamma settings
| being cranked up to 11 or something. I try to use some
| color as an in-between but that may take some time to
| "perfect".
| gabrielsroka wrote:
| Even Google Chrome's Lighthouse said that your background
| and foreground colors do not have a sufficient contrast
| ratio.
| 8chanAnon wrote:
| Every dark mode site in existence should fail that test.
| dewey wrote:
| No, because the contrast can still be good as the colors
| are just reversed?
| pwg wrote:
| > Is there a happy in-between?
|
| Since browsers allow users to configure a default font
| and background color then one possible "happy in-between"
| would be to set no background color, and set no font
| color, thereby allowing each user agent (i.e., browser)
| to display the site with that user's default background
| and font colors.
|
| In that case, each viewer should get their preferred
| colors, all without you doing anything.
| dewey wrote:
| You don't have max-width set on the text, so unless you have
| your browser window resized to very small size the paragraphs
| will span your whole screen.
| 8chanAnon wrote:
| I see. I don't have a wide-screen monitor (still using an
| old tube type until it finally expires but it's taking a
| few decades, lol). I've wondered whether people actually
| like reading websites on wide-screen. Some do and some
| don't. What would you suggest for a max width?
|
| You could also try zooming in. My apps don't expand to full
| width because of the video box but you can zoom.
| dewey wrote:
| There's a lot of info about this but usually 500-700px or
| ~80 characters will be much easier to read:
| https://ux.stackexchange.com/questions/108801/what-is-
| the-be...
| pwg wrote:
| Which places you, the reader, fully in control of the width
| of lines you prefer to see. Adjust your browser window
| width, or apply a user style sheet to tell your browser to
| format the text the way you want to see it formatted.
| dewey wrote:
| I knew this reply would come... How many people do you
| think use custom browser stylesheets? It's probably
| smaller than 0.1% of the internet population and everyone
| moved on to formatting text so everyone can enjoy good
| readability. Also not all devices have the luxury of
| supporting custom stylesheets.
|
| Of course it's always up to the site owner, but most
| people want people to read what they share.
| nsonha wrote:
| > Adjust your browser window width, or apply a user style
| sheet
|
| very funny, both jokes
| bityard wrote:
| I have the opposite problem. I typically have many browser
| windows open at the same time but only two screens, and
| many sites that I use are designed to assume that everyone
| has full-screen browsers.
| micahdeath wrote:
| Similar problem here. My resolution is 1440x900 (27in
| Monitor) paired w/ 1280x720 (32in TV) and I keep 3
| Browsers Open (Edge, Opera, FireFox - each has intent).
| Each are at 3/4 width and 1/2 height and offset so I can
| see each partially.
|
| With this setup, many sites work, but a few... a few have
| a top ad banner, a side banner and a footer of 'cookie
| acceptance'... then add in a 'subscribe to our email' and
| a google login prompt.... (Game Wiki's.. I game in
| smaller windows too -- what good is a multi tasking
| computer if you don't use it?)
| chaosharmonic wrote:
| > You can find plenty of tutorials on the Internet about the art
| of web scraping... and the first things you will learn about are
| Python and Beautiful Soup. There is no tutorial on web scraping
| with Javascript in a web browser...
|
| Um... [0]
|
| [0] https://bhmt.dev/blog/scraping
| 8chanAnon wrote:
| Rather long so I'll read it later. Thanks for the tip. Got more
| or is that it?
| chaosharmonic wrote:
| Not that I _know_ of, I 'm just quipping a bit about my own
| work lol
| simlan wrote:
| I also did something similar for my spring project. The idea was
| to buy a used car and I was frustrated with the BS the listing
| sites claimed as fair price etc..
|
| I went the browser extension route and used grease monkey to
| inject custom JavaScript. I patched the window.fetch and because
| it was a react page it did most of the work for me providing me
| with a slightly convolute JSON doc everytime I scrolled. Getting
| the data extracted was only a question of getting a flask API
| with correct CORS settings running.
|
| Thanks for posting using a local proxy for even more control
| could be helpful in the future.
| throwaway48476 wrote:
| Apparently there is no web extension API to inspect the body of
| a fetch response and you have to override window.fetch
|
| Seems like an omission in the spec.
| gabrielsroka wrote:
| Why do you need a proxy or to worry about CORS? Why not just
| point your browser to rumble.com and start from there?
|
| I've posted here about scraping for example HN with JavaScript.
| It's certainly not a new idea.
|
| 2020: https://news.ycombinator.com/item?id=22788236
| CharlieDigital wrote:
| > Why do you need a proxy or to worry about CORS?
|
| Not sure about OP, but you might want to point to a proxy
| depending on the site/content you are scraping and your
| location. For example, if you are in Canada but you want to
| scrape in USD, you might need to use a proxy located in the US
| to get US prices. > Why not just point your
| browser to rumble.com and start from there?
|
| Some endpoints use simple web application firewall rules that
| will block IPs. In this case, a rotating proxy can help evade
| the blocks (and prevent your legitimate traffic from being
| blocked). Some domains use more sophisticated WAFs like Imperva
| and will do browser fingerprinting so you'll need even more
| advanced techniques to scrape successfully.
|
| Source: work at a startup that does a lot of scraping and these
| are issues we've run into. Our entire office network is blocked
| from some sites due to some early testing without a proxy.
| datadrivenangel wrote:
| I've been playing around with this idea lately as well! There are
| a lot of web interfaces that are hostile to scraping, and I see
| no reason why we shouldn't be able to use the data we have access
| to for our own purposes. CUSTOMIZE YOUR INTERFACES
| joshdavham wrote:
| > can you write a web scraper in your browser? The answer is:
| YES, you can! So why is nobody doing it?
|
| Completely agree with this sentiment.
|
| I just spent the last couple of months developing a chrome
| extension, but recently also did an unrleated web scraping
| project where I looked into all the common tools like beautiful
| soup, selenium, playwright, pupeteer, etc, etc.
|
| All of these tools were needlessly complicated and I was having a
| ton of trouble with sites that required authentication. I then
| realized it would be way easier to write some javascript and
| paste it in my browser to do the scraping. Worked like a charm!
| metadat wrote:
| You might like Tamper monkey. You can add a button to kick it
| off or whatever your heart desires.
|
| Tampermoney also works around CORs issues with relative ease.
| crtasm wrote:
| Userscripts for the win. Consider Violentmonkey over
| Tampermonkey though.
| metadat wrote:
| > Consider Violentmonkey over Tampermonkey though.
|
| Why?
| MC995 wrote:
| Probably because it looks to be closed source these days.
|
| https://github.com/Tampermonkey/tampermonkey
|
| > This repository contains the source of the Tampermonkey
| extension up to version 2.9. All newer versions are
| distributed under a proprietary license.
| panphora wrote:
| Open source
| joshdavham wrote:
| I love Tamper monkey! I've never made my own scripts though -
| just used other peoples'.
| moritzwarhier wrote:
| Is Playwright really that complicated?
|
| I feel that when it has been set up, it's very straightforward
| to use.
|
| Maybe in contrast to other solutions you posted? Not sure about
| that though; having only brief experiences with both,
| Playwright seems like an improved Cypress to me.
| ljw1004 wrote:
| In my web-scraping I've gravitated towards the "cheerio" library
| for javascript.
|
| I kind of don't want to use DOMParser because it's browser-
| only... my web-scrapers have to evolve every few years as the
| underlying web pages change, so I really want CI tests, so it's
| easiest to have something that works in node.
| welder wrote:
| Neo already did that in the Matrix:
|
| https://www.youtube.com/watch?v=sjoad6gcRzs
| deisteve wrote:
| is there anything that runs on WASM for scraping? the issue is
| that you need to enable flags and turn off other security
| features to scrape on your web browser and this is why its not
| popular but with WASM that might change
| 8chanAnon wrote:
| WASM runs in a sandbox. It can only talk to the outside world
| via Javascript so you can forget the idea that it might be a
| way to crack through browser security.
|
| Maybe somebody will make a web browser with all of the security
| locks disabled. Sort of like the Russian commander in "Hunting
| for Red October" who disabled his missiles' security features
| in order to more effectively target the American sub but then
| got blown up by his own missile.
| spullara wrote:
| I love it when something like this reminds me of a project from
| forever ago...
|
| https://github.com/spullara/browsercrawler
| linsomniac wrote:
| There is an extension called "Amazon Order History Reporter" that
| will scrape Amazon to download your order history. I've used it a
| couple times and it works brilliantly.
| smallerfish wrote:
| I wrote a prototype of a browser extension that scraped your
| bookmarks + 1 degree, and indexed everything into an in-memory
| search index (which gets persisted in localstorage). I took over
| the new tab page with a simple search UI, with instant type-ahead
| search.
|
| Rough aspects:
|
| a) It requires a _lot_ of browser permissions to install the
| extension, and I figured the audience who might be interested in
| their own search index would likely be put off by intrusive
| perms.
|
| b) Loading the search index from localstorage on browser startup
| took 10-15s with a moderate number of sites; not great. Maybe
| would be a fit for pouchdb or something else that makes IndexedDB
| tolerable. (Or wasm sqllite, if it's mature enough.)
|
| c) A lot of sites didn't like being scraped (even with rate
| limiting and back-off), and I ended up being served an annoying
| number of captchas in my regular everyday browsing.
|
| d) Some walled garden sites seem completely unscrapable (even in
| the browser) - e.g. Linkedin.
| 8chanAnon wrote:
| >Some walled garden sites seem completely unscrapable
|
| Any examples besides Linkedin? Tell me what sites you're trying
| to target and I'll have a look to see what can be done with
| them. It takes some pretty evil Javascript obfuscation to block
| me and only one site has been able to do that. I doubt that the
| sites you're hitting are anywhere near that evil, lol. I would
| appreciate it if you have a good example that I could use in a
| future article.
| smallerfish wrote:
| It's been ~18 months so I'm fuzzy on details. I remember
| gmail being tricky also.
|
| IIRC I ended up building an iframe based scraper for sites
| that didn't yield any content with just a fetch - and I think
| built a fallback mechanism so that if fetch didn't work, I'd
| queue it up in the iframe scraper. The problem with that is
| that there are various heavily used security headers that
| prohibit loading in an iframe. (And the reason for iframe vs
| just loading in a tab and injecting my extension's script is
| that I wanted it to be able to run "in the background"
| without being super distracting for the user - the tab
| changing favicon every second or two was pretty annoying.)
| changing1999 wrote:
| In my experience building a browser-based scraper I preferred
| scraping pages by a direct in-browser visit rather that a fetch
| request. A direct visit from a real browser is basically
| undetectable by anti-bot software (unless you try to do
| something funny like automated deep crawling and scraping). So
| applied to your usecase it would have to go through every
| bookmark + 1 degree to index it. Maybe even in an offscreen
| canvas (haven't tried that though, could be detectable).
| paulryanrogers wrote:
| How often did it crawl? Once per day shouldn't trigger any
| blockers.
| changing1999 wrote:
| > can you write a web scraper in your browser? The answer is:
| YES, you can! So why is nobody doing it?
|
| My guess would be that some companies are doing it (I worked at a
| major tech company that is/was), just not publicizing this fact
| as crawling/scraping is such a gray legal area.
| micahdeath wrote:
| Excel/Word Macro using a WebBrowser object in a Form (old IE did
| this nicely; Haven't done that since Edge came out.)
| flashgordon wrote:
| Ah I remember doing this almost 20 years ago and even rotating
| through 1500 proxies to not get tripped up by ddos detectors :).
| A plugin is one of the ways to scrape as it also looks like a
| human (ie more js run, more divs loaded and so on).
| turingfeel wrote:
| If you want to get your personal IP and fingerprint blacklisted
| across major providers and large ranges, unfortunately this is
| how you do it. Just keep the rates low.
| zarzavat wrote:
| Obviously anyone scraping on their home IP is being foolish.
| Getting blacklisted is the least bad thing that can happen.
|
| As for fingerprinting, you can just use a different computer.
| Most people probably have a bunch of old computers lying
| around, right? If not, computers are cheap.
| squigz wrote:
| This horrendous color scheme makes it impossible for me to read
| this.
| ggorlen wrote:
| I wrote a similar post on in-browser scraping:
| https://serpapi.com/blog/dynamic-scraping-without-libraries/
|
| My approach is a step or two more automated (optionally using a
| userscript and a backend) and runs in the console on the site
| under automation rather than cross-origin, as shown in OP.
|
| In addition to being simple for one-off scripts and avoiding the
| learning curve of a Selenium, Playwright or Puppeteer, scraping
| in-browser avoids a good deal of potential bot detection issues,
| and is useful for constant polling a site to wait for something
| to happen (for example, a specific message or article to appear).
|
| You can still use a backend and write to file, trigger an email
| or SMS, etc. Just have your userscript make requests to a server
| you're running.
| gmac wrote:
| Yes: I find it surprising that this isn't a more widespread
| approach. It's how I've taught web scraping to my PhD students
| for some years.
|
| https://github.com/jawj/web-scraping-for-researchers
| hombre_fatal wrote:
| It's not widespread because it's much more complicated than
| making an http request and reading the results from the body.
| You don't spin up a browser, much less the full GUI, unless
| it's a last resort.
| nsonha wrote:
| sorry the format of this site is just too annoying for me to
| bother to read it. If this is about the shocking revelation that
| you can paste some code into the browser console, aka manually
| extracting information, then manually put that into whatever
| workflow that you need that information for, then I don't think
| that is called web scrapping, it's just browsing the web with
| code.
| ttshaw1 wrote:
| How is this different from scraping in, say, Selenium in non-
| headless mode?
| twelve40 wrote:
| I think Selenium's killer use case is (aside from
| legacy/inertia) cross-browser and cross-language. In exchange,
| it comes with a ton of its own baggage, since it's an
| additional layer in between you and your task, with its own
| Selenium-specific bugs, behavior limitations and edge cases.
|
| If you don't need cross-browser and Chrome is all you need,
| then something like a simple Chrome extension and/or Chrome
| DevTools Protocol cuts out a lot of middle-man baggage and at
| least you will be wrangling the browser behavior directly,
| without any extra idiosyncrasies of middle layers.
| ttshaw1 wrote:
| Sorry, I dropped the context when I decided to make this a
| top-level comment. Does scraping in a browser circumvent
| scraping protections that Selenium non-headless would get
| caught by?
| 8chanAnon wrote:
| My next article will be on the topic of bypassing the
| Cloudflare bot protection. You can then compare with how
| Selenium handles this problem (if at all).
| pimlottc wrote:
| When I have to do some really quick ad-hoc webscraping, I often
| just select all text on the page, copy it, and then switch to a
| terminal window where I build a pipeline that extracts the part I
| need (using pbpaste to access the clipboard). Very quick and
| dirty for when you just need to hit a few pages.
| seanwilson wrote:
| > So the question is: can you write a web scraper in your
| browser? The answer is: YES, you can! So why is nobody doing it?
|
| > One of the issues is what is called CORS (Cross-Origin Resource
| Sharing) which is a set of protocols which may forbid or allow
| access to a web resource by Javascript. There are two possible
| workarounds: a browser extension or a proxy server. The first
| choice is fairly limited since some security restrictions still
| apply.
|
| I'm doing this for a browser extension that crawls a website from
| page to page checking for SEO/speed/security problems
| (https://www.checkbot.io/). It's been flexible enough, and it's
| nice not to have to maintain and scale servers for the web
| crawling. https://browserflow.app/ is another extension I know of
| that does scraping within the browser I think, and other
| automation.
___________________________________________________________________
(page generated 2024-10-02 23:02 UTC)