hngopher.com

       [HN Gopher] Web scraping with your web browser: Why not?
       ___________________________________________________________________
        
       Web scraping with your web browser: Why not?
        
       Includes working code. First article in a planned series.
        
       Author : 8chanAnon
       Score  : 127 points
       Date   : 2024-10-01 17:55 UTC (1 days ago)
        
 (HTM) web link (8chananon.github.io)
 (TXT) w3m dump (8chananon.github.io)
        
       | dewey wrote:
       | I've read through that (hard to read, because of the bad
       | formatting) but I still don't understand why you would do that
       | instead of Playwright, Puppeteer etc. - The only reason seems to
       | be "This technique certainly has its limits.".
        
         | bdcravens wrote:
         | Solutions that want to automate in the context of their
         | customers' browser. For example, ListPerfectly, a solution for
         | cross-listing to eBay, Poshmark, etc, does this in their
         | browser extension.
        
           | dewey wrote:
           | This is just regular browser extension behavior though, a bit
           | different than the use case talked about in this post.
        
         | 8chanAnon wrote:
         | >bad formatting
         | 
         | If you can elaborate, I would very much appreciate it. I'm
         | always interested in doing better.
         | 
         | Why use Puppeteer etc. when you don't have to? What is the
         | argument for using these additional tools versus not using
         | them?
        
           | gabrielsroka wrote:
           | The red text on a yellow background is not great. Neither is
           | a serif font. Also it should be JavaScript with a capital S.
        
             | 8chanAnon wrote:
             | Red text on yellow - you mean the website? Would you like
             | the text to be darker?
             | 
             | And Beautiful Soup should be BeautifulSoup. Who makes the
             | rules?
        
               | gabrielsroka wrote:
               | Yes, the website. Better contrast is good. Black on white
               | works great.
               | 
               | Margins would also be nice on the left and right.
               | 
               | Beautiful Soup is two words. Just look at their website.
        
               | bityard wrote:
               | Too much contrast makes it hard to read too. (With bonus
               | eye strain.)
        
               | 8chanAnon wrote:
               | Is there a happy in-between? Maybe not. What looks
               | perfect to one user might appear atrocious to another.
               | What is a poor website operator to do???
               | 
               | I dislike black-on-white and don't understand gray-on-
               | black which seems to be popular now due to gamma settings
               | being cranked up to 11 or something. I try to use some
               | color as an in-between but that may take some time to
               | "perfect".
        
               | gabrielsroka wrote:
               | Even Google Chrome's Lighthouse said that your background
               | and foreground colors do not have a sufficient contrast
               | ratio.
        
               | 8chanAnon wrote:
               | Every dark mode site in existence should fail that test.
        
               | dewey wrote:
               | No, because the contrast can still be good as the colors
               | are just reversed?
        
               | pwg wrote:
               | > Is there a happy in-between?
               | 
               | Since browsers allow users to configure a default font
               | and background color then one possible "happy in-between"
               | would be to set no background color, and set no font
               | color, thereby allowing each user agent (i.e., browser)
               | to display the site with that user's default background
               | and font colors.
               | 
               | In that case, each viewer should get their preferred
               | colors, all without you doing anything.
        
           | dewey wrote:
           | You don't have max-width set on the text, so unless you have
           | your browser window resized to very small size the paragraphs
           | will span your whole screen.
        
             | 8chanAnon wrote:
             | I see. I don't have a wide-screen monitor (still using an
             | old tube type until it finally expires but it's taking a
             | few decades, lol). I've wondered whether people actually
             | like reading websites on wide-screen. Some do and some
             | don't. What would you suggest for a max width?
             | 
             | You could also try zooming in. My apps don't expand to full
             | width because of the video box but you can zoom.
        
               | dewey wrote:
               | There's a lot of info about this but usually 500-700px or
               | ~80 characters will be much easier to read:
               | https://ux.stackexchange.com/questions/108801/what-is-
               | the-be...
        
             | pwg wrote:
             | Which places you, the reader, fully in control of the width
             | of lines you prefer to see. Adjust your browser window
             | width, or apply a user style sheet to tell your browser to
             | format the text the way you want to see it formatted.
        
               | dewey wrote:
               | I knew this reply would come... How many people do you
               | think use custom browser stylesheets? It's probably
               | smaller than 0.1% of the internet population and everyone
               | moved on to formatting text so everyone can enjoy good
               | readability. Also not all devices have the luxury of
               | supporting custom stylesheets.
               | 
               | Of course it's always up to the site owner, but most
               | people want people to read what they share.
        
               | nsonha wrote:
               | > Adjust your browser window width, or apply a user style
               | sheet
               | 
               | very funny, both jokes
        
             | bityard wrote:
             | I have the opposite problem. I typically have many browser
             | windows open at the same time but only two screens, and
             | many sites that I use are designed to assume that everyone
             | has full-screen browsers.
        
               | micahdeath wrote:
               | Similar problem here. My resolution is 1440x900 (27in
               | Monitor) paired w/ 1280x720 (32in TV) and I keep 3
               | Browsers Open (Edge, Opera, FireFox - each has intent).
               | Each are at 3/4 width and 1/2 height and offset so I can
               | see each partially.
               | 
               | With this setup, many sites work, but a few... a few have
               | a top ad banner, a side banner and a footer of 'cookie
               | acceptance'... then add in a 'subscribe to our email' and
               | a google login prompt.... (Game Wiki's.. I game in
               | smaller windows too -- what good is a multi tasking
               | computer if you don't use it?)
        
       | chaosharmonic wrote:
       | > You can find plenty of tutorials on the Internet about the art
       | of web scraping... and the first things you will learn about are
       | Python and Beautiful Soup. There is no tutorial on web scraping
       | with Javascript in a web browser...
       | 
       | Um... [0]
       | 
       | [0] https://bhmt.dev/blog/scraping
        
         | 8chanAnon wrote:
         | Rather long so I'll read it later. Thanks for the tip. Got more
         | or is that it?
        
           | chaosharmonic wrote:
           | Not that I _know_ of, I 'm just quipping a bit about my own
           | work lol
        
       | simlan wrote:
       | I also did something similar for my spring project. The idea was
       | to buy a used car and I was frustrated with the BS the listing
       | sites claimed as fair price etc..
       | 
       | I went the browser extension route and used grease monkey to
       | inject custom JavaScript. I patched the window.fetch and because
       | it was a react page it did most of the work for me providing me
       | with a slightly convolute JSON doc everytime I scrolled. Getting
       | the data extracted was only a question of getting a flask API
       | with correct CORS settings running.
       | 
       | Thanks for posting using a local proxy for even more control
       | could be helpful in the future.
        
         | throwaway48476 wrote:
         | Apparently there is no web extension API to inspect the body of
         | a fetch response and you have to override window.fetch
         | 
         | Seems like an omission in the spec.
        
       | gabrielsroka wrote:
       | Why do you need a proxy or to worry about CORS? Why not just
       | point your browser to rumble.com and start from there?
       | 
       | I've posted here about scraping for example HN with JavaScript.
       | It's certainly not a new idea.
       | 
       | 2020: https://news.ycombinator.com/item?id=22788236
        
         | CharlieDigital wrote:
         | > Why do you need a proxy or to worry about CORS?
         | 
         | Not sure about OP, but you might want to point to a proxy
         | depending on the site/content you are scraping and your
         | location. For example, if you are in Canada but you want to
         | scrape in USD, you might need to use a proxy located in the US
         | to get US prices.                   > Why not just point your
         | browser to rumble.com and start from there?
         | 
         | Some endpoints use simple web application firewall rules that
         | will block IPs. In this case, a rotating proxy can help evade
         | the blocks (and prevent your legitimate traffic from being
         | blocked). Some domains use more sophisticated WAFs like Imperva
         | and will do browser fingerprinting so you'll need even more
         | advanced techniques to scrape successfully.
         | 
         | Source: work at a startup that does a lot of scraping and these
         | are issues we've run into. Our entire office network is blocked
         | from some sites due to some early testing without a proxy.
        
       | datadrivenangel wrote:
       | I've been playing around with this idea lately as well! There are
       | a lot of web interfaces that are hostile to scraping, and I see
       | no reason why we shouldn't be able to use the data we have access
       | to for our own purposes. CUSTOMIZE YOUR INTERFACES
        
       | joshdavham wrote:
       | > can you write a web scraper in your browser? The answer is:
       | YES, you can! So why is nobody doing it?
       | 
       | Completely agree with this sentiment.
       | 
       | I just spent the last couple of months developing a chrome
       | extension, but recently also did an unrleated web scraping
       | project where I looked into all the common tools like beautiful
       | soup, selenium, playwright, pupeteer, etc, etc.
       | 
       | All of these tools were needlessly complicated and I was having a
       | ton of trouble with sites that required authentication. I then
       | realized it would be way easier to write some javascript and
       | paste it in my browser to do the scraping. Worked like a charm!
        
         | metadat wrote:
         | You might like Tamper monkey. You can add a button to kick it
         | off or whatever your heart desires.
         | 
         | Tampermoney also works around CORs issues with relative ease.
        
           | crtasm wrote:
           | Userscripts for the win. Consider Violentmonkey over
           | Tampermonkey though.
        
             | metadat wrote:
             | > Consider Violentmonkey over Tampermonkey though.
             | 
             | Why?
        
               | MC995 wrote:
               | Probably because it looks to be closed source these days.
               | 
               | https://github.com/Tampermonkey/tampermonkey
               | 
               | > This repository contains the source of the Tampermonkey
               | extension up to version 2.9. All newer versions are
               | distributed under a proprietary license.
        
               | panphora wrote:
               | Open source
        
           | joshdavham wrote:
           | I love Tamper monkey! I've never made my own scripts though -
           | just used other peoples'.
        
         | moritzwarhier wrote:
         | Is Playwright really that complicated?
         | 
         | I feel that when it has been set up, it's very straightforward
         | to use.
         | 
         | Maybe in contrast to other solutions you posted? Not sure about
         | that though; having only brief experiences with both,
         | Playwright seems like an improved Cypress to me.
        
       | ljw1004 wrote:
       | In my web-scraping I've gravitated towards the "cheerio" library
       | for javascript.
       | 
       | I kind of don't want to use DOMParser because it's browser-
       | only... my web-scrapers have to evolve every few years as the
       | underlying web pages change, so I really want CI tests, so it's
       | easiest to have something that works in node.
        
       | welder wrote:
       | Neo already did that in the Matrix:
       | 
       | https://www.youtube.com/watch?v=sjoad6gcRzs
        
       | deisteve wrote:
       | is there anything that runs on WASM for scraping? the issue is
       | that you need to enable flags and turn off other security
       | features to scrape on your web browser and this is why its not
       | popular but with WASM that might change
        
         | 8chanAnon wrote:
         | WASM runs in a sandbox. It can only talk to the outside world
         | via Javascript so you can forget the idea that it might be a
         | way to crack through browser security.
         | 
         | Maybe somebody will make a web browser with all of the security
         | locks disabled. Sort of like the Russian commander in "Hunting
         | for Red October" who disabled his missiles' security features
         | in order to more effectively target the American sub but then
         | got blown up by his own missile.
        
       | spullara wrote:
       | I love it when something like this reminds me of a project from
       | forever ago...
       | 
       | https://github.com/spullara/browsercrawler
        
       | linsomniac wrote:
       | There is an extension called "Amazon Order History Reporter" that
       | will scrape Amazon to download your order history. I've used it a
       | couple times and it works brilliantly.
        
       | smallerfish wrote:
       | I wrote a prototype of a browser extension that scraped your
       | bookmarks + 1 degree, and indexed everything into an in-memory
       | search index (which gets persisted in localstorage). I took over
       | the new tab page with a simple search UI, with instant type-ahead
       | search.
       | 
       | Rough aspects:
       | 
       | a) It requires a _lot_ of browser permissions to install the
       | extension, and I figured the audience who might be interested in
       | their own search index would likely be put off by intrusive
       | perms.
       | 
       | b) Loading the search index from localstorage on browser startup
       | took 10-15s with a moderate number of sites; not great. Maybe
       | would be a fit for pouchdb or something else that makes IndexedDB
       | tolerable. (Or wasm sqllite, if it's mature enough.)
       | 
       | c) A lot of sites didn't like being scraped (even with rate
       | limiting and back-off), and I ended up being served an annoying
       | number of captchas in my regular everyday browsing.
       | 
       | d) Some walled garden sites seem completely unscrapable (even in
       | the browser) - e.g. Linkedin.
        
         | 8chanAnon wrote:
         | >Some walled garden sites seem completely unscrapable
         | 
         | Any examples besides Linkedin? Tell me what sites you're trying
         | to target and I'll have a look to see what can be done with
         | them. It takes some pretty evil Javascript obfuscation to block
         | me and only one site has been able to do that. I doubt that the
         | sites you're hitting are anywhere near that evil, lol. I would
         | appreciate it if you have a good example that I could use in a
         | future article.
        
           | smallerfish wrote:
           | It's been ~18 months so I'm fuzzy on details. I remember
           | gmail being tricky also.
           | 
           | IIRC I ended up building an iframe based scraper for sites
           | that didn't yield any content with just a fetch - and I think
           | built a fallback mechanism so that if fetch didn't work, I'd
           | queue it up in the iframe scraper. The problem with that is
           | that there are various heavily used security headers that
           | prohibit loading in an iframe. (And the reason for iframe vs
           | just loading in a tab and injecting my extension's script is
           | that I wanted it to be able to run "in the background"
           | without being super distracting for the user - the tab
           | changing favicon every second or two was pretty annoying.)
        
         | changing1999 wrote:
         | In my experience building a browser-based scraper I preferred
         | scraping pages by a direct in-browser visit rather that a fetch
         | request. A direct visit from a real browser is basically
         | undetectable by anti-bot software (unless you try to do
         | something funny like automated deep crawling and scraping). So
         | applied to your usecase it would have to go through every
         | bookmark + 1 degree to index it. Maybe even in an offscreen
         | canvas (haven't tried that though, could be detectable).
        
         | paulryanrogers wrote:
         | How often did it crawl? Once per day shouldn't trigger any
         | blockers.
        
       | changing1999 wrote:
       | > can you write a web scraper in your browser? The answer is:
       | YES, you can! So why is nobody doing it?
       | 
       | My guess would be that some companies are doing it (I worked at a
       | major tech company that is/was), just not publicizing this fact
       | as crawling/scraping is such a gray legal area.
        
       | micahdeath wrote:
       | Excel/Word Macro using a WebBrowser object in a Form (old IE did
       | this nicely; Haven't done that since Edge came out.)
        
       | flashgordon wrote:
       | Ah I remember doing this almost 20 years ago and even rotating
       | through 1500 proxies to not get tripped up by ddos detectors :).
       | A plugin is one of the ways to scrape as it also looks like a
       | human (ie more js run, more divs loaded and so on).
        
       | turingfeel wrote:
       | If you want to get your personal IP and fingerprint blacklisted
       | across major providers and large ranges, unfortunately this is
       | how you do it. Just keep the rates low.
        
         | zarzavat wrote:
         | Obviously anyone scraping on their home IP is being foolish.
         | Getting blacklisted is the least bad thing that can happen.
         | 
         | As for fingerprinting, you can just use a different computer.
         | Most people probably have a bunch of old computers lying
         | around, right? If not, computers are cheap.
        
       | squigz wrote:
       | This horrendous color scheme makes it impossible for me to read
       | this.
        
       | ggorlen wrote:
       | I wrote a similar post on in-browser scraping:
       | https://serpapi.com/blog/dynamic-scraping-without-libraries/
       | 
       | My approach is a step or two more automated (optionally using a
       | userscript and a backend) and runs in the console on the site
       | under automation rather than cross-origin, as shown in OP.
       | 
       | In addition to being simple for one-off scripts and avoiding the
       | learning curve of a Selenium, Playwright or Puppeteer, scraping
       | in-browser avoids a good deal of potential bot detection issues,
       | and is useful for constant polling a site to wait for something
       | to happen (for example, a specific message or article to appear).
       | 
       | You can still use a backend and write to file, trigger an email
       | or SMS, etc. Just have your userscript make requests to a server
       | you're running.
        
       | gmac wrote:
       | Yes: I find it surprising that this isn't a more widespread
       | approach. It's how I've taught web scraping to my PhD students
       | for some years.
       | 
       | https://github.com/jawj/web-scraping-for-researchers
        
         | hombre_fatal wrote:
         | It's not widespread because it's much more complicated than
         | making an http request and reading the results from the body.
         | You don't spin up a browser, much less the full GUI, unless
         | it's a last resort.
        
       | nsonha wrote:
       | sorry the format of this site is just too annoying for me to
       | bother to read it. If this is about the shocking revelation that
       | you can paste some code into the browser console, aka manually
       | extracting information, then manually put that into whatever
       | workflow that you need that information for, then I don't think
       | that is called web scrapping, it's just browsing the web with
       | code.
        
       | ttshaw1 wrote:
       | How is this different from scraping in, say, Selenium in non-
       | headless mode?
        
         | twelve40 wrote:
         | I think Selenium's killer use case is (aside from
         | legacy/inertia) cross-browser and cross-language. In exchange,
         | it comes with a ton of its own baggage, since it's an
         | additional layer in between you and your task, with its own
         | Selenium-specific bugs, behavior limitations and edge cases.
         | 
         | If you don't need cross-browser and Chrome is all you need,
         | then something like a simple Chrome extension and/or Chrome
         | DevTools Protocol cuts out a lot of middle-man baggage and at
         | least you will be wrangling the browser behavior directly,
         | without any extra idiosyncrasies of middle layers.
        
           | ttshaw1 wrote:
           | Sorry, I dropped the context when I decided to make this a
           | top-level comment. Does scraping in a browser circumvent
           | scraping protections that Selenium non-headless would get
           | caught by?
        
             | 8chanAnon wrote:
             | My next article will be on the topic of bypassing the
             | Cloudflare bot protection. You can then compare with how
             | Selenium handles this problem (if at all).
        
       | pimlottc wrote:
       | When I have to do some really quick ad-hoc webscraping, I often
       | just select all text on the page, copy it, and then switch to a
       | terminal window where I build a pipeline that extracts the part I
       | need (using pbpaste to access the clipboard). Very quick and
       | dirty for when you just need to hit a few pages.
        
       | seanwilson wrote:
       | > So the question is: can you write a web scraper in your
       | browser? The answer is: YES, you can! So why is nobody doing it?
       | 
       | > One of the issues is what is called CORS (Cross-Origin Resource
       | Sharing) which is a set of protocols which may forbid or allow
       | access to a web resource by Javascript. There are two possible
       | workarounds: a browser extension or a proxy server. The first
       | choice is fairly limited since some security restrictions still
       | apply.
       | 
       | I'm doing this for a browser extension that crawls a website from
       | page to page checking for SEO/speed/security problems
       | (https://www.checkbot.io/). It's been flexible enough, and it's
       | nice not to have to maintain and scale servers for the web
       | crawling. https://browserflow.app/ is another extension I know of
       | that does scraping within the browser I think, and other
       | automation.
        
       ___________________________________________________________________
       (page generated 2024-10-02 23:02 UTC)