[HN Gopher] Show HN: Web Scraping with Your Web Browser: Why Not?
       ___________________________________________________________________
        
       Show HN: Web Scraping with Your Web Browser: Why Not?
        
       Includes working code. First article in a planned series.
        
       Author : 8chanAnon
       Score  : 48 points
       Date   : 2024-10-01 17:55 UTC (5 hours ago)
        
 (HTM) web link (8chananon.github.io)
 (TXT) w3m dump (8chananon.github.io)
        
       | dewey wrote:
       | I've read through that (hard to read, because of the bad
       | formatting) but I still don't understand why you would do that
       | instead of Playwright, Puppeteer etc. - The only reason seems to
       | be "This technique certainly has its limits.".
        
         | bdcravens wrote:
         | Solutions that want to automate in the context of their
         | customers' browser. For example, ListPerfectly, a solution for
         | cross-listing to eBay, Poshmark, etc, does this in their
         | browser extension.
        
           | dewey wrote:
           | This is just regular browser extension behavior though, a bit
           | different than the use case talked about in this post.
        
         | 8chanAnon wrote:
         | >bad formatting
         | 
         | If you can elaborate, I would very much appreciate it. I'm
         | always interested in doing better.
         | 
         | Why use Puppeteer etc. when you don't have to? What is the
         | argument for using these additional tools versus not using
         | them?
        
           | gabrielsroka wrote:
           | The red text on a yellow background is not great. Neither is
           | a serif font. Also it should be JavaScript with a capital S.
        
             | 8chanAnon wrote:
             | Red text on yellow - you mean the website? Would you like
             | the text to be darker?
             | 
             | And Beautiful Soup should be BeautifulSoup. Who makes the
             | rules?
        
               | gabrielsroka wrote:
               | Yes, the website. Better contrast is good. Black on white
               | works great.
               | 
               | Margins would also be nice on the left and right.
               | 
               | Beautiful Soup is two words. Just look at their website.
        
               | bityard wrote:
               | Too much contrast makes it hard to read too. (With bonus
               | eye strain.)
        
               | 8chanAnon wrote:
               | Is there a happy in-between? Maybe not. What looks
               | perfect to one user might appear atrocious to another.
               | What is a poor website operator to do???
               | 
               | I dislike black-on-white and don't understand gray-on-
               | black which seems to be popular now due to gamma settings
               | being cranked up to 11 or something. I try to use some
               | color as an in-between but that may take some time to
               | "perfect".
        
           | dewey wrote:
           | You don't have max-width set on the text, so unless you have
           | your browser window resized to very small size the paragraphs
           | will span your whole screen.
        
             | 8chanAnon wrote:
             | I see. I don't have a wide-screen monitor (still using an
             | old tube type until it finally expires but it's taking a
             | few decades, lol). I've wondered whether people actually
             | like reading websites on wide-screen. Some do and some
             | don't. What would you suggest for a max width?
             | 
             | You could also try zooming in. My apps don't expand to full
             | width because of the video box but you can zoom.
        
               | dewey wrote:
               | There's a lot of info about this but usually 500-700px or
               | ~80 characters will be much easier to read:
               | https://ux.stackexchange.com/questions/108801/what-is-
               | the-be...
        
             | pwg wrote:
             | Which places you, the reader, fully in control of the width
             | of lines you prefer to see. Adjust your browser window
             | width, or apply a user style sheet to tell your browser to
             | format the text the way you want to see it formatted.
        
               | dewey wrote:
               | I knew this reply would come... How many people do you
               | think use custom browser stylesheets? It's probably
               | smaller than 0.1% of the internet population and everyone
               | moved on to formatting text so everyone can enjoy good
               | readability. Also not all devices have the luxury of
               | supporting custom stylesheets.
               | 
               | Of course it's always up to the site owner, but most
               | people want people to read what they share.
        
             | bityard wrote:
             | I have the opposite problem. I typically have many browser
             | windows open at the same time but only two screens, and
             | many sites that I use are designed to assume that everyone
             | has full-screen browsers.
        
       | chaosharmonic wrote:
       | > You can find plenty of tutorials on the Internet about the art
       | of web scraping... and the first things you will learn about are
       | Python and Beautiful Soup. There is no tutorial on web scraping
       | with Javascript in a web browser...
       | 
       | Um... [0]
       | 
       | [0] https://bhmt.dev/blog/scraping
        
         | 8chanAnon wrote:
         | Rather long so I'll read it later. Thanks for the tip. Got more
         | or is that it?
        
           | chaosharmonic wrote:
           | Not that I _know_ of, I 'm just quipping a bit about my own
           | work lol
        
       | simlan wrote:
       | I also did something similar for my spring project. The idea was
       | to buy a used car and I was frustrated with the BS the listing
       | sites claimed as fair price etc..
       | 
       | I went the browser extension route and used grease monkey to
       | inject custom JavaScript. I patched the window.fetch and because
       | it was a react page it did most of the work for me providing me
       | with a slightly convolute JSON doc everytime I scrolled. Getting
       | the data extracted was only a question of getting a flask API
       | with correct CORS settings running.
       | 
       | Thanks for posting using a local proxy for even more control
       | could be helpful in the future.
        
       | gabrielsroka wrote:
       | Why do you need a proxy or to worry about CORS? Why not just
       | point your browser to rumble.com and start from there?
       | 
       | I've posted here about scraping for example HN with JavaScript.
       | It's certainly not a new idea.
       | 
       | 2020: https://news.ycombinator.com/item?id=22788236
        
         | CharlieDigital wrote:
         | > Why do you need a proxy or to worry about CORS?
         | 
         | Not sure about OP, but you might want to point to a proxy
         | depending on the site/content you are scraping and your
         | location. For example, if you are in Canada but you want to
         | scrape in USD, you might need to use a proxy located in the US
         | to get US prices.                   > Why not just point your
         | browser to rumble.com and start from there?
         | 
         | Some endpoints use simple web application firewall rules that
         | will block IPs. In this case, a rotating proxy can help evade
         | the blocks (and prevent your legitimate traffic from being
         | blocked). Some domains use more sophisticated WAFs like Imperva
         | and will do browser fingerprinting so you'll need even more
         | advanced techniques to scrape successfully.
         | 
         | Source: work at a startup that does a lot of scraping and these
         | are issues we've run into. Our entire office network is blocked
         | from some sites due to some early testing without a proxy.
        
       | datadrivenangel wrote:
       | I've been playing around with this idea lately as well! There are
       | a lot of web interfaces that are hostile to scraping, and I see
       | no reason why we shouldn't be able to use the data we have access
       | to for our own purposes. CUSTOMIZE YOUR INTERFACES
        
       | joshdavham wrote:
       | > can you write a web scraper in your browser? The answer is:
       | YES, you can! So why is nobody doing it?
       | 
       | Completely agree with this sentiment.
       | 
       | I just spent the last couple of months developing a chrome
       | extension, but recently also did an unrleated web scraping
       | project where I looked into all the common tools like beautiful
       | soup, selenium, playwright, pupeteer, etc, etc.
       | 
       | All of these tools were needlessly complicated and I was having a
       | ton of trouble with sites that required authentication. I then
       | realized it would be way easier to write some javascript and
       | paste it in my browser to do the scraping. Worked like a charm!
        
         | metadat wrote:
         | You might like Tamper monkey. You can add a button to kick it
         | off or whatever your heart desires.
         | 
         | Tampermoney also works around CORs issues with relative ease.
        
           | crtasm wrote:
           | Userscripts for the win. Consider Violentmonkey over
           | Tampermonkey though.
        
       | ljw1004 wrote:
       | In my web-scraping I've gravitated towards the "cheerio" library
       | for javascript.
       | 
       | I kind of don't want to use DOMParser because it's browser-
       | only... my web-scrapers have to evolve every few years as the
       | underlying web pages change, so I really want CI tests, so it's
       | easiest to have something that works in node.
        
       | welder wrote:
       | Neo already did that in the Matrix:
       | 
       | https://www.youtube.com/watch?v=sjoad6gcRzs
        
       | deisteve wrote:
       | is there anything that runs on WASM for scraping? the issue is
       | that you need to enable flags and turn off other security
       | features to scrape on your web browser and this is why its not
       | popular but with WASM that might change
        
         | 8chanAnon wrote:
         | WASM runs in a sandbox. It can only talk to the outside world
         | via Javascript so you can forget the idea that it might be a
         | way to crack through browser security.
         | 
         | Maybe somebody will make a web browser with all of the security
         | locks disabled. Sort of like the Russian commander in "Hunting
         | for Red October" who disabled his missiles' security features
         | in order to more effectively target the American sub but then
         | got blown up by his own missile.
        
       | spullara wrote:
       | I love it when something like this reminds me of a project from
       | forever ago...
       | 
       | https://github.com/spullara/browsercrawler
        
       ___________________________________________________________________
       (page generated 2024-10-01 23:00 UTC)