[HN Gopher] Show HN: Web Scraping with Your Web Browser: Why Not?
___________________________________________________________________
Show HN: Web Scraping with Your Web Browser: Why Not?
Includes working code. First article in a planned series.
Author : 8chanAnon
Score : 48 points
Date : 2024-10-01 17:55 UTC (5 hours ago)
(HTM) web link (8chananon.github.io)
(TXT) w3m dump (8chananon.github.io)
| dewey wrote:
| I've read through that (hard to read, because of the bad
| formatting) but I still don't understand why you would do that
| instead of Playwright, Puppeteer etc. - The only reason seems to
| be "This technique certainly has its limits.".
| bdcravens wrote:
| Solutions that want to automate in the context of their
| customers' browser. For example, ListPerfectly, a solution for
| cross-listing to eBay, Poshmark, etc, does this in their
| browser extension.
| dewey wrote:
| This is just regular browser extension behavior though, a bit
| different than the use case talked about in this post.
| 8chanAnon wrote:
| >bad formatting
|
| If you can elaborate, I would very much appreciate it. I'm
| always interested in doing better.
|
| Why use Puppeteer etc. when you don't have to? What is the
| argument for using these additional tools versus not using
| them?
| gabrielsroka wrote:
| The red text on a yellow background is not great. Neither is
| a serif font. Also it should be JavaScript with a capital S.
| 8chanAnon wrote:
| Red text on yellow - you mean the website? Would you like
| the text to be darker?
|
| And Beautiful Soup should be BeautifulSoup. Who makes the
| rules?
| gabrielsroka wrote:
| Yes, the website. Better contrast is good. Black on white
| works great.
|
| Margins would also be nice on the left and right.
|
| Beautiful Soup is two words. Just look at their website.
| bityard wrote:
| Too much contrast makes it hard to read too. (With bonus
| eye strain.)
| 8chanAnon wrote:
| Is there a happy in-between? Maybe not. What looks
| perfect to one user might appear atrocious to another.
| What is a poor website operator to do???
|
| I dislike black-on-white and don't understand gray-on-
| black which seems to be popular now due to gamma settings
| being cranked up to 11 or something. I try to use some
| color as an in-between but that may take some time to
| "perfect".
| dewey wrote:
| You don't have max-width set on the text, so unless you have
| your browser window resized to very small size the paragraphs
| will span your whole screen.
| 8chanAnon wrote:
| I see. I don't have a wide-screen monitor (still using an
| old tube type until it finally expires but it's taking a
| few decades, lol). I've wondered whether people actually
| like reading websites on wide-screen. Some do and some
| don't. What would you suggest for a max width?
|
| You could also try zooming in. My apps don't expand to full
| width because of the video box but you can zoom.
| dewey wrote:
| There's a lot of info about this but usually 500-700px or
| ~80 characters will be much easier to read:
| https://ux.stackexchange.com/questions/108801/what-is-
| the-be...
| pwg wrote:
| Which places you, the reader, fully in control of the width
| of lines you prefer to see. Adjust your browser window
| width, or apply a user style sheet to tell your browser to
| format the text the way you want to see it formatted.
| dewey wrote:
| I knew this reply would come... How many people do you
| think use custom browser stylesheets? It's probably
| smaller than 0.1% of the internet population and everyone
| moved on to formatting text so everyone can enjoy good
| readability. Also not all devices have the luxury of
| supporting custom stylesheets.
|
| Of course it's always up to the site owner, but most
| people want people to read what they share.
| bityard wrote:
| I have the opposite problem. I typically have many browser
| windows open at the same time but only two screens, and
| many sites that I use are designed to assume that everyone
| has full-screen browsers.
| chaosharmonic wrote:
| > You can find plenty of tutorials on the Internet about the art
| of web scraping... and the first things you will learn about are
| Python and Beautiful Soup. There is no tutorial on web scraping
| with Javascript in a web browser...
|
| Um... [0]
|
| [0] https://bhmt.dev/blog/scraping
| 8chanAnon wrote:
| Rather long so I'll read it later. Thanks for the tip. Got more
| or is that it?
| chaosharmonic wrote:
| Not that I _know_ of, I 'm just quipping a bit about my own
| work lol
| simlan wrote:
| I also did something similar for my spring project. The idea was
| to buy a used car and I was frustrated with the BS the listing
| sites claimed as fair price etc..
|
| I went the browser extension route and used grease monkey to
| inject custom JavaScript. I patched the window.fetch and because
| it was a react page it did most of the work for me providing me
| with a slightly convolute JSON doc everytime I scrolled. Getting
| the data extracted was only a question of getting a flask API
| with correct CORS settings running.
|
| Thanks for posting using a local proxy for even more control
| could be helpful in the future.
| gabrielsroka wrote:
| Why do you need a proxy or to worry about CORS? Why not just
| point your browser to rumble.com and start from there?
|
| I've posted here about scraping for example HN with JavaScript.
| It's certainly not a new idea.
|
| 2020: https://news.ycombinator.com/item?id=22788236
| CharlieDigital wrote:
| > Why do you need a proxy or to worry about CORS?
|
| Not sure about OP, but you might want to point to a proxy
| depending on the site/content you are scraping and your
| location. For example, if you are in Canada but you want to
| scrape in USD, you might need to use a proxy located in the US
| to get US prices. > Why not just point your
| browser to rumble.com and start from there?
|
| Some endpoints use simple web application firewall rules that
| will block IPs. In this case, a rotating proxy can help evade
| the blocks (and prevent your legitimate traffic from being
| blocked). Some domains use more sophisticated WAFs like Imperva
| and will do browser fingerprinting so you'll need even more
| advanced techniques to scrape successfully.
|
| Source: work at a startup that does a lot of scraping and these
| are issues we've run into. Our entire office network is blocked
| from some sites due to some early testing without a proxy.
| datadrivenangel wrote:
| I've been playing around with this idea lately as well! There are
| a lot of web interfaces that are hostile to scraping, and I see
| no reason why we shouldn't be able to use the data we have access
| to for our own purposes. CUSTOMIZE YOUR INTERFACES
| joshdavham wrote:
| > can you write a web scraper in your browser? The answer is:
| YES, you can! So why is nobody doing it?
|
| Completely agree with this sentiment.
|
| I just spent the last couple of months developing a chrome
| extension, but recently also did an unrleated web scraping
| project where I looked into all the common tools like beautiful
| soup, selenium, playwright, pupeteer, etc, etc.
|
| All of these tools were needlessly complicated and I was having a
| ton of trouble with sites that required authentication. I then
| realized it would be way easier to write some javascript and
| paste it in my browser to do the scraping. Worked like a charm!
| metadat wrote:
| You might like Tamper monkey. You can add a button to kick it
| off or whatever your heart desires.
|
| Tampermoney also works around CORs issues with relative ease.
| crtasm wrote:
| Userscripts for the win. Consider Violentmonkey over
| Tampermonkey though.
| ljw1004 wrote:
| In my web-scraping I've gravitated towards the "cheerio" library
| for javascript.
|
| I kind of don't want to use DOMParser because it's browser-
| only... my web-scrapers have to evolve every few years as the
| underlying web pages change, so I really want CI tests, so it's
| easiest to have something that works in node.
| welder wrote:
| Neo already did that in the Matrix:
|
| https://www.youtube.com/watch?v=sjoad6gcRzs
| deisteve wrote:
| is there anything that runs on WASM for scraping? the issue is
| that you need to enable flags and turn off other security
| features to scrape on your web browser and this is why its not
| popular but with WASM that might change
| 8chanAnon wrote:
| WASM runs in a sandbox. It can only talk to the outside world
| via Javascript so you can forget the idea that it might be a
| way to crack through browser security.
|
| Maybe somebody will make a web browser with all of the security
| locks disabled. Sort of like the Russian commander in "Hunting
| for Red October" who disabled his missiles' security features
| in order to more effectively target the American sub but then
| got blown up by his own missile.
| spullara wrote:
| I love it when something like this reminds me of a project from
| forever ago...
|
| https://github.com/spullara/browsercrawler
___________________________________________________________________
(page generated 2024-10-01 23:00 UTC)