[HN Gopher] Web scraping via JavaScript runtime heap snapshots
___________________________________________________________________
Web scraping via JavaScript runtime heap snapshots
Author : adriancooney
Score : 196 points
Date : 2022-04-29 13:56 UTC (9 hours ago)
(HTM) web link (www.adriancooney.ie)
(TXT) w3m dump (www.adriancooney.ie)
| superasn wrote:
| Awesome, I wonder if there is a possibility to create a chrome
| extension that works like 'Vue devttools' and show the heap and
| changes in real-time and maybe allow editing. That would be
| amazing for learning / debugging.
|
| > We use the --no-headless argument to boot a windowed Chrome
| instance (i.e. not headless) because Google can detect and thwart
| headless Chrome - but that's a story for another time.
|
| Use `puppeteer-extra-plugin-stealth`(1) for such sites. It
| defeats a lot of bot identification including recaptcha v3.
|
| (1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
| acemarke wrote:
| Not _quite_ what you're describing, but Replay [0], the company
| I work for, _is_ building a true "time-traveling debugger" for
| JS. It works by recording the OS-level interactions with the
| browser process, then re-running those in the cloud. From the
| user's perspective in our debugging client UI, they can jump to
| any point in a timeline and do typical step debugging. However,
| you can also see how many times any line of code ran, and also
| add print statements to any line that will print out the
| results from _every time that line got executed_.
|
| So, no heap analysis per se, but you can definitely inspect the
| variables and stack from anywhere in the recording.
|
| Right now our debugging client is just scratching the surface
| of the info we have available from our backend. We recently put
| together a couple small examples that use the Replay backend
| API to extract data from recordings and do other analysis, like
| generating code coverage reports and introspecting React's
| internals to determine whether a given component was mounting
| or re-rendering.
|
| Given that capability, we hope to add the ability to do "React
| component stack" debugging in the not-too-distant future, such
| as a button that would let you "Step Back to Parent Component".
| We're also working on adding Redux DevTools integration now
| (like, I filed an initial PR for this today! [2]), and hope to
| add integration with other frameworks down the road.
|
| [0] https://replay.io
|
| [1] https://github.com/RecordReplay/replay-protocol-examples
|
| [2] https://github.com/RecordReplay/devtools/pull/6601
| invalidname wrote:
| Scraping is inherently fragile due to all the small changes that
| can happen to the data model as a website evolves. The important
| thing is to fix these things quickly. This article discusses a
| related approach of debugging such failures directly on the
| server: https://talktotheduck.dev/debugging-jsoup-java-code-in-
| produ...
|
| It's in Java (using JSoup) but the approach will work for Node,
| Python, Kotlin etc. The core concept is to discover the cause of
| the regression instantly on the server and deploy a fix fast.
| There are also user specific regressions in scraping that are
| again very hard to debug.
| dymk wrote:
| Would this method work if the website obfuscated its HTML as per
| the usual techniques, but also rendered everything server side?
| adriancooney wrote:
| If it's rendered server-side - no. The data likely won't be
| loaded into the JS heap (the DOM isn't included in the heap
| snapshots) when you visit the page. You might be in luck if the
| website executes JavaScript to augment the server-side rendered
| page however. If it does, your data may be loaded into memory
| in a way you can extract it.
| marwis wrote:
| This sadly does not help if js code is minified/obfuscated and
| data is exchanged using some binary/binary-like protocol like
| grpc. Unfortunately this is increasingly common.
|
| The only long term way is to parse visible text.
| mdaniel wrote:
| I've never seen grpc from a browser on a consumer-facing site;
| do you have an example I could see?
|
| That said, for this approach something like grpc would be a
| benefit since AIUI grpc is designed to be versioned so one
| could identify structural changes in the payload fairly quickly
| versus the json-y way of "I dunno, are there suddenly new
| fields?"
| marwis wrote:
| Not aware of any actual grpc websites but given grpc-web has
| 6.5k stars on github something must be out there.
|
| Google's websites frequently use binary-like formats where
| json is just an array of values with no properties, and most
| of these values are numbers. See for example Gmail.
| toraway1234 wrote:
| swsieber wrote:
| Yeah, found when it happened -
| https://news.ycombinator.com/item?id=30275804
| d4a wrote:
| You're banned, which means everything you post will be marked
| as "Dead". Only those with "showdead" enabled in their profile
| will be able to see your comments and posts. Others can "vouch"
| for your post to make it not "dead" so it can be replied to
| (which is what I have done)
|
| As for why you were banned:
| https://news.ycombinator.com/item?id=30275804
| BbzzbB wrote:
| Odd, I see his comment and I'm playing HN with undeads off
| ("showdead" is "no").
| d4a wrote:
| I had to vouch for the comment (make it not dead) to reply
| to it
| BbzzbB wrote:
| Oh I see, you're playing medic.
|
| Thanks, I always wondered what "showdead" meant (tho not
| enough to Google it I guess).
| d4a wrote:
| https://github.com/minimaxir/hacker-news-undocumented
| [deleted]
| kvathupo wrote:
| The article brings up two interesting points for web
| preservation:
|
| 1. The reliance on externally hosted APIs
|
| 2. Source code obfuscation
|
| For 1, in order to fully preserve a webpage, you'd have to go
| down the rabbit hole of externally hosted APIs, and preserve
| those as well. For example, sometimes a webpage won't render
| latex notation since a MathJax endpoint can't be connected to.
| Were we to save this webpage, we would need a copy of MathJax JS
| too.
|
| For 2, I think WASM makes things more interesting. With Web
| Assembly, I'd imagine it's much easier to obfuscate source code:
| a preservationist would need a WASM decompiler for whatever
| source language was used.
| flockonus wrote:
| Awesome experimentation! I'd be curious to how you navigate the
| heap dump in some real website examples.
| mwcampbell wrote:
| > Developers no longer need to label their data with class-names
| or ids - it's only a courtesy to screen readers now.
|
| In general, screen readers don't use class names or IDs. In
| principle they can, to enable site-specific workarounds for
| accessibility problems. But of course, that's as fragile as
| scraping. Perhaps you were thinking of semantic HTML tag names
| and ARIA roles.
| ComputerGuru wrote:
| Anything relying on id/class names has been broken since the
| advent of machine-generated names that come part and parcel
| with the most popular SPA frameworks. They're all gobbly-dook
| now, which makes writing custom ad block cosmetic filters a
| real PITA.
| jchw wrote:
| React doesn't do that. You may still find gibberish on
| hostile sites like Twitter which intentionally obfuscate
| class names, using something like React Armor.
| mdaniel wrote:
| That's an exceedingly clever idea, thanks for sharing it!
|
| Please consider adding an actual license text file to your repo,
| since (a) I don't think GitHub's licensee looks inside
| package.json (b) I bet _most_ of the "license" properties of
| package.json files are "yeah, yeah, whatever" versus an
| intentional choice: https://github.com/adriancooney/puppeteer-
| heap-snapshot/blob... I'm not saying that applies to you, but an
| explicit license file in the repo would make your wishes clearer
| adriancooney wrote:
| Ah thank you for the reminder. Added it now!
| marmada wrote:
| Wow this is brilliant. I've sometimes tried to reverse engineer
| APIs in the past, but this is definitely the next level.
|
| I used to think ML models could be good for scraping too, but
| this seems better.
|
| I think this + a network request interception tool (to get data
| that is embedded into HTML) could be the future.
| lemax wrote:
| I've used a similar technique on some web pages that get returned
| from the server with an in-tact redux state object just sitting
| in a <script> tag. Instead of parsing the HTML, I just pull out
| the state object. Super
| 1vuio0pswjnm7 wrote:
| "We can see the response is JSON! A clean, well-formed data
| structure extracted magically from the webpage."
|
| Honest question: Why are heap snapshots required.
|
| Why not just request the page and then extract the JSON.
| tnftp -4o 1.htm https://www.youtube.com/watch?v=L_o_O7v1ews
| yy059 < 1.htm > 1.json less 1.json
|
| yy059 is a quick and dirty, small, simple, portable C program to
| extract and reformat JSON to make reading JSON easier for me.
|
| For me, it works. There is no "magic".
|
| "Tech" companies, the ones manipulating access to public
| information to sell online advertising services, can change their
| free browsers at any time. In the future, they could cripple or
| remove the heap snapshot feature. Utilising the feature to
| extract data is interesting but how is this technique more
| "future-proof" than requesting a web page using any client, not
| necessarily a "tech" company's web browser, and looking at what
| it contains.
| quickthrower2 wrote:
| Because cloudflare, recaptcha etc. mean this is not general in
| possible. You need to quack like a normal user for it to work.
| If a site is really against scraping they could probably
| completely make it uneconomical by tracking user footprints and
| detect unexpected patterns of usage.
| datalopers wrote:
| They detect and block headless browsers just as easy.
| Jiger104 wrote:
| Really cool approach, great work
| elbajo wrote:
| Love this approach, thanks for sharing!
|
| I am trying this on a website for which Puppeteer has trouble
| loading so I got a heap snapshot directly in Chrome. I was trying
| to search for relevant objects directly in the Chrome heap viewer
| but I don't think the search looks inside objects.
|
| I think your tool would work: "puppeteer-heap-snapshot query -f
| /tmp/file.heapsnapshot -p property1" or really any JSON parser
| but it requires extra steps. Would you say this is the easiest
| way to view/debug a heap snapshot?
| BbzzbB wrote:
| This is great, thanks a lot.
|
| It's my understanding that Playwright is the "new Puppeteer"
| (even with core devs migrating). I presume this sort of technique
| would be feasible on Playwright too? Do you think there's any
| advantage or disadvantage of using one over the other for this
| use case, or it's basically the same (or I'm off base and they're
| not so interchangeable)?
|
| I'm basing my personal "scraping toolbox" off Scrapy which I
| think has decent Playwright integration, hence the question if I
| try to reproduce this strategy in Playwright.
| mdaniel wrote:
| My understanding of Playwright is that it's trying to be the
| new Selenium, in that it's a programming language orchestrating
| the WebDriver protocol
|
| That means that if you are running against Chromium, this will
| likely work, but unless Firefox has a similar heapdump
| function, it is unlikely to work[1]. And almost certainly not
| Safari, based on my experience. All of that is also qualified
| by whether Playwright exposes that behavior, or otherwise
| allows one to "get into the weeds" to invoke the function under
| the hood
|
| 1 = as an update, I checked and Firefox does have a memory
| snapshot feature, but the file it saved is some kind of binary
| encoded thing without any obvious strings in it
|
| I didn't see any such thing in Safari
| asabla wrote:
| Well kind of for Firefox, there is this profiling tool which
| you could use (semi-built in)
|
| https://github.com/firefox-devtools/profiler. Which let you
| save a report in json.gz format
| anyfactor wrote:
| Very interesting. Can't wait to give it a shot.
|
| I personally use a combination of xpath, basic math and regex, so
| this class/id security solution isn't a major deterrent. Couple
| of times, I did find it to be an hassle to scrape data embedded
| in iframes, and I can see the heap snapshots treat iframes
| differently.
|
| Also, if a website takes the extra steps to block web scrapers,
| identification of elements is never the main problem. It is
| always IP bans and other security measures.
|
| After all that, I do look forward using something like this and
| making a switch to nodejs based solution soon. But if you are
| trying web scraping at scale, reverse engineering should always
| be your first choice. Not only it enables you a faster solution,
| it is more ethical (IMO) as you are minimizing your impact to
| it's resources. Rendering full website resources is always my
| last choice.
| rvnx wrote:
| Nice this won't work anymore then
| benbristow wrote:
| Exactly my thoughts - the author is using it 'in production' -
| speaking out loud to a forum where Facebook/Meta employees (and
| other Silicon Valley folk) are definitely observing is a rookie
| mistake
| chrismeller wrote:
| A neat idea for sure, I just wanted to point out that this is why
| I prefer XPath over CSS selectors.
|
| We all know the display of the page and the structure of the page
| should be mutually exclusive, so why would you base your
| selectors on display? Particularly if you're looking for
| something on a semantically designed page, why would I look for
| an .article, a class that may disappear with the next redesign,
| when they're unlikely to stop using the article HTML tag?
| goldenkey wrote:
| CSS selectors don't have to select purely by classes. They can
| be something like:
|
| div > div > * > *:nth-child(7)
|
| XPath doesn't have any additional abilities, it's just verbose
| and difficult to write. It's a lemon.
| tommica wrote:
| I might be wrong, but xpath has contains, where you can look
| for a text content inside an element, which I don't think CSS
| can do
| mdaniel wrote:
| Yeah, for sure XPath is the more powerful of the two, so
| much so that Scrapy's parsel library parses CSS selectors
| and transforms then into the equivalent XPath for execution
|
| To the best of my knowledge, CSS selectors care only about
| the _structure_ of the page, lightly dipping its toes in
| the content only for attributes and things like :first-char
| and :first-line
| chrismeller wrote:
| Well that is 100% originally an XPath selector (:nth-child),
| so kudos if CSS selectors support it now.
|
| Still, using // instead of multiple *'s (and the two divs)
| still seems better for longer-term scraping.
___________________________________________________________________
(page generated 2022-04-29 23:00 UTC)