[HN Gopher] Pup: Parsing HTML at the command line
___________________________________________________________________
Pup: Parsing HTML at the command line
Author : tosh
Score : 67 points
Date : 2022-11-30 18:55 UTC (4 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bijoo wrote:
| It looks like the project became inactive for a bit and there are
| alternatives such as htmlq, etc.
| https://github.com/ericchiang/pup/issues/150
| mananaysiempre wrote:
| From the looks of it, htmlq doesn't have anything comparable to
| pup's JSON output. That JSON is cumbersome to work with, but
| combined with jq it allows one to extend the shell hackery just
| a little bit beyond what CSS can do.
| dang wrote:
| Related:
|
| _Pup - Like Jq, but for HTML_ -
| https://news.ycombinator.com/item?id=24797697 - Oct 2020 (2
| comments)
|
| _Show HN: Pup - A command-line HTML parser_ -
| https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27
| comments)
|
| Random bit of history: that Show HN was a very early choice for
| what is now called the second-chance pool:
|
| _Ask HN: Why did three HN stories jump 100 ranking points in 5
| mins?_ - https://news.ycombinator.com/item?id=8313505 - Sept 2014
| (6 comments)
| John23832 wrote:
| While this is nice, it's three years old.
|
| Direct installation of brew scripts isn't supported anymore. `go
| get` installs aren't either.
|
| It needs an update.
| JodieBenitez wrote:
| It may need an update, but not for installation:
|
| go install github.com/ericchiang/pup@latest
| thangalin wrote:
| https://www.w3.org/Tools/HTML-XML-utils/
| bsnnkv wrote:
| I use this extensively in bash scripts where I need to scrape
| HTML reliably and consistently. Cannot recommend it highly
| enough.
|
| In fact, all of the source data for a project of mine, Baytyab[1]
| (couplet-finder) was scraped using bash + pup.
|
| [1]: https://baytyab.com
| pipeline_peak wrote:
| Would you recommend it over BeautifulSoup?
| turtlebits wrote:
| Will check it out, but would have preferred XPath selectors
| instead of CSS.
| undume wrote:
| xmllint can do that: curl example.org | xmllint
| --html --xpath '//some/xpath/selector' -
| natrys wrote:
| Also: https://github.com/benibela/xidel
| curben wrote:
| I'm using xmlstarlet in Alpine as a bare minimum way to scrap a
| webpage in CI pipeline.
| BossHogg wrote:
| There's also https://github.com/charmparticle/xpe
| wmichelin wrote:
| Why? CSS selectors are the normal web developer way to select
| content from a document. Even JavaScript adopted the approach.
| tuukkah wrote:
| XPath supports more complex queries. In JavaScript, XPath is
| available as document.evaluate
| ducktective wrote:
| I'm looking for something like this but for scraping SPAs and JS-
| rich web content. A _single static binary_ like this not chrome
| driver or selenium etc
___________________________________________________________________
(page generated 2022-11-30 23:00 UTC)