hngopher.com

       [HN Gopher] Pup: Parsing HTML at the command line
       ___________________________________________________________________
        
       Pup: Parsing HTML at the command line
        
       Author : tosh
       Score  : 67 points
       Date   : 2022-11-30 18:55 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bijoo wrote:
       | It looks like the project became inactive for a bit and there are
       | alternatives such as htmlq, etc.
       | https://github.com/ericchiang/pup/issues/150
        
         | mananaysiempre wrote:
         | From the looks of it, htmlq doesn't have anything comparable to
         | pup's JSON output. That JSON is cumbersome to work with, but
         | combined with jq it allows one to extend the shell hackery just
         | a little bit beyond what CSS can do.
        
       | dang wrote:
       | Related:
       | 
       |  _Pup - Like Jq, but for HTML_ -
       | https://news.ycombinator.com/item?id=24797697 - Oct 2020 (2
       | comments)
       | 
       |  _Show HN: Pup - A command-line HTML parser_ -
       | https://news.ycombinator.com/item?id=8312249 - Sept 2014 (27
       | comments)
       | 
       | Random bit of history: that Show HN was a very early choice for
       | what is now called the second-chance pool:
       | 
       |  _Ask HN: Why did three HN stories jump 100 ranking points in 5
       | mins?_ - https://news.ycombinator.com/item?id=8313505 - Sept 2014
       | (6 comments)
        
       | John23832 wrote:
       | While this is nice, it's three years old.
       | 
       | Direct installation of brew scripts isn't supported anymore. `go
       | get` installs aren't either.
       | 
       | It needs an update.
        
         | JodieBenitez wrote:
         | It may need an update, but not for installation:
         | 
         | go install github.com/ericchiang/pup@latest
        
       | thangalin wrote:
       | https://www.w3.org/Tools/HTML-XML-utils/
        
       | bsnnkv wrote:
       | I use this extensively in bash scripts where I need to scrape
       | HTML reliably and consistently. Cannot recommend it highly
       | enough.
       | 
       | In fact, all of the source data for a project of mine, Baytyab[1]
       | (couplet-finder) was scraped using bash + pup.
       | 
       | [1]: https://baytyab.com
        
         | pipeline_peak wrote:
         | Would you recommend it over BeautifulSoup?
        
       | turtlebits wrote:
       | Will check it out, but would have preferred XPath selectors
       | instead of CSS.
        
         | undume wrote:
         | xmllint can do that:                 curl example.org | xmllint
         | --html --xpath '//some/xpath/selector' -
        
         | natrys wrote:
         | Also: https://github.com/benibela/xidel
        
         | curben wrote:
         | I'm using xmlstarlet in Alpine as a bare minimum way to scrap a
         | webpage in CI pipeline.
        
         | BossHogg wrote:
         | There's also https://github.com/charmparticle/xpe
        
         | wmichelin wrote:
         | Why? CSS selectors are the normal web developer way to select
         | content from a document. Even JavaScript adopted the approach.
        
           | tuukkah wrote:
           | XPath supports more complex queries. In JavaScript, XPath is
           | available as document.evaluate
        
       | ducktective wrote:
       | I'm looking for something like this but for scraping SPAs and JS-
       | rich web content. A _single static binary_ like this not chrome
       | driver or selenium etc
        
       ___________________________________________________________________
       (page generated 2022-11-30 23:00 UTC)