https://github.com/ericchiang/pup Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code + Explore + All features + Documentation + GitHub Skills + Blog * Solutions + For + Enterprise + Teams + Startups + Education + By Solution + CI/CD & Automation + DevOps + DevSecOps + Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles + Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} ericchiang / pup Public * Notifications * Fork 236 * Star 7.1k Parsing HTML at the command line License MIT license 7.1k stars 236 forks Star Notifications * Code * Issues 70 * Pull requests 18 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights ericchiang/pup This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 3 branches 15 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/e] Use Git or checkout with SVN using the web URL. [gh repo clone ericch] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @ericchiang ericchiang *: cleanup repo and add github actions CI ... 5a57cf1 Mar 6, 2022 *: cleanup repo and add github actions CI 5a57cf1 Git stats * 103 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time .github/workflows *: cleanup repo and add github actions CI Mar 6, 2022 tests tests updated Feb 1, 2015 .gitignore not and parent-of selectors added Nov 10, 2014 LICENSE license added Sep 1, 2014 README.md Fix broken Markdown headings Apr 18, 2017 display.go fix missing import Jul 23, 2016 go.mod vendor: switch to go modules Sep 19, 2019 go.sum vendor: switch to go modules Sep 19, 2019 parse.go fix typos Aug 19, 2016 parse_test.go comma detector added Nov 22, 2014 pup.go vendor: fix go-iastty to not break on unknown operating systems Jul 23, 2016 pup.rb Update: new hardware class Apr 1, 2017 selector.go fix typos Aug 19, 2016 View code [ ] pup Install Quick start Basic Usage Examples Clean and indent Filter by tag Filter by id Filter by attribute Pseudo Classes +, >, and , Chain selectors together Implemented Selectors Display Functions text {} attr{attrkey} json{} Flags README.md pup pup is a command line tool for processing HTML. It reads from stdin, prints to stdout, and allows the user to filter parts of the page using CSS selectors. Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal. Install Direct downloads are available through the releases page. If you have Go installed on your computer just run go get. go get github.com/ericchiang/pup If you're on OS X, use Homebrew to install (no Go required). brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb Quick start $ curl -s https://news.ycombinator.com/ Ew, HTML. Let's run that through some pup selectors: $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a' Okay, how about only the links? $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a attr{href}' Even better, let's grab the titles too: $ curl -s https://news.ycombinator.com/ | pup 'table table tr:nth-last-of-type(n+2) td.title a json{}' Basic Usage $ cat index.html | pup [flags] '[selectors] [display function]' Examples Download a webpage with wget. $ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html Clean and indent By default pup will fill in missing tags and properly indent the page. $ cat robots.html # nasty looking HTML $ cat robots.html | pup --color # cleaned, indented, and colorful HTML Filter by tag $ cat robots.html | pup 'title' Robots exclusion standard - Wikipedia, the free encyclopedia Filter by id $ cat robots.html | pup 'span#See_also' See also Filter by attribute $ cat robots.html | pup 'th[scope="row"]' Exclusion standards Related marketing topics Search marketing related topics Search engine spam Linking People Other Pseudo Classes CSS selectors have a group of specifiers called "pseudo classes" which are pretty cool. pup implements a majority of the relevant ones them. Here are some examples. $ cat robots.html | pup 'a[rel]:empty' $ cat robots.html | pup ':contains("History")' History History $ cat robots.html | pup ':parent-of([action="edit"])' Edit links For a complete list, view the implemented selectors section. +, >, and , These are intermediate characters that declare special instructions. For instance, a comma , allows pup to specify multiple groups of selectors. $ cat robots.html | pup 'title, h1 span[dir="auto"]' Robots exclusion standard - Wikipedia, the free encyclopedia Robots exclusion standard Chain selectors together When combining selectors, the HTML nodes selected by the previous selector will be passed to the next ones. $ cat robots.html | pup 'h1#firstHeading'

Robots exclusion standard

$ cat robots.html | pup 'h1#firstHeading span' Robots exclusion standard Implemented Selectors For further examples of these selectors head over to MDN. pup '.class' pup '#id' pup 'element' pup 'selector + selector' pup 'selector > selector' pup '[attribute]' pup '[attribute="value"]' pup '[attribute*="value"]' pup '[attribute~="value"]' pup '[attribute^="value"]' pup '[attribute$="value"]' pup ':empty' pup ':first-child' pup ':first-of-type' pup ':last-child' pup ':last-of-type' pup ':only-child' pup ':only-of-type' pup ':contains("text")' pup ':nth-child(n)' pup ':nth-of-type(n)' pup ':nth-last-child(n)' pup ':nth-last-of-type(n)' pup ':not(selector)' pup ':parent-of(selector)' You can mix and match selectors as you wish. cat index.html | pup 'element#id[attribute="value"]:first-of-type' Display Functions Non-HTML selectors which effect the output type are implemented as functions which can be provided as a final argument. text{} Print all text from selected nodes and children in depth first order. $ cat robots.html | pup '.mw-headline text{}' History About the standard Disadvantages Alternatives Examples Nonstandard extensions Crawl-delay directive Allow directive Sitemap Host Universal "*" match Meta tags and headers See also References External links attr{attrkey} Print the values of all attributes with a given key from all selected nodes. $ cat robots.html | pup '.catlinks div attr{id}' mw-normal-catlinks mw-hidden-catlinks json{} Print HTML as JSON. $ cat robots.html | pup 'div#p-namespaces a' Article Talk $ cat robots.html | pup 'div#p-namespaces a json{}' [ { "accesskey": "c", "href": "/wiki/Robots_exclusion_standard", "tag": "a", "text": "Article", "title": "View the content page [c]" }, { "accesskey": "t", "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", "text": "Talk", "title": "Discussion about the content page [t]" } ] Use the -i / --indent flag to control the intent level. $ cat robots.html | pup -i 4 'div#p-namespaces a json{}' [ { "accesskey": "c", "href": "/wiki/Robots_exclusion_standard", "tag": "a", "text": "Article", "title": "View the content page [c]" }, { "accesskey": "t", "href": "/wiki/Talk:Robots_exclusion_standard", "tag": "a", "text": "Talk", "title": "Discussion about the content page [t]" } ] If the selectors only return one element the results will be printed as a JSON object, not a list. $ cat robots.html | pup --indent 4 'title json{}' { "tag": "title", "text": "Robots exclusion standard - Wikipedia, the free encyclopedia" } Because there is no universal standard for converting HTML/XML to JSON, a method has been chosen which hopefully fits. The goal is simply to get the output of pup into a more consumable format. Flags Run pup --help for a list of further options About Parsing HTML at the command line Resources Readme License MIT license Stars 7.1k stars Watchers 93 watching Forks 236 forks Releases 15 pup - v0.4.0 Latest Jul 23, 2016 + 14 releases Packages 0 No packages published Used by 4 * @opendevstack * @TheDahv * @ankibahuguna * @kamilsk Contributors 11 * @ericchiang * @duncanbeevers * @mattn * @Habbie * @jwilk * @claudioc * @philoserf * @nemecec * @vitorgalvao * @bryant1410 * @iamnewton Languages * HTML 84.7% * Go 14.8% * Other 0.5% Footer (c) 2022 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.