codemadness.org

       extractjson.1 - extractjson - extract embedded JSON metadata from HTML pages
 (HTM) git clone git://git.codemadness.org/extractjson
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       extractjson.1 (862B)
       ---
            1 .Dd August 14, 2022
            2 .Dt EXTRACTJSON 1
            3 .Os
            4 .Sh NAME
            5 .Nm extractjson
            6 .Nd extracts embedded JSON metadata from HTML pages
            7 .Sh SYNOPSIS
            8 .Nm
            9 .Sh DESCRIPTION
           10 .Nm
           11 extracts embedded JSON metadata from HTML pages, such as data in the tags:
           12 <script type="application/ld+json">
           13 .Pp
           14 It reads HTML from stdin and outputs JSON per line to stdout.
           15 .Sh EXIT STATUS
           16 .Ex -std
           17 .Sh EXAMPLES
           18 .Bd -literal
           19 curl -s https://www.imdb.com/title/tt0107048/ | extractjson | sed 1q | json2tsv
           20 .Ed
           21 .Pp
           22 This extracts the JSON metadata from the IMDB page of the movie "Ground Hog Day".
           23 It uses the first embedded JSON fragment and pipes it to json2tsv.
           24 It can then be further processed using awk to get the relevant data.
           25 .Pp
           26 It can also be useful for extracting video streams from webpages.
           27 .Sh SEE ALSO
           28 .Xr curl 1 ,
           29 .Xr json2tsv 1
           30 .Sh AUTHORS
           31 .An Hiltjo Posthuma Aq Mt hiltjo@codemadness.org