json2tsv.md - www.codemadness.org - www.codemadness.org saait content files
 (HTM) git clone git://git.codemadness.org/www.codemadness.org
 (DIR) Log
 (DIR) Files
 (DIR) Refs
 (DIR) README
 (DIR) LICENSE
       ---
       json2tsv.md (5819B)
       ---
            1 Convert JSON to TSV or separated output.
            2 
            3 json2tsv reads JSON data from stdin.  It outputs each JSON type to a TAB-
            4 Separated Value format per line by default.
            5 
            6 
            7 ## TAB-Separated Value format
            8 
            9 The output format per line is:
           10 
           11         nodename<TAB>type<TAB>value<LF>
           12 
           13 Control-characters such as a newline, TAB and backslash (\n, \t and \\) are
           14 escaped in the nodename and value fields.  Other control-characters are
           15 removed.
           16 
           17 The type field is a single byte and can be:
           18 
           19 * a for array
           20 * b for bool
           21 * n for number
           22 * o for object
           23 * s for string
           24 * ? for null
           25 
           26 Filtering on the first field "nodename" is easy using awk for example.
           27 
           28 
           29 ## Features
           30 
           31 * Accepts all **valid** JSON.
           32 * Designed to work well with existing UNIX programs like awk and grep.
           33 * Straightforward and not much lines of code: about 475 lines of C.
           34 * Few dependencies: C compiler (C99), libc.
           35 * No need to learn a new (meta-)language for processing data.
           36 * The parser supports code point decoding and UTF-16 surrogates to UTF-8.
           37 * It does not output control-characters to the terminal for security reasons by
           38   default (but it has a -r option if needed).
           39 * On OpenBSD it supports [pledge(2)](https://man.openbsd.org/pledge) for syscall restriction:
           40   pledge("stdio", NULL).
           41 * Supports setting a different field separator and record separator with the -F
           42   and -R option.
           43 
           44 
           45 ## Cons
           46 
           47 * For the tool there is additional overhead by processing and filtering data
           48   from stdin after parsing.
           49 * The parser does not do complete validation on numbers.
           50 * The parser accepts some bad input such as invalid UTF-8
           51   (see [RFC8259 - 8.1. Character Encoding](https://tools.ietf.org/html/rfc8259#section-8.1)).
           52   json2tsv reads from stdin and does not do assumptions about a "closed
           53   ecosystem" as described in the RFC.
           54 * The parser accepts some bad JSON input and "extensions"
           55   (see [RFC8259 - 9. Parsers](https://tools.ietf.org/html/rfc8259#section-9)).
           56 * Encoded NUL bytes (\u0000) in strings are ignored.
           57   (see [RFC8259 - 9. Parsers](https://tools.ietf.org/html/rfc8259#section-9)).
           58   "An implementation may set limits on the length and character contents of
           59   strings."
           60 * The parser is not the fastest possible JSON parser (but also not the
           61   slowest).  For example: for ease of use, at the cost of performance all
           62   strings are decoded, even though they may be unused.
           63 
           64 
           65 ## Why Yet Another JSON parser?
           66 
           67 I wanted a tool that makes parsing JSON easier and work well from the shell,
           68 similar to [jq](https://stedolan.github.io/jq/).
           69 
           70 sed and grep often work well enough for matching some value using some regex
           71 pattern, but it is not good enough to parse JSON correctly or to extract all
           72 information: just like parsing HTML/XML using some regex is not good (enough)
           73 or a good idea :P.
           74 
           75 I didn't want to learn a new specific [meta-language](https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions) which jq has and wanted
           76 something simpler.
           77 
           78 While it is more efficient to embed this query language for data aggregation,
           79 it is also less simple. In my opinion it is simpler to separate this and use
           80 pattern-processing by awk or an other filtering/aggregating program.
           81 
           82 For the parser, there are many JSON parsers out there, like the efficient
           83 [jsmn parser](https://github.com/zserge/jsmn), however a few parser behaviours I want to have are:
           84 
           85 * jsmn buffers data as tokens, which is very efficient, but also a bit
           86   annoying as an API as it requires another layer of code to interpret the
           87   tokens.
           88 * jsmn does not handle decoding strings by default. Which is very efficient
           89   if you don't need parts of the data though.
           90 * jsmn does not keep context of nested structures by default, so may require
           91   writing custom utility functions for nested data.
           92 
           93 This is why I went for a parser design that uses a single callback per "node"
           94 type and keeps track of the current nested structure in a single array and
           95 emits that.
           96 
           97 
           98 ## Clone
           99 
          100         git clone git://git.codemadness.org/json2tsv
          101 
          102 
          103 ## Browse
          104 
          105 You can browse the source-code at:
          106 
          107 * <https://git.codemadness.org/json2tsv/>
          108 * <gopher://codemadness.org/1/git/json2tsv>
          109 
          110 
          111 ## Download releases
          112 
          113 Releases are available at:
          114 
          115 * <https://codemadness.org/releases/json2tsv/>
          116 * <gopher://codemadness.org/1/releases/json2tsv>
          117 
          118 
          119 ## Build and install
          120 
          121         $ make
          122         # make install
          123 
          124 
          125 ## Examples
          126 
          127 An usage example to parse posts of the JSON API of [reddit.com](https://www.reddit.com/) and format them
          128 to a plain-text list using awk:
          129 
          130         #!/bin/sh
          131         curl -s -H 'User-Agent:' 'https://old.reddit.com/.json?raw_json=1&limit=100' | \
          132         json2tsv | \
          133         awk -F '\t' '
          134         function show() {
          135                 if (length(o["title"]) == 0)
          136                         return;
          137                 print n ". " o["title"] " by " o["author"] " in r/" o["subreddit"];
          138                 print o["url"];
          139                 print "";
          140         }
          141         $1 == ".data.children[].data" {
          142                 show();
          143                 n++;
          144                 delete o;
          145         }
          146         $1 ~ /^\.data\.children\[\]\.data\.[a-zA-Z0-9_]*$/ {
          147                 o[substr($1, 23)] = $3;
          148         }
          149         END {
          150                 show();
          151         }'
          152 
          153 
          154 ## References
          155 
          156 * Sites:
          157   * [seriot.ch - Parsing JSON is a Minefield](http://seriot.ch/parsing_json.php)
          158   * [A comprehensive test suite for RFC 8259 compliant JSON parsers](https://github.com/nst/JSONTestSuite)
          159   * [json.org](https://json.org/)
          160 * Current standard:
          161   * [RFC8259 - The JavaScript Object Notation (JSON) Data Interchange Format](https://tools.ietf.org/html/rfc8259)
          162   * [Standard ECMA-404 - The JSON Data Interchange Syntax (2nd edition (December 2017)](https://www.ecma-international.org/publications/standards/Ecma-404.htm)
          163 * Historic standards:
          164   * [RFC7159 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete)](https://tools.ietf.org/html/rfc7159)
          165   * [RFC7158 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete)](https://tools.ietf.org/html/rfc7158)
          166   * [RFC4627 - The JavaScript Object Notation (JSON) Data Interchange Format (obsolete, original)](https://tools.ietf.org/html/rfc4627)