http://www.oilshell.org/release/latest/doc/qsn.html source | all docs for version 0.9.8 | all versions | oilshell.org QSN: A Familiar String Interchange Format QSN ("quoted string notation") is a data format for byte strings. Examples: '' # empty string 'my favorite song.mp3' 'bob\t1.0\ncarol\t2.0\n' # tabs and newlines 'BEL = \x07' # byte escape 'mu = \u{03bc}' # char escapes are encoded in UTF-8 'mu = m' # represented literally, not escaped It's an adaptation of Rust's string literal syntax with a few use cases: * To print filenames to a terminal. Printing arbitrary bytes to a terminal is bad, so programs like coreutils already have informal QSN-like formats. * To exchange data between different programs, like JSON or UTF-8. Note that JSON can't express arbitrary byte strings. * To solve the "framing problem" over pipes. QSN represents newlines like \n, so literal newlines can be used to delimit records. Oil uses QSN because it's well-defined and parsable. It's both human- and machine-readable. Any programming language or tool that understands JSON should also understand QSN. Table of Contents Important Properties More QSN Use Cases Specification A Short Description An Analogy Full Spec Advantages Over JSON Strings Implementation Issues How Does a QSN Encoder Deal with Unicode? Which Bytes Should Be Hex-Escaped? List of Syntax Errors Reference Implementation in Oil Appendices Design Notes Related Links set -x example Important Properties * QSN can represent any byte sequence. * Given a QSN-encoded string, any 2 decoders must produce the same byte string. (On the other hand, encoders have flexiblity with regard to escaping.) * An encoded string always fits on a single line. Newlines must be encoded as \n, not literal. * A encoded string always fits in a TSV cell. Tabs must be encoded as \t, not literal. * An encoded string can itself be valid UTF-8. + Example: 'm \xff' is valid UTF-8, even though the decoded string is not. * An encoded string can itself be valid ASCII. + Example: '\xce\xbc' is valid ASCII, even though the decoded string is not. More QSN Use Cases * To pack arbitrary bytes on a single line, e.g. for line-based tools like grep, awk, and xargs. QSN strings never contain literal newlines or tabs. * For set -x in shell. Like filenames, Unix argv arrays may contain arbitrary bytes. There's an example in the appendix. + ps has to display untrusted argv arrays. + ls has to display untrusted filenames. + env has to display untrusted byte strings. (Most versions of env don't handle newlines well.) * As a building block for larger specifications, like QTT. * To transmit arbitrary bytes over channels that can only represent ASCII or UTF-8 (e.g. e-mail, Twitter). Specification A Short Description 1. Start with Rust String Literal Syntax 2. Use single quotes instead of double quotes to surround the string. This is mainly to to avoid confusion with JSON. An Analogy JavaScript Object Literals are to JSON as Rust String Literals are to QSN But QSN is not tied to either Rust or shell, just like JSON isn't tied to JavaScript. It's a language-independent format like UTF-8 or HTML. We're only borrowing a design, so that it's well-specified and familiar. Full Spec TODO: The short description above should be sufficient, but we might want to write it out. * Special escapes: + \t \r \n + \' \" + \\ + \0 * Byte escapes: \x7F * Character escapes: \u{03bc} or \u{0003bc}. These are encoded as UTF-8. Advantages Over JSON Strings * QSN can represent any byte string, like '\x00\xff\x00'. JSON can't represent binary data directly. * QSN can represent any code point, like '\u{01f600}' for . JSON needs awkward surrogate pairs to represent this code point. Implementation Issues How Does a QSN Encoder Deal with Unicode? The input to a QSN encoder is a raw byte string. However, the string may have additional structure, like being UTF-8 encoded. The encoder has three options to deal with this structure: 1. Don't decode UTF-8. Walk through bytes one-by-one, showing unprintable ones with escapes like \xce\xbc. Never emit escapes like \u{3bc} or literals like m. This option is OK for machines, but isn't friendly to humans who can read Unicode characters. Or speculatively decode UTF-8. After decoding a valid UTF-8 sequence, there are two options: 2. Show escaped code points, like \u{3bc}. The encoded string is limited to the ASCII subset, which is useful in some contexts. 3. Show them literally, like m. QSN encoding should never fail; it should only fall back to byte escapes like \xff. TODO: Show the state machine for detecting and decoding UTF-8. Note: Strategies 2 and 3 indicate whether the string is valid UTF-8. Which Bytes Should Be Hex-Escaped? The reference implementation has two functions: * IsUnprintableLow: any byte below an ASCII space ' ' is escaped * IsUnprintableHigh: the byte \x7f and all bytes above are escaped, unless they're part of a valid UTF-8 sequence. In theory, only escapes like \' \n \\ are strictly necessary, and no bytes need to be hex-escaped. But that strategy would defeat the purpose of QSN for many applications, like printing filenames in a terminal. List of Syntax Errors QSN decoders must enforce (at least) these syntax errors: * Literal newline or tab in a string. Should be \t or \n. (The lack of literal tabs and newlines is essential for QTT.) * Invalid character escape, e.g. \z * Invalid hex escape, e.g. \xgg * Invalid unicode escape, e.g. \u{123 (incomplete) Separate messages aren't required for each error; the only requirement is that they not accept these sequences. Reference Implementation in Oil * Oil's encoder is in qsn_/qsn.py, including the state machine for the UTF-8 strategies. * The decoder has a lexer in frontend/lexer_def.py, and a "parser" / validator in qsn_/qsn_native.py. (Note that QSN is a regular language). The encoder has options to emit shell-compatible strings, which you probably don't need. That is, C-escaped strings in bash look $'like this\n'. A subset of QSN is compatible with this syntax. Example: $'\x01\n' # A valid bash string. Removing $ makes it valid QSN. Something like $'\0065' is never emitted, because QSN doesn't contain octal escapes. It can be encoded with hex or character escapes. Appendices Design Notes The general idea: Rust string literals are like C and JavaScript string literals, without cruft like octal (\755 or \0755 -- which is it?) and vertical tabs (\v). Comparison with shell strings: * 'Single quoted strings' in shell can't represent arbitrary byte strings. * $'C-style shell strings\n' strings are similar to QSN, but have cruft like octal and \v. * "Double quoted strings" have unneeded features like $var and $ (command sub). Comparison with Python's repr(): * A single quote in Python is "'", whereas it's '\'' in QSN * Python has both \uxxxx and \Uxxxxxxxx, whereas QSN has the more natural \u{xxxxxx}. Related Links * GNU Coreutils - Quoting File names. Starting with GNU coreutils version 8.25 (released Jan. 2016), ls's default output quotes filenames with special characters * In-band signaling is the fundamental problem with filenames and terminals. Code (control codes) and data are intermingled. * QTT is a cleanup of CSV/TSV, built on top of QSN. set -x example When arguments don't have any spaces, there's no ambiguity: $ set -x $ echo two args + echo two args Here we need quotes to show that the argv array has 3 elements: $ set -x $ x='a b' $ echo "$x" c + echo 'a b' c And we want the trace to fit on a single line, so we print a QSN string with \n: $ set -x $ x=$'a\nb' $ echo "$x" c + echo $'a\nb' c Here's an example with unprintable characters: $ set -x $ x=$'\e\001' $ echo "$x" + echo $'\x1b\x01' --------------------------------------------------------------------- Generated on Sat Feb 19 18:43:17 EST 2022