[HN Gopher] Show HN: Comma Separated Values (CSV) to Unicode Sep...
       ___________________________________________________________________
        
       Show HN: Comma Separated Values (CSV) to Unicode Separated Values
       (USV)
        
       Author : jph
       Score  : 154 points
       Date   : 2024-03-12 13:43 UTC (9 hours ago)
        
 (HTM) web link (crates.io)
 (TXT) w3m dump (crates.io)
        
       | jiehong wrote:
       | For those wondering what USV is, like myself:
       | 
       | > Unicode separated values (USV) is a data format that uses
       | Unicode symbol characters between data parts. USV competes with
       | comma separated values (CSV), tab separated values (TSV), ASCII
       | separated values (ASV), and similar systems. USV offers more
       | capabilities and standards-track syntax.
       | 
       | > Separators:
       | 
       | >
       | 
       | >  U+241F Symbol for Unit Separator (US)
       | 
       | >
       | 
       | >  U+241E Symbol for Record Separator (RS)
       | 
       | >
       | 
       | >  U+241D Symbol for Group Separator (GS)
       | 
       | >
       | 
       | >  U+241C Symbol for File Separator (FS)
       | 
       | >
       | 
       | > Modifiers:
       | 
       | >
       | 
       | >  U+241B Symbol for Escape (ESC)
       | 
       | >
       | 
       | >  U+2417 Symbol for End of Transmission Block (ETB)
       | 
       | >
       | 
       | >  U+2416 Symbol For Synchronous Idle (SYN)
        
         | calvinmorrison wrote:
         | I always wonder why we don't use this.
        
           | curtisblaine wrote:
           | Not easily readable / editable using a regular text editor.
        
             | timmg wrote:
             | Do those characters map to something visually useful in
             | (typical) unicode fonts?
             | 
             | That would be neat :)
             | 
             | Edit: Apparently, kinda (e.g.
             | https://www.compart.com/en/unicode/U+241E )
             | 
             | Not the most creative....
        
             | bryanlarsen wrote:
             | According to their GitHub README:
             | 
             | ```USV works with many kinds of editors. Any editor that
             | can render the USV characters will work. We use vi, emacs,
             | Coda, Notepad++, TextMate, Sublime, VS Code, etc.```
             | 
             | I loaded an example in my fairly generic Emacs and it
             | worked out of the box. The separators were pretty small so
             | I had to increase my font size to distinguish US from RS.
             | And of course I have no idea how to enter those characters.
             | I'm sure there is, but cut & paste worked.
        
             | Solvency wrote:
             | Who here uses a "regular" text editor, let's be real
        
             | ahofmann wrote:
             | I'm fascinated that a lot of posters in this thread are not
             | understanding the ideas and experiences, that the inventors
             | of this file format had or made. They invented this format
             | because it works for machines as well as for humans. Text
             | editors can handle the proposed UTF characters just fine.
             | Humans can see them. The only challenge is that it is
             | cumbersome to type the delimiters. And that the format is
             | not used in any relevant software (like Excel). Both are
             | reason enough, that USV will not be used anywhere. But I
             | can see why they went this way on their file format.
        
               | wtetzner wrote:
               | I don't really see what benefit it provides over CSV
               | other than needing to escape less frequently. That hardly
               | seems like it's worth it.
        
               | vidarh wrote:
               | We might be able to see them, but for me they're just a
               | blur unless I zoom in significantly, so I'll need editor
               | accommodations just as much for these characters as if
               | they used the already existing RS/FS/US/GS characters.
               | 
               | It feels like instead of fixing it properly, they went
               | with an option that will still need tool improvements,
               | will be controversial, and adds unnecessary details (e.g.
               | the SYN they've added will be an active nuisance and I'd
               | be willing to bet will get ignored by enough tools to
               | become a hazard to data integrity).
               | 
               | I quite like an initiative to make use of proper record
               | and unit separators, but this feels poorly thought
               | through in several respects (e.g. their quirky escape
               | characters that adds differently depending on the class
               | of the following character will be a 'fun' source of
               | bugs; that splitting records on LF requires three
               | characters almost certainly will mean a number of tools
               | will incorrectly treat those three characters as a unit,
               | etc. -- these assumptions are based on how slapdash a lot
               | of CSV parsing and generation is; if you want to compete
               | with CSV you ought to learn those lessons)
        
               | dahart wrote:
               | CSV works for machines as well as humans, why do you
               | assume or imply otherwise? Making the separator hard to
               | type makes this 'invention' hard for humans to use. Using
               | the glyphs instead of the semantic Unicode separators
               | might also make this harder to use, even if you can
               | understand why they did it, and to some degree it
               | subverts the intent of the Unicode standard's separator
               | and glyph characters.
        
               | theamk wrote:
               | We don't need a new format which works for machines as
               | well as for humans, because there are are tons of
               | existing ones. You have CSV or TSV for wide support;
               | JSONlines if you want very easy edit-ability and
               | structure; and if those don't work for some reason,
               | pretty much any other delimiter/escape would work better
               | (example: newline for records, "^^" for fields, "^"-style
               | character escaping; or JS-style "\"-escaping with field
               | separator being "\N")
        
           | theamk wrote:
           | All downsides, no upsides.
           | 
           | You cannot edit it in regular editor, like csv/tsv/jsonlines.
           | 
           | There is no schema or efficient storage, like binary formats.
           | 
           | There is no wide library support.
           | 
           | Not all data is representable.
        
             | mavhc wrote:
             | 1 Editors can be improved 2 Same as CSV etc then 3
             | Libraries can be improved 4 Escaping characters exists
             | 
             | ASCII 1963 had 8 separators, 1965 reduced it to 4, and
             | named them. See 6.3.12 of
             | https://dl.acm.org/doi/pdf/10.1145/363831.363839
        
               | cxr wrote:
               | The task here is to explain why one should use this over
               | CSV. By your own admissions, there is no reason to prefer
               | this over CSV.
        
             | ixwt wrote:
             | > You cannot edit it in regular editor, like
             | csv/tsv/jsonlines.
             | 
             | If only there were shortcuts on modern operating systems to
             | allow us to do things that aren't readily on our keyboards.
             | Like upper case characters. Or copy and paste. Or close
             | windows. Our lives would be so much better.
             | 
             | If ASV had caught on, there could be common shared
             | shortcuts to type them, and fonts would regularly display
             | them (just like the unicode characters proposed). But CSV
             | was simple enough and readily type-able.
             | 
             | > There is no schema or efficient storage, like binary
             | formats.
             | 
             | I'm not quite certain where you're trying to go with this.
             | Binary formats aren't really meant to be human readable in
             | an average text editor. It doesn't know to differentiate 1,
             | 2, 4, or 8 bytes as an integer or a float. Even current hex
             | editors to make it easier to navigate these formats don't
             | really know unless you are able to tell it somehow.
             | 
             | > There is no wide library support.
             | 
             | It's a critical mass problem. Not enough people are using
             | them, so no libraries are being made.
             | 
             | > Not all data is representable.
             | 
             | I'm not quite certain what data couldn't be represented. f
             | you can represent your data in CSV, you can represent it in
             | ASV. It's all plain text that gets interpreted based on
             | what you need. They're nearly a 1:1 replacement. Commas get
             | replaced by unit separators, new lines get replaced by
             | group separators. Then you have record and file separators
             | to do with for further levels of abstraction if you need.
        
               | theamk wrote:
               | Re editors: The problem with USV is not that it's hard to
               | type the characters, but rather than the newlines are
               | completely optional. Which means that in general case,
               | most line-based tools are not going to work with USV.
               | 
               | Now, the readme actually has that optional newline
               | separator thing, but the optionality of it makes it
               | completely useless, it seems like an after-thought. Fr
               | example the first "real" USV writer I found, the "csv-to-
               | usv", does not put them [0] and thus makes uneditable
               | files.
               | 
               | And if we are going to end up with uneditable files,
               | might as well go with something schema-full, like parquet
               | or avro. You are going to have the same "critical mass
               | problem", but at least the tooling is much better and you
               | have neat features like schemas.
               | 
               | [0] https://github.com/SixArm/csv-to-usv-rust-
               | crate/blob/30a0324...
        
           | edent wrote:
           | I've sketched out a replacement for JSON which would use
           | these characters - https://shkspr.mobi/blog/2017/03/kyli-
           | because-it-is-superior...
        
           | michaelt wrote:
           | CSV is the javascript of the tabular data world.
           | 
           | Everyone thinks they can do better, but nothing's more widely
           | supported (for a sufficiently generous definition of
           | 'supported')
        
             | hermitcrab wrote:
             | Unfortunately CSVs vary a lot in the wild. Some people use
             | commas as a delimiter, some use semi-colons. Escaping rules
             | vary. And the text encoding is not specified.
             | 
             | I randomly generated some CSVs and fed them into Excel and
             | Numbers and they were differently interpreted.
        
               | mst wrote:
               | This is why I tend to use the Pg COPY version of TSV -
               | works beautifully with 'cut' and friends, loads trivially
               | into most databases, and the 'vary a lot' problem is
               | (ish) avoided by specifying COPY escaping which is
               | clearly documented and something people often already
               | recognise.
               | 
               | Generally my only interaction with CSV itself is to fling
               | it through https://p3rl.org/Text::CSV since that seems to
               | be able to get a pretty decent parse of every sort of CSV
               | I've yet had to deal with in the wild.
        
               | Tijdreiziger wrote:
               | Countries that use . as the thousands separator (e.g.
               | 1.000) use , as the CSV separator.
               | 
               | Countries that use , as the thousands separator (e.g.
               | 1,000) use ; as the CSV separator.
               | 
               | Why? Because that's how Excel does it.
        
             | dimask wrote:
             | Funny thing, excel, which is the most common spreadsheet
             | editor, does not practically support CSV files if you
             | happen to live in countries where the default official
             | convention is using commas for decimal points in numbers.
             | Unless you go around and manually set stuff in how it
             | imports it or you change your default settings. It has
             | reached meme levels at my work.
             | 
             | Tab separated files are much better imo in not getting
             | confused with the delimiter for a sufficiently sane tsv
             | file.
        
               | Tijdreiziger wrote:
               | Yes it does, but then it uses ; as a separator.
        
           | chasil wrote:
           | In a POSIX shell, I actually prefer to use the bell character
           | for IFS.                 while IFS="$(printf \\a)" read -r
           | field1 field2...       do ...       done
           | 
           | This works just as well as anything outside the range of
           | printing characters.
        
           | pavon wrote:
           | Back when rolling your own application level protocol on top
           | of TCP was common (as opposed to using http, zeromq, etc) I
           | frequently used file/record/group/unit separators for
           | delimiters, and considered them an underrated gem, especially
           | for plain-text data where they were prohibited to occur in
           | the message body so you didn't have to escape them (still
           | good to scan and reject messages containing them). As a
           | modern example they (and most other ASCII control characters)
           | are disallowed in json strings.
        
             | gkbrk wrote:
             | You can put control characters in JSON strings, you just
             | need to escape them.
        
               | pavon wrote:
               | The way I read the json standard, the only way to include
               | control characters is to encode them as hex. For example
               | BEL can be encoded as "\u0007", but escaping it by using
               | a backslash followed by a literal BEL character is not
               | allowed. So literal control characters should never be in
               | json text.
        
           | queuebert wrote:
           | CSV is honestly not that problematic. Figuring out if an
           | field contains and comma and then properly quoting it is
           | trivial. And fields without commas don't need quoting.
           | Sometimes your application even guarantees no commas,
           | especially if CSV is into it from the beginning.
        
             | hermitcrab wrote:
             | I'm guessing you haven't worked in custom support where
             | people send you their "CSV" files. Even the field delimiter
             | varies (many Europeans use semi-colons).
        
               | queuebert wrote:
               | No, I have. I don't consider abuse of the format a
               | problem with the format. Though I can see how having to
               | delimit with special characters will help the type of
               | person who writes print(','.join(stuff)).
        
               | hermitcrab wrote:
               | >I don't consider abuse of the format a problem with the
               | format.
               | 
               | That's a fair point. But you could argue that when the
               | abuse is so widespread, it becomes a defacto part of the
               | format (even if it isn't in the RFC).
        
         | teddyh wrote:
         | Using Unicode _graphic_ characters as metasyntactic escape
         | characters is fundamentally wrong. Those Unicode characters are
         | for _displaying the symbols_ for Unit Separator, Record
         | Separator, etc. and _not_ for actually _being_ separators!
         | ASCII _already has those! Included in Unicode!_
        
           | layer8 wrote:
           | To be fair, I don't quite get those graphic characters,
           | because the original characters should already be displayed
           | that way, shouldn't they? Now when I see such a character, I
           | have no idea if it's the real character or just it's graphic-
           | character counterpart.
        
         | eadmund wrote:
         | Wait a second ... he's not proposing using
         | unit/record/group/file separators as separators, he's proposing
         | using the _symbols for those separators_ as separators! Why not
         | just use the separators themselves!?
         | 
         | Yes, rather than using U+1F (the ASCII and Unicode unit
         | separator), he proposes using U+241F (the Unicode symbol _for_
         | the unit separator). I almost feel like this must be an early
         | April Fool's joke?
         | 
         | Also, he writes 'comprised of' rather than 'composed of' or
         | 'comprises' throughout his RFC.
        
           | ahofmann wrote:
           | They explain in the FAQ that this approach works with most
           | text editors and copy-paste situations.
        
             | philipwhiuk wrote:
             | It doesn't "work" because I can't read the darn things at a
             | sane zoom level.
        
           | bryanlarsen wrote:
           | Using a visible character rather than an invisible one makes
           | editing in an editor a lot easier.
        
             | tgv wrote:
             | It won't wrap at the record separator, so you'll get a very
             | long line.
        
               | bryanlarsen wrote:
               | The example seems to use `\n` as a separator rather than
               | just ``. I assume their proposed standard is more
               | definitive.
        
               | vidarh wrote:
               | Their ABNF uses RS, defined as U+241E, not U+241E + '\n'
               | as the record separator. They seem to add an "USV escape"
               | in front of the linefeeds.
               | 
               | My bet is that this _will_ lead to implementations that
               | wrongly treats  "\n" (RS ESC \m) as the real record
               | separator, the same way lots of "CSV" implementations
               | just split on comma and LF.
               | 
               | Seems to me if you're going to add support for something
               | like that you should just bite the bullet and declare an
               | LF immediately following an RS as part of the record
               | separator, or you're falling in the same trap as CSV of
               | being "close enough" to naively splittable that people
               | will do it because it works often enough.
        
               | shawnz wrote:
               | The escape symbol lets you ignore any non-special
               | character, not just newlines:
               | https://github.com/sixarm/usv?tab=readme-ov-file#escape-
               | esc
        
               | vidarh wrote:
               | I'm aware. I don't think that serves a useful purpose - I
               | think the way they've done it is likely to make people
               | more likely to get the parsing wrong for pretty much zero
               | benefit. My guess is you'll end up seeing a lot of
               | "pseudo-USV" parsers the same way we have a ton of
               | "pseudo-CSV" parsers that breaks on escapes or quoted
               | strings with commas, and so I think they fundamentally
               | failed to learn the lessons of CSV.
        
               | theamk wrote:
               | that's a lie as far as I can see, the csv-to-usv tools
               | does not add any newlines:
               | 
               | [0] https://github.com/SixArm/csv-to-usv-rust-
               | crate/blob/30a0324...
        
               | bryanlarsen wrote:
               | The examples here have them:
               | https://github.com/SixArm/usv/tree/main/examples
        
               | theamk wrote:
               | the submitted tool does not have produce them, check out
               | the tests - note there is no \n anywhere in USV
               | 
               | https://github.com/SixArm/csv-to-usv-rust-
               | crate/blob/30a0324...
        
             | NoMoreNicksLeft wrote:
             | If you're doing spreadsheets, then it should show in a
             | spreadsheet and not in an editor. It's like complaining
             | that he can't edit jpegs in Sublime or something... there's
             | a reason that's working poorly.
             | 
             | Speaking of which, last time I had a control code heavy
             | file open in Sublime, it actually did show the control
             | codes as special characters, and it was possible to
             | copy/paste those. This proposal is so bad I suspect it will
             | become a standard.
        
               | dimask wrote:
               | There are a lot of cases where I would rather
               | inspect/quickfix a csv file in a text editor rather than
               | open it as a spreadsheet. Especially cases where
               | something is wrong in the format, and it will just not
               | open as a spreadsheet at all. Adding unnecessary levels
               | of obfuscation to your data should never be considered a
               | good idea imo.
        
             | eadmund wrote:
             | The ASCII separators are visible in my editor. If something
             | doesn't support ASCII text, that sounds like a bug which
             | should be fixed, not a reason to misuse graphical
             | characters for something other than their purpose.
        
           | SigmundurM wrote:
           | They cover the reasoning for using the control picture
           | characters instead of the control characters in the FAQ:
           | 
           | "We tried using the control characters, and also tried
           | configuring various editors to show the control characters by
           | rendering the control picture characters.
           | 
           | First, we encountered many difficulties with editor
           | configurations, attempting to make each editor treat the
           | invisible zero-width characters by rendering with the visible
           | letter-width characters.
           | 
           | Second, we encountered problems with copy/paste
           | functionality, where it often didn't work because the editor
           | implementations and terminal implementations copied visible
           | letter-width characters, not the underlying invisible zero-
           | width characters.
           | 
           | Third, users were unable to distinguish between the rendered
           | control picture characters (e.g. the editor saw ASCII 31 and
           | rendered Unicode Unit Separator) versus the control picture
           | characters being in the data content (e.g. someone actually
           | typed Unicode Unit Separator into the data content)."
           | 
           | - https://github.com/SixArm/usv/tree/main/doc/faq#why-use-
           | cont...
        
             | nostrademons wrote:
             | https://xkcd.com/927/
        
               | pie_flavor wrote:
               | 'Too many competing standards' is not one of the quoted
               | reasons.
        
               | ascorbic wrote:
               | https://github.com/SixArm/usv/blob/main/doc/criticisms/in
               | dex...
        
             | vidarh wrote:
             | I can't read those characters at the size I can/prefer to
             | read the text at, so I need the tooling to support and
             | render these differently anyway... This feels like solving
             | the wrong problem in a way that will still end up with the
             | same amount of work.
        
           | ape4 wrote:
           | An issue with CSV is that commas need to be escaped. Are the
           | U+241F characters escaped in this USV format?
        
           | hermitcrab wrote:
           | I don't see any real advantage over using ASCII unit and
           | record separators (.asv).
           | 
           | Also I am not convinced about the need for an escape
           | character. If you really need to use ASCII unit or record
           | separators as data - tough use a different format.
           | 
           | If only editors would display the ASCII unit separator
           | (Notepad++ does) and treat the ASCII record as a carriage
           | return (Notepad++ doesn't) then .asv format would be a huge
           | improvement on CSV.
        
           | skirmish wrote:
           | 'comprised of' is standard verbiage in US patents, and I am
           | guessing he is trying to sound formal and official. Also see
           | https://en.wikipedia.org/wiki/Comprised_of .
        
       | evrimoztamur wrote:
       | First time hearing about USV, nifty! However, I think the
       | adoptability challenge remains here to be Excel support (very
       | tough).
        
         | croes wrote:
         | Excel can't even handle CSV correctly without using the import
         | function.
        
           | ahofmann wrote:
           | Well, CSV would be much harder to import, than something like
           | USV, because the delimiters are well-defined in USV and there
           | is no need for quoting strings.
        
             | croes wrote:
             | How to put a USV example into one column of a USV without
             | qualifier?
        
       | yewenjie wrote:
       | I'm still confused whether this is a joke or not.
        
         | jefftk wrote:
         | I don't think it's a joke; at https://github.com/sixarm/usv
         | they discuss how they're working on IANA standardization.
        
         | romeoblade wrote:
         | Apparently it is not. They have submitted it to the ietf. I
         | will have to watch closely to see if librecalc/excel and
         | languages/libraries adopt support. Seems like it does solve
         | some common problems with CSV.
         | 
         | https://www.ietf.org/archive/id/draft-unicode-separated-valu...
         | 
         | https://datatracker.ietf.org/doc/draft-unicode-separated-val...
        
         | usrusr wrote:
         | I certainly hope that anyone proposing a Unicode CSV variant as
         | a joke would pick some raised hand emoji as the separator and
         | the victory gesture (0xe011, also popular as an approximation
         | of how an air quote emoji would look like) as the quote
         | character.
         | 
         | But we already keep stumbling over missing support for the on-
         | demand quote character even with separators like comma and tab,
         | using more exotic characters as the separator will only make it
         | worse. The value of less escaping is negative.
        
         | knallfrosch wrote:
         | Completely unreadable. Then again, Germans know the pain of
         | decimal points.
         | 
         | We write 3.000,00 for exactly three thousand, instead of
         | 3,000.00
         | 
         | Now imagine how often parsing breaks.
        
           | alwyn wrote:
           | In my head 3.000,00 is correct and I always get confused
           | because it seems most(?) people use the other method.
        
           | Ekaros wrote:
           | Finland uses 3 000,00 which is also kinda pain to parse.
           | 
           | I think rarely used ' to group thousands is actually most
           | sensible solution.
        
             | euroderf wrote:
             | And now and then you encounter a web form in the .fi domain
             | that rejects "," and expects ".", but does not tell you
             | that that is the reason for rejecting your input. The web
             | "designers" that deploy such crap in .fi should be sent to
             | Siberia.
        
       | michaelmior wrote:
       | If I understand the API correctly from my brief glance, the crate
       | returns a triply-nested vector with the outermost vector being
       | the equivalent of CSV rows, then CSV columns, then "units" which
       | don't have a direct CSV equivalent. It would be helpful if there
       | was an API method that returned results without this final level
       | of nesting, perhaps panicking if there is more than one unit.
       | This would make it easier to deal with the common case (in CSV at
       | least) where each column only has a single value.
        
         | hiccuphippo wrote:
         | I think the units are the csv fields, records are rows, groups
         | would be multiple CSV files (or multiple sheets in an excel
         | file) and file separator... a zip with multiple CSV files? (or
         | multiple excel files).
        
           | michaelmior wrote:
           | My mistake then about the correspondence :)
        
       | eli wrote:
       | Not sure I understand the advantage over ASCII Separated Values
       | (ASV) which use ASCII control characters 0x1E and 0x1F
        
         | p_l wrote:
         | Surprisingly, they actually did write a FAQ entry on it (I'm
         | honestly surprised):
         | 
         | https://github.com/SixArm/usv/tree/main/doc/comparisons#asci...
        
         | jdeisenberg wrote:
         | Addressed in the FAQ:
         | https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...
         | 
         | Main point: "USV provides typically-visible letter-width
         | characters (such as Unicode 241F), whereas ASV provides
         | typically-invisible zero-width characters (such as ASCII 31)."
        
           | AdamH12113 wrote:
           | USV would have the disadvantage of using multi-byte
           | characters as delimiters, so you have to decode the file in
           | order to separate records. And you still can't type the
           | characters directly or be guaranteed to display them without
           | font support. This honestly seems like cleverness for
           | cleverness's sake.
        
           | eli wrote:
           | Ah fair enough. Of course you _could_ configure your shell
           | /editor/whatever to make control characters visible. Seems
           | like if you were going to edit USV or ASV by hand you'd
           | probably want a customized editor anyway.
        
           | a-priori wrote:
           | The way I would have gone would be to define the standard to
           | support both, such that the two sets of codes MUST be
           | considered semantically equivalent, but that generation tools
           | SHOULD prefer to generate the control codes for new files.
           | 
           | This way people can initially use the visible glyphs while
           | editors don't support the format, and this will always be
           | supported. But, as editors add support and start to generate
           | the files via tools or manually in tabular interfaces where
           | the codes themselves disappear, usage will automatically
           | transition over to the control codes.
        
           | layer8 wrote:
           | This is so weird, since the purpose of the former characters
           | is displaying the latter characters. If they are actually
           | used for display, then you can't tell which is which.
        
       | jefftk wrote:
       | Description of USV: https://github.com/sixarm/usv
        
       | tambourine_man wrote:
       | ASCII has a field delimiter character. The fact that we chose
       | comma and tabs because a field delimiter character is hard to
       | type or see is one of those things that saddens me in computing.
       | 
       | Imagine the amount of pain that could have been spared if we had
       | done it right from the start some 50 years ago.
        
         | g4zj wrote:
         | Interesting. Are you referring to the unit separator (1F)?
         | 
         | https://www.ascii-code.com/31
        
           | tambourine_man wrote:
           | Yes, we have unit, record, group and file separators. And we
           | chose never to use them.
        
             | g4zj wrote:
             | It seems as though one could easily build a file format far
             | more useful than CSV simply by utilizing these separators,
             | and I'm sure it's been done countless times.
             | 
             | Perhaps this would make an interesting personal project.
             | Are you aware of any hurdles, missing key features, etc.
             | that previous attempts at creating such a format have run
             | into (other than adoption, obviously)?
        
               | hermitcrab wrote:
               | The ASCII unit separator and record separator characters
               | are not well supported by editors. That is why people
               | stick to the (horrible and inconsistent) CSV format.
        
               | tambourine_man wrote:
               | People don't like invisible hard to type character. They
               | prefer suffering quoting, escaping, escaping quotes and
               | all that fun stuff
        
               | t-3 wrote:
               | Are people actually typing up *SV files by hand? It's
               | trivial to support editing in an IDE and exporting from
               | data-producing applications.
        
               | andyferris wrote:
               | Yes, sometimes, of course. It's a bit like JSON.
               | Sometimes it's easiest to inject a small piece of hand-
               | written data into a test or whatever.
               | 
               | (That said every text editor since ever should have had a
               | "table mode" that uses the ASCII field/record seperators
               | (or whatever you choose), I was always confused why this
               | isn't common. Maybe vim and emacs do?)
        
               | EvanAnderson wrote:
               | I've done ETL work with systems that used the ASCII
               | separators. It was very pleasant work. Not having to
               | worry about escaping things (because the ASCII separators
               | weren't permitted to be in valid source data to begin
               | with) was very, very nice.
               | 
               | I'm a Notepad++ person. When I needed to mock-up data
               | typing the characters was easy-- just ALT and the ASCII
               | code on the numeric pad. It took a bit to memorize the
               | codes I needed to use. Their visual representation is
               | just inverse text and initials.
        
             | eirikbakke wrote:
             | Dedicated separator characters don't solve the problem--
             | you'd still need to escape them. Or validate that the data
             | (which may come from untrusted web forms etc.) does not
             | contain them, which means you have another error condition
             | to handle.
        
               | hermitcrab wrote:
               | Or specify that the data can't contain this data. If it
               | does, you have to use a different format. This keeps
               | everything super simple. And how often are ASCII US and
               | RS characters used in data? I don't think I have ever
               | seem one in the wild, apart from in a .asv file.
        
               | g4zj wrote:
               | I'm no expert on character encodings or Unicode itself,
               | but would this be as simple as checking for the byte 1F
               | in the data? Assuming the file is ASCII or UTF-8 encoded
               | (or attempting to confirm this as much as possible as
               | well), it seems like that check would suffice to validate
               | the absence of the code point in the data, but I imagine
               | it's not quite so simple.
        
               | rhelz wrote:
               | For text data, it would work fine, but you'd have to do
               | some finagling with binary data; $1F is a perfectly valid
               | byte to have in, say, a 4-byte integer.
        
               | tambourine_man wrote:
               | The "problem" I'm referring to is that we chose a widely
               | used character as a field separator. Of course you still
               | have to write a parser, etc, it's just a lot easier if
               | you choose a dedicated character.
        
               | AdamH12113 wrote:
               | There's an ASCII character for escaping, too, if you need
               | it.
               | 
               | The advantage of ASV is not that you can't have invalid
               | or insecure data, it's that valid data will almost never
               | contain ASCII control characters in the record fields
               | themselves. Commas, quotation marks, and backslashes,
               | meanwhile, are everywhere.
        
             | mechanicalpulse wrote:
             | I often use them in compound keys (e.g., in a flat key
             | space as might be used by a cache or similar simple
             | key/value store). IMHO, they are superior to other common
             | separators like colons, dashes, etc. because they are (1)
             | semantically appropriate and (2) less likely to be present
             | in the constituent pieces of data, especially if the data
             | in question is already limited to a subset of characters
             | that do not include the separators, which it often is
             | (e.g., a URL).
        
               | layer8 wrote:
               | "Less likely" doesn't help if you may get arbitrary
               | (user) input. If you can use a byte sequence as the key,
               | a better strategy is to UTF-8-encode the pieces and use
               | 0xFF as the separator byte, which can never occur in
               | UTF-8.
        
             | TheRealPomax wrote:
             | Because they're zero-width. If you can't see them when you
             | print your data, it's a machine-only separator, which makes
             | it a bad separator for data that humans need to look at and
             | work with.
             | 
             | (Because CSV is a terrible data exchange format in terms of
             | information per byte. But that makes sense, because it's an
             | intentionally _human readable_ data exchange format, not a
             | machine format)
             | 
             | Hence https://github.com/SixArm/usv/tree/main/doc/faq#why-
             | choose-u...
        
         | atrus wrote:
         | Yeah it's really interesting to me how much of what we use/do
         | is shaped by our input devices. Macropads are a start, but I'd
         | love a keyboard with screens on each key, that's not absurdly
         | expensive and can be layered easily.
        
           | benjijay wrote:
           | Something like the Optimus Maximus?
           | 
           | https://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
           | 
           | (It's been almost 20 years and you still can't get one...)
        
         | NelsonMinar wrote:
         | I've used the ASCII delimiters in a webapp once; Javascript in
         | the browser formatted data with them and sent it to my server
         | via HTTP POSTs. I was a bit nervous that something in the path
         | would break the data but happily it all just worked fine.
        
           | adammarples wrote:
           | Currently saving the day in a data pipeline project which
           | depends on a tool which only exports unescaped csvs. They
           | work very well through the pipeline, Unix split, awk, and
           | then snowflake all support them nicely. One annoying thing is
           | that they are annoying to type and you never quite know if
           | you need to refer to them using octal, hex or what, and what
           | special shell escaping might be used.
        
         | littlestymaar wrote:
         | > ASCII has a field delimiter character.
         | 
         | Where's the key on my keyboard yo make one?
         | 
         | The point of text-based formats is that you can edit them in a
         | text editor by hand trivially, if typing the character is
         | nontrivial, then it entirely defeats the point (that's also why
         | USV ads very little value IMHO).
        
           | tambourine_man wrote:
           | What's the key to enter the euro symbol? That means you can't
           | use it in a text editor?
           | 
           | There is no perfect solution, but I'd rather open a text file
           | in a decent editor than having to deal with the escaping hell
           | that is CSV.
           | 
           | They could have chosen the pipe character "|" at least, but
           | the comma is the thousand separator in many languages (number
           | formatting is kind of important for tabular data, if you ask
           | me) and also, you know, general prose.
        
             | MaBu wrote:
             | >What's the key to enter the euro symbol? That means you
             | can't use it in a text editor?
             | 
             | Alt gr+E? Like it's shown on the keyboard.
        
               | tambourine_man wrote:
               | Not on a US keyboard layout. The point is that we insert
               | characters that aren't written on the keyboard keys with
               | some regularity, like (c), (r), (tm), etc
        
               | couchand wrote:
               | Well some of us do. There's this interesting effect where
               | many people perceive the limitations on their current
               | tools to be equivalent to limitations on their abstract
               | abilities. If they don't know how to do it, it's
               | impossible.
        
               | shawnz wrote:
               | I think that's exactly the point that the parent poster
               | is trying to make by example? Just because we don't have
               | good tooling today for using ASCII delimiter characters,
               | doesn't mean it's impossible -- just like typing the euro
               | symbol on an american keyboard
        
               | couchand wrote:
               | Oh yes certainly. And I think that when you're deep into
               | creation it can be really really hard to remember that
               | experience, and so recently I'm trying to find ways to
               | help pull back the curtain for folks.
        
               | littlestymaar wrote:
               | It doesn't mean it's impossible, but it's definitely
               | cumbersome. Any non English people who has had to type in
               | their native language from an american keyboard can tell
               | you.
        
             | littlestymaar wrote:
             | > What's the key to enter the euro symbol?
             | 
             | There's one on French keyboards actually!
             | 
             | And it was there even before we got euro coins in our hands
             | (I know this because I'm still using my first (mechanical)
             | keyboard that I got with my first own PC in 2001: and there
             | is a "EUR" symbol on it)
        
               | wiml wrote:
               | There's also the generic currency symbol, $?, which I
               | think is on some keyboard layouts pre-Euro.
        
               | andyferris wrote:
               | Ooh is that what that is? TIL
        
               | mongol wrote:
               | That is an underappreciated gem. It should find more use!
        
           | zzo38computer wrote:
           | Control underscore is the unit separator character. (Some
           | editors may require you to escape that character, though.)
        
         | eirikbakke wrote:
         | The great thing about comma as a field separator is (1) the
         | character is visible and (2) the character is common, so if
         | there are escaping bugs in either the generator or the parser,
         | they will quickly become apparent. Much better to fail fast
         | than having a parse error at line 28357283 because a more
         | uncommon separator character somehow still made its way into
         | the data.
        
           | tambourine_man wrote:
           | We have editors that can work with invisible characters. It's
           | not hard. I do that all the time in Vim with tabs and CR/LF
           | anyway.
           | 
           | Unfortunately that ship has sailed. We have standards for
           | escaping commas, escaping quotes, it's escaping all the way
           | down
        
       | remram wrote:
       | Why not use parquet at this point? (or a row-oriented equivalent
       | like Avro or SQLite)
       | 
       | If you don't have a human-readable file, might as well be
       | compressible, queriable, and metadata-enabled I think.
        
       | codeulike wrote:
       | CSV is like an invasive plant species, or perhaps a curse; you're
       | never going to be able to root it out even thought there are a
       | billion better data formats.
        
         | croes wrote:
         | For its use case a good and simple format with just three
         | simple rules and three special purpose characters.
        
           | codeulike wrote:
           | True but there's so much scope for people to do naive
           | implementations with join() or split() functions and then you
           | end up with nothing escaped properly and a big mess
        
         | HideousKojima wrote:
         | CSV can be manually read/edited by non-technical/non-developer
         | humans using commonly available tools like Excel and Notepad.
         | Not many of the better data formats match that criteria.
        
           | otherme123 wrote:
           | Notepad, I agree. Excel... not so much: it tends to change
           | data silently unless you are very cautious with your
           | environment (e.g. dates transformed to number of days since
           | 1900, and some strings to dates)
        
             | HideousKojima wrote:
             | Actually Excel finally added a "stop &$*@ing up my data"
             | option recently: https://mashable.com/article/microsoft-
             | excel-disable-setting...
        
               | otherme123 wrote:
               | That helps, no doubt. But last week one of my coworkers
               | touched a Csv with Excel, and all dates went from ISO8601
               | to MDY. We are based in Europe (i.e. we use DMY at
               | minimum). In my experience, a Csv touched by Excel is not
               | trustable for further analysis.
        
       | teddyh wrote:
       | This is needlessly adding yet another standard1 to the mix. If
       | you are in a position to choose what standard you use, just use:
       | 
       | * Whatever is best for the data model and/or languages you use.
       | JSON is a common modern choice, suitable for most things.
       | 
       | * If you want something more tabular, closer to CSV (which is a
       | valid choice for bulk data), use strict RFC 4180 compliant data.
       | 
       | * If you want to specify your own binary super-compact data, use
       | ASN.1. I am also given to understand that Protobuf is a popular
       | modern choice.
       | 
       | If you _aren't_ in a position to choose your standards, just do
       | whatever you need to do to parse whatever junk you are given, and
       | emit as standards-compliant data as possible as output; again,
       | RFC 4180 is a great way to standardize your own CSV output, as
       | long as you stick to a subset which the receiving party can
       | parse.
       | 
       | Nobody needs "USV", and nobody should use it.
       | 
       | 1. <https://xkcd.com/927/>
        
       | ochrist wrote:
       | If you live in a place where comma is the decimal separator, your
       | CSV files will often use semicolon as the separator instead of
       | comma. Will this tool cater for that?
        
         | wodenokoto wrote:
         | What do you mean cater to that? The point is you separate with
         | a value that is not used within the fields. So decimal your
         | numbers however you want.
        
           | greenshackle2 wrote:
           | This is (nominally) a discussion about the csv-to-usv tool.
           | They are asking if the csv-to-usv tool also accepts semi-
           | colon delimited files as input.
           | 
           | Have you maybe lost track of what post you're commenting
           | under?
           | 
           | (I believe the answer is no BTW, the tool only supports , as
           | delimiter in its input.)
        
             | ochrist wrote:
             | Yes, this. Thank you.
             | 
             | If I work with CSV files they are most often not comma-
             | separated but semicolon-separated because of the numbers.
             | An Excel installation localized for decimal comma would not
             | read 'real' CSV files correct.
             | 
             | If csv-to-usv cannot cater for this type of CSV files, it
             | would not be usable in a large part of the world.
        
               | greenshackle2 wrote:
               | Yeah they should add it. The tool is like 20 lines of
               | Rust code. It's a thin wrapper around the csv Rust crate,
               | which does support specifying alternative delimiters.
        
       | code-faster wrote:
       | CSV is great because excel can import it, but it can't import
       | USV, so at that point, why use USV when you can use JSON?
       | 
       | https://github.com/tyleradams/json-toolkit/
        
         | hiccuphippo wrote:
         | Maybe their objective in submiting to the ietf is to get
         | programs like Excel to start supporting it.
        
           | layer8 wrote:
           | That's... not how Excel/Microsoft works.
        
         | extraduder_ire wrote:
         | Can you not customize the separators used when importing csv-
         | likes into excel? Libreoffice has a neat little window for it
         | that even shows a preview of what values go into which cells.
        
           | da_chicken wrote:
           | Sure if you want to stop and fiddle with Excel.
           | 
           | If you want to just double click and get to work, no.
        
       | forgetfulness wrote:
       | Seems complex enough that you'd only manipulate files in this
       | format by serializing through a tool, and by then it's competing
       | with established binary formats rather than CSV.
        
       | jonathaneunice wrote:
       | Fascinated this uses the Unicode glyphs / symbols for unit and
       | record separator rather than the unit and record separators
       | themselves (ASCII US and RS).
       | 
       | Perfect deployment of David Wheeler's aphorism:
       | 
       | > All problems in computer science can be solved by adding
       | another level of indirection.
       | 
       | https://en.wikipedia.org/wiki/David_Wheeler_(computer_scient...
        
         | ale42 wrote:
         | Indeed... I didn't read the standard in detail to check whether
         | escaping is allowed/taken into account, but what if my data
         | contains those symbols? I mean, they are perfectly legal
         | Unicode printable characters, unlike the ASCII ones.
        
           | BugsJustFindMe wrote:
           | There's an escape.
        
             | theamk wrote:
             | I thought the point is you don't need escapes?
             | 
             | If you still need to implement escape mechanism, might as
             | well do CSV/TSV.
        
               | BugsJustFindMe wrote:
               | The point is ASCII DSV, which gives innately better
               | hierarchy than CSV, but with visible tokens and stream
               | accommodation. You should read the github readme. It's
               | not that long.
               | 
               | https://github.com/SixArm/usv/tree/main/doc/faq#why-
               | choose-u...
               | 
               | As for still needing escapes, using obscure symbols
               | instead of ones that are extremely common in writing
               | inherently means needing far far faaaaaaar fewer of them.
        
               | theamk wrote:
               | What's the point of visible tokens if it's all squished
               | in one line? You are not going to be editing this in
               | regular editor once you have non-trivial amount of data.
               | 
               | And yes, I read README and source code, so I know that
               | newlines are optional, existing tools don't generate
               | them, and multi-line examples are basically fake.
        
               | BugsJustFindMe wrote:
               | > _What 's the point of visible tokens if it's all
               | squished in one line?_
               | 
               | It doesn't have to be all squished in one line, it just
               | doesn't hurt anything. Visually splitting squished lines
               | for presentation or perusal is trivial because of the
               | record separator.
               | 
               | > _You are not going to be editing this in regular
               | editor_
               | 
               | I know (or at least I think) that you meant this in
               | relation to squished lines getting very long, but maybe
               | we can talk about it in a broader context, since record
               | splitting is trivial...
               | 
               | One could easily say these same words about documents
               | written in right-to-left languages. But people in Israel
               | manage to create files too somehow, so that's clearly not
               | an insurmountable barrier.
        
               | couchand wrote:
               | Editors generally support composing right-to-left
               | languages that way? So I suppose the metaphor suggests
               | that all editors should directly support the visible
               | glyphs semantically?
               | 
               | And yet, that's explicitly not the semantic purpose of
               | those glyphs. The actual delimiters already exist at a
               | lower code point. If we're asking editors to semantically
               | support delimiters we should be asking them to support
               | the semantic delimiters.
        
           | 6510 wrote:
           | I one time attempted to write a blog post about escaping
           | stuff in rss feeds, while technically correct nothing could
           | parse the rss feed for the blog.
        
         | red_admiral wrote:
         | Indeed, if the result is to be encoded with UTF-8, using 1-byte
         | separators vs the multi-byte encoding of (241F) would make
         | sense to me.
         | 
         | I'd also prefer if escapes were done in the "traditional"
         | manner of, for example, "\t" for a tab because you can then
         | read in stuff with something like
         | input.split("\t").map(unescape); you know any actual tab
         | character in the input is a field separator, and then you can
         | go through the fields to put back the escaped ones.
        
           | eadmund wrote:
           | > you can then read in stuff with something like
           | input.split("\t").map(unescape)
           | 
           | What about input lines like 'asdf\\\thjkl\tzxcvb'? That
           | should be two fields, one the string 'asdf\thjkl' and the
           | other the string 'zxcvb.'
           | 
           | I think that your way is a bit like trying to match context-
           | free grammars with a regular expression. The right way is to
           | parse the input character by character.
        
             | cxr wrote:
             | Although matching up nested pairs of brackets requires
             | something at least as powerful as a pushdown automaton (CFG
             | matcher), discriminating between an arbitrary number of
             | escaped backslashes followed by an unescaped 't' versus an
             | arbitrary number of escaped backslashes followed by the
             | '\t' escape sequence doesn't require anything more powerful
             | than a finite state machine.
        
             | qzzi wrote:
             | > you know any actual tab character in the input is a field
             | separator, and then you can go through the fields to put
             | back the escaped ones
             | 
             | The "\t" in "split" is not a "slash-tee" but an actual tab
             | character and then escape sequences in fields are handled
             | by the "unescape" function.
        
             | fiddlerwoaroof wrote:
             | I think the suggestion is that the field separator is an
             | actual tab character (ascii code 9) but tabs inside the
             | field are `\t`. So, splitting on the tab character always
             | works because fields cannot contain ascii code 9 but must
             | use the two character escape instead.
        
         | ajdude wrote:
         | This makes me sad; such a missed opportunity.
        
         | default-kramer wrote:
         | Two links away is the answer:
         | https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
        
           | divbzero wrote:
           | The answer makes sense to me, but I wish we could fix editors
           | to properly handle the ASCII separators (1C, 1D, 1E, 1F)
           | instead of resorting to Unicode control picture characters
           | (241C, 241D, 241E, 241F).
           | 
           | Maybe if editors are fixed up we could adopt ASCII Separated
           | Values (ASV) as the new standard.
        
           | marwis wrote:
           | Why not combine zero width character with visible character,
           | i.e. use 2 characters for separators?
           | 
           | ,<FS> for fields \n<RS> for records
           | 
           | This removes ambiguity in parsing and remains user readable.
           | It's also relatively easy to auto-fix files edited by users
           | in normal editors.
           | 
           | It also mostly removes need for escaping.
           | 
           | It's also smaller or same size as unicode multibyte
           | characters (haven't checked).
        
         | HL33tibCe7 wrote:
         | Perfect deployment of HL33tibCe7's aphorism:
         | 
         | > For every interesting HN post, there's at least one smug
         | commenter who thinks he knows better, but actually doesn't
         | 
         | https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
        
           | ok_dad wrote:
           | The OP was probably assuming no human would want to actually
           | read a CSV raw, and so was probably correct from their POV.
           | Your POV is probably from someone who reads CSVs raw. You
           | don't have to be so rude about it, you're being even more
           | smug than the OP, probably.
        
             | groby_b wrote:
             | One of the two likely works with CSVs for a living, and
             | it's definitely not the person suggesting "What if it just
             | was hard to eyeball/edit".
             | 
             | If you don't understand why something is the way it is, it
             | might be better to start with a question than with a
             | statement implying the tech misses existing tech.
             | Chesterton's fence still applies, and ignoring it means
             | you're outsourcing your work to others. RTFM is a perfectly
             | valid answer at that point.
        
               | ok_dad wrote:
               | I use CSVs for a living but I rarely read them manually.
               | I'd rather have ASCII than Unicode in my CSVs.
               | 
               | My point above, though, is that everyone has opinions and
               | you don't have to be a dickhead about "correcting" them.
        
         | 1vuio0pswjnm7 wrote:
         | (For text processing, I use octal \034 all the time.)
         | 
         | Perhaps there is a software developer version of "Needs more
         | cowbell" called "Needs more complexity"
         | 
         | Computer languages generally use the Latin alphabet. And even
         | in a case like APL, which some HN commenters call
         | "hieroglyphics", the number of symbols is limited and each is
         | precisely defined (cf. "emojis" that are open to
         | interpretation).
        
           | Tijdreiziger wrote:
           | Well, yeah, not every language uses the Latin alphabet.
        
       | _obviously wrote:
       | Unicode is Turing complete which makes it an attack vector.
        
         | hermitcrab wrote:
         | It is a set of glyphs and their encodings. How is that 'Turing
         | complete'?
        
           | zzo38computer wrote:
           | Unicode involves more than the set of glyphs and their
           | encodings; it also involves properties, etc. However, it can
           | be an attack vector even ignoring that stuff; it does not
           | have to be Turing-complete to be an attack vector. But, the
           | specific kind of attacks depends on the application.
           | 
           | Different kind of character sets and character encodings will
           | be good for different purposes. Unicode is "equally bad" for
           | many uses.
        
       | otabdeveloper4 wrote:
       | Absolutely terrible documentation. The RFC doesn't even explain
       | the purpose of the "End of Transmission Block" token.
        
       | bombledmonk wrote:
       | I've actually been employing Emoji Separated Values (ESV), often
       | , here and there when doing some of this kind of work. Granted,
       | it's not standard, but it's been really useful when I've needed
       | it.
       | 
       | *edit Apparently emojis don't fly here, but it was an index
       | finger pointing right.
        
         | gausswho wrote:
         | My delimiter over several projects over the years has been:
         | 
         | Only a matter of time before something breaks catastrophically
         | but it hasn't happened yet.
        
         | philipwhiuk wrote:
         | The benefit of this is that you can use different emojis to
         | denote content type.
         | 
         | e.g. if it's a frowny face you know it's an invoice.
        
       | pquki4 wrote:
       | The usv github repository says it is "the standard for data
       | markup of ...", has 66 stars, and is _currently_ applying for
       | "text/usv" MIME type. That's all about it.
       | 
       | Maybe I'll consider it when it does not belong to a company, has
       | two more zeros in the number of stars, and has RFC/ISO attached
       | to it. Because right now it is not much more of a "standard" than
       | a hobby project I create on a whim.
        
         | netsharc wrote:
         | Yeah, I can't imagine the ego one needs to basically go "Hey
         | everyone, I've invented a new standard!"...
        
           | renewiltord wrote:
           | About the most annoying thing about the modern Internet is
           | this kind of chip-on-the-shoulder comment about "oh he has
           | such a big ego" and nonsense like that.
           | 
           | Man, I preferred it when people could just write up and
           | propose things. The insufferable "is that professional?",
           | "What about consensus?", "Wow the ego to propose something".
           | 
           | Time to return to monke.
        
           | FrustratedMonky wrote:
           | What a Pedantic take on what constitutes a 'standard'.
        
         | paulddraper wrote:
         | We'll be waiting on baited breath
        
       | MrOxiMoron wrote:
       | /me looks at calendar, nope not April 1st yet.
        
       | vidarh wrote:
       | Their examples if anything convinced me not to use this for a
       | long time.
       | 
       | I need to zoom to be able to tell these apart, so I'll need
       | editor support for it to be convenient to work with these anyway.
       | And then clicking through to the comparisons, it demonstrates the
       | difference _existing support for CSV "everywhere"_ makes - Github
       | renders the CSV examples nicely as tables, while again I need to
       | zoom in to see which separator is which for USV.
       | 
       | Maybe once there is widespread editor support. But if you need
       | editor support for it to be comfortable anyway, then the main
       | benefit vs. using the old-school actual separator characters goes
       | out the window.
        
         | strunz wrote:
         | csvkit makes displaying CSV in a terminal trivial and has all
         | the tools to manipulate/filter data I've ever needed -
         | https://csvkit.readthedocs.io/en/latest/
         | 
         | I don't really get this project at all.
        
           | vidarh wrote:
           | The thing is, while I'll probably just stick with CSV too,
           | I'm sympathetic to the intent, but given I expect it'll need
           | tooling anyway I'm less sympathetic to them not picking the
           | existing separator.
           | 
           | I also think there are failed lessons here that reduces the
           | incentive for switching.
           | 
           | E.g. If you're going to improve on CSV, a key improvement
           | would be to aim to make the format trivially splittable,
           | because the lesson from CSV is that when a format _looks this
           | trivial_ people will assume they can just split on a fixed
           | string or trivial regex, and so the more you can reduce the
           | harm of that the better.
           | 
           | As such, I'd avoid most of the escaping they show,
           | _especially for line endings_ , and just make RS '\n' the
           | record separator, or possibly RS '\n'*. Optionally do the
           | same for US. Require escaping LF immediately after RS/US, and
           | _only_ allow escaping RS, so unescaping can be done with a
           | trivial fixed replace per field if you have a reason to
           | assume your data might have leading linefeeds in fields - a
           | lot of apps will get away with just ignoring that.
           | 
           | Then parsing is reduced to something like
           | `data.split(RS).map{|row| row.split(US).map{|col|
           | col.gsub(ESCAPE,"\n") } }` (assuming RS, US, and ESCAPE are
           | regexps that include the optional trailing linefeeds and
           | escapes leading linefeeds respectively). Being able to copy a
           | correct one-liner from Stackoverflow ought to avoid most of
           | the problems with broken CSV/TSV parsing.
           | 
           | I'm also not convinced adding GS, FS, ETB is a good idea,
           | partly for that reason, partly because a lot of the tools
           | people will want to load data into will not handle more than
           | one set of records, and so you'll end up splitting files
           | anyway, in which case I'd just use a proper archive format...
           | Those characters feels like they're trying to do too much
           | given they're "competing" primarily with CSV/TSV.
           | 
           | Their spec also needs to talk about encoding, because unless
           | I've missed something, they only talk about codepoints, and
           | they're likely to e.g. get people splitting on the UTF8
           | sequence etc. This to me is another reason for using the
           | ASCII values - they encode the same in ASCII based characters
           | sets and UTF8, and so it feels likely to be more robust
           | against the horrors of people doing naive split-based
           | parsing.
        
             | strunz wrote:
             | CSV isn't even restricted to comma as the separator. You
             | can use any character you like (pipe | is a common one) and
             | csvkit will happy still work with a simple CLI flag. Pretty
             | much all Unix tools have a similar flag. I've always been
             | able to find an ASCII character that my data doesn't use,
             | though maybe there are exceptions I haven't hit.
        
           | bdzr wrote:
           | I love csvkit, particularly csvstat. I just wish it were
           | quicker on larger files. The types I deal with routinely take
           | 5-20 minutes to run and those are usually the ones I want the
           | csvstat output for the most.
        
         | pie_flavor wrote:
         | It's all down to font differences. You would use the file with
         | a font that uses larger letters diagonally, for control
         | pictures, instead of tiny letters horizontally. And the main
         | benefit isn't anything to do with the editor, I have no idea
         | what you meant by that. The main benefit is that commas show up
         | a lot more often in normal text than control pictures do.
        
           | vidarh wrote:
           | There's no space for larger letters diagonally unless I waste
           | screen estate by increasing the font size, which I
           | categorically will not do. So I'd need to replace a font I'm
           | happy with and find one with _other symbols_ that are
           | readable enough. In which case it 's _just as easy_ and less
           | invasive for me to adjust my editor to display them using
           | different glyphs. In which case I can just as well do that
           | with the actually ASCII control characters.
           | 
           | The point is that their stated "advantage" does not exist for
           | me. I still need to make changes to my setup to handle them.
           | In which case why should I pick _this_ option? (as you can
           | see elsewhere, especially as this isn 't the only issue I
           | have with their format choices).
           | 
           | > And the main benefit isn't anything to do with the editor,
           | I have no idea what you meant by that.
           | 
           | The main benefit _relative to using the actual control
           | characters_ is only the tool support. Where this does not
           | work for me without making changes anyway to how the symbols
           | are displayed anyway. Hence that  "advantage" does not
           | actually buy me anything.
        
         | derbOac wrote:
         | I think you're articulating something about this proposal that
         | bothers me.
         | 
         | The thing about the _actual_ separators is that an editor could
         | and should probably display them as they were intended, as data
         | separators. It should be a setting in an editor you control,
         | sort of like how you control tab width and things like that.
         | 
         | Just because a glyph is "invisible" doesn't mean it has to
         | actually be invisible.
         | 
         | The symbols for the separators are hard to read, like you're
         | pointing out, which means someone would eventually replace them
         | with some other graphical display, in which case you were just
         | as well off with the actual separators themselves.
         | 
         | They would have been better off advocating for editor support
         | for actual separator display.
        
       | nilslice wrote:
       | If you would like to run csv-to-usv from 15+ languages (not only
       | rust!) then check out this demo I made, converting the library to
       | an Extism plugin function: https://github.com/extism/extism-csv-
       | to-usv
       | 
       | Here's a snippet that runs it in your browser:
       | // Simple example to run this in your browser! But will work in
       | Go, PHP, Ruby, Java, Python, etc...         const extism = await
       | import("https://esm.sh/@extism/extism");
       | const plugin = await extism.createPlugin("https://cdn.modsurfer.d
       | ylibso.com/api/v1/module/a28e7322a6fde92cc27344584b5e86c211dbd5a3
       | 45fe6ec95f1389733c325541.wasm",           { useWasi: false }
       | );              let out = await plugin.call("csv_to_usv",
       | "a,b,c");         console.log(out.text());
        
         | greenshackle2 wrote:
         | I'm sorry but.. why? The library is a single function
         | consisting of 10 lines of Rust code. And would be about 10 LOCs
         | to re-implement in any language that has native csv libs. It
         | seems a little bit unnecessary to load a WASM runtime for that.
        
           | theamk wrote:
           | But without WASM, how are you are going to get 500ms+ startup
           | time and an 3rd party server dependency in your critical
           | path?
        
             | philipwhiuk wrote:
             | And two domains you're blindly trusting not to be hijacked.
        
               | nilslice wrote:
               | Sorry do you know what "demo" means?
        
           | nilslice wrote:
           | for sure -- do it!
        
             | greenshackle2 wrote:
             | I'm good I'm just here to chat, not to promote anything ;)
        
         | philipwhiuk wrote:
         | > esm.sh
         | 
         | > cdn.modsurfer.dylibso.com
         | 
         | Do people routinely do this - just run random code from
         | arbitrary endpoints.
         | 
         | Yikes
        
       | SuperHeavy256 wrote:
       | I've long wanted a successor to CSV, but this is kinda stupid.
       | People like CSVs because they look good, feel natural even in
       | plaintext. This is the same reason that Markdown in successful.
       | 
       | As for including commas in your data, it could just have been
       | managed with a simple escape character like a \, for when there's
       | actually a comma in your data. That's it.
        
         | hermitcrab wrote:
         | >As for including commas in your data, it could just have been
         | managed with a simple escape character like a \, for when
         | there's actually a comma in your data. That's it.
         | 
         | Not quite. What if there is a \ in your data? Then you have to
         | escape that.
        
           | lelanthran wrote:
           | > Not quite. What if there is a \ in your data? Then you have
           | to escape that.
           | 
           | No problem, _any_ character following a `\\` is a literal
           | character. `\\\\` = > literal `\\`. `\,` => literal comma.
           | `\a` => literal `a`, etc.
           | 
           | Parsing this is easy, generating it is easy, and there is
           | only _one rule_ to remember for humans reading or generating
           | it.
           | 
           | Each rule added for parsing is one more added complexity and
           | point of failure.
        
           | philipwhiuk wrote:
           | You still have to solve this in USV.
        
         | euroderf wrote:
         | Or two commas in a row can be the escape, without overloading
         | backslash.
        
           | estebank wrote:
           | That wouldn't allow for empty fields.
        
             | euroderf wrote:
             | Four commas in a row.
        
               | greenshackle2 wrote:
               | Can you parse this 2-row CSV:
               | 
               | SomeCommas,MoreCommas,OnlyOneComma,ALotOfCommas
               | 
               | ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
               | ,,,,,,,,,,,,,,
        
               | euroderf wrote:
               | Pathological cases are never difficult to find for any
               | app.
        
               | greenshackle2 wrote:
               | Wrong. Consider the standard backslash escape: represent
               | literal comma as "\,", and literal backslash as "\\\".
               | Backslashes are otherwise forbidden.
               | 
               | It will be difficult to find pathological cases for this
               | grammar, because they don't exist.
        
         | blackbeans wrote:
         | I don't see this as a perfect solution, but CSV is not great
         | either. A comma is super common in both text and numbers. Here
         | in Europe we often use commas as decimal separator and use a
         | semicolon as value separator.
         | 
         | As a result spreadsheets almost always fail to automatically
         | parse a CSV.
         | 
         | I do like the idea of having a dedicated separator character,
         | that would work right worldwide. And then just standardize the
         | use of a dot as decimal separator in these files.
        
       | crq-yml wrote:
       | It's sensible in principle:
       | 
       | * Editors will play nicely with the graphical representation. If
       | you need better graphics, it's done with font customization,
       | which everyone already supports.
       | 
       | * It announces that the data is source text, vs transmitted
       | bytes. The type/token distinction is not easy to overcome.
       | 
       | * It sits way out in Unicode's space where a collision is
       | unlikely. The whole reason why CSV-type formats create
       | frustration is because the tooling is ad-hoc, never does the
       | right thing and uses the lower byte spaces where people stuff all
       | kinds of random junk. This is the "fuck it, you get the same
       | treatment as a Youtube video id" kind of solution.
       | 
       | That said, if used, someone will attack it by printing those
       | characters as input.
        
       | evnix wrote:
       | There are some well researched alternatives to CSV,
       | 
       | From the top of my head, I can highly recommend SML
       | 
       | https://dev.stenway.com/SML/SimpleML.html
       | 
       | Recommend watching the, 'stop using CSV video' too
       | 
       | https://youtu.be/mGUlW6YgHjE?si=zDG_9Jv8LSy-ttP4
        
       | pimlottc wrote:
       | > Is USV aiming to become a standard? > > Yes and we've submitted
       | the first draft of the USV standard to the IETF: link.
       | 
       | This is a nice idea, and all, but seems unlikely to become a
       | meaningful standard without some major backing behind that "we".
        
       | justtinker wrote:
       | This is the XKCD comic in action. https://xkcd.com/927/
       | 
       | Someone should write a family of filters of the form CSV2ASV,
       | CSV2USV, CSV2JSON ,USV2XML , TOML2USV, USV2Cuneiform.......
        
       | hermitcrab wrote:
       | Alternatives to CSV are also covered in length at:
       | 
       | https://news.ycombinator.com/item?id=31220841
        
       | isoprophlex wrote:
       | This is just ESV files with extra complexity!
       | 
       | ESV: eggplant-separated values. Because who is ever going to put
       | AUBERGINE (U+1F346) into a dataset? It's the perfect record
       | separator!
        
       | difer7 wrote:
       | Does USV supports nested fields? While reading the USV GitHub's
       | README I did not clearly understand the purpose of the "group
       | separator"
        
         | philsnow wrote:
         | In the same way that CSV supports fields that contain nested
         | CSV documents: cumbersomely / painfully, with lots of escaping
         | of the delimiter characters.
        
       | tamimio wrote:
       | I am uncertain, but this is likely to reintroduce the issue of
       | Unicode buffer overflow into the mainstream. What are your
       | proposed solutions, considering it is expected to become
       | standardized?
        
       | nayuki wrote:
       | Nope, this isn't a good approach. I prefer tab-separated values
       | (TSV) and use it as much as possible.
        
       | Fileformat wrote:
       | A similar concept that is (IMHO) much nicer: RSV
       | 
       | It doesn't need any escaping or quoting: a field just has to be
       | valid UTF-8.
       | 
       | The trick is that the delimiters are bytes that are invalid
       | UTF-8.
       | 
       | The spec fits on a napkin, parsing is trivial, you can jump to
       | the middle of a doc and find the nearest row, etc.
       | 
       | Main downside is you need an editor/viewer that can handle it.
       | 
       | https://github.com/Stenway/RSV-Specification
        
       | zzo38computer wrote:
       | I have seen Unicode Separated Values. I don't like Unicode and I
       | even more don't like USV. I like ASCII Separated Values, which
       | can encode each separator as a single byte, and can be used with
       | character encodings other than Unicode (and, even if you do use
       | it with Unicode, does not prevent you from using the Unicode
       | control pictures in your data; USV does prevent you from using
       | those characters in your data even though the data is (allegedly)
       | Unicode).
       | 
       | What they say about display and input really depends on the
       | specific editors and viewers that you are using (and perhaps on
       | the fonts as well). When I use vi, I have no difficulty entering
       | ASCII control characters in the text. However, there is also the
       | problem with line breaking, with ASV and with USV, anyways; and
       | they do mention this in the issues anyways.
       | 
       | Fortunately, I can write a program to convert these formats
       | without too much difficulty, even without implementing Unicode
       | (since it is a fixed sequence of bytes that will need to be
       | replaced; however, it does mean that it will need to read
       | multiple bytes to figure out whether or not it is a record
       | separator, which is not as simple as ASV).
        
       | philsnow wrote:
       | > The Synchronous Idle (SYN) symbol is a heartbeat, and is
       | especially useful for streaming data, such as to keep a
       | connection alive. > > SYN tells the data reader that data
       | streaming is still in progress. > > SYN has no effect on the
       | output content. > > Example of a unit that contains a Synchronous
       | Idle: > > ab
       | 
       | Why would this go in-band inside a document format? Just why? If
       | you want keep-alives, use a kind of connection that supports out-
       | of-band keepalives.
       | 
       | If you download the same document twice, and the second time the
       | server is heavily loaded (or it's waiting on some dependency, or
       | whatever), presumably the server will helpfully generate some
       | SYNs in the middle of the document to keep the connection alive
       | (?), but now you've got the same document "spelled" two different
       | ways, that won't checksum alike.
       | 
       | SYN along with the weirdness of
       | 
       | > Escape + [non-USV-special] character: the character is ignored
       | 
       | means that you have arbitrarily many ways of writing
       | semantically-same documents.
        
         | paulddraper wrote:
         | This entire thing is a solution in search of a problem, and
         | this is the most obvious one.
         | 
         | Why does a file format need a transport protocol?
         | 
         | ---
         | 
         | Existing transport protocols (TCP, QUIC) already provide this.
        
       ___________________________________________________________________
       (page generated 2024-03-12 23:00 UTC)