[HN Gopher] Show HN: Comma Separated Values (CSV) to Unicode Sep...
___________________________________________________________________
Show HN: Comma Separated Values (CSV) to Unicode Separated Values
(USV)
Author : jph
Score : 154 points
Date : 2024-03-12 13:43 UTC (9 hours ago)
(HTM) web link (crates.io)
(TXT) w3m dump (crates.io)
| jiehong wrote:
| For those wondering what USV is, like myself:
|
| > Unicode separated values (USV) is a data format that uses
| Unicode symbol characters between data parts. USV competes with
| comma separated values (CSV), tab separated values (TSV), ASCII
| separated values (ASV), and similar systems. USV offers more
| capabilities and standards-track syntax.
|
| > Separators:
|
| >
|
| > U+241F Symbol for Unit Separator (US)
|
| >
|
| > U+241E Symbol for Record Separator (RS)
|
| >
|
| > U+241D Symbol for Group Separator (GS)
|
| >
|
| > U+241C Symbol for File Separator (FS)
|
| >
|
| > Modifiers:
|
| >
|
| > U+241B Symbol for Escape (ESC)
|
| >
|
| > U+2417 Symbol for End of Transmission Block (ETB)
|
| >
|
| > U+2416 Symbol For Synchronous Idle (SYN)
| calvinmorrison wrote:
| I always wonder why we don't use this.
| curtisblaine wrote:
| Not easily readable / editable using a regular text editor.
| timmg wrote:
| Do those characters map to something visually useful in
| (typical) unicode fonts?
|
| That would be neat :)
|
| Edit: Apparently, kinda (e.g.
| https://www.compart.com/en/unicode/U+241E )
|
| Not the most creative....
| bryanlarsen wrote:
| According to their GitHub README:
|
| ```USV works with many kinds of editors. Any editor that
| can render the USV characters will work. We use vi, emacs,
| Coda, Notepad++, TextMate, Sublime, VS Code, etc.```
|
| I loaded an example in my fairly generic Emacs and it
| worked out of the box. The separators were pretty small so
| I had to increase my font size to distinguish US from RS.
| And of course I have no idea how to enter those characters.
| I'm sure there is, but cut & paste worked.
| Solvency wrote:
| Who here uses a "regular" text editor, let's be real
| ahofmann wrote:
| I'm fascinated that a lot of posters in this thread are not
| understanding the ideas and experiences, that the inventors
| of this file format had or made. They invented this format
| because it works for machines as well as for humans. Text
| editors can handle the proposed UTF characters just fine.
| Humans can see them. The only challenge is that it is
| cumbersome to type the delimiters. And that the format is
| not used in any relevant software (like Excel). Both are
| reason enough, that USV will not be used anywhere. But I
| can see why they went this way on their file format.
| wtetzner wrote:
| I don't really see what benefit it provides over CSV
| other than needing to escape less frequently. That hardly
| seems like it's worth it.
| vidarh wrote:
| We might be able to see them, but for me they're just a
| blur unless I zoom in significantly, so I'll need editor
| accommodations just as much for these characters as if
| they used the already existing RS/FS/US/GS characters.
|
| It feels like instead of fixing it properly, they went
| with an option that will still need tool improvements,
| will be controversial, and adds unnecessary details (e.g.
| the SYN they've added will be an active nuisance and I'd
| be willing to bet will get ignored by enough tools to
| become a hazard to data integrity).
|
| I quite like an initiative to make use of proper record
| and unit separators, but this feels poorly thought
| through in several respects (e.g. their quirky escape
| characters that adds differently depending on the class
| of the following character will be a 'fun' source of
| bugs; that splitting records on LF requires three
| characters almost certainly will mean a number of tools
| will incorrectly treat those three characters as a unit,
| etc. -- these assumptions are based on how slapdash a lot
| of CSV parsing and generation is; if you want to compete
| with CSV you ought to learn those lessons)
| dahart wrote:
| CSV works for machines as well as humans, why do you
| assume or imply otherwise? Making the separator hard to
| type makes this 'invention' hard for humans to use. Using
| the glyphs instead of the semantic Unicode separators
| might also make this harder to use, even if you can
| understand why they did it, and to some degree it
| subverts the intent of the Unicode standard's separator
| and glyph characters.
| theamk wrote:
| We don't need a new format which works for machines as
| well as for humans, because there are are tons of
| existing ones. You have CSV or TSV for wide support;
| JSONlines if you want very easy edit-ability and
| structure; and if those don't work for some reason,
| pretty much any other delimiter/escape would work better
| (example: newline for records, "^^" for fields, "^"-style
| character escaping; or JS-style "\"-escaping with field
| separator being "\N")
| theamk wrote:
| All downsides, no upsides.
|
| You cannot edit it in regular editor, like csv/tsv/jsonlines.
|
| There is no schema or efficient storage, like binary formats.
|
| There is no wide library support.
|
| Not all data is representable.
| mavhc wrote:
| 1 Editors can be improved 2 Same as CSV etc then 3
| Libraries can be improved 4 Escaping characters exists
|
| ASCII 1963 had 8 separators, 1965 reduced it to 4, and
| named them. See 6.3.12 of
| https://dl.acm.org/doi/pdf/10.1145/363831.363839
| cxr wrote:
| The task here is to explain why one should use this over
| CSV. By your own admissions, there is no reason to prefer
| this over CSV.
| ixwt wrote:
| > You cannot edit it in regular editor, like
| csv/tsv/jsonlines.
|
| If only there were shortcuts on modern operating systems to
| allow us to do things that aren't readily on our keyboards.
| Like upper case characters. Or copy and paste. Or close
| windows. Our lives would be so much better.
|
| If ASV had caught on, there could be common shared
| shortcuts to type them, and fonts would regularly display
| them (just like the unicode characters proposed). But CSV
| was simple enough and readily type-able.
|
| > There is no schema or efficient storage, like binary
| formats.
|
| I'm not quite certain where you're trying to go with this.
| Binary formats aren't really meant to be human readable in
| an average text editor. It doesn't know to differentiate 1,
| 2, 4, or 8 bytes as an integer or a float. Even current hex
| editors to make it easier to navigate these formats don't
| really know unless you are able to tell it somehow.
|
| > There is no wide library support.
|
| It's a critical mass problem. Not enough people are using
| them, so no libraries are being made.
|
| > Not all data is representable.
|
| I'm not quite certain what data couldn't be represented. f
| you can represent your data in CSV, you can represent it in
| ASV. It's all plain text that gets interpreted based on
| what you need. They're nearly a 1:1 replacement. Commas get
| replaced by unit separators, new lines get replaced by
| group separators. Then you have record and file separators
| to do with for further levels of abstraction if you need.
| theamk wrote:
| Re editors: The problem with USV is not that it's hard to
| type the characters, but rather than the newlines are
| completely optional. Which means that in general case,
| most line-based tools are not going to work with USV.
|
| Now, the readme actually has that optional newline
| separator thing, but the optionality of it makes it
| completely useless, it seems like an after-thought. Fr
| example the first "real" USV writer I found, the "csv-to-
| usv", does not put them [0] and thus makes uneditable
| files.
|
| And if we are going to end up with uneditable files,
| might as well go with something schema-full, like parquet
| or avro. You are going to have the same "critical mass
| problem", but at least the tooling is much better and you
| have neat features like schemas.
|
| [0] https://github.com/SixArm/csv-to-usv-rust-
| crate/blob/30a0324...
| edent wrote:
| I've sketched out a replacement for JSON which would use
| these characters - https://shkspr.mobi/blog/2017/03/kyli-
| because-it-is-superior...
| michaelt wrote:
| CSV is the javascript of the tabular data world.
|
| Everyone thinks they can do better, but nothing's more widely
| supported (for a sufficiently generous definition of
| 'supported')
| hermitcrab wrote:
| Unfortunately CSVs vary a lot in the wild. Some people use
| commas as a delimiter, some use semi-colons. Escaping rules
| vary. And the text encoding is not specified.
|
| I randomly generated some CSVs and fed them into Excel and
| Numbers and they were differently interpreted.
| mst wrote:
| This is why I tend to use the Pg COPY version of TSV -
| works beautifully with 'cut' and friends, loads trivially
| into most databases, and the 'vary a lot' problem is
| (ish) avoided by specifying COPY escaping which is
| clearly documented and something people often already
| recognise.
|
| Generally my only interaction with CSV itself is to fling
| it through https://p3rl.org/Text::CSV since that seems to
| be able to get a pretty decent parse of every sort of CSV
| I've yet had to deal with in the wild.
| Tijdreiziger wrote:
| Countries that use . as the thousands separator (e.g.
| 1.000) use , as the CSV separator.
|
| Countries that use , as the thousands separator (e.g.
| 1,000) use ; as the CSV separator.
|
| Why? Because that's how Excel does it.
| dimask wrote:
| Funny thing, excel, which is the most common spreadsheet
| editor, does not practically support CSV files if you
| happen to live in countries where the default official
| convention is using commas for decimal points in numbers.
| Unless you go around and manually set stuff in how it
| imports it or you change your default settings. It has
| reached meme levels at my work.
|
| Tab separated files are much better imo in not getting
| confused with the delimiter for a sufficiently sane tsv
| file.
| Tijdreiziger wrote:
| Yes it does, but then it uses ; as a separator.
| chasil wrote:
| In a POSIX shell, I actually prefer to use the bell character
| for IFS. while IFS="$(printf \\a)" read -r
| field1 field2... do ... done
|
| This works just as well as anything outside the range of
| printing characters.
| pavon wrote:
| Back when rolling your own application level protocol on top
| of TCP was common (as opposed to using http, zeromq, etc) I
| frequently used file/record/group/unit separators for
| delimiters, and considered them an underrated gem, especially
| for plain-text data where they were prohibited to occur in
| the message body so you didn't have to escape them (still
| good to scan and reject messages containing them). As a
| modern example they (and most other ASCII control characters)
| are disallowed in json strings.
| gkbrk wrote:
| You can put control characters in JSON strings, you just
| need to escape them.
| pavon wrote:
| The way I read the json standard, the only way to include
| control characters is to encode them as hex. For example
| BEL can be encoded as "\u0007", but escaping it by using
| a backslash followed by a literal BEL character is not
| allowed. So literal control characters should never be in
| json text.
| queuebert wrote:
| CSV is honestly not that problematic. Figuring out if an
| field contains and comma and then properly quoting it is
| trivial. And fields without commas don't need quoting.
| Sometimes your application even guarantees no commas,
| especially if CSV is into it from the beginning.
| hermitcrab wrote:
| I'm guessing you haven't worked in custom support where
| people send you their "CSV" files. Even the field delimiter
| varies (many Europeans use semi-colons).
| queuebert wrote:
| No, I have. I don't consider abuse of the format a
| problem with the format. Though I can see how having to
| delimit with special characters will help the type of
| person who writes print(','.join(stuff)).
| hermitcrab wrote:
| >I don't consider abuse of the format a problem with the
| format.
|
| That's a fair point. But you could argue that when the
| abuse is so widespread, it becomes a defacto part of the
| format (even if it isn't in the RFC).
| teddyh wrote:
| Using Unicode _graphic_ characters as metasyntactic escape
| characters is fundamentally wrong. Those Unicode characters are
| for _displaying the symbols_ for Unit Separator, Record
| Separator, etc. and _not_ for actually _being_ separators!
| ASCII _already has those! Included in Unicode!_
| layer8 wrote:
| To be fair, I don't quite get those graphic characters,
| because the original characters should already be displayed
| that way, shouldn't they? Now when I see such a character, I
| have no idea if it's the real character or just it's graphic-
| character counterpart.
| eadmund wrote:
| Wait a second ... he's not proposing using
| unit/record/group/file separators as separators, he's proposing
| using the _symbols for those separators_ as separators! Why not
| just use the separators themselves!?
|
| Yes, rather than using U+1F (the ASCII and Unicode unit
| separator), he proposes using U+241F (the Unicode symbol _for_
| the unit separator). I almost feel like this must be an early
| April Fool's joke?
|
| Also, he writes 'comprised of' rather than 'composed of' or
| 'comprises' throughout his RFC.
| ahofmann wrote:
| They explain in the FAQ that this approach works with most
| text editors and copy-paste situations.
| philipwhiuk wrote:
| It doesn't "work" because I can't read the darn things at a
| sane zoom level.
| bryanlarsen wrote:
| Using a visible character rather than an invisible one makes
| editing in an editor a lot easier.
| tgv wrote:
| It won't wrap at the record separator, so you'll get a very
| long line.
| bryanlarsen wrote:
| The example seems to use `\n` as a separator rather than
| just ``. I assume their proposed standard is more
| definitive.
| vidarh wrote:
| Their ABNF uses RS, defined as U+241E, not U+241E + '\n'
| as the record separator. They seem to add an "USV escape"
| in front of the linefeeds.
|
| My bet is that this _will_ lead to implementations that
| wrongly treats "\n" (RS ESC \m) as the real record
| separator, the same way lots of "CSV" implementations
| just split on comma and LF.
|
| Seems to me if you're going to add support for something
| like that you should just bite the bullet and declare an
| LF immediately following an RS as part of the record
| separator, or you're falling in the same trap as CSV of
| being "close enough" to naively splittable that people
| will do it because it works often enough.
| shawnz wrote:
| The escape symbol lets you ignore any non-special
| character, not just newlines:
| https://github.com/sixarm/usv?tab=readme-ov-file#escape-
| esc
| vidarh wrote:
| I'm aware. I don't think that serves a useful purpose - I
| think the way they've done it is likely to make people
| more likely to get the parsing wrong for pretty much zero
| benefit. My guess is you'll end up seeing a lot of
| "pseudo-USV" parsers the same way we have a ton of
| "pseudo-CSV" parsers that breaks on escapes or quoted
| strings with commas, and so I think they fundamentally
| failed to learn the lessons of CSV.
| theamk wrote:
| that's a lie as far as I can see, the csv-to-usv tools
| does not add any newlines:
|
| [0] https://github.com/SixArm/csv-to-usv-rust-
| crate/blob/30a0324...
| bryanlarsen wrote:
| The examples here have them:
| https://github.com/SixArm/usv/tree/main/examples
| theamk wrote:
| the submitted tool does not have produce them, check out
| the tests - note there is no \n anywhere in USV
|
| https://github.com/SixArm/csv-to-usv-rust-
| crate/blob/30a0324...
| NoMoreNicksLeft wrote:
| If you're doing spreadsheets, then it should show in a
| spreadsheet and not in an editor. It's like complaining
| that he can't edit jpegs in Sublime or something... there's
| a reason that's working poorly.
|
| Speaking of which, last time I had a control code heavy
| file open in Sublime, it actually did show the control
| codes as special characters, and it was possible to
| copy/paste those. This proposal is so bad I suspect it will
| become a standard.
| dimask wrote:
| There are a lot of cases where I would rather
| inspect/quickfix a csv file in a text editor rather than
| open it as a spreadsheet. Especially cases where
| something is wrong in the format, and it will just not
| open as a spreadsheet at all. Adding unnecessary levels
| of obfuscation to your data should never be considered a
| good idea imo.
| eadmund wrote:
| The ASCII separators are visible in my editor. If something
| doesn't support ASCII text, that sounds like a bug which
| should be fixed, not a reason to misuse graphical
| characters for something other than their purpose.
| SigmundurM wrote:
| They cover the reasoning for using the control picture
| characters instead of the control characters in the FAQ:
|
| "We tried using the control characters, and also tried
| configuring various editors to show the control characters by
| rendering the control picture characters.
|
| First, we encountered many difficulties with editor
| configurations, attempting to make each editor treat the
| invisible zero-width characters by rendering with the visible
| letter-width characters.
|
| Second, we encountered problems with copy/paste
| functionality, where it often didn't work because the editor
| implementations and terminal implementations copied visible
| letter-width characters, not the underlying invisible zero-
| width characters.
|
| Third, users were unable to distinguish between the rendered
| control picture characters (e.g. the editor saw ASCII 31 and
| rendered Unicode Unit Separator) versus the control picture
| characters being in the data content (e.g. someone actually
| typed Unicode Unit Separator into the data content)."
|
| - https://github.com/SixArm/usv/tree/main/doc/faq#why-use-
| cont...
| nostrademons wrote:
| https://xkcd.com/927/
| pie_flavor wrote:
| 'Too many competing standards' is not one of the quoted
| reasons.
| ascorbic wrote:
| https://github.com/SixArm/usv/blob/main/doc/criticisms/in
| dex...
| vidarh wrote:
| I can't read those characters at the size I can/prefer to
| read the text at, so I need the tooling to support and
| render these differently anyway... This feels like solving
| the wrong problem in a way that will still end up with the
| same amount of work.
| ape4 wrote:
| An issue with CSV is that commas need to be escaped. Are the
| U+241F characters escaped in this USV format?
| hermitcrab wrote:
| I don't see any real advantage over using ASCII unit and
| record separators (.asv).
|
| Also I am not convinced about the need for an escape
| character. If you really need to use ASCII unit or record
| separators as data - tough use a different format.
|
| If only editors would display the ASCII unit separator
| (Notepad++ does) and treat the ASCII record as a carriage
| return (Notepad++ doesn't) then .asv format would be a huge
| improvement on CSV.
| skirmish wrote:
| 'comprised of' is standard verbiage in US patents, and I am
| guessing he is trying to sound formal and official. Also see
| https://en.wikipedia.org/wiki/Comprised_of .
| evrimoztamur wrote:
| First time hearing about USV, nifty! However, I think the
| adoptability challenge remains here to be Excel support (very
| tough).
| croes wrote:
| Excel can't even handle CSV correctly without using the import
| function.
| ahofmann wrote:
| Well, CSV would be much harder to import, than something like
| USV, because the delimiters are well-defined in USV and there
| is no need for quoting strings.
| croes wrote:
| How to put a USV example into one column of a USV without
| qualifier?
| yewenjie wrote:
| I'm still confused whether this is a joke or not.
| jefftk wrote:
| I don't think it's a joke; at https://github.com/sixarm/usv
| they discuss how they're working on IANA standardization.
| romeoblade wrote:
| Apparently it is not. They have submitted it to the ietf. I
| will have to watch closely to see if librecalc/excel and
| languages/libraries adopt support. Seems like it does solve
| some common problems with CSV.
|
| https://www.ietf.org/archive/id/draft-unicode-separated-valu...
|
| https://datatracker.ietf.org/doc/draft-unicode-separated-val...
| usrusr wrote:
| I certainly hope that anyone proposing a Unicode CSV variant as
| a joke would pick some raised hand emoji as the separator and
| the victory gesture (0xe011, also popular as an approximation
| of how an air quote emoji would look like) as the quote
| character.
|
| But we already keep stumbling over missing support for the on-
| demand quote character even with separators like comma and tab,
| using more exotic characters as the separator will only make it
| worse. The value of less escaping is negative.
| knallfrosch wrote:
| Completely unreadable. Then again, Germans know the pain of
| decimal points.
|
| We write 3.000,00 for exactly three thousand, instead of
| 3,000.00
|
| Now imagine how often parsing breaks.
| alwyn wrote:
| In my head 3.000,00 is correct and I always get confused
| because it seems most(?) people use the other method.
| Ekaros wrote:
| Finland uses 3 000,00 which is also kinda pain to parse.
|
| I think rarely used ' to group thousands is actually most
| sensible solution.
| euroderf wrote:
| And now and then you encounter a web form in the .fi domain
| that rejects "," and expects ".", but does not tell you
| that that is the reason for rejecting your input. The web
| "designers" that deploy such crap in .fi should be sent to
| Siberia.
| michaelmior wrote:
| If I understand the API correctly from my brief glance, the crate
| returns a triply-nested vector with the outermost vector being
| the equivalent of CSV rows, then CSV columns, then "units" which
| don't have a direct CSV equivalent. It would be helpful if there
| was an API method that returned results without this final level
| of nesting, perhaps panicking if there is more than one unit.
| This would make it easier to deal with the common case (in CSV at
| least) where each column only has a single value.
| hiccuphippo wrote:
| I think the units are the csv fields, records are rows, groups
| would be multiple CSV files (or multiple sheets in an excel
| file) and file separator... a zip with multiple CSV files? (or
| multiple excel files).
| michaelmior wrote:
| My mistake then about the correspondence :)
| eli wrote:
| Not sure I understand the advantage over ASCII Separated Values
| (ASV) which use ASCII control characters 0x1E and 0x1F
| p_l wrote:
| Surprisingly, they actually did write a FAQ entry on it (I'm
| honestly surprised):
|
| https://github.com/SixArm/usv/tree/main/doc/comparisons#asci...
| jdeisenberg wrote:
| Addressed in the FAQ:
| https://github.com/SixArm/usv/tree/main/doc/faq#why-choose-u...
|
| Main point: "USV provides typically-visible letter-width
| characters (such as Unicode 241F), whereas ASV provides
| typically-invisible zero-width characters (such as ASCII 31)."
| AdamH12113 wrote:
| USV would have the disadvantage of using multi-byte
| characters as delimiters, so you have to decode the file in
| order to separate records. And you still can't type the
| characters directly or be guaranteed to display them without
| font support. This honestly seems like cleverness for
| cleverness's sake.
| eli wrote:
| Ah fair enough. Of course you _could_ configure your shell
| /editor/whatever to make control characters visible. Seems
| like if you were going to edit USV or ASV by hand you'd
| probably want a customized editor anyway.
| a-priori wrote:
| The way I would have gone would be to define the standard to
| support both, such that the two sets of codes MUST be
| considered semantically equivalent, but that generation tools
| SHOULD prefer to generate the control codes for new files.
|
| This way people can initially use the visible glyphs while
| editors don't support the format, and this will always be
| supported. But, as editors add support and start to generate
| the files via tools or manually in tabular interfaces where
| the codes themselves disappear, usage will automatically
| transition over to the control codes.
| layer8 wrote:
| This is so weird, since the purpose of the former characters
| is displaying the latter characters. If they are actually
| used for display, then you can't tell which is which.
| jefftk wrote:
| Description of USV: https://github.com/sixarm/usv
| tambourine_man wrote:
| ASCII has a field delimiter character. The fact that we chose
| comma and tabs because a field delimiter character is hard to
| type or see is one of those things that saddens me in computing.
|
| Imagine the amount of pain that could have been spared if we had
| done it right from the start some 50 years ago.
| g4zj wrote:
| Interesting. Are you referring to the unit separator (1F)?
|
| https://www.ascii-code.com/31
| tambourine_man wrote:
| Yes, we have unit, record, group and file separators. And we
| chose never to use them.
| g4zj wrote:
| It seems as though one could easily build a file format far
| more useful than CSV simply by utilizing these separators,
| and I'm sure it's been done countless times.
|
| Perhaps this would make an interesting personal project.
| Are you aware of any hurdles, missing key features, etc.
| that previous attempts at creating such a format have run
| into (other than adoption, obviously)?
| hermitcrab wrote:
| The ASCII unit separator and record separator characters
| are not well supported by editors. That is why people
| stick to the (horrible and inconsistent) CSV format.
| tambourine_man wrote:
| People don't like invisible hard to type character. They
| prefer suffering quoting, escaping, escaping quotes and
| all that fun stuff
| t-3 wrote:
| Are people actually typing up *SV files by hand? It's
| trivial to support editing in an IDE and exporting from
| data-producing applications.
| andyferris wrote:
| Yes, sometimes, of course. It's a bit like JSON.
| Sometimes it's easiest to inject a small piece of hand-
| written data into a test or whatever.
|
| (That said every text editor since ever should have had a
| "table mode" that uses the ASCII field/record seperators
| (or whatever you choose), I was always confused why this
| isn't common. Maybe vim and emacs do?)
| EvanAnderson wrote:
| I've done ETL work with systems that used the ASCII
| separators. It was very pleasant work. Not having to
| worry about escaping things (because the ASCII separators
| weren't permitted to be in valid source data to begin
| with) was very, very nice.
|
| I'm a Notepad++ person. When I needed to mock-up data
| typing the characters was easy-- just ALT and the ASCII
| code on the numeric pad. It took a bit to memorize the
| codes I needed to use. Their visual representation is
| just inverse text and initials.
| eirikbakke wrote:
| Dedicated separator characters don't solve the problem--
| you'd still need to escape them. Or validate that the data
| (which may come from untrusted web forms etc.) does not
| contain them, which means you have another error condition
| to handle.
| hermitcrab wrote:
| Or specify that the data can't contain this data. If it
| does, you have to use a different format. This keeps
| everything super simple. And how often are ASCII US and
| RS characters used in data? I don't think I have ever
| seem one in the wild, apart from in a .asv file.
| g4zj wrote:
| I'm no expert on character encodings or Unicode itself,
| but would this be as simple as checking for the byte 1F
| in the data? Assuming the file is ASCII or UTF-8 encoded
| (or attempting to confirm this as much as possible as
| well), it seems like that check would suffice to validate
| the absence of the code point in the data, but I imagine
| it's not quite so simple.
| rhelz wrote:
| For text data, it would work fine, but you'd have to do
| some finagling with binary data; $1F is a perfectly valid
| byte to have in, say, a 4-byte integer.
| tambourine_man wrote:
| The "problem" I'm referring to is that we chose a widely
| used character as a field separator. Of course you still
| have to write a parser, etc, it's just a lot easier if
| you choose a dedicated character.
| AdamH12113 wrote:
| There's an ASCII character for escaping, too, if you need
| it.
|
| The advantage of ASV is not that you can't have invalid
| or insecure data, it's that valid data will almost never
| contain ASCII control characters in the record fields
| themselves. Commas, quotation marks, and backslashes,
| meanwhile, are everywhere.
| mechanicalpulse wrote:
| I often use them in compound keys (e.g., in a flat key
| space as might be used by a cache or similar simple
| key/value store). IMHO, they are superior to other common
| separators like colons, dashes, etc. because they are (1)
| semantically appropriate and (2) less likely to be present
| in the constituent pieces of data, especially if the data
| in question is already limited to a subset of characters
| that do not include the separators, which it often is
| (e.g., a URL).
| layer8 wrote:
| "Less likely" doesn't help if you may get arbitrary
| (user) input. If you can use a byte sequence as the key,
| a better strategy is to UTF-8-encode the pieces and use
| 0xFF as the separator byte, which can never occur in
| UTF-8.
| TheRealPomax wrote:
| Because they're zero-width. If you can't see them when you
| print your data, it's a machine-only separator, which makes
| it a bad separator for data that humans need to look at and
| work with.
|
| (Because CSV is a terrible data exchange format in terms of
| information per byte. But that makes sense, because it's an
| intentionally _human readable_ data exchange format, not a
| machine format)
|
| Hence https://github.com/SixArm/usv/tree/main/doc/faq#why-
| choose-u...
| atrus wrote:
| Yeah it's really interesting to me how much of what we use/do
| is shaped by our input devices. Macropads are a start, but I'd
| love a keyboard with screens on each key, that's not absurdly
| expensive and can be layered easily.
| benjijay wrote:
| Something like the Optimus Maximus?
|
| https://en.wikipedia.org/wiki/Optimus_Maximus_keyboard
|
| (It's been almost 20 years and you still can't get one...)
| NelsonMinar wrote:
| I've used the ASCII delimiters in a webapp once; Javascript in
| the browser formatted data with them and sent it to my server
| via HTTP POSTs. I was a bit nervous that something in the path
| would break the data but happily it all just worked fine.
| adammarples wrote:
| Currently saving the day in a data pipeline project which
| depends on a tool which only exports unescaped csvs. They
| work very well through the pipeline, Unix split, awk, and
| then snowflake all support them nicely. One annoying thing is
| that they are annoying to type and you never quite know if
| you need to refer to them using octal, hex or what, and what
| special shell escaping might be used.
| littlestymaar wrote:
| > ASCII has a field delimiter character.
|
| Where's the key on my keyboard yo make one?
|
| The point of text-based formats is that you can edit them in a
| text editor by hand trivially, if typing the character is
| nontrivial, then it entirely defeats the point (that's also why
| USV ads very little value IMHO).
| tambourine_man wrote:
| What's the key to enter the euro symbol? That means you can't
| use it in a text editor?
|
| There is no perfect solution, but I'd rather open a text file
| in a decent editor than having to deal with the escaping hell
| that is CSV.
|
| They could have chosen the pipe character "|" at least, but
| the comma is the thousand separator in many languages (number
| formatting is kind of important for tabular data, if you ask
| me) and also, you know, general prose.
| MaBu wrote:
| >What's the key to enter the euro symbol? That means you
| can't use it in a text editor?
|
| Alt gr+E? Like it's shown on the keyboard.
| tambourine_man wrote:
| Not on a US keyboard layout. The point is that we insert
| characters that aren't written on the keyboard keys with
| some regularity, like (c), (r), (tm), etc
| couchand wrote:
| Well some of us do. There's this interesting effect where
| many people perceive the limitations on their current
| tools to be equivalent to limitations on their abstract
| abilities. If they don't know how to do it, it's
| impossible.
| shawnz wrote:
| I think that's exactly the point that the parent poster
| is trying to make by example? Just because we don't have
| good tooling today for using ASCII delimiter characters,
| doesn't mean it's impossible -- just like typing the euro
| symbol on an american keyboard
| couchand wrote:
| Oh yes certainly. And I think that when you're deep into
| creation it can be really really hard to remember that
| experience, and so recently I'm trying to find ways to
| help pull back the curtain for folks.
| littlestymaar wrote:
| It doesn't mean it's impossible, but it's definitely
| cumbersome. Any non English people who has had to type in
| their native language from an american keyboard can tell
| you.
| littlestymaar wrote:
| > What's the key to enter the euro symbol?
|
| There's one on French keyboards actually!
|
| And it was there even before we got euro coins in our hands
| (I know this because I'm still using my first (mechanical)
| keyboard that I got with my first own PC in 2001: and there
| is a "EUR" symbol on it)
| wiml wrote:
| There's also the generic currency symbol, $?, which I
| think is on some keyboard layouts pre-Euro.
| andyferris wrote:
| Ooh is that what that is? TIL
| mongol wrote:
| That is an underappreciated gem. It should find more use!
| zzo38computer wrote:
| Control underscore is the unit separator character. (Some
| editors may require you to escape that character, though.)
| eirikbakke wrote:
| The great thing about comma as a field separator is (1) the
| character is visible and (2) the character is common, so if
| there are escaping bugs in either the generator or the parser,
| they will quickly become apparent. Much better to fail fast
| than having a parse error at line 28357283 because a more
| uncommon separator character somehow still made its way into
| the data.
| tambourine_man wrote:
| We have editors that can work with invisible characters. It's
| not hard. I do that all the time in Vim with tabs and CR/LF
| anyway.
|
| Unfortunately that ship has sailed. We have standards for
| escaping commas, escaping quotes, it's escaping all the way
| down
| remram wrote:
| Why not use parquet at this point? (or a row-oriented equivalent
| like Avro or SQLite)
|
| If you don't have a human-readable file, might as well be
| compressible, queriable, and metadata-enabled I think.
| codeulike wrote:
| CSV is like an invasive plant species, or perhaps a curse; you're
| never going to be able to root it out even thought there are a
| billion better data formats.
| croes wrote:
| For its use case a good and simple format with just three
| simple rules and three special purpose characters.
| codeulike wrote:
| True but there's so much scope for people to do naive
| implementations with join() or split() functions and then you
| end up with nothing escaped properly and a big mess
| HideousKojima wrote:
| CSV can be manually read/edited by non-technical/non-developer
| humans using commonly available tools like Excel and Notepad.
| Not many of the better data formats match that criteria.
| otherme123 wrote:
| Notepad, I agree. Excel... not so much: it tends to change
| data silently unless you are very cautious with your
| environment (e.g. dates transformed to number of days since
| 1900, and some strings to dates)
| HideousKojima wrote:
| Actually Excel finally added a "stop &$*@ing up my data"
| option recently: https://mashable.com/article/microsoft-
| excel-disable-setting...
| otherme123 wrote:
| That helps, no doubt. But last week one of my coworkers
| touched a Csv with Excel, and all dates went from ISO8601
| to MDY. We are based in Europe (i.e. we use DMY at
| minimum). In my experience, a Csv touched by Excel is not
| trustable for further analysis.
| teddyh wrote:
| This is needlessly adding yet another standard1 to the mix. If
| you are in a position to choose what standard you use, just use:
|
| * Whatever is best for the data model and/or languages you use.
| JSON is a common modern choice, suitable for most things.
|
| * If you want something more tabular, closer to CSV (which is a
| valid choice for bulk data), use strict RFC 4180 compliant data.
|
| * If you want to specify your own binary super-compact data, use
| ASN.1. I am also given to understand that Protobuf is a popular
| modern choice.
|
| If you _aren't_ in a position to choose your standards, just do
| whatever you need to do to parse whatever junk you are given, and
| emit as standards-compliant data as possible as output; again,
| RFC 4180 is a great way to standardize your own CSV output, as
| long as you stick to a subset which the receiving party can
| parse.
|
| Nobody needs "USV", and nobody should use it.
|
| 1. <https://xkcd.com/927/>
| ochrist wrote:
| If you live in a place where comma is the decimal separator, your
| CSV files will often use semicolon as the separator instead of
| comma. Will this tool cater for that?
| wodenokoto wrote:
| What do you mean cater to that? The point is you separate with
| a value that is not used within the fields. So decimal your
| numbers however you want.
| greenshackle2 wrote:
| This is (nominally) a discussion about the csv-to-usv tool.
| They are asking if the csv-to-usv tool also accepts semi-
| colon delimited files as input.
|
| Have you maybe lost track of what post you're commenting
| under?
|
| (I believe the answer is no BTW, the tool only supports , as
| delimiter in its input.)
| ochrist wrote:
| Yes, this. Thank you.
|
| If I work with CSV files they are most often not comma-
| separated but semicolon-separated because of the numbers.
| An Excel installation localized for decimal comma would not
| read 'real' CSV files correct.
|
| If csv-to-usv cannot cater for this type of CSV files, it
| would not be usable in a large part of the world.
| greenshackle2 wrote:
| Yeah they should add it. The tool is like 20 lines of
| Rust code. It's a thin wrapper around the csv Rust crate,
| which does support specifying alternative delimiters.
| code-faster wrote:
| CSV is great because excel can import it, but it can't import
| USV, so at that point, why use USV when you can use JSON?
|
| https://github.com/tyleradams/json-toolkit/
| hiccuphippo wrote:
| Maybe their objective in submiting to the ietf is to get
| programs like Excel to start supporting it.
| layer8 wrote:
| That's... not how Excel/Microsoft works.
| extraduder_ire wrote:
| Can you not customize the separators used when importing csv-
| likes into excel? Libreoffice has a neat little window for it
| that even shows a preview of what values go into which cells.
| da_chicken wrote:
| Sure if you want to stop and fiddle with Excel.
|
| If you want to just double click and get to work, no.
| forgetfulness wrote:
| Seems complex enough that you'd only manipulate files in this
| format by serializing through a tool, and by then it's competing
| with established binary formats rather than CSV.
| jonathaneunice wrote:
| Fascinated this uses the Unicode glyphs / symbols for unit and
| record separator rather than the unit and record separators
| themselves (ASCII US and RS).
|
| Perfect deployment of David Wheeler's aphorism:
|
| > All problems in computer science can be solved by adding
| another level of indirection.
|
| https://en.wikipedia.org/wiki/David_Wheeler_(computer_scient...
| ale42 wrote:
| Indeed... I didn't read the standard in detail to check whether
| escaping is allowed/taken into account, but what if my data
| contains those symbols? I mean, they are perfectly legal
| Unicode printable characters, unlike the ASCII ones.
| BugsJustFindMe wrote:
| There's an escape.
| theamk wrote:
| I thought the point is you don't need escapes?
|
| If you still need to implement escape mechanism, might as
| well do CSV/TSV.
| BugsJustFindMe wrote:
| The point is ASCII DSV, which gives innately better
| hierarchy than CSV, but with visible tokens and stream
| accommodation. You should read the github readme. It's
| not that long.
|
| https://github.com/SixArm/usv/tree/main/doc/faq#why-
| choose-u...
|
| As for still needing escapes, using obscure symbols
| instead of ones that are extremely common in writing
| inherently means needing far far faaaaaaar fewer of them.
| theamk wrote:
| What's the point of visible tokens if it's all squished
| in one line? You are not going to be editing this in
| regular editor once you have non-trivial amount of data.
|
| And yes, I read README and source code, so I know that
| newlines are optional, existing tools don't generate
| them, and multi-line examples are basically fake.
| BugsJustFindMe wrote:
| > _What 's the point of visible tokens if it's all
| squished in one line?_
|
| It doesn't have to be all squished in one line, it just
| doesn't hurt anything. Visually splitting squished lines
| for presentation or perusal is trivial because of the
| record separator.
|
| > _You are not going to be editing this in regular
| editor_
|
| I know (or at least I think) that you meant this in
| relation to squished lines getting very long, but maybe
| we can talk about it in a broader context, since record
| splitting is trivial...
|
| One could easily say these same words about documents
| written in right-to-left languages. But people in Israel
| manage to create files too somehow, so that's clearly not
| an insurmountable barrier.
| couchand wrote:
| Editors generally support composing right-to-left
| languages that way? So I suppose the metaphor suggests
| that all editors should directly support the visible
| glyphs semantically?
|
| And yet, that's explicitly not the semantic purpose of
| those glyphs. The actual delimiters already exist at a
| lower code point. If we're asking editors to semantically
| support delimiters we should be asking them to support
| the semantic delimiters.
| 6510 wrote:
| I one time attempted to write a blog post about escaping
| stuff in rss feeds, while technically correct nothing could
| parse the rss feed for the blog.
| red_admiral wrote:
| Indeed, if the result is to be encoded with UTF-8, using 1-byte
| separators vs the multi-byte encoding of (241F) would make
| sense to me.
|
| I'd also prefer if escapes were done in the "traditional"
| manner of, for example, "\t" for a tab because you can then
| read in stuff with something like
| input.split("\t").map(unescape); you know any actual tab
| character in the input is a field separator, and then you can
| go through the fields to put back the escaped ones.
| eadmund wrote:
| > you can then read in stuff with something like
| input.split("\t").map(unescape)
|
| What about input lines like 'asdf\\\thjkl\tzxcvb'? That
| should be two fields, one the string 'asdf\thjkl' and the
| other the string 'zxcvb.'
|
| I think that your way is a bit like trying to match context-
| free grammars with a regular expression. The right way is to
| parse the input character by character.
| cxr wrote:
| Although matching up nested pairs of brackets requires
| something at least as powerful as a pushdown automaton (CFG
| matcher), discriminating between an arbitrary number of
| escaped backslashes followed by an unescaped 't' versus an
| arbitrary number of escaped backslashes followed by the
| '\t' escape sequence doesn't require anything more powerful
| than a finite state machine.
| qzzi wrote:
| > you know any actual tab character in the input is a field
| separator, and then you can go through the fields to put
| back the escaped ones
|
| The "\t" in "split" is not a "slash-tee" but an actual tab
| character and then escape sequences in fields are handled
| by the "unescape" function.
| fiddlerwoaroof wrote:
| I think the suggestion is that the field separator is an
| actual tab character (ascii code 9) but tabs inside the
| field are `\t`. So, splitting on the tab character always
| works because fields cannot contain ascii code 9 but must
| use the two character escape instead.
| ajdude wrote:
| This makes me sad; such a missed opportunity.
| default-kramer wrote:
| Two links away is the answer:
| https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
| divbzero wrote:
| The answer makes sense to me, but I wish we could fix editors
| to properly handle the ASCII separators (1C, 1D, 1E, 1F)
| instead of resorting to Unicode control picture characters
| (241C, 241D, 241E, 241F).
|
| Maybe if editors are fixed up we could adopt ASCII Separated
| Values (ASV) as the new standard.
| marwis wrote:
| Why not combine zero width character with visible character,
| i.e. use 2 characters for separators?
|
| ,<FS> for fields \n<RS> for records
|
| This removes ambiguity in parsing and remains user readable.
| It's also relatively easy to auto-fix files edited by users
| in normal editors.
|
| It also mostly removes need for escaping.
|
| It's also smaller or same size as unicode multibyte
| characters (haven't checked).
| HL33tibCe7 wrote:
| Perfect deployment of HL33tibCe7's aphorism:
|
| > For every interesting HN post, there's at least one smug
| commenter who thinks he knows better, but actually doesn't
|
| https://github.com/SixArm/usv/tree/main/doc/faq#why-use-cont...
| ok_dad wrote:
| The OP was probably assuming no human would want to actually
| read a CSV raw, and so was probably correct from their POV.
| Your POV is probably from someone who reads CSVs raw. You
| don't have to be so rude about it, you're being even more
| smug than the OP, probably.
| groby_b wrote:
| One of the two likely works with CSVs for a living, and
| it's definitely not the person suggesting "What if it just
| was hard to eyeball/edit".
|
| If you don't understand why something is the way it is, it
| might be better to start with a question than with a
| statement implying the tech misses existing tech.
| Chesterton's fence still applies, and ignoring it means
| you're outsourcing your work to others. RTFM is a perfectly
| valid answer at that point.
| ok_dad wrote:
| I use CSVs for a living but I rarely read them manually.
| I'd rather have ASCII than Unicode in my CSVs.
|
| My point above, though, is that everyone has opinions and
| you don't have to be a dickhead about "correcting" them.
| 1vuio0pswjnm7 wrote:
| (For text processing, I use octal \034 all the time.)
|
| Perhaps there is a software developer version of "Needs more
| cowbell" called "Needs more complexity"
|
| Computer languages generally use the Latin alphabet. And even
| in a case like APL, which some HN commenters call
| "hieroglyphics", the number of symbols is limited and each is
| precisely defined (cf. "emojis" that are open to
| interpretation).
| Tijdreiziger wrote:
| Well, yeah, not every language uses the Latin alphabet.
| _obviously wrote:
| Unicode is Turing complete which makes it an attack vector.
| hermitcrab wrote:
| It is a set of glyphs and their encodings. How is that 'Turing
| complete'?
| zzo38computer wrote:
| Unicode involves more than the set of glyphs and their
| encodings; it also involves properties, etc. However, it can
| be an attack vector even ignoring that stuff; it does not
| have to be Turing-complete to be an attack vector. But, the
| specific kind of attacks depends on the application.
|
| Different kind of character sets and character encodings will
| be good for different purposes. Unicode is "equally bad" for
| many uses.
| otabdeveloper4 wrote:
| Absolutely terrible documentation. The RFC doesn't even explain
| the purpose of the "End of Transmission Block" token.
| bombledmonk wrote:
| I've actually been employing Emoji Separated Values (ESV), often
| , here and there when doing some of this kind of work. Granted,
| it's not standard, but it's been really useful when I've needed
| it.
|
| *edit Apparently emojis don't fly here, but it was an index
| finger pointing right.
| gausswho wrote:
| My delimiter over several projects over the years has been:
|
| Only a matter of time before something breaks catastrophically
| but it hasn't happened yet.
| philipwhiuk wrote:
| The benefit of this is that you can use different emojis to
| denote content type.
|
| e.g. if it's a frowny face you know it's an invoice.
| pquki4 wrote:
| The usv github repository says it is "the standard for data
| markup of ...", has 66 stars, and is _currently_ applying for
| "text/usv" MIME type. That's all about it.
|
| Maybe I'll consider it when it does not belong to a company, has
| two more zeros in the number of stars, and has RFC/ISO attached
| to it. Because right now it is not much more of a "standard" than
| a hobby project I create on a whim.
| netsharc wrote:
| Yeah, I can't imagine the ego one needs to basically go "Hey
| everyone, I've invented a new standard!"...
| renewiltord wrote:
| About the most annoying thing about the modern Internet is
| this kind of chip-on-the-shoulder comment about "oh he has
| such a big ego" and nonsense like that.
|
| Man, I preferred it when people could just write up and
| propose things. The insufferable "is that professional?",
| "What about consensus?", "Wow the ego to propose something".
|
| Time to return to monke.
| FrustratedMonky wrote:
| What a Pedantic take on what constitutes a 'standard'.
| paulddraper wrote:
| We'll be waiting on baited breath
| MrOxiMoron wrote:
| /me looks at calendar, nope not April 1st yet.
| vidarh wrote:
| Their examples if anything convinced me not to use this for a
| long time.
|
| I need to zoom to be able to tell these apart, so I'll need
| editor support for it to be convenient to work with these anyway.
| And then clicking through to the comparisons, it demonstrates the
| difference _existing support for CSV "everywhere"_ makes - Github
| renders the CSV examples nicely as tables, while again I need to
| zoom in to see which separator is which for USV.
|
| Maybe once there is widespread editor support. But if you need
| editor support for it to be comfortable anyway, then the main
| benefit vs. using the old-school actual separator characters goes
| out the window.
| strunz wrote:
| csvkit makes displaying CSV in a terminal trivial and has all
| the tools to manipulate/filter data I've ever needed -
| https://csvkit.readthedocs.io/en/latest/
|
| I don't really get this project at all.
| vidarh wrote:
| The thing is, while I'll probably just stick with CSV too,
| I'm sympathetic to the intent, but given I expect it'll need
| tooling anyway I'm less sympathetic to them not picking the
| existing separator.
|
| I also think there are failed lessons here that reduces the
| incentive for switching.
|
| E.g. If you're going to improve on CSV, a key improvement
| would be to aim to make the format trivially splittable,
| because the lesson from CSV is that when a format _looks this
| trivial_ people will assume they can just split on a fixed
| string or trivial regex, and so the more you can reduce the
| harm of that the better.
|
| As such, I'd avoid most of the escaping they show,
| _especially for line endings_ , and just make RS '\n' the
| record separator, or possibly RS '\n'*. Optionally do the
| same for US. Require escaping LF immediately after RS/US, and
| _only_ allow escaping RS, so unescaping can be done with a
| trivial fixed replace per field if you have a reason to
| assume your data might have leading linefeeds in fields - a
| lot of apps will get away with just ignoring that.
|
| Then parsing is reduced to something like
| `data.split(RS).map{|row| row.split(US).map{|col|
| col.gsub(ESCAPE,"\n") } }` (assuming RS, US, and ESCAPE are
| regexps that include the optional trailing linefeeds and
| escapes leading linefeeds respectively). Being able to copy a
| correct one-liner from Stackoverflow ought to avoid most of
| the problems with broken CSV/TSV parsing.
|
| I'm also not convinced adding GS, FS, ETB is a good idea,
| partly for that reason, partly because a lot of the tools
| people will want to load data into will not handle more than
| one set of records, and so you'll end up splitting files
| anyway, in which case I'd just use a proper archive format...
| Those characters feels like they're trying to do too much
| given they're "competing" primarily with CSV/TSV.
|
| Their spec also needs to talk about encoding, because unless
| I've missed something, they only talk about codepoints, and
| they're likely to e.g. get people splitting on the UTF8
| sequence etc. This to me is another reason for using the
| ASCII values - they encode the same in ASCII based characters
| sets and UTF8, and so it feels likely to be more robust
| against the horrors of people doing naive split-based
| parsing.
| strunz wrote:
| CSV isn't even restricted to comma as the separator. You
| can use any character you like (pipe | is a common one) and
| csvkit will happy still work with a simple CLI flag. Pretty
| much all Unix tools have a similar flag. I've always been
| able to find an ASCII character that my data doesn't use,
| though maybe there are exceptions I haven't hit.
| bdzr wrote:
| I love csvkit, particularly csvstat. I just wish it were
| quicker on larger files. The types I deal with routinely take
| 5-20 minutes to run and those are usually the ones I want the
| csvstat output for the most.
| pie_flavor wrote:
| It's all down to font differences. You would use the file with
| a font that uses larger letters diagonally, for control
| pictures, instead of tiny letters horizontally. And the main
| benefit isn't anything to do with the editor, I have no idea
| what you meant by that. The main benefit is that commas show up
| a lot more often in normal text than control pictures do.
| vidarh wrote:
| There's no space for larger letters diagonally unless I waste
| screen estate by increasing the font size, which I
| categorically will not do. So I'd need to replace a font I'm
| happy with and find one with _other symbols_ that are
| readable enough. In which case it 's _just as easy_ and less
| invasive for me to adjust my editor to display them using
| different glyphs. In which case I can just as well do that
| with the actually ASCII control characters.
|
| The point is that their stated "advantage" does not exist for
| me. I still need to make changes to my setup to handle them.
| In which case why should I pick _this_ option? (as you can
| see elsewhere, especially as this isn 't the only issue I
| have with their format choices).
|
| > And the main benefit isn't anything to do with the editor,
| I have no idea what you meant by that.
|
| The main benefit _relative to using the actual control
| characters_ is only the tool support. Where this does not
| work for me without making changes anyway to how the symbols
| are displayed anyway. Hence that "advantage" does not
| actually buy me anything.
| derbOac wrote:
| I think you're articulating something about this proposal that
| bothers me.
|
| The thing about the _actual_ separators is that an editor could
| and should probably display them as they were intended, as data
| separators. It should be a setting in an editor you control,
| sort of like how you control tab width and things like that.
|
| Just because a glyph is "invisible" doesn't mean it has to
| actually be invisible.
|
| The symbols for the separators are hard to read, like you're
| pointing out, which means someone would eventually replace them
| with some other graphical display, in which case you were just
| as well off with the actual separators themselves.
|
| They would have been better off advocating for editor support
| for actual separator display.
| nilslice wrote:
| If you would like to run csv-to-usv from 15+ languages (not only
| rust!) then check out this demo I made, converting the library to
| an Extism plugin function: https://github.com/extism/extism-csv-
| to-usv
|
| Here's a snippet that runs it in your browser:
| // Simple example to run this in your browser! But will work in
| Go, PHP, Ruby, Java, Python, etc... const extism = await
| import("https://esm.sh/@extism/extism");
| const plugin = await extism.createPlugin("https://cdn.modsurfer.d
| ylibso.com/api/v1/module/a28e7322a6fde92cc27344584b5e86c211dbd5a3
| 45fe6ec95f1389733c325541.wasm", { useWasi: false }
| ); let out = await plugin.call("csv_to_usv",
| "a,b,c"); console.log(out.text());
| greenshackle2 wrote:
| I'm sorry but.. why? The library is a single function
| consisting of 10 lines of Rust code. And would be about 10 LOCs
| to re-implement in any language that has native csv libs. It
| seems a little bit unnecessary to load a WASM runtime for that.
| theamk wrote:
| But without WASM, how are you are going to get 500ms+ startup
| time and an 3rd party server dependency in your critical
| path?
| philipwhiuk wrote:
| And two domains you're blindly trusting not to be hijacked.
| nilslice wrote:
| Sorry do you know what "demo" means?
| nilslice wrote:
| for sure -- do it!
| greenshackle2 wrote:
| I'm good I'm just here to chat, not to promote anything ;)
| philipwhiuk wrote:
| > esm.sh
|
| > cdn.modsurfer.dylibso.com
|
| Do people routinely do this - just run random code from
| arbitrary endpoints.
|
| Yikes
| SuperHeavy256 wrote:
| I've long wanted a successor to CSV, but this is kinda stupid.
| People like CSVs because they look good, feel natural even in
| plaintext. This is the same reason that Markdown in successful.
|
| As for including commas in your data, it could just have been
| managed with a simple escape character like a \, for when there's
| actually a comma in your data. That's it.
| hermitcrab wrote:
| >As for including commas in your data, it could just have been
| managed with a simple escape character like a \, for when
| there's actually a comma in your data. That's it.
|
| Not quite. What if there is a \ in your data? Then you have to
| escape that.
| lelanthran wrote:
| > Not quite. What if there is a \ in your data? Then you have
| to escape that.
|
| No problem, _any_ character following a `\\` is a literal
| character. `\\\\` = > literal `\\`. `\,` => literal comma.
| `\a` => literal `a`, etc.
|
| Parsing this is easy, generating it is easy, and there is
| only _one rule_ to remember for humans reading or generating
| it.
|
| Each rule added for parsing is one more added complexity and
| point of failure.
| philipwhiuk wrote:
| You still have to solve this in USV.
| euroderf wrote:
| Or two commas in a row can be the escape, without overloading
| backslash.
| estebank wrote:
| That wouldn't allow for empty fields.
| euroderf wrote:
| Four commas in a row.
| greenshackle2 wrote:
| Can you parse this 2-row CSV:
|
| SomeCommas,MoreCommas,OnlyOneComma,ALotOfCommas
|
| ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
| ,,,,,,,,,,,,,,
| euroderf wrote:
| Pathological cases are never difficult to find for any
| app.
| greenshackle2 wrote:
| Wrong. Consider the standard backslash escape: represent
| literal comma as "\,", and literal backslash as "\\\".
| Backslashes are otherwise forbidden.
|
| It will be difficult to find pathological cases for this
| grammar, because they don't exist.
| blackbeans wrote:
| I don't see this as a perfect solution, but CSV is not great
| either. A comma is super common in both text and numbers. Here
| in Europe we often use commas as decimal separator and use a
| semicolon as value separator.
|
| As a result spreadsheets almost always fail to automatically
| parse a CSV.
|
| I do like the idea of having a dedicated separator character,
| that would work right worldwide. And then just standardize the
| use of a dot as decimal separator in these files.
| crq-yml wrote:
| It's sensible in principle:
|
| * Editors will play nicely with the graphical representation. If
| you need better graphics, it's done with font customization,
| which everyone already supports.
|
| * It announces that the data is source text, vs transmitted
| bytes. The type/token distinction is not easy to overcome.
|
| * It sits way out in Unicode's space where a collision is
| unlikely. The whole reason why CSV-type formats create
| frustration is because the tooling is ad-hoc, never does the
| right thing and uses the lower byte spaces where people stuff all
| kinds of random junk. This is the "fuck it, you get the same
| treatment as a Youtube video id" kind of solution.
|
| That said, if used, someone will attack it by printing those
| characters as input.
| evnix wrote:
| There are some well researched alternatives to CSV,
|
| From the top of my head, I can highly recommend SML
|
| https://dev.stenway.com/SML/SimpleML.html
|
| Recommend watching the, 'stop using CSV video' too
|
| https://youtu.be/mGUlW6YgHjE?si=zDG_9Jv8LSy-ttP4
| pimlottc wrote:
| > Is USV aiming to become a standard? > > Yes and we've submitted
| the first draft of the USV standard to the IETF: link.
|
| This is a nice idea, and all, but seems unlikely to become a
| meaningful standard without some major backing behind that "we".
| justtinker wrote:
| This is the XKCD comic in action. https://xkcd.com/927/
|
| Someone should write a family of filters of the form CSV2ASV,
| CSV2USV, CSV2JSON ,USV2XML , TOML2USV, USV2Cuneiform.......
| hermitcrab wrote:
| Alternatives to CSV are also covered in length at:
|
| https://news.ycombinator.com/item?id=31220841
| isoprophlex wrote:
| This is just ESV files with extra complexity!
|
| ESV: eggplant-separated values. Because who is ever going to put
| AUBERGINE (U+1F346) into a dataset? It's the perfect record
| separator!
| difer7 wrote:
| Does USV supports nested fields? While reading the USV GitHub's
| README I did not clearly understand the purpose of the "group
| separator"
| philsnow wrote:
| In the same way that CSV supports fields that contain nested
| CSV documents: cumbersomely / painfully, with lots of escaping
| of the delimiter characters.
| tamimio wrote:
| I am uncertain, but this is likely to reintroduce the issue of
| Unicode buffer overflow into the mainstream. What are your
| proposed solutions, considering it is expected to become
| standardized?
| nayuki wrote:
| Nope, this isn't a good approach. I prefer tab-separated values
| (TSV) and use it as much as possible.
| Fileformat wrote:
| A similar concept that is (IMHO) much nicer: RSV
|
| It doesn't need any escaping or quoting: a field just has to be
| valid UTF-8.
|
| The trick is that the delimiters are bytes that are invalid
| UTF-8.
|
| The spec fits on a napkin, parsing is trivial, you can jump to
| the middle of a doc and find the nearest row, etc.
|
| Main downside is you need an editor/viewer that can handle it.
|
| https://github.com/Stenway/RSV-Specification
| zzo38computer wrote:
| I have seen Unicode Separated Values. I don't like Unicode and I
| even more don't like USV. I like ASCII Separated Values, which
| can encode each separator as a single byte, and can be used with
| character encodings other than Unicode (and, even if you do use
| it with Unicode, does not prevent you from using the Unicode
| control pictures in your data; USV does prevent you from using
| those characters in your data even though the data is (allegedly)
| Unicode).
|
| What they say about display and input really depends on the
| specific editors and viewers that you are using (and perhaps on
| the fonts as well). When I use vi, I have no difficulty entering
| ASCII control characters in the text. However, there is also the
| problem with line breaking, with ASV and with USV, anyways; and
| they do mention this in the issues anyways.
|
| Fortunately, I can write a program to convert these formats
| without too much difficulty, even without implementing Unicode
| (since it is a fixed sequence of bytes that will need to be
| replaced; however, it does mean that it will need to read
| multiple bytes to figure out whether or not it is a record
| separator, which is not as simple as ASV).
| philsnow wrote:
| > The Synchronous Idle (SYN) symbol is a heartbeat, and is
| especially useful for streaming data, such as to keep a
| connection alive. > > SYN tells the data reader that data
| streaming is still in progress. > > SYN has no effect on the
| output content. > > Example of a unit that contains a Synchronous
| Idle: > > ab
|
| Why would this go in-band inside a document format? Just why? If
| you want keep-alives, use a kind of connection that supports out-
| of-band keepalives.
|
| If you download the same document twice, and the second time the
| server is heavily loaded (or it's waiting on some dependency, or
| whatever), presumably the server will helpfully generate some
| SYNs in the middle of the document to keep the connection alive
| (?), but now you've got the same document "spelled" two different
| ways, that won't checksum alike.
|
| SYN along with the weirdness of
|
| > Escape + [non-USV-special] character: the character is ignored
|
| means that you have arbitrarily many ways of writing
| semantically-same documents.
| paulddraper wrote:
| This entire thing is a solution in search of a problem, and
| this is the most obvious one.
|
| Why does a file format need a transport protocol?
|
| ---
|
| Existing transport protocols (TCP, QUIC) already provide this.
___________________________________________________________________
(page generated 2024-03-12 23:00 UTC)