hngopher.com

       [HN Gopher] RFC 4180: Common Format and MIME Type for CSV Files ...
       ___________________________________________________________________
        
       RFC 4180: Common Format and MIME Type for CSV Files (2005)
        
       Author : tosh
       Score  : 43 points
       Date   : 2021-06-03 10:02 UTC (12 hours ago)
        
 (HTM) web link (datatracker.ietf.org)
 (TXT) w3m dump (datatracker.ietf.org)
        
       | edoceo wrote:
       | Oh heck. I'm cited reference #5 (almost last) but the link to my
       | site is broken. I'll have to put a 301 on there.
        
       | aliasEli wrote:
       | Anyone who has written software to import CSV files knows that in
       | general it's a complete mess.
       | 
       | Also, geneticists have been forced to rename certain gene names
       | because Excel messed them up[0].
       | 
       | It would be nice if this standard would be generally accepted.
       | This standard is from 2005 and it is pretty clear that it is not
       | universally used.
       | 
       | [0] https://www.theregister.com/2020/08/06/excel_gene_names/
        
         | FranOntanaya wrote:
         | By that logic though, all text formats are a mess because text
         | processors will autocorrect and break the formatting.
        
         | dspillett wrote:
         | We've had similar issues with clients using 1-2-1 as a meeting
         | type name. When data about those records are exported to CSV
         | and loaded by Excel, it gets seen as the date 2001-02-01 (or,
         | presumably, 2001-01-02 for those using a US locale in their
         | setup) _even if the value is quoted_ (which we hoped would be
         | interpreted as  "this value is definitely nothing fancier than
         | a basic string", but it was not).
         | 
         |  _> This standard is from 2005 and it is pretty clear that it
         | is not universally used._
         | 
         | Unfortunately a lot of non-conforming software can't change or
         | they risk compatibility problems with old versions of
         | themselves. 2005 is relatively recent in the history of CSV
         | files.
        
           | viraptor wrote:
           | Opening CSV in excel rather than importing is pretty much
           | broken by design. Import data from text and you can set the
           | column types. Open by default and suffer consequences.
        
             | dspillett wrote:
             | You are not at all wrong, but unfortunately "you are
             | opening it wrong" did not fly with the client for us in
             | this case.
        
         | olau wrote:
         | Excel is the ultimate bane of CSV files. If Microsoft would fix
         | Excel and distribute the fixes to old Excels, then life would
         | be good again.
        
           | aliasEli wrote:
           | It is not just Excel, there are a lot of systems that export
           | CSV. Many of them are broken.
           | 
           | Examples:
           | 
           | - double quotes in text fields
           | 
           | - using the wrong decimal separator, should it be "," or "."
           | 
           | - using the wrong column separator, should be "," or ";"
        
           | chrismorgan wrote:
           | But they will never do that because that would break existing
           | systems that depend on the old behaviour.
        
             | wolrah wrote:
             | > But they will never do that because that would break
             | existing systems that depend on the old behaviour.
             | 
             | Which is an inevitability for anything that's been a
             | dependency for long enough.
             | 
             | Either you admit that things have to change and sometimes
             | shit that's been secretly broken for years will have to
             | change too, or you give in to the shit and admit that you
             | can never solve any problems that made it to production.
             | 
             | Both answers suck, but one is a lot more suck so I choose
             | the "fix it anyways and if something else turns out to have
             | depended on the brokeness then add that to the list of shit
             | to fix" approach whenever possible.
        
       | mattowen_uk wrote:
       | Ok, so now I know that text/csv is proper mime type, I have to
       | ask, why don't browsers offer a native way to view CSVs as a
       | [html] table complete with sorting and filtering?
        
       | gardaani wrote:
       | The reason I don't like CSV is that CSV files can use any text
       | encoding: ASCII, ISO-8859, UTF-8.. It's not nice to try to guess
       | the encoding when reading files.
       | 
       | RFC 4180 also fails to define the default text encoding. It seems
       | that text encoding can be given only when MIME type is given for
       | the file.
       | 
       | JSON is so much easier.
        
         | formerly_proven wrote:
         | Excel generates RFC-compliant CSV files _if_ you are using the
         | en-us /en-uk locales, but not for many other locales. Of
         | course, this is a system-wide setting, and Excel has no option
         | to generate a standard CSV independent of locale.
        
           | dspillett wrote:
           | _> the en-us /en-uk locales_
           | 
           | Though be careful if you communicate with people using both
           | locals. Things interpreted as dates will get corrupted in fun
           | ways.
        
           | shellac wrote:
           | I used Excel recently and it added a BOM, which is pretty
           | awful. Is there any way to stop that nonsense?
        
             | cpach wrote:
             | AFAIK: Nope. Microsoft loves that BOM, for reasons I cannot
             | comprehend.
        
               | andylynch wrote:
               | Almost certainly for backwards compatibility. Excel in
               | particular uses the BOM as a single to open a file in a
               | Unicode encoding, rather than cp1252 etc.
        
             | dbsmith83 wrote:
             | I use a script to strip it from the file. Definitely not
             | ideal.
        
         | yakovsh2 wrote:
         | Default encoding is now UTF-8, due to the way the RFC process
         | works it is not in the original document:
         | https://www.iana.org/assignments/media-types/text/csv
        
       | throw0101a wrote:
       | Sadly "ASCII delimited" text, using ASCI 28 to 31, doesn't seem
       | to be a thing:
       | 
       | * https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text
       | 
       | * http://en.wikipedia.org/wiki/Unit_separator
        
         | Nullabillity wrote:
         | I don't really see the point... Using special characters makes
         | it harder to manipulate by hand, and you still don't solve
         | conflicts (where content is able to corrupt the container
         | structure). Ultimately, the only safe way is to use Pascal
         | string-style length prefixes.
         | 
         | Escaping is too easy to ignore, until it comes back to bite you
         | in the ass. Balanced delimiters break as soon as the content
         | contains invalid data (or is delimited according to a different
         | convention than the container).
        
           | throw0101a wrote:
           | > Using special characters makes it harder to manipulate by
           | hand
           | 
           | I generally use CSV as an import/export format, and would
           | like for software to at least support the option (even if
           | CSV/TSV remains the default).
        
           | jrochkind1 wrote:
           | It's much less likely that content is going to include ASCII
           | 28 to 31 than that it will include a comma or a tab. But
           | you're right, certainly not impossible.
           | 
           | I think the human-editability you mention may be a major
           | reason why comma-separated has been such a practical winner?
           | 
           | Since ASCII 28 to 31 are invisible or indistinguishable in
           | most editors, I suspect a format relying on them would be
           | subject to routine corruption.
           | 
           | (I have some experience working with a "legacy" "binary"
           | format which DOES use those ASCII separator values...
           | https://en.wikipedia.org/wiki/MARC_standards)
        
             | Nullabillity wrote:
             | > It's much less likely that content is going to include
             | ASCII 28 to 31 than that it will include a comma or a tab.
             | But you're right, certainly not impossible.
             | 
             | Yeah, that's the same problem as UTF-8's almost-
             | compatibility with ASCII. Things work, until they break
             | spectacularly.
             | 
             | Ultimately, you either have to solve the issue properly, or
             | keep it visible enough that people are forced to handle it.
        
             | slver wrote:
             | In machine encoding formats, "much less likely" is the same
             | as "very likely". You're either OK for something to happen
             | or you categorically make it impossible to happen.
             | 
             | Invisible characters also mean you can no longer call a
             | data format "human readable".
             | 
             | The special ASCII characters you're referring to are mostly
             | a historical oddity today.
        
               | jrochkind1 wrote:
               | How much have you worked with actually existing CSV data?
               | If it was as easy as "making it categorically impossible
               | to happen", working with CSV wouldn't be such a mess! You
               | can say whatever you want in whatever standard you want,
               | data bugs still happen.
               | 
               | Any CSV format description anywhere, will of course you
               | need to use various internal escaping mechanisms for a
               | literal comma, or forbid them entirely. Does that
               | "categorically make it impossible to happen"? Obviously
               | not, words in a standard can't make something
               | categorically impossible. How likely a protocol/standard
               | violation is to happen of course matters to actual
               | practical pain.
               | 
               | Yes, of course the ascii separator values are a
               | historical path not taken. This thread, not started by
               | me, was musing about why, and what-if, by people who,
               | believe it or not, have plenty of experience dealing with
               | data, and bad data.
        
           | Pet_Ant wrote:
           | Manipulating by hand is easily solved with alt sequences and
           | adding support to distinctly display those characters to text
           | editors. It's easier than figuring out the corner cases of
           | YAML.
        
           | slver wrote:
           | > Escaping is too easy to ignore
           | 
           | This is why we have editors that do syntax checking as we
           | type. Editing CSV by hand is just not important enough for
           | such editors to be commonplace. CSV files are exported and
           | imported.
           | 
           | And since we have JSON, whose spec is stable, known and at
           | this point consistently implemented, CSV should really just
           | be sun set rather than "fixed".
           | 
           | Also, the simplest escaping format is wrap in quotes and
           | represent quotes by typing quotes twice. That's easy to
           | remember and relatively hard to screw up even when editing by
           | hand.
        
           | chriswarbo wrote:
           | > Ultimately, the only safe way is to use Pascal string-style
           | length prefixes.
           | 
           | Note that length-prefixes aren't necessarily safer when
           | _nested_. For example, `line(10 bytes, [cell(1000 bytes,
           | "xxxxxxx...`. This can lead to vulnerabilities like buffer
           | overflows (naively allocating 10 bytes for the line, then
           | trying to put 1000 bytes into it)
        
             | Nullabillity wrote:
             | Sure, without bounds-checking all hope is lost, but then
             | you're in a hopeless place _anyway_.
        
           | simongr3dal wrote:
           | If you're editing by hand then length prefixes are also
           | broken easily and counting the length of strings is
           | cumbersome unless you're using a "technical" text editor like
           | VS Code in which case its just as easy to write a delimiter
           | like \x1F (unit separator) as it is to highlight the text and
           | look at character count.
           | 
           | Unless "editing by hand" means using something like excel?
        
             | Nullabillity wrote:
             | There's a difference between "take on this inconvenience
             | and gain nothing" and "take on this inconvenience in
             | exchange for solving this problem".
        
       | sonjaqql wrote:
       | This is timely for me - I was just sharing this 2014 article with
       | my team.
       | 
       | http://thomasburette.com/blog/2014/05/25/so-you-want-to-writ...
       | 
       | It made an impact on my approach to CSV handling, and helped me
       | understand, "just because I can, doesn't mean I should".
        
       | temp8964 wrote:
       | I recommend CSVY (CSV with YAML frontmatter, https://csvy.org/),
       | which solves several problems in CSV.
       | 
       | Currently R data.table supports read and write csvy format.
       | https://www.rdocumentation.org/packages/data.table/versions/...
        
         | slver wrote:
         | CSV in the abstract has no problems. The problem is two tools
         | attempting to interact by using subtly incompatible versions of
         | CSV. And CSVY just adds one more of those.
        
       | yakovsh2 wrote:
       | (I'm the author of the RFC)
       | 
       | Not sure why this is trending now, but this RFC is being revised
       | now: https://datatracker.ietf.org/doc/html/draft-shafranovich-
       | rfc...
       | 
       | Suggestions and comments are welcome here:
       | https://github.com/nightwatchcybersecurity/rfc4180-bis
        
         | billpg wrote:
         | Part of me thinks that a new CSV standard with technical
         | breaking changes (lines starting with #) is not needed because
         | we have JSON array-of-arrays files.
         | 
         | The other part thinks this is quite cool and wishes your
         | efforts well.
        
         | jtvjan wrote:
         | I think we should just stop using commas and newlines and start
         | using the ASCII unit separator and record separator. It would
         | alleviate most quoting and escaping issues because those two
         | rarely appear in actual text.
        
           | teknopaul wrote:
           | significant whitespace is nasty, by definition its hard to
           | see errors. Not convenient if you can't edit the format with
           | a normal keyboard.
        
           | teddyh wrote:
           | Sure, maybe, but that won't be CSV.
        
             | Jiocus wrote:
             | Maybe that would be an "SSV"? Separator Separated Values
             | file.
        
           | SahAssar wrote:
           | That would just make it easier to implement the standard in a
           | way that works most of the time but breaks on valid input.
           | Why would that be desirable?
        
           | jpalomaki wrote:
           | On the other hand, if the characters still technically can
           | exist in the content, the rareness just makes the problems
           | harder to spot.
           | 
           | In this sense sticking to a relatively common separators is
           | good, because they encourage you to do the right thing from
           | the start.
        
           | colejohnson66 wrote:
           | Until I can type those ASCII separators on my keyboard,
           | they're not going to happen. That's why CSV won out: the
           | comma key already existed.
        
         | ape4 wrote:
         | The new RFC allows for names for the fields: "3. The first
         | record in the file MAY be an optional header with the same
         | format as normal records. This header will contain names
         | corresponding to the fields in the file" Is there a reliable
         | way to tell if this first record has names or actual data?
        
           | gfody wrote:
           | iirc text/csv; header=present
        
             | ape4 wrote:
             | Oh that's new to me! But not available after downloading of
             | course.
        
         | contravariant wrote:
         | Oh interesting it includes a specification for comments. That
         | is likely going to be the most controversial part.
         | 
         | Does any implementation support/generate comments that way? The
         | most I've seen so far is an oversized multiline header.
        
           | chriswarbo wrote:
           | Wadler's Law strikes again https://wiki.c2.com/?WadlersLaw
        
         | [deleted]
        
         | slver wrote:
         | I feel the simple recommendation for all CSV usecases should be
         | "use JSON".
        
         | Grollicus wrote:
         | Is Microsoft somehow on board for this? In my experience there
         | are CSV files and there are CSV files that excel understands
         | and as long as you don't get your file format into excel that
         | will seriously hinder adoption.
        
           | miffe wrote:
           | In my experience excel doesn't even understand its own csv
           | files. If you save one using the Swedish locale it uses
           | semicolons as field separators since comma is used as decimal
           | points. Trying to open the resulting file using a English
           | locale results in garbage.
        
             | sheetjs wrote:
             | When saving as CSV, Excel will use the regional "List
             | separator" setting. You can override this in Windows 7 with
             | Region and Language > Additional Settings > List separator.
             | 
             | If you are trying to generate a file that plays nice with
             | Excel, there is a way to force a specific delimiter with
             | the "sep" pragma:                   sep=|         a|b|c
             | 1|2|3
        
               | teknopaul wrote:
               | Please give up on that and make csv itself nice and
               | simple. :) Let's have it err comma separated.
               | 
               | Simple is good.
               | 
               | If it's really simple even Microsoft will be able to
               | implement it.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2021-06-03 23:02 UTC)