[HN Gopher] RFC 4180: Common Format and MIME Type for CSV Files ...
___________________________________________________________________
RFC 4180: Common Format and MIME Type for CSV Files (2005)
Author : tosh
Score : 43 points
Date : 2021-06-03 10:02 UTC (12 hours ago)
(HTM) web link (datatracker.ietf.org)
(TXT) w3m dump (datatracker.ietf.org)
| edoceo wrote:
| Oh heck. I'm cited reference #5 (almost last) but the link to my
| site is broken. I'll have to put a 301 on there.
| aliasEli wrote:
| Anyone who has written software to import CSV files knows that in
| general it's a complete mess.
|
| Also, geneticists have been forced to rename certain gene names
| because Excel messed them up[0].
|
| It would be nice if this standard would be generally accepted.
| This standard is from 2005 and it is pretty clear that it is not
| universally used.
|
| [0] https://www.theregister.com/2020/08/06/excel_gene_names/
| FranOntanaya wrote:
| By that logic though, all text formats are a mess because text
| processors will autocorrect and break the formatting.
| dspillett wrote:
| We've had similar issues with clients using 1-2-1 as a meeting
| type name. When data about those records are exported to CSV
| and loaded by Excel, it gets seen as the date 2001-02-01 (or,
| presumably, 2001-01-02 for those using a US locale in their
| setup) _even if the value is quoted_ (which we hoped would be
| interpreted as "this value is definitely nothing fancier than
| a basic string", but it was not).
|
| _> This standard is from 2005 and it is pretty clear that it
| is not universally used._
|
| Unfortunately a lot of non-conforming software can't change or
| they risk compatibility problems with old versions of
| themselves. 2005 is relatively recent in the history of CSV
| files.
| viraptor wrote:
| Opening CSV in excel rather than importing is pretty much
| broken by design. Import data from text and you can set the
| column types. Open by default and suffer consequences.
| dspillett wrote:
| You are not at all wrong, but unfortunately "you are
| opening it wrong" did not fly with the client for us in
| this case.
| olau wrote:
| Excel is the ultimate bane of CSV files. If Microsoft would fix
| Excel and distribute the fixes to old Excels, then life would
| be good again.
| aliasEli wrote:
| It is not just Excel, there are a lot of systems that export
| CSV. Many of them are broken.
|
| Examples:
|
| - double quotes in text fields
|
| - using the wrong decimal separator, should it be "," or "."
|
| - using the wrong column separator, should be "," or ";"
| chrismorgan wrote:
| But they will never do that because that would break existing
| systems that depend on the old behaviour.
| wolrah wrote:
| > But they will never do that because that would break
| existing systems that depend on the old behaviour.
|
| Which is an inevitability for anything that's been a
| dependency for long enough.
|
| Either you admit that things have to change and sometimes
| shit that's been secretly broken for years will have to
| change too, or you give in to the shit and admit that you
| can never solve any problems that made it to production.
|
| Both answers suck, but one is a lot more suck so I choose
| the "fix it anyways and if something else turns out to have
| depended on the brokeness then add that to the list of shit
| to fix" approach whenever possible.
| mattowen_uk wrote:
| Ok, so now I know that text/csv is proper mime type, I have to
| ask, why don't browsers offer a native way to view CSVs as a
| [html] table complete with sorting and filtering?
| gardaani wrote:
| The reason I don't like CSV is that CSV files can use any text
| encoding: ASCII, ISO-8859, UTF-8.. It's not nice to try to guess
| the encoding when reading files.
|
| RFC 4180 also fails to define the default text encoding. It seems
| that text encoding can be given only when MIME type is given for
| the file.
|
| JSON is so much easier.
| formerly_proven wrote:
| Excel generates RFC-compliant CSV files _if_ you are using the
| en-us /en-uk locales, but not for many other locales. Of
| course, this is a system-wide setting, and Excel has no option
| to generate a standard CSV independent of locale.
| dspillett wrote:
| _> the en-us /en-uk locales_
|
| Though be careful if you communicate with people using both
| locals. Things interpreted as dates will get corrupted in fun
| ways.
| shellac wrote:
| I used Excel recently and it added a BOM, which is pretty
| awful. Is there any way to stop that nonsense?
| cpach wrote:
| AFAIK: Nope. Microsoft loves that BOM, for reasons I cannot
| comprehend.
| andylynch wrote:
| Almost certainly for backwards compatibility. Excel in
| particular uses the BOM as a single to open a file in a
| Unicode encoding, rather than cp1252 etc.
| dbsmith83 wrote:
| I use a script to strip it from the file. Definitely not
| ideal.
| yakovsh2 wrote:
| Default encoding is now UTF-8, due to the way the RFC process
| works it is not in the original document:
| https://www.iana.org/assignments/media-types/text/csv
| throw0101a wrote:
| Sadly "ASCII delimited" text, using ASCI 28 to 31, doesn't seem
| to be a thing:
|
| * https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text
|
| * http://en.wikipedia.org/wiki/Unit_separator
| Nullabillity wrote:
| I don't really see the point... Using special characters makes
| it harder to manipulate by hand, and you still don't solve
| conflicts (where content is able to corrupt the container
| structure). Ultimately, the only safe way is to use Pascal
| string-style length prefixes.
|
| Escaping is too easy to ignore, until it comes back to bite you
| in the ass. Balanced delimiters break as soon as the content
| contains invalid data (or is delimited according to a different
| convention than the container).
| throw0101a wrote:
| > Using special characters makes it harder to manipulate by
| hand
|
| I generally use CSV as an import/export format, and would
| like for software to at least support the option (even if
| CSV/TSV remains the default).
| jrochkind1 wrote:
| It's much less likely that content is going to include ASCII
| 28 to 31 than that it will include a comma or a tab. But
| you're right, certainly not impossible.
|
| I think the human-editability you mention may be a major
| reason why comma-separated has been such a practical winner?
|
| Since ASCII 28 to 31 are invisible or indistinguishable in
| most editors, I suspect a format relying on them would be
| subject to routine corruption.
|
| (I have some experience working with a "legacy" "binary"
| format which DOES use those ASCII separator values...
| https://en.wikipedia.org/wiki/MARC_standards)
| Nullabillity wrote:
| > It's much less likely that content is going to include
| ASCII 28 to 31 than that it will include a comma or a tab.
| But you're right, certainly not impossible.
|
| Yeah, that's the same problem as UTF-8's almost-
| compatibility with ASCII. Things work, until they break
| spectacularly.
|
| Ultimately, you either have to solve the issue properly, or
| keep it visible enough that people are forced to handle it.
| slver wrote:
| In machine encoding formats, "much less likely" is the same
| as "very likely". You're either OK for something to happen
| or you categorically make it impossible to happen.
|
| Invisible characters also mean you can no longer call a
| data format "human readable".
|
| The special ASCII characters you're referring to are mostly
| a historical oddity today.
| jrochkind1 wrote:
| How much have you worked with actually existing CSV data?
| If it was as easy as "making it categorically impossible
| to happen", working with CSV wouldn't be such a mess! You
| can say whatever you want in whatever standard you want,
| data bugs still happen.
|
| Any CSV format description anywhere, will of course you
| need to use various internal escaping mechanisms for a
| literal comma, or forbid them entirely. Does that
| "categorically make it impossible to happen"? Obviously
| not, words in a standard can't make something
| categorically impossible. How likely a protocol/standard
| violation is to happen of course matters to actual
| practical pain.
|
| Yes, of course the ascii separator values are a
| historical path not taken. This thread, not started by
| me, was musing about why, and what-if, by people who,
| believe it or not, have plenty of experience dealing with
| data, and bad data.
| Pet_Ant wrote:
| Manipulating by hand is easily solved with alt sequences and
| adding support to distinctly display those characters to text
| editors. It's easier than figuring out the corner cases of
| YAML.
| slver wrote:
| > Escaping is too easy to ignore
|
| This is why we have editors that do syntax checking as we
| type. Editing CSV by hand is just not important enough for
| such editors to be commonplace. CSV files are exported and
| imported.
|
| And since we have JSON, whose spec is stable, known and at
| this point consistently implemented, CSV should really just
| be sun set rather than "fixed".
|
| Also, the simplest escaping format is wrap in quotes and
| represent quotes by typing quotes twice. That's easy to
| remember and relatively hard to screw up even when editing by
| hand.
| chriswarbo wrote:
| > Ultimately, the only safe way is to use Pascal string-style
| length prefixes.
|
| Note that length-prefixes aren't necessarily safer when
| _nested_. For example, `line(10 bytes, [cell(1000 bytes,
| "xxxxxxx...`. This can lead to vulnerabilities like buffer
| overflows (naively allocating 10 bytes for the line, then
| trying to put 1000 bytes into it)
| Nullabillity wrote:
| Sure, without bounds-checking all hope is lost, but then
| you're in a hopeless place _anyway_.
| simongr3dal wrote:
| If you're editing by hand then length prefixes are also
| broken easily and counting the length of strings is
| cumbersome unless you're using a "technical" text editor like
| VS Code in which case its just as easy to write a delimiter
| like \x1F (unit separator) as it is to highlight the text and
| look at character count.
|
| Unless "editing by hand" means using something like excel?
| Nullabillity wrote:
| There's a difference between "take on this inconvenience
| and gain nothing" and "take on this inconvenience in
| exchange for solving this problem".
| sonjaqql wrote:
| This is timely for me - I was just sharing this 2014 article with
| my team.
|
| http://thomasburette.com/blog/2014/05/25/so-you-want-to-writ...
|
| It made an impact on my approach to CSV handling, and helped me
| understand, "just because I can, doesn't mean I should".
| temp8964 wrote:
| I recommend CSVY (CSV with YAML frontmatter, https://csvy.org/),
| which solves several problems in CSV.
|
| Currently R data.table supports read and write csvy format.
| https://www.rdocumentation.org/packages/data.table/versions/...
| slver wrote:
| CSV in the abstract has no problems. The problem is two tools
| attempting to interact by using subtly incompatible versions of
| CSV. And CSVY just adds one more of those.
| yakovsh2 wrote:
| (I'm the author of the RFC)
|
| Not sure why this is trending now, but this RFC is being revised
| now: https://datatracker.ietf.org/doc/html/draft-shafranovich-
| rfc...
|
| Suggestions and comments are welcome here:
| https://github.com/nightwatchcybersecurity/rfc4180-bis
| billpg wrote:
| Part of me thinks that a new CSV standard with technical
| breaking changes (lines starting with #) is not needed because
| we have JSON array-of-arrays files.
|
| The other part thinks this is quite cool and wishes your
| efforts well.
| jtvjan wrote:
| I think we should just stop using commas and newlines and start
| using the ASCII unit separator and record separator. It would
| alleviate most quoting and escaping issues because those two
| rarely appear in actual text.
| teknopaul wrote:
| significant whitespace is nasty, by definition its hard to
| see errors. Not convenient if you can't edit the format with
| a normal keyboard.
| teddyh wrote:
| Sure, maybe, but that won't be CSV.
| Jiocus wrote:
| Maybe that would be an "SSV"? Separator Separated Values
| file.
| SahAssar wrote:
| That would just make it easier to implement the standard in a
| way that works most of the time but breaks on valid input.
| Why would that be desirable?
| jpalomaki wrote:
| On the other hand, if the characters still technically can
| exist in the content, the rareness just makes the problems
| harder to spot.
|
| In this sense sticking to a relatively common separators is
| good, because they encourage you to do the right thing from
| the start.
| colejohnson66 wrote:
| Until I can type those ASCII separators on my keyboard,
| they're not going to happen. That's why CSV won out: the
| comma key already existed.
| ape4 wrote:
| The new RFC allows for names for the fields: "3. The first
| record in the file MAY be an optional header with the same
| format as normal records. This header will contain names
| corresponding to the fields in the file" Is there a reliable
| way to tell if this first record has names or actual data?
| gfody wrote:
| iirc text/csv; header=present
| ape4 wrote:
| Oh that's new to me! But not available after downloading of
| course.
| contravariant wrote:
| Oh interesting it includes a specification for comments. That
| is likely going to be the most controversial part.
|
| Does any implementation support/generate comments that way? The
| most I've seen so far is an oversized multiline header.
| chriswarbo wrote:
| Wadler's Law strikes again https://wiki.c2.com/?WadlersLaw
| [deleted]
| slver wrote:
| I feel the simple recommendation for all CSV usecases should be
| "use JSON".
| Grollicus wrote:
| Is Microsoft somehow on board for this? In my experience there
| are CSV files and there are CSV files that excel understands
| and as long as you don't get your file format into excel that
| will seriously hinder adoption.
| miffe wrote:
| In my experience excel doesn't even understand its own csv
| files. If you save one using the Swedish locale it uses
| semicolons as field separators since comma is used as decimal
| points. Trying to open the resulting file using a English
| locale results in garbage.
| sheetjs wrote:
| When saving as CSV, Excel will use the regional "List
| separator" setting. You can override this in Windows 7 with
| Region and Language > Additional Settings > List separator.
|
| If you are trying to generate a file that plays nice with
| Excel, there is a way to force a specific delimiter with
| the "sep" pragma: sep=| a|b|c
| 1|2|3
| teknopaul wrote:
| Please give up on that and make csv itself nice and
| simple. :) Let's have it err comma separated.
|
| Simple is good.
|
| If it's really simple even Microsoft will be able to
| implement it.
| [deleted]
___________________________________________________________________
(page generated 2021-06-03 23:02 UTC)