[HN Gopher] A straightforward way to extend CSV with metadata
       ___________________________________________________________________
        
       A straightforward way to extend CSV with metadata
        
       Author : todsacerdoti
       Score  : 38 points
       Date   : 2021-08-19 16:02 UTC (6 hours ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | [deleted]
        
       | prpl wrote:
       | There is CSVW
        
         | lolive wrote:
         | Thanks to show that a little semantics can come a long way.
        
       | a9h74j wrote:
       | Could this be too general a problem to "solve" without narrowing
       | down major uses-cases and industries, and likely first-adopters?
       | Will there need to be different evolutionary/adoption paths? Will
       | Postal's rule need to play a role?
       | 
       | A priori, the design space I'd want to look at is something
       | between protobuf and csv. Perhaps optionally at the end of each
       | line [somewhere non-breaking] you add metadata _for that line_ ,
       | including specifying a parser implementation you know can perform
       | perfectly.
       | 
       | The OP suggestion might address this if metadata(s) could apply
       | to particular ranges of lines. It would be more fragile to the
       | extent of being in different file.
        
       | newbamboo wrote:
       | I thought the "time to retire csv?" thread was tongue in cheek
       | satire, but it seems real people think it can happen. Maybe my
       | thinking is too static, but I feel like as long as a delimited
       | string is a thing, csv will remain ubiquitous. Everything is meta
       | data and also not meta data.
        
       | karmakaze wrote:
       | Anything that requires multiple files to be passed around
       | together is a non-starter. Even with just one file it's so hard
       | for people to keep track of the 'current' one.
        
         | bachmeier wrote:
         | I put any metadata in the first lines of the file. You can
         | trivially have R skip those lines.
        
           | munk-a wrote:
           | Even better, you can just bake that metadata into the column
           | headers. CSVs have no mechanism or widely accepted habit on
           | skipping over entries - it's safer and more portable to bake
           | any control data you need into the place reserved for it -
           | the header line.
        
         | larrydag wrote:
         | Thinking same thing. CSV is already a text file. Why not just
         | append the metadata to the file already. Call it a CSF file or
         | something like that.
        
         | [deleted]
        
         | sigzero wrote:
         | Yeah, if you are going to change the spec just put the format
         | in the first lines of the file.                 %csvddfVersion:
         | 1.2       %delimiter: ","       %doubleQuote: true
         | %lineTerminator: "\r\n"       %quoteChar: "\""
         | %skipInitialSpace: true       %header: true       %commentChar:
         | "#"       %mandatory: address name       %type: address:string
         | %type: name:string       %type: birthday:date       %type:
         | telephone:number            address,name,birthday,telephone
         | 177A Bleecker Street,Stephen Strange,1963-01-01
         | 
         | Or something like that, where everything up to the first blank
         | line is the schema.
        
         | swader999 wrote:
         | Agree but there are whole industries that's work this way.
         | Looking at you banking and finance.
        
         | [deleted]
        
         | hinkley wrote:
         | I think that's why the archive is part of the format. But then
         | you have a process change where you expect people to work with
         | the _archive_ and not the contents. That can be done but it
         | requires some institutional will.
         | 
         | For instance in a code signing situation it was socialized that
         | only the archive was 'safe' and you used one of several tools
         | to build them or crack them open. However that company was
         | already used to thinking about physical shipping manifests and
         | so learning by analogy worked, after a fashion.
        
           | karmakaze wrote:
           | That's a distinction without a difference. It's either a
           | binary format--not a CSV. Or it's a CSV file + another file.
           | You can't have it both ways.
        
       | delusional wrote:
       | Amazing, a standard to define the characters we couldn't
       | standardize on.
       | 
       | If you can decide this file format, couldn't you just normalize
       | the CSV file instead?
        
         | barbazoo wrote:
         | I think they're simply acknowledging the fact that there are
         | many permutations of formats out there and what they propose is
         | a way to communicate what your particular format is without
         | abandoning CSV completely.
         | 
         | I'm assuming it's in response to the post yesterday that
         | outlined all the things that are wrong with CSV and how other
         | formats like Parquet are better.
        
           | hinkley wrote:
           | The chaos of CSV files is older than XML. And while many of
           | the same issues have played out in HTML, I have a strong
           | suspicion that the particular levels of hell familiar to
           | anyone using CSV files professionally informed some of the
           | opinionated nature of XML. Especially since XML delves into
           | data transport instead of just typesetting of text meant for
           | human consumption.
           | 
           | I know I had specific conversations about CSV versus XML and
           | those referred to a substantial body of literature on the
           | topic.
        
       | thebelal wrote:
       | An earlier attempt of this is http://csvy.org/, but AFAIK it has
       | never really caught on.
        
       | dec0dedab0de wrote:
       | If we're going to change the format, why not just use the record
       | separator, and field separator characters. They've been around
       | since ascii.
        
         | floatingatoll wrote:
         | People don't know how to type them on a US keyboard, so it'll
         | never pass the "if I can't hand-write this in ed(1), it isn't a
         | valid file format" hacker adoption barrier without someone of
         | Google's size to force us all to adopt it (e.g. protobuf).
        
           | dec0dedab0de wrote:
           | Fair enough, but imagine how much easier things would have
           | been if we had those two extra keys on our keyboards for the
           | last 40 years.
        
             | floatingatoll wrote:
             | I believe you could type it in DOS with the same trick that
             | works today: Alt-028, though nowadays you have to say
             | Alt-001E due to Unicode. (Actual ascii/uni codes may vary.)
             | 
             | Perl6 tried incorporating non-US-keyboard characters into
             | the language and that went very badly. I'm sure it works
             | fine for Perl6 people, but beyond that boundary, I still
             | encounter people who can't type e on a Mac with the
             | keyboard alone today, much less handle Alt-001E. So I am
             | extremely pessimistic.
        
         | [deleted]
        
         | Zizizizz wrote:
         | We actually use these as part of bulk loading CSV s into SQL
         | server 2016 because bulk insert can't discern commas in a field
         | with one being used as a delimiter. Pain in the ass to debug
         | with though.
        
         | timwis wrote:
         | Interesting, hadn't heard of these! My guess is the complexity
         | of parsing csv consistently extends a bit beyond that (e.g.
         | handling line breaks), but the bigger problem is getting enough
         | traction behind any solution, no matter how spot on it is, and
         | outpacing csv. Not to mention csv is a far catchier acronym
         | than "rssv" :P
        
       | [deleted]
        
       | roberto wrote:
       | Why JSON? I don't see any benefits in storing `format.txt` as
       | JSON (specially in a file with the txt extension). It just makes
       | editing harder.
       | 
       | This would be better, IMHO:                   csvddfVersion: 1.2
       | delimiter: ";"         doubleQuote: true         lineTerminator:
       | "\r\n"         quoteChar: "\""         skipInitialSpace: true
       | header: true         commentChar: "#"
        
         | mtVessel wrote:
         | Because what better way to configure a parser than to require
         | it to use a completely different parser for configuration? /s
        
         | VenTatsu wrote:
         | I thought the same thing at first glance, but it looks like the
         | author is just reusing an existing spec
         | https://specs.frictionlessdata.io/csv-dialect/
        
       | Hackbraten wrote:
       | Can we please have a third file in the zip?
       | 
       | * format.txt
       | 
       | * mydata.csv
       | 
       | * .DS_Store
       | 
       | It will be there if the zip was created on a Mac so might as well
       | include it in the standard.
        
         | munk-a wrote:
         | No, we can't - since people who aren't on Macs may potentially
         | need to use this standard and that metadata descriptor file has
         | failed to gain any adoption traction outside of Apple.
         | 
         | That said - all sorts of programs that bundle files inject
         | rando metadata files and if there was a codified standard it
         | should be flexible enough to ignore these files (possibly
         | agreeing to never require any files to function that lead with
         | a `.` since most of the random junk that gets thrown into zip
         | files tends to at least consistently follow that habit.
        
         | kaeruct wrote:
         | I cannot tell if this is a joke or not.
         | 
         | If you are serious, then what about just ignoring any files in
         | the zip that are not specified in the standard?
        
           | teknopaul wrote:
           | The point is to proliferate any decisions taken by Apple
           | because Apple = God and thefore such decisions are beyond
           | human understanding.
           | 
           | You cant reason your way out of Apple being right _always_.
        
       | fitba72 wrote:
       | Interesting. Bridges the gap between old school and new school.
       | Us old school folks have to walk the fine line between extinction
       | or having some legacy in the DNA :)
        
       | bitwize wrote:
       | Single file with an open standard binary format based on Protobuf
       | or CapnProto. This idea of using text files by default is ancient
       | Unix cruft. Use binary formats unless you have a damn good reason
       | not to, always, and release open-source tools for reading and
       | processing files in your format. Then you can do some semblance
       | of enforcing structure and types on your data.
        
       | swader999 wrote:
       | Some notion of control data should be added to this spec. Total
       | lines to parse and so on could be included. Perhaps a hash...
        
       | omarhaneef wrote:
       | I see a similar pattern in electronic health records startups,
       | javascript frameworks, backends for APIs and many other domains:
       | 
       | 1. There are way too many standards, we have n standards
       | 
       | 2. The ecosystem is fragmented and we have to let them talk to
       | each other
       | 
       | 3. I think we should solve it by introducing our new standard
       | 
       | 4. We now have n+1 standards
       | 
       | And I am not saying I have solution.
        
       | teknopaul wrote:
       | metadata in a file is as annoying as a BOM. Make decisions about
       | how you plan to interpret wooly specs outside the file and do
       | this for all files. Problem solved: with no code.
        
       | slim wrote:
       | There's already a common way to add metadata to csv : use the
       | first row. It is used for column names, but we can extend it to
       | add any metadata. It would be easy to implement in current
       | parsers and it would fail in a, if not graceful, predictable way
       | probably already accounted for (i.e. garbage in the first row)
        
       | peanut_worm wrote:
       | This seems kind of silly. So you need 2 parsers now and two files
       | you might as well just put your data in JSON.
        
       | dreyfan wrote:
       | Drop the notion of packaging them together and you have a winning
       | idea. There's an optional metadata file that may or may not be
       | present alongside the exact same CSV that we've always had. For
       | systems that implement the metadata they can take advantage.
       | 
       | We do not need a new format. There are numerous superior options
       | if we're just going to provide a different format (parquet, avro,
       | orc, etc).
        
       | 41209 wrote:
       | Am I insane here, or does Excel generally do a very good job of
       | handling almost all CSV files.
       | 
       | I don't really see a need for a metadata file, nor would I ever
       | see Excel or other tools accepting it. The main problem is
       | adoption, CSV isn't perfect but it's what we have. Now if you
       | wrote this as a member of the Excel team at Microsoft, and then
       | Excel had the option of exporting CSV files with a metadata file,
       | then I'd be a bit more excited.
        
         | pbreit wrote:
         | You are insane :-).
         | 
         | Excel has the most inexplicably horrific handling of long
         | number strings that it is borderline unusable.
         | 
         | https://excel.uservoice.com/forums/304921-excel-for-windows-...
        
         | Tagbert wrote:
         | Excel has some major problems with CSV ingestion though I don't
         | think that this proposal will address those problems.
         | 
         | Here are a couple of cases that I run into frequently:
         | 
         | * Excel is very aggressive about forcing type conversion based
         | on its own assumptions. It will convert strings to dates or
         | numbers, even if data is lost in the process. It will ignore
         | quotes to convert long numeric IDs into scientific notation
         | which truncates the ID unrecoverable.
         | 
         | * Excel cannot deal with quoted strings containing line breaks.
         | It treats them as separate records and you get truncated
         | records and partial records on separate rows.
        
         | VenTatsu wrote:
         | Excel does a fairly bad job when moving data between two
         | computers that aren't configured the same way, which kind of
         | defeats the purpose of using a data interchange format.
         | 
         | Where I work we have offices in the US, and in Europe where
         | installing a localized version of windows will swap ',' and '.'
         | when used as the group and decimal separator. Excel when
         | loading a value 100,002 in the US will see one hundred thousand
         | and two, in some parts of Europe it will see one hundred and 2
         | thousandths.
         | 
         | Character set handing can be just as bad, there is no good way
         | to get Excel to auto open a CSV file as UTF-8 that won't break
         | every other CSV parser in existence. The only cross platform
         | option is ASCII. Excel will happily load your local OS
         | encoding, likely some variant of ISO-8859, but any other
         | encoding requires jumping through hoops.
        
         | IanCal wrote:
         | Excel does an atrocious job of handling CSV files. It regularly
         | alters data, messes up encodings and either can't or couldn't
         | (I haven't checked in a few years) open CSV files that start
         | with a capital letter I.
         | 
         | Source: dealing with CSV files people exported from Excel and
         | the horrors that flowed from there.
        
           | swader999 wrote:
           | I've donated hours of my life to resolving excel corrupted
           | csv files. They haven't been satisfying hours.
        
             | IanCal wrote:
             | Same. If it helps, the python package ftfy can fix a lot of
             | encoding issues.
        
               | swader999 wrote:
               | It's tempting to consider different languages...
        
           | bitwize wrote:
           | > either can't or couldn't (I haven't checked in a few years)
           | open CSV files that start with a capital letter I.
           | 
           | It can't do this because it confuses such files with files in
           | SYLK format, which was YET ANOTHER attempt to standardize
           | spreadsheet data interchange, dating from the 80s.
        
             | IanCal wrote:
             | What I find incredible about this is that it decides that
             | the file ending ".CSV" must not be a CSV file but SYLK.
             | Then loading it as SYLK fails, and it doesn't then try and
             | load it as a CSV file instead.
        
           | 41209 wrote:
           | Does any widely used application do it better.
           | 
           | I always view CSV as a lowest common denominator, of course
           | more precise formats exist, but not everyone can use those.
           | Csvs normally get the job done, but like anything else you
           | need to know it's limitations. Something like a basic phone
           | book should work, your scientific data, with dozens upon
           | dozens of floating point numbers may not work.
        
             | IanCal wrote:
             | These are not limitations of CSV files, that's excel. Not
             | being able to open a csv file starting with the letter I
             | has nothing to do with CSV as a format.
             | 
             | There just aren't that many widely used applications that
             | deal with things like this, but Google sheets seems to be
             | the obvious one.
        
             | pbreit wrote:
             | Google Sheets handles CSVs MUCH better.
        
             | capeterson wrote:
             | LibreOffice is generally a little less aggressive about
             | formatting, but still has issues similar to Excel.
             | 
             | For me personally, I usually view things in Excel/LO but
             | don't save, and if I need to modify anything I'll use a
             | text editor for one-off changes, or I'll use something like
             | Python with the pandas library for more programmatic
             | changes. Pandas does have issues forcing timestamps to
             | convert sometimes, but that can be easily configured.
             | Otherwise, no problems with this method.
             | 
             | There's definitely a lot of friction if you want to edit
             | .CSV's without your formatting/encoding being altered,
             | unfortunately.
        
               | munk-a wrote:
               | I much prefer ingesting CSVs into LibreOffice since it's
               | a lot more upfront about what encoding it's going to try
               | and use and the delimiter it has chosen - and those
               | choices are extremely easy to modify and flexible. Excel
               | also has a habit of utterly clobbering character encoding
               | on export - once you start seeing umlauts and cedillas in
               | your text you'll notice that Excel has a really bad habit
               | of chosen latin-1.
        
         | awild wrote:
         | Excel will strip leading zeros of your data, even when you're
         | escaping it, it will also assume that that single cell of a
         | Column of dot-separated floats is in fact a date. Excel is in
         | fact so good at detecting formats that even if you construct an
         | xls fill it with ids which partially start with zeros, properly
         | mark them as strings, it will still nag you at every single
         | cell of your file that there is something fishy about the file.
        
           | pbreit wrote:
           | The stripping of leading zeroes is dreadful when working with
           | zip codes and SSNs.
        
         | jpeloquin wrote:
         | > Am I insane here, or does Excel generally do a very good job
         | of handling almost all CSV files.
         | 
         | If the user does Data > From Text/CSV > Transform > use
         | PowerQuery to set the first row as the header, I think it's
         | true that Excel does a good job. It provides the basics at
         | least: configurable charset and column type detection. When the
         | user re-saves to CSV, I'm not aware of any way to configure the
         | output (e.g., force quotation of text content), but that's sort
         | of ok.
         | 
         | The easy path--double clicking on a CSV file or using File >
         | Open--is where all the weird auto-conversion of values happens.
         | But other posts have covered that part.
        
         | isoprophlex wrote:
         | No, excel does about the worst possible job of handling CSVs.
         | 
         | You can't even hope to keep a file intact upon opening...
        
       | avnigo wrote:
       | There was an interesting talk on CSV-LD (CSV linked data and
       | metadata) at csv,conf,v6 [0],[1].
       | 
       | [0]: https://raw.githubusercontent.com/csv-
       | ld/ns/main/2021-05-csv...
       | 
       | [1]: https://www.youtube.com/watch?v=qIXcnSCOyZs
        
       | kuroguro wrote:
       | We could also...
       | 
       | 1. select sane defaults on all of those options and
       | 
       | 2. create a new file extension (.scsv for strict csv or sth)
       | 
       | and call it a day
       | 
       | _edit_
       | 
       | Oh, someone already did it:
       | https://github.com/code4fukui/StrictCSV
        
         | munk-a wrote:
         | Oh neat - they actually decided to support the technically
         | correct quote escaping approach for CSV files instead of
         | defaulting to the c-string approach!
        
       | dang wrote:
       | Recent and related:
       | 
       |  _Time to retire the CSV?_ -
       | https://news.ycombinator.com/item?id=28221654 - Aug 2021 (544
       | comments)
        
       | iask wrote:
       | This just make it more complex...and then what? We end up with
       | multiple specifications like an x12 document? No, it's not time
       | to retire it. It's just another option. Just like any other
       | delimited file format. Remember.INI files? They're still being
       | used when there is need for simplicity.
       | 
       | If you think CSV is complicated for your app requirement choose a
       | different delimiter like pipe, else look at other alternatives.
       | Simple as that.
       | 
       | I've spent years building parsers for different document in the
       | retail juggernaut businesses.
        
       | asdff wrote:
       | You could also add a header line(s) marked with a leading # and
       | skip these lines when using the file for whatever task. It's used
       | in some bioinformatics file formats to specify metadata and is
       | simple enough to work with.
        
       | mamcx wrote:
       | Nope, Straightforward Way is like this:
       | name:String, date:Date, value:Int         "Miami", 2021-08-19
       | 11:54:19.721376-05, 2
       | 
       | And making header mandatory.
       | 
       | I think forcing quoting of strings and forcing "," for separation
       | and "\n" for lines. Dates are ISO, decimals use .
       | 
       | That is all.
       | 
       | P.D: This is similar how I done this for my mini-lang
       | https://tablam.org/syntax. Tabular data is very simple and the
       | only problem with csv is that is too flexible and let everyone to
       | change it, but making the "schema" outside from csv is against it
       | purpose.
        
         | zzleeper wrote:
         | Why do you prefer comma for delimiters? I found tabs are less
         | likely to appear within the strings (so less need to quote the
         | strings if a delimiter is found, and thus the files are
         | smaller). And both can be typed with a single keystroke.
         | 
         | The only advantage I would see for commas is if you are editing
         | the file in a text editor that doesn't highlight strings (like
         | notepad) so it's hard to distinguish them from spaces
        
           | munk-a wrote:
           | The advantage you mentioned is the one I've seen that has the
           | most benefit. If users are ever modifying the file raw
           | they're going to bork the tab characters - nevermind some
           | developer opening the file who has his editor configured to
           | aggressively convert tabs to spaces.
           | 
           | Tabs are trivial to actually type (so they're much off than,
           | say, FormFeed) but they're difficult to visually distinguish.
           | I also think it's generally a good habit to just force
           | quoting of all fields in CSV to dodge any impending user
           | errors.
           | 
           | Human readable file formats have surrendered space efficiency
           | for legibility and the main attribute they need to focus on,
           | IMO, is defense against human errors.
        
           | FroshKiller wrote:
           | If the delimiter isn't a comma, it isn't a CSV. I prefer tabs
           | as well, but that's why I ask for a "tab-delimited text file"
           | instead of a CSV.
        
         | riedel wrote:
         | This addresses part of the problem. Don't forget the character
         | encoding, null encodings, separator, quote and escape chars,...
         | 
         | No doubt, you can use csv in a well defined way if there is
         | agreement and specs. The problem is there simply isn't any
         | defacto standard and it won't be able to establish one in this
         | domain...
         | 
         | Btw. : The python intake package allows you to specify
         | metadata.
        
           | mamcx wrote:
           | But is like with json: We are not debating about that details
           | because is a bit more well defined.
           | 
           | For a improved CSV, that minutae must be defined from the
           | start, but also must take in account how is CSV used, and
           | keep it spirit.
           | 
           | P.D: A example on something like this is
           | https://commonmark.org, that codify the rules for markdown in
           | a better defined speck.
        
           | munk-a wrote:
           | Null encoding is entirely uncovered by the CSV format - if
           | you want to communicate nulls you need to have an internal
           | convention - some sort of reserved string. Quoting characters
           | are pretty commonly agreed to be " (aka &quot) and escape
           | characters are technically unnecessary. Deciding to support
           | \\\ and \n is a pretty nice thing to do since some producers
           | and parsers will erroneously choke on mid-string newlines but
           | the only character that actually needs to be escaped (the
           | double quote) can just be doubled up as per RFC4180[1]. The
           | indecision on escaping means that the literal character
           | sequence "\n" is very hard to represent in CSV consistently
           | this is an unfortunate malus of C programmers getting all up
           | in the standard.
           | 
           | RFC4180 is an after-the-fact RFC to try and codify common
           | practices in the industry so some tools may refuse to conform
           | to it - but it's really clearly written and like... _super_
           | short. If you 've never read it just go ahead and do so since
           | it lays things out with exceeding care.
           | 
           | 1. https://datatracker.ietf.org/doc/html/rfc4180#section-2
        
             | [deleted]
        
         | tmountain wrote:
         | Simple, elegant, and backward compatible; albeit with
         | degradation, with the existing "standard". I like this
         | approach.
        
           | teknopaul wrote:
           | IMHO the point about .csv is that writing a parser is trivial
           | so none of this matters. Many writers bork on certain types
           | of data. No biggie. Yes there are variants, no they are not
           | complicated. Trying to support _all_ variants of something
           | that is not well-defined is a mugs game. Most of the
           | differences are by mistake. Some of the differences can't be
           | fixed in the parser.
           | 
           | If you need csv to work, pick any one interpretation that
           | does work.
        
       | timwis wrote:
       | Sounds a bit like a data package, a spec defined by the same
       | people who wrote the csv spec referenced in the article:
       | https://specs.frictionlessdata.io/data-package/
        
       | comeonseriously wrote:
       | Or just use some other text format instead? CSV is supposed to be
       | simple.
        
       ___________________________________________________________________
       (page generated 2021-08-19 23:01 UTC)