[HN Gopher] A straightforward way to extend CSV with metadata
___________________________________________________________________
A straightforward way to extend CSV with metadata
Author : todsacerdoti
Score : 38 points
Date : 2021-08-19 16:02 UTC (6 hours ago)
(HTM) web link (gist.github.com)
(TXT) w3m dump (gist.github.com)
| [deleted]
| prpl wrote:
| There is CSVW
| lolive wrote:
| Thanks to show that a little semantics can come a long way.
| a9h74j wrote:
| Could this be too general a problem to "solve" without narrowing
| down major uses-cases and industries, and likely first-adopters?
| Will there need to be different evolutionary/adoption paths? Will
| Postal's rule need to play a role?
|
| A priori, the design space I'd want to look at is something
| between protobuf and csv. Perhaps optionally at the end of each
| line [somewhere non-breaking] you add metadata _for that line_ ,
| including specifying a parser implementation you know can perform
| perfectly.
|
| The OP suggestion might address this if metadata(s) could apply
| to particular ranges of lines. It would be more fragile to the
| extent of being in different file.
| newbamboo wrote:
| I thought the "time to retire csv?" thread was tongue in cheek
| satire, but it seems real people think it can happen. Maybe my
| thinking is too static, but I feel like as long as a delimited
| string is a thing, csv will remain ubiquitous. Everything is meta
| data and also not meta data.
| karmakaze wrote:
| Anything that requires multiple files to be passed around
| together is a non-starter. Even with just one file it's so hard
| for people to keep track of the 'current' one.
| bachmeier wrote:
| I put any metadata in the first lines of the file. You can
| trivially have R skip those lines.
| munk-a wrote:
| Even better, you can just bake that metadata into the column
| headers. CSVs have no mechanism or widely accepted habit on
| skipping over entries - it's safer and more portable to bake
| any control data you need into the place reserved for it -
| the header line.
| larrydag wrote:
| Thinking same thing. CSV is already a text file. Why not just
| append the metadata to the file already. Call it a CSF file or
| something like that.
| [deleted]
| sigzero wrote:
| Yeah, if you are going to change the spec just put the format
| in the first lines of the file. %csvddfVersion:
| 1.2 %delimiter: "," %doubleQuote: true
| %lineTerminator: "\r\n" %quoteChar: "\""
| %skipInitialSpace: true %header: true %commentChar:
| "#" %mandatory: address name %type: address:string
| %type: name:string %type: birthday:date %type:
| telephone:number address,name,birthday,telephone
| 177A Bleecker Street,Stephen Strange,1963-01-01
|
| Or something like that, where everything up to the first blank
| line is the schema.
| swader999 wrote:
| Agree but there are whole industries that's work this way.
| Looking at you banking and finance.
| [deleted]
| hinkley wrote:
| I think that's why the archive is part of the format. But then
| you have a process change where you expect people to work with
| the _archive_ and not the contents. That can be done but it
| requires some institutional will.
|
| For instance in a code signing situation it was socialized that
| only the archive was 'safe' and you used one of several tools
| to build them or crack them open. However that company was
| already used to thinking about physical shipping manifests and
| so learning by analogy worked, after a fashion.
| karmakaze wrote:
| That's a distinction without a difference. It's either a
| binary format--not a CSV. Or it's a CSV file + another file.
| You can't have it both ways.
| delusional wrote:
| Amazing, a standard to define the characters we couldn't
| standardize on.
|
| If you can decide this file format, couldn't you just normalize
| the CSV file instead?
| barbazoo wrote:
| I think they're simply acknowledging the fact that there are
| many permutations of formats out there and what they propose is
| a way to communicate what your particular format is without
| abandoning CSV completely.
|
| I'm assuming it's in response to the post yesterday that
| outlined all the things that are wrong with CSV and how other
| formats like Parquet are better.
| hinkley wrote:
| The chaos of CSV files is older than XML. And while many of
| the same issues have played out in HTML, I have a strong
| suspicion that the particular levels of hell familiar to
| anyone using CSV files professionally informed some of the
| opinionated nature of XML. Especially since XML delves into
| data transport instead of just typesetting of text meant for
| human consumption.
|
| I know I had specific conversations about CSV versus XML and
| those referred to a substantial body of literature on the
| topic.
| thebelal wrote:
| An earlier attempt of this is http://csvy.org/, but AFAIK it has
| never really caught on.
| dec0dedab0de wrote:
| If we're going to change the format, why not just use the record
| separator, and field separator characters. They've been around
| since ascii.
| floatingatoll wrote:
| People don't know how to type them on a US keyboard, so it'll
| never pass the "if I can't hand-write this in ed(1), it isn't a
| valid file format" hacker adoption barrier without someone of
| Google's size to force us all to adopt it (e.g. protobuf).
| dec0dedab0de wrote:
| Fair enough, but imagine how much easier things would have
| been if we had those two extra keys on our keyboards for the
| last 40 years.
| floatingatoll wrote:
| I believe you could type it in DOS with the same trick that
| works today: Alt-028, though nowadays you have to say
| Alt-001E due to Unicode. (Actual ascii/uni codes may vary.)
|
| Perl6 tried incorporating non-US-keyboard characters into
| the language and that went very badly. I'm sure it works
| fine for Perl6 people, but beyond that boundary, I still
| encounter people who can't type e on a Mac with the
| keyboard alone today, much less handle Alt-001E. So I am
| extremely pessimistic.
| [deleted]
| Zizizizz wrote:
| We actually use these as part of bulk loading CSV s into SQL
| server 2016 because bulk insert can't discern commas in a field
| with one being used as a delimiter. Pain in the ass to debug
| with though.
| timwis wrote:
| Interesting, hadn't heard of these! My guess is the complexity
| of parsing csv consistently extends a bit beyond that (e.g.
| handling line breaks), but the bigger problem is getting enough
| traction behind any solution, no matter how spot on it is, and
| outpacing csv. Not to mention csv is a far catchier acronym
| than "rssv" :P
| [deleted]
| roberto wrote:
| Why JSON? I don't see any benefits in storing `format.txt` as
| JSON (specially in a file with the txt extension). It just makes
| editing harder.
|
| This would be better, IMHO: csvddfVersion: 1.2
| delimiter: ";" doubleQuote: true lineTerminator:
| "\r\n" quoteChar: "\"" skipInitialSpace: true
| header: true commentChar: "#"
| mtVessel wrote:
| Because what better way to configure a parser than to require
| it to use a completely different parser for configuration? /s
| VenTatsu wrote:
| I thought the same thing at first glance, but it looks like the
| author is just reusing an existing spec
| https://specs.frictionlessdata.io/csv-dialect/
| Hackbraten wrote:
| Can we please have a third file in the zip?
|
| * format.txt
|
| * mydata.csv
|
| * .DS_Store
|
| It will be there if the zip was created on a Mac so might as well
| include it in the standard.
| munk-a wrote:
| No, we can't - since people who aren't on Macs may potentially
| need to use this standard and that metadata descriptor file has
| failed to gain any adoption traction outside of Apple.
|
| That said - all sorts of programs that bundle files inject
| rando metadata files and if there was a codified standard it
| should be flexible enough to ignore these files (possibly
| agreeing to never require any files to function that lead with
| a `.` since most of the random junk that gets thrown into zip
| files tends to at least consistently follow that habit.
| kaeruct wrote:
| I cannot tell if this is a joke or not.
|
| If you are serious, then what about just ignoring any files in
| the zip that are not specified in the standard?
| teknopaul wrote:
| The point is to proliferate any decisions taken by Apple
| because Apple = God and thefore such decisions are beyond
| human understanding.
|
| You cant reason your way out of Apple being right _always_.
| fitba72 wrote:
| Interesting. Bridges the gap between old school and new school.
| Us old school folks have to walk the fine line between extinction
| or having some legacy in the DNA :)
| bitwize wrote:
| Single file with an open standard binary format based on Protobuf
| or CapnProto. This idea of using text files by default is ancient
| Unix cruft. Use binary formats unless you have a damn good reason
| not to, always, and release open-source tools for reading and
| processing files in your format. Then you can do some semblance
| of enforcing structure and types on your data.
| swader999 wrote:
| Some notion of control data should be added to this spec. Total
| lines to parse and so on could be included. Perhaps a hash...
| omarhaneef wrote:
| I see a similar pattern in electronic health records startups,
| javascript frameworks, backends for APIs and many other domains:
|
| 1. There are way too many standards, we have n standards
|
| 2. The ecosystem is fragmented and we have to let them talk to
| each other
|
| 3. I think we should solve it by introducing our new standard
|
| 4. We now have n+1 standards
|
| And I am not saying I have solution.
| teknopaul wrote:
| metadata in a file is as annoying as a BOM. Make decisions about
| how you plan to interpret wooly specs outside the file and do
| this for all files. Problem solved: with no code.
| slim wrote:
| There's already a common way to add metadata to csv : use the
| first row. It is used for column names, but we can extend it to
| add any metadata. It would be easy to implement in current
| parsers and it would fail in a, if not graceful, predictable way
| probably already accounted for (i.e. garbage in the first row)
| peanut_worm wrote:
| This seems kind of silly. So you need 2 parsers now and two files
| you might as well just put your data in JSON.
| dreyfan wrote:
| Drop the notion of packaging them together and you have a winning
| idea. There's an optional metadata file that may or may not be
| present alongside the exact same CSV that we've always had. For
| systems that implement the metadata they can take advantage.
|
| We do not need a new format. There are numerous superior options
| if we're just going to provide a different format (parquet, avro,
| orc, etc).
| 41209 wrote:
| Am I insane here, or does Excel generally do a very good job of
| handling almost all CSV files.
|
| I don't really see a need for a metadata file, nor would I ever
| see Excel or other tools accepting it. The main problem is
| adoption, CSV isn't perfect but it's what we have. Now if you
| wrote this as a member of the Excel team at Microsoft, and then
| Excel had the option of exporting CSV files with a metadata file,
| then I'd be a bit more excited.
| pbreit wrote:
| You are insane :-).
|
| Excel has the most inexplicably horrific handling of long
| number strings that it is borderline unusable.
|
| https://excel.uservoice.com/forums/304921-excel-for-windows-...
| Tagbert wrote:
| Excel has some major problems with CSV ingestion though I don't
| think that this proposal will address those problems.
|
| Here are a couple of cases that I run into frequently:
|
| * Excel is very aggressive about forcing type conversion based
| on its own assumptions. It will convert strings to dates or
| numbers, even if data is lost in the process. It will ignore
| quotes to convert long numeric IDs into scientific notation
| which truncates the ID unrecoverable.
|
| * Excel cannot deal with quoted strings containing line breaks.
| It treats them as separate records and you get truncated
| records and partial records on separate rows.
| VenTatsu wrote:
| Excel does a fairly bad job when moving data between two
| computers that aren't configured the same way, which kind of
| defeats the purpose of using a data interchange format.
|
| Where I work we have offices in the US, and in Europe where
| installing a localized version of windows will swap ',' and '.'
| when used as the group and decimal separator. Excel when
| loading a value 100,002 in the US will see one hundred thousand
| and two, in some parts of Europe it will see one hundred and 2
| thousandths.
|
| Character set handing can be just as bad, there is no good way
| to get Excel to auto open a CSV file as UTF-8 that won't break
| every other CSV parser in existence. The only cross platform
| option is ASCII. Excel will happily load your local OS
| encoding, likely some variant of ISO-8859, but any other
| encoding requires jumping through hoops.
| IanCal wrote:
| Excel does an atrocious job of handling CSV files. It regularly
| alters data, messes up encodings and either can't or couldn't
| (I haven't checked in a few years) open CSV files that start
| with a capital letter I.
|
| Source: dealing with CSV files people exported from Excel and
| the horrors that flowed from there.
| swader999 wrote:
| I've donated hours of my life to resolving excel corrupted
| csv files. They haven't been satisfying hours.
| IanCal wrote:
| Same. If it helps, the python package ftfy can fix a lot of
| encoding issues.
| swader999 wrote:
| It's tempting to consider different languages...
| bitwize wrote:
| > either can't or couldn't (I haven't checked in a few years)
| open CSV files that start with a capital letter I.
|
| It can't do this because it confuses such files with files in
| SYLK format, which was YET ANOTHER attempt to standardize
| spreadsheet data interchange, dating from the 80s.
| IanCal wrote:
| What I find incredible about this is that it decides that
| the file ending ".CSV" must not be a CSV file but SYLK.
| Then loading it as SYLK fails, and it doesn't then try and
| load it as a CSV file instead.
| 41209 wrote:
| Does any widely used application do it better.
|
| I always view CSV as a lowest common denominator, of course
| more precise formats exist, but not everyone can use those.
| Csvs normally get the job done, but like anything else you
| need to know it's limitations. Something like a basic phone
| book should work, your scientific data, with dozens upon
| dozens of floating point numbers may not work.
| IanCal wrote:
| These are not limitations of CSV files, that's excel. Not
| being able to open a csv file starting with the letter I
| has nothing to do with CSV as a format.
|
| There just aren't that many widely used applications that
| deal with things like this, but Google sheets seems to be
| the obvious one.
| pbreit wrote:
| Google Sheets handles CSVs MUCH better.
| capeterson wrote:
| LibreOffice is generally a little less aggressive about
| formatting, but still has issues similar to Excel.
|
| For me personally, I usually view things in Excel/LO but
| don't save, and if I need to modify anything I'll use a
| text editor for one-off changes, or I'll use something like
| Python with the pandas library for more programmatic
| changes. Pandas does have issues forcing timestamps to
| convert sometimes, but that can be easily configured.
| Otherwise, no problems with this method.
|
| There's definitely a lot of friction if you want to edit
| .CSV's without your formatting/encoding being altered,
| unfortunately.
| munk-a wrote:
| I much prefer ingesting CSVs into LibreOffice since it's
| a lot more upfront about what encoding it's going to try
| and use and the delimiter it has chosen - and those
| choices are extremely easy to modify and flexible. Excel
| also has a habit of utterly clobbering character encoding
| on export - once you start seeing umlauts and cedillas in
| your text you'll notice that Excel has a really bad habit
| of chosen latin-1.
| awild wrote:
| Excel will strip leading zeros of your data, even when you're
| escaping it, it will also assume that that single cell of a
| Column of dot-separated floats is in fact a date. Excel is in
| fact so good at detecting formats that even if you construct an
| xls fill it with ids which partially start with zeros, properly
| mark them as strings, it will still nag you at every single
| cell of your file that there is something fishy about the file.
| pbreit wrote:
| The stripping of leading zeroes is dreadful when working with
| zip codes and SSNs.
| jpeloquin wrote:
| > Am I insane here, or does Excel generally do a very good job
| of handling almost all CSV files.
|
| If the user does Data > From Text/CSV > Transform > use
| PowerQuery to set the first row as the header, I think it's
| true that Excel does a good job. It provides the basics at
| least: configurable charset and column type detection. When the
| user re-saves to CSV, I'm not aware of any way to configure the
| output (e.g., force quotation of text content), but that's sort
| of ok.
|
| The easy path--double clicking on a CSV file or using File >
| Open--is where all the weird auto-conversion of values happens.
| But other posts have covered that part.
| isoprophlex wrote:
| No, excel does about the worst possible job of handling CSVs.
|
| You can't even hope to keep a file intact upon opening...
| avnigo wrote:
| There was an interesting talk on CSV-LD (CSV linked data and
| metadata) at csv,conf,v6 [0],[1].
|
| [0]: https://raw.githubusercontent.com/csv-
| ld/ns/main/2021-05-csv...
|
| [1]: https://www.youtube.com/watch?v=qIXcnSCOyZs
| kuroguro wrote:
| We could also...
|
| 1. select sane defaults on all of those options and
|
| 2. create a new file extension (.scsv for strict csv or sth)
|
| and call it a day
|
| _edit_
|
| Oh, someone already did it:
| https://github.com/code4fukui/StrictCSV
| munk-a wrote:
| Oh neat - they actually decided to support the technically
| correct quote escaping approach for CSV files instead of
| defaulting to the c-string approach!
| dang wrote:
| Recent and related:
|
| _Time to retire the CSV?_ -
| https://news.ycombinator.com/item?id=28221654 - Aug 2021 (544
| comments)
| iask wrote:
| This just make it more complex...and then what? We end up with
| multiple specifications like an x12 document? No, it's not time
| to retire it. It's just another option. Just like any other
| delimited file format. Remember.INI files? They're still being
| used when there is need for simplicity.
|
| If you think CSV is complicated for your app requirement choose a
| different delimiter like pipe, else look at other alternatives.
| Simple as that.
|
| I've spent years building parsers for different document in the
| retail juggernaut businesses.
| asdff wrote:
| You could also add a header line(s) marked with a leading # and
| skip these lines when using the file for whatever task. It's used
| in some bioinformatics file formats to specify metadata and is
| simple enough to work with.
| mamcx wrote:
| Nope, Straightforward Way is like this:
| name:String, date:Date, value:Int "Miami", 2021-08-19
| 11:54:19.721376-05, 2
|
| And making header mandatory.
|
| I think forcing quoting of strings and forcing "," for separation
| and "\n" for lines. Dates are ISO, decimals use .
|
| That is all.
|
| P.D: This is similar how I done this for my mini-lang
| https://tablam.org/syntax. Tabular data is very simple and the
| only problem with csv is that is too flexible and let everyone to
| change it, but making the "schema" outside from csv is against it
| purpose.
| zzleeper wrote:
| Why do you prefer comma for delimiters? I found tabs are less
| likely to appear within the strings (so less need to quote the
| strings if a delimiter is found, and thus the files are
| smaller). And both can be typed with a single keystroke.
|
| The only advantage I would see for commas is if you are editing
| the file in a text editor that doesn't highlight strings (like
| notepad) so it's hard to distinguish them from spaces
| munk-a wrote:
| The advantage you mentioned is the one I've seen that has the
| most benefit. If users are ever modifying the file raw
| they're going to bork the tab characters - nevermind some
| developer opening the file who has his editor configured to
| aggressively convert tabs to spaces.
|
| Tabs are trivial to actually type (so they're much off than,
| say, FormFeed) but they're difficult to visually distinguish.
| I also think it's generally a good habit to just force
| quoting of all fields in CSV to dodge any impending user
| errors.
|
| Human readable file formats have surrendered space efficiency
| for legibility and the main attribute they need to focus on,
| IMO, is defense against human errors.
| FroshKiller wrote:
| If the delimiter isn't a comma, it isn't a CSV. I prefer tabs
| as well, but that's why I ask for a "tab-delimited text file"
| instead of a CSV.
| riedel wrote:
| This addresses part of the problem. Don't forget the character
| encoding, null encodings, separator, quote and escape chars,...
|
| No doubt, you can use csv in a well defined way if there is
| agreement and specs. The problem is there simply isn't any
| defacto standard and it won't be able to establish one in this
| domain...
|
| Btw. : The python intake package allows you to specify
| metadata.
| mamcx wrote:
| But is like with json: We are not debating about that details
| because is a bit more well defined.
|
| For a improved CSV, that minutae must be defined from the
| start, but also must take in account how is CSV used, and
| keep it spirit.
|
| P.D: A example on something like this is
| https://commonmark.org, that codify the rules for markdown in
| a better defined speck.
| munk-a wrote:
| Null encoding is entirely uncovered by the CSV format - if
| you want to communicate nulls you need to have an internal
| convention - some sort of reserved string. Quoting characters
| are pretty commonly agreed to be " (aka ") and escape
| characters are technically unnecessary. Deciding to support
| \\\ and \n is a pretty nice thing to do since some producers
| and parsers will erroneously choke on mid-string newlines but
| the only character that actually needs to be escaped (the
| double quote) can just be doubled up as per RFC4180[1]. The
| indecision on escaping means that the literal character
| sequence "\n" is very hard to represent in CSV consistently
| this is an unfortunate malus of C programmers getting all up
| in the standard.
|
| RFC4180 is an after-the-fact RFC to try and codify common
| practices in the industry so some tools may refuse to conform
| to it - but it's really clearly written and like... _super_
| short. If you 've never read it just go ahead and do so since
| it lays things out with exceeding care.
|
| 1. https://datatracker.ietf.org/doc/html/rfc4180#section-2
| [deleted]
| tmountain wrote:
| Simple, elegant, and backward compatible; albeit with
| degradation, with the existing "standard". I like this
| approach.
| teknopaul wrote:
| IMHO the point about .csv is that writing a parser is trivial
| so none of this matters. Many writers bork on certain types
| of data. No biggie. Yes there are variants, no they are not
| complicated. Trying to support _all_ variants of something
| that is not well-defined is a mugs game. Most of the
| differences are by mistake. Some of the differences can't be
| fixed in the parser.
|
| If you need csv to work, pick any one interpretation that
| does work.
| timwis wrote:
| Sounds a bit like a data package, a spec defined by the same
| people who wrote the csv spec referenced in the article:
| https://specs.frictionlessdata.io/data-package/
| comeonseriously wrote:
| Or just use some other text format instead? CSV is supposed to be
| simple.
___________________________________________________________________
(page generated 2021-08-19 23:01 UTC)