[HN Gopher] Why isn't there a decent file format for tabular data?
___________________________________________________________________
Why isn't there a decent file format for tabular data?
Author : LeonB
Score : 45 points
Date : 2022-05-01 00:20 UTC (2 days ago)
(HTM) web link (successfulsoftware.net)
(TXT) w3m dump (successfulsoftware.net)
| prepend wrote:
| I think it's because csv is good enough.
|
| All the standards I've seen haven't been worth the effort to
| implement. So since csv, with all its flaws, is good enough it
| crowds out other open standards.
|
| People complain about it, but it's not really much of a challenge
| to use csv. I'd also prefer it over the crap (rdf, xml, even
| schemad json) proposed by people who value more structure. It's
| easier for me to just make clean production and consumption
| programs than to spend time on a really structured table format.
|
| Although I would love a simple, tabular format so I'd never have
| to use csv again.
| quesera wrote:
| I think CSV is crappy because commas are so common in real
| data.
|
| For _almost all_ scenarios I 've had to work with, I'd have
| been perfectly happy with TSV where literal Tab was a
| disallowed character. No escaping histrionics required.
| hermitcrab wrote:
| Not being able to include Tabs and Carriage Returns in your
| data can be a problem though.
| [deleted]
| prepend wrote:
| I prefer commas because I can see them over tabs. I spend
| zero time escaping commas because the libraries and read and
| write with (usually pandas but pretty much everything) escape
| for me. So unless I'm manually building csvs myself it's a
| non-issue and certainly not histrionics.
| kall wrote:
| Yeah I agree. Just the combination of opening in excel on
| double click and being dead simple to spit out from any
| software makes it a winner. It helps that it's usable in plain
| text too.
|
| If it would have used ANYTHING other than half the world's
| decimal separator as a value separator, it would be a no
| brainer. If it had been .psv (pipe separated values) the edge
| cases would be so few that we could just ignore them.
| hedora wrote:
| CSV is fine. If you care about edge cases, implement RFC 4180:
|
| https://www.rfc-archive.org/getrfc.php?rfc=4180
|
| If you don't, then split each line on ",". Problem solved.
|
| If you find tab delimited easier to read (as I do), then check
| out the IANA spec for TSV files:
|
| https://www.iana.org/assignments/media-types/text/tab-separa...
|
| It's easier to parse than CSV. Unfortunately, you have to
| decide how to handle newlines and tabs yourself (probably as
| \n, \t, with \\\ for backslashes).
| [deleted]
| Pxtl wrote:
| > Why can't we have a format where
|
| Does Excel support it? No? Then that's the end of that.
|
| Excel is tabular data to all non developers. The formats
| supported by Excel are the whole thing.
|
| And if we're inventing a CSV-like format that uses a more
| convenient character than quotes and commas, maybe jumping to a
| non-displayable non-typeable character isn't the best?
|
| Honestly, if I were inventing a table format, I'd use a strict
| subset of HTML5 tables. Strip it down as hard as possible to the
| bare minimal elements - no unneeded close tags. The only real
| flaw is that their encoding of whitespace and escaped chars are
| verbose and ugly. <!DOCTYPE html>
| <html> <head> <title>Example workbook</title>
| <meta charset="utf-8"> <meta name="subtype"
| content="tabular data"> </head> <body>
| <table><caption>Persons</caption> <tr><th>Name<th>Address
| <tr><td>Alice<td>123 Fake St, Faketon FA 12345
| <tr><td>Bob<td>789 Any Pl, Anytown AY 54321 </table>
| <table><caption>Cars</caption> <tr><th>Make<th>Model
| <tr><td>Buick<td>LeSabre <tr><td>Pontiac<td>Vibe
| </table>
| adgjlsfhk1 wrote:
| it will also be at least 2x bigger on disk for no reason.
| Pxtl wrote:
| Your data must be very sparse or include a lot of escape
| chars for that.
|
| I've seen a lot of CSV where everything is quoted, meaning
| that the cell separator is effectively ","
|
| which is only one character less than <td>
|
| and still beats the pants off of JSON or XML. imho, it would
| be a good compromise, in that there's already partial tooling
| and GUI support.
| prepend wrote:
| But usually it will be 1 character vs 4. So that adds a lot
| of space that doesn't add much value.
|
| It's also harder to read.
| Pxtl wrote:
| I would say invisible control characters would be even
| harder to read.
| dtech wrote:
| Excel actually opens HTML tables
| briHass wrote:
| If it's tabular, I want schema for the columns. Is this column a
| 'number' or a string? Even better, is there a max length or
| precision known? Can the cell be null and how is that
| represented? How are dates formatted? Are they UTC/specific TZ,
| etc.
|
| Most of my complaints about CSV relate to trying to determine the
| types used to parse or import as, not how commas are escaped.
| Excel, for example, actually warns you about this if you try to
| save a native Excel file as CSV: you are effectively throwing
| away information.
| Pxtl wrote:
| Mostly the problem comes from how excel is apocalyptically
| shitty at inferring datatypes, incorrectly assuming non-dates
| are dates and ISO8601 dates are god knows what, when a sane
| format would default to text if it didn't know better.
| hirundo wrote:
| > Most tabular data currently gets exchanged as: CSV, Tab
| separated, XML, JSON or Excel. And they are all highly sub-
| optimal for the job.
|
| > CSV is a mess. One quote in the wrong place and the file is
| invalid.
|
| That breaks the other formats too, why pick on CSV? I can imagine
| a format designed to be friendly to syntax errors, but contra
| Postel's Law I'm not sure it would be an improvement over a
| strict, fail-fast syntax.
|
| That's CSV/TSV's real shortcoming: about the only generic
| validation they allow is to make sure the column count is the
| same for all rows.
| zaidw wrote:
| We can blame CSV, or we can blame the way people use CSV.
| Either way CSV is so unreliable that I try to "fail-fast" as
| soon as possible in automated pipeline.
|
| At work, we explicitly define data structuring process,
| converting CSV to Parquet with strict schema and
| technical/structural validation. We assign interns and new grad
| engineers for this, which is nicely within their capabilities
| too with minimal training.
| hermitcrab wrote:
| if you add an extra comma in a CSV (outside of quoting) then
| the rest of the cells in that row are off by 1. Which is not
| good obviously. But if you add an extra quote, then the entire
| rest of the file is garbage.
| dossy wrote:
| > That's CSV/TSV's real shortcoming: about the only generic
| validation they allow is to make sure the column count is the
| same for all rows.
|
| Once upon a time, when I was doing a lot of data interchange
| between a wide variety of systems (OS'es, applications, etc.) I
| considered proposing an "enhanced CSV" (ECSV) where the values
| did not start on the second row in the file, but instead the
| second row would be regular expressions that could be used to
| validate the contents of the columns that followed, and data
| would start on row 3.
|
| In other words, you might have:
|
| ``` ID,NAME,DATE
| "/^\d+$/","//","/^\d{4}-\d{2}-\d{2}$/"
| 867,Alice,1984-01-09 5309,Bob,1981-11-16
|
| ```
|
| (Newlines added because HN doesn't speak Markdown, sigh.)
|
| In the end, I think the solution was far simpler: we just
| exchanged a separate descriptor file that had column names and
| their corresponding regexp patterns for validation as a
| separate (versioned) file, in order to save a few bytes inside
| each file transmission, which was a real savings when you paid
| per-byte over an EDI network.
| syntheweave wrote:
| I spent a while making a binary format for tabularish
| documents, and even started on an editor for it. What I
| decided on after some long months of gradual iteration was to
| give each cell its own header that could contain various
| forms of type info, flags, and modes, and to define a cell
| type that described forms of break (space, line, page, etc. -
| a 16-bit range of break types could be encoded). The document
| header also described a dictionary mapping for the data so
| that it could immediately be presented to the editor in a
| readable form.
|
| But now I just use plain old spreadsheets to do things - I
| obsoleted my own tech, although I like certain things about
| it. The editing and storage encoding isn't really the problem
| so much as the integrity of the literals, which a solution
| like the regex idea could accommodate.
|
| I do think that CSV would benefit by having a header area
| that described the encoding of breaks in cells and lines.
| Maybe that's the only thing that really needs fixing in it.
| And if it included arbitrary break levels like my thing and
| dropped the rectangular row-column shape, it would cover a
| huge number of documents.
| sundarurfriend wrote:
| >> CSV is a mess. One quote in the wrong place and the file is
| invalid.
|
| > That breaks the other formats too, why pick on CSV?
|
| I think it's perhaps badly worded, but the implied (and more
| important) criticism seems to me to be that CSV makes this kind
| of error much more likely, with its handling of quotes. Having
| worked with CSV files that had commas in the data (and
| sometimes quotes too), I quickly learned that I should `set
| readonly` on my editor and only interact with the file through
| programmatic tools, and give up any notion of it being a
| plaintext hand-editable data format.
| dossy wrote:
| > Columns are separated by \u001F (ASCII unit separator) > Rows
| are separated by \u001E (ASCII record separator)
|
| Or, how about columns separated by \u002C, and rows separated by
| \u000A. And, for bonus points, we can even define unambiguous
| ways of escaping those two characters so that they CAN appear
| within column values, if we wanted to, and not tell people that
| our encoding format is totally stupid and that they need to use a
| different format.
|
| OP's proposal is equally "highly sub-optimal for the job" for
| exactly the same imaginary reasons they dislike the currently
| available encoding formats, but they don't seem to realize it.
| LeonB wrote:
| > OP's proposal is equally "highly sub-optimal for the job" for
| exactly the same imaginary reasons they dislike the currently
| available encoding formats, but they don't seem to realize it.
|
| This is a really unfair appraisal in a bunch of different ways.
|
| Removing the ability to embed record delimiters, for example,
| means you can process the records in parallel. That's a massive
| improvement all by itself.
|
| Stating that their reasons are "imaginary" is just a needless
| insult, apart from being wrong. Why be like that?
| benmmurphy wrote:
| if you are ok with a binary format there is apache parquet or
| apache feather or 'jay' (https://datatable.readthedocs.io/en/late
| st/api/frame/to_jay....).
| hermitcrab wrote:
| No doubt binary formats like Parquet are the way to go for high
| performance with multi-GB datasets. Seems like total overkill
| if you have a few hundred or thousand rows of data though.
| Being able to create/edit/view stuff in a text editor and
| easily version it is very useful.
| twobitshifter wrote:
| It's strange that we pretend text is not binary. The truth is
| our many tools are set up to handle binary text data and
| these tools are not set up for alternate encodings.
|
| If you grab a text file from a Windows machine and bring it
| to a mac, you'll see that txt is far from perfect.
|
| This is a long way of saying that if we develop both the
| format and the tooling then the distinction of text vs
| "binary" tabular data goes away.
| hermitcrab wrote:
| Fair point. But the vast majority of programming tools do
| already handle UFT8 text (although perhaps don't do a great
| job with US/RS characters).
|
| also you could write an efficient parser for US/RS
| separated data in 5 minutes. For parquet you would have to
| integrate with a library (with all it's dependencies and
| design choices) or spend days writing your own parser.
| habitue wrote:
| Do people really edit csvs in a text editor? It's horrific,
| the columns don't line up at all, empty cells are represented
| by a bunch of commas in a row (which, are you supposed to
| count all the commas?)
|
| And in terms of versioning, I have seen people commit diffs
| of csvs before, and they're equally unreadable.
|
| CSV is a plain text format, but that basically buys you
| nothing. As long as you're going to be loading it into excel
| or whatever anyway, might as well just pick a good binary
| format like parquet.
| hermitcrab wrote:
| I create and edit CSVs by hand daily. To create/modify
| simple examples to provide technical support for a data
| transformation tool.
| macintux wrote:
| I frequently use the table syntax in org-mode. No, I
| wouldn't edit CSVs because they are miserable, but a
| friendly syntax with some basic text editor support,
| absolutely.
| habitue wrote:
| There is a decent file format for tabular data, and the author
| dismisses it: parquet.
|
| It's compact, encodes all the common data types well, does
| int/float distinction (thanks for teaching us about how important
| that is json), stores null records with a mask instead of a
| special value, row major order, has compression, speedy random
| access... it has it all. And it isn't bogged down with legacy
| cruft (yet).
|
| Since you need to look at tabular data outside of a text editor
| anyway[0], I don't see a ton of benefit to making it a plaintext
| format. Especially not with the author's suggestion of un-
| typeable ascii delimiters. If I can't type it on my keyboard, I
| may as well be looking at a hex dump of a binary file because I
| can't really edit it.
|
| [0] Who among us hasn't experienced the joy of a pull request
| updating a checked in csv file? A mess of
| ,,,,,,,"Birmingham",,,AL, etc.
| emef wrote:
| parquet is great but it's not particularly easy to read or
| write. the libraries that do exist to work with it are few and
| far between, and those that do either have a hundred
| dependencies or depend on native code (e.g. libarrow).
| certainly an important dimension in an ideal file format should
| be the ease of parsing/writing it, and parquet gets an
| extremely low score on that front imo
| hermitcrab wrote:
| Editors such as Notepad++ make it fairly easy to insert US and
| RS ASCII codes. But not quite as simple as typing a comma or
| return, obviously.
| aimor wrote:
| I've done exactly what the author suggests for the exact same
| reasons. CSV headaches got me searching for other delimiters and
| I saw there were already unit and record separators, "WTF!?".
|
| As long as you're dealing with text, and you're writing and
| parsing the data yourself, and you never need to use tools other
| than things that can be modified to set the item and line
| delimiters, then it's great. I haven't used the approach since
| then.
|
| Mostly I haven't used it because half my use cases need binary
| numerical data (floats) and the others can be worked with
| entirely in Excel or as CSVs. But I like the idea, even if the
| support just isn't there.
| webmaven wrote:
| The proposed format is reasonably same, but you really want to
| prevent people from writing them by hand, and adding a bit of
| metadata to describe the column data types at a minimum, and
| ideally more information such as allowed values, semantics, etc.
|
| To that end, I suggest that putting the tabular data file, along
| with a metadata descriptor file, inside an archive format (zip,
| tarball, etc.); that would put just the right size speed-bump to
| encourage accessing the data through tools and libraries (though
| if someone is just a bit determined, reading and editing the
| contained data directly isn't actually impossible or forbidden).
|
| All that said, if you want a better tabular data interchange
| format badly enough that you're considering devising one, you
| should probably look at using something even more featureful,
| like SQLite:
|
| https://www.sqlite.org/appfileformat.html
| hermitcrab wrote:
| Being able to create small tabular datasets by hand is
| incredibly useful to me (doing support for data wrangling
| software).
|
| Having an optional associated meta data file would be useful
| though.
| LeonB wrote:
| I put some work into creating a standard, csvz, for putting csv
| files and their metadata, into a zip file.
|
| https://github.com/secretGeek/csvz
|
| It's a pretty powerful concept.
|
| SimonW's preferred technique of using sqlite as the means of
| exchange is also very powerful. Particularly when combined with
| all of the utils he maintains.
| bitmedley wrote:
| > Tab separated is a bit better than CSV. But can't store tabs
| and still has issues with line endings, encodings etc.
|
| There are three main variations of tab-separated-values files: 1)
| Those that don't allow tab and line endings. 2) Those that
| replace tab and newline characters with escaped values (\n for
| newline, \t for tab, \r for carriage return, \\\ for backslash).
| 3) Those that follow the CSV convention of quoting fields as
| defined in RFC4180
|
| The third option is by far the best and is what Microsoft Excel
| uses. Microsoft Excel has a save file type called "Unicode Text (
| _.txt) " which saves the data as a tab-separated file using
| RFC4180 quoting/escaping with the UTF-16LE character encoding. In
| older versions of Excel, "Unicode Text (_.txt)" was the only way
| to export any values containing Unicode characters since the "CSV
| (Comma delimited) ( _.csv) " export format uses the ANSI
| encoding(Windows-1252 on Western/US computers) corrupting any
| characters not contained in the ANSI character set. In late 2016,
| Microsoft finally added the "CSV UTF-8 (Comma delimited) (_.csv)"
| option for exporting a CSV file containing Unicode characters.
|
| https://en.wikipedia.org/wiki/Tab-separated_values
| Justsignedup wrote:
| I literally don't get why JSON is bad:
|
| [{row1}, {row2}, {row3}]
|
| The fact that it can do more is in no way a negative. Can even
| make a limited JSON parser with reduced capabilities. And with
| JSON can do more definitions like header names vs column names vs
| just arrays of arrays.
| tshaddox wrote:
| I think the problem is that there is no clear standard or
| convention for how to do this. Indeed there are many ways to
| represent tabular data. Most large software projects probably
| have at least one bespoke file format that they use internally.
| The trouble is the lack of such a format that everyone agrees
| on.
| prepend wrote:
| Doesn't open in Excel.
|
| And since it doesn't require one record per line it can be a
| hassle to read without having to parse it.
|
| It's really nice to be able to do "head -n 5"
| saulpw wrote:
| jsonlines (jsonlines.org) is one record per line.
| bumper_crop wrote:
| This table has one column, one row, and thus one cell. What is
| its value?
|
| [{"col1":"val1", "col1":"val2"}]
| delusional wrote:
| Not being able to do more is exactly the point. By restricting
| the space of the file format, we can free the mind to think
| about how to fit the data to the format.
|
| If you can do anything, it becomes hard to do the right thing.
| If you can only do the right thing, it becomes trivial to
| decide.
| Pxtl wrote:
| The repetition of the keys seems like the sore point here.
|
| I could see a specialized form of JSON using jagged arrays
| {"Header": ["Name", "Street Address", "City"] "Body": [
| ["Alice", "123 Fake St", "Faketon"], ["Bob", "987 Any
| Pl", "Anytown"], ]}
|
| in that way the keys aren't repeated. It wouldn't be coherent
| useful JS objects when deserialized, but it would be trivial to
| convert js table object into a true array of Javascript
| objects.
| orasis wrote:
| A JSON array doesn't allow easily appending a row. JSONLines is
| a bit better.
| nescioquid wrote:
| From TFA:
|
| >> XML and Javascript are tree structures and not suitable for
| efficiently storing tabular data (plus other issues).
| javajosh wrote:
| What does the acronym TFA mean here?
| tsimionescu wrote:
| "The Fine Article" (though more traditionally, and/or
| depending on tone, "The Fucking Article").
| barrkel wrote:
| Doesn't have a great int64 number story, no native dates /
| date-times. If you want named tuples, then the names need to go
| everywhere, otherwise it's a serialization mechanism on top of
| JSON.
| the_gipsy wrote:
| Number size is out of JSON scope, int64 is a problem of
| JavaScript no matter what format you use.
| hedora wrote:
| It also has opinions about floating point numbers.
| nottorp wrote:
| > No escaping. If you want to put \u001F or \u001E in your data -
| tough you > can't. Use a different format.
|
| > It would be reasonably compact, efficient to parse and easy to
| manually edit > (Notepad++ shows the unit separator as a 'US'
| symbol).
|
| Is it me or it won't be human readable because of the lack of new
| lines?
| hermitcrab wrote:
| Yes, that is an issue with the suggested approach. Unless we
| can persuade all the editor developers to break lines as 'RS'
| characters (which is not going to happen).
| shadowofneptune wrote:
| I rarely want to see tabular data in a human-readable format.
| It is always the most tedious way to approach it. My go-to is
| Excel/LibreOffice Calc. This approach is at least tolerable to
| edit in a text editor, while something like the OpenDocument
| Spreadsheet format or the Excel format is impenetrable.
| prepend wrote:
| I rarely do it, but it's nice to be able to Human read when I
| need to. Also being able to use all the command line text
| tools is super convenient.
|
| I think it's a think where having the option for the .1% of
| times when you need it keeps me using it.
| nottorp wrote:
| > I rarely do it, but it's nice to be able to Human read
| when I need to. Also being able to use all the command line
| text tools is super convenient.
|
| Sometimes it helps a lot to eyeball what you have before
| doing one off scripts to filter/massage your data. Had a
| recent case where the path of least resistance was database
| to csv to one off python to statistic tools with a gui and
| tabular display.
|
| Could have probably done some enterprise looking export
| infrastructure but it was a one off and not worth it.
| AdamH12113 wrote:
| I never understood why the ASCII separator characters aren't used
| more. It seems like we're one simple text editor feature away
| from having easy display and modification. Is there some
| historical reason for not doing that?
| quesera wrote:
| ASCII separators (mnemonics FS, GS, RS, US) are difficult for
| most users to type, and have no obvious/standardized visual
| representation.
| hermitcrab wrote:
| Notepad++ show RS and US as little black boxes with 'RS' and
| 'US' in. It works well enough.
| johnny22 wrote:
| I like one of the json lines formats, although pretty sure you'd
| wanna treat floats as strings there, so they don't get mangled.
|
| Each line is separate object there.
| synack wrote:
| HDF5 is often used in scientific computing for this.
|
| https://en.wikipedia.org/wiki/Hierarchical_Data_Format
| sundarurfriend wrote:
| > This results in a truly hierarchical, filesystem-like data
| format. In fact, resources in an HDF5 file can be accessed
| using the POSIX-like syntax /path/to/resource.
|
| That seems a whole higher level of complexity compared to CSV
| or the other options listed in TFA (perhaps comparable to
| Excel).
| brandmeyer wrote:
| NetCDF4 (built on top of HDF5 largely through sub-setting) is
| considerably more powerful than excel/libreoffice. Its also
| easy to access through widely-available libraries. I
| frequently use the Python `netCDF4` (yes, it really is
| capitalized that way) library for exploratory work.
| out_of_protocol wrote:
| You can just use sqlite then. Very compact, highly popular (in
| different role though). Seen it used for large datasets - map
| tiles (millions of jpeg files). Much smaller size than zip or
| tar archive, indexed, fast.
|
| P.S. sqlite> .mode csv sqlite> .import
| city.csv cities
| nh23423fefe wrote:
| I don't think anyone wants tabular data. They want to ingest it
| into a system so they can query it, or join it, or aggregate it.
| They want to get rid of tabular data as quickly as possible
|
| Data at the boundary must be validated, a file format can't do
| that for your. Semantics is harder than syntax and you can't push
| every problem to that level.
| barrkel wrote:
| We've had XML for decades and the default interop is still
| tabular data - probably in large part because unless you go
| into non-standard extensions, tabular data is the easiest thing
| to get from a database query.
|
| Spreadsheets haven't exactly gone away either.
| Sharlin wrote:
| And the data in the system is... almost always tabular? Either
| Excel (which runs half the world) or RDB/SQL (which runs the
| other half).
| hirundo wrote:
| The finance world runs on tables. They want to ingest it, query
| it, join it, aggregate it ... and look at the result in tables.
| Tables are everywhere, and a generic tabular format is
| constantly useful for a la carte one off spontaneous data
| processing. I love strict schema validation as much as the next
| guy, but sometimes you just want to cut and paste a table.
| hermitcrab wrote:
| It's tables all the way down...
| ACow_Adonis wrote:
| yeah, um, my life is in tables (stats, finance, etc).
|
| while csv is less than optimal, it's always worked out in
| practice for me (because we still have to validate anything of
| significant size and nothing finance or city/nation sized
| should generally be edited by hand anyway, and errors are
| everywhere independent of format).
|
| honestly, my bigger nightmare has been people trying to put
| things into something like xml...
|
| edit: not that there aren't huge problems with the
| csv/excel/database software and interaction: generally I just
| end up having to assume everything is text and defining type
| information my end afterwards on successful parsing
| human_person wrote:
| Have you see tiledb? https://tiledb.com/data-types/dataframes My
| team is currently transitioning from HDF5 to tiledb for genomics
| data.
| simonw wrote:
| SQLite is one of the few file formats that's recommended by the
| US Library of Congress for archival storage:
| https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...
|
| See also this page on the SQLite website (they're understandably
| very proud of this): https://www.sqlite.org/locrsf.html
|
| I think it's a fantastic format for archiving and distributing
| data.
| rileytg wrote:
| sqlite also has roots in the us military (navy iirc). this
| could explain the US govs willingness to adopt.
___________________________________________________________________
(page generated 2022-05-03 23:00 UTC)