[HN Gopher] Show HN: ZSV (Zip Separated Values) columnar data fo...
___________________________________________________________________
Show HN: ZSV (Zip Separated Values) columnar data format
A columnar data format built using simple, mature technologies.
Author : hafthor
Score : 48 points
Date : 2024-04-13 19:52 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| modulus1 wrote:
| Can't store tabs or newlines, odd choice.
| hafthor wrote:
| yeah, this is a limitation from the TSV format this is based on
| - there is an extension to the format that supports storing
| binary blobs - ref:
| https://github.com/Hafthor/zsvutil?tab=readme-ov-file#nested...
| alexandreyc wrote:
| Basically it's the same limitations as CSV.
|
| At least you could use something less likely to appear in
| data as record sepator (like 0x1E)
|
| Otherwise it's an interesting idea!
| tboerstad wrote:
| 0x1E is the record separator, in ASCII precisely for this
| purpose. Too bad it's not popular, here we're stuck with
| inferior TSV/CSV
| orf wrote:
| Strings can contain 0x1E, so it has exactly the same
| issues as a tab character but with all the downsides of
| it not being an easy, "simple" character.
| mattnewton wrote:
| I can't easily type that out - and once the format can't
| be read / editing in a simple text editor, I'm starting
| to lean towards a nice binary format like protobuf.
| bobbylarrybobby wrote:
| As far as I know, thanks to quoting it is possible to put
| basically any data you want in a CSV.
| inimino wrote:
| (Offtopic, but just FYI) it's tenet (principle) not tenant
| (building resident).
| hafthor wrote:
| doh. fixed. thanks!
| nextaccountic wrote:
| can't you just do quoting?
| olejorgenb wrote:
| https://github.com/Hafthor/zsvutil?tab=readme-ov-
| file#what-a...
|
| > Any escaping or encoding of these characters would make
| the format less human-readable, harder to parse and could
| introduce ambiguity and consistency problems.
|
| Found the wording of "could introduce ambiguity and
| consistency problems" a bit odd, but guess they mean that
| even if things are specified precisely (so there's no
| ambiguity) not everyone would follow the rules or
| something? And they want to play nice with other tools
| following the TSV "standard"
| 8n4vidtmkvmk wrote:
| Please. I wrote a csv parser a couple weeks ago in an
| hour or two. It's not that hard to handle the quoting and
| edge cases. Yes, maybe different parsers will handle them
| differently, but just document your choices and that's
| that. How is ambiguity better than completely disallowing
| certain chars? That's a non-starter
| romanows wrote:
| Reading quick, it's because the tab is used to indicate nested
| tabular data in a column. I wonder why not just have a zsv in
| the zsv?
| jiggawatts wrote:
| This copied the superficial data layout without the key benefit
| of modern columnar formats: segment elimination.
|
| Most such formats support efficient querying by skipping the disk
| read step entirely when a chunk of data is not relevant to a
| query. This is done by splitting the data into segments of about
| 100K rows, and then calculating the min/max range for each
| column. That is stored separately in a header or small metadata
| file. This allows huge chunks of the data to be entirely skipped
| if it falls out of range of some query predicate.
|
| PS: the same compression ratio advantages could be achieved by
| compressing columns stored as JSON arrays, but such a format
| could encode all Unicode characters and has a readily available
| decoder in all mainstream programming languages.
| orthoxerox wrote:
| It is simple, but how do you access the price in row #1234567890?
| If your data doesn't have this many records and can fit into RAM,
| a basic NLJSON or CSV will work just as well.
| CapitalistCartr wrote:
| What is NLJSON?
| svieira wrote:
| New Line delimited JSON
| chuckadams wrote:
| Also known as JSONL, or JSON Lines. Basically a file of JSON
| objects separated by newlines. Popular format for logs these
| days for obvious reasons.
| rzzzt wrote:
| NDJSON is the shorthand I've seen:
| https://github.com/ndjson/ndjson-spec
| cm2187 wrote:
| Like parquet this isn't really meant for RDBMS type of
| database, more like for analytics over large datasets. I work
| in an environment where we typically have tables with over 300
| columns, 10s if not 100s millions of rows daily. When you want
| to do a simple sum/group by involving 2 or 3 columns, it is
| great to have a column store file format, where you only read
| the columns you need and those are compressed.
|
| The price you pay is that it is inefficient for single record
| access, or for "select * " kind of queries.
| orthoxerox wrote:
| I _was_ comparing it with Parquet, which is much more
| complex, but has features that help you access the data in
| less than O(n), like row groups and pages.
| psanford wrote:
| The only benefit this format provides is the ability to read some
| columns without needing to read all columns. Unfortunately it is
| not a seekable format. That's a pretty big miss.
|
| It also wouldn't be that hard to make it seekable. All you would
| have to do is make each tsv file two columns: record-id, value.
| selimnairb wrote:
| We need to have a come to Jesus meeting about these columnar
| formats.
| andenacitelli wrote:
| Can we just all converge on Parquet + Arrow and call it a day
| please? Too much effort being put into 1..N ways to solve a
| problem that would be better put towards a single standard.
|
| We work with Parquet + Arrow every day at $DAYJOB in a ML and Big
| Data context and it's been great. We don't even think we're using
| it to its fullest potential, but it's never been the bottleneck
| for us.
| pdimitar wrote:
| How is the data schema description language btw? I haven't used
| either yet.
| btbuildem wrote:
| Colour me out of the loop, but what is the utility of this type
| of approach? I can't seem to grok this.
| warthog wrote:
| This is very promising for nested datapoints
___________________________________________________________________
(page generated 2024-04-13 23:00 UTC)