[HN Gopher] Show HN: ZSV (Zip Separated Values) columnar data fo...
       ___________________________________________________________________
        
       Show HN: ZSV (Zip Separated Values) columnar data format
        
       A columnar data format built using simple, mature technologies.
        
       Author : hafthor
       Score  : 48 points
       Date   : 2024-04-13 19:52 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | modulus1 wrote:
       | Can't store tabs or newlines, odd choice.
        
         | hafthor wrote:
         | yeah, this is a limitation from the TSV format this is based on
         | - there is an extension to the format that supports storing
         | binary blobs - ref:
         | https://github.com/Hafthor/zsvutil?tab=readme-ov-file#nested...
        
           | alexandreyc wrote:
           | Basically it's the same limitations as CSV.
           | 
           | At least you could use something less likely to appear in
           | data as record sepator (like 0x1E)
           | 
           | Otherwise it's an interesting idea!
        
             | tboerstad wrote:
             | 0x1E is the record separator, in ASCII precisely for this
             | purpose. Too bad it's not popular, here we're stuck with
             | inferior TSV/CSV
        
               | orf wrote:
               | Strings can contain 0x1E, so it has exactly the same
               | issues as a tab character but with all the downsides of
               | it not being an easy, "simple" character.
        
               | mattnewton wrote:
               | I can't easily type that out - and once the format can't
               | be read / editing in a simple text editor, I'm starting
               | to lean towards a nice binary format like protobuf.
        
             | bobbylarrybobby wrote:
             | As far as I know, thanks to quoting it is possible to put
             | basically any data you want in a CSV.
        
           | inimino wrote:
           | (Offtopic, but just FYI) it's tenet (principle) not tenant
           | (building resident).
        
             | hafthor wrote:
             | doh. fixed. thanks!
        
           | nextaccountic wrote:
           | can't you just do quoting?
        
             | olejorgenb wrote:
             | https://github.com/Hafthor/zsvutil?tab=readme-ov-
             | file#what-a...
             | 
             | > Any escaping or encoding of these characters would make
             | the format less human-readable, harder to parse and could
             | introduce ambiguity and consistency problems.
             | 
             | Found the wording of "could introduce ambiguity and
             | consistency problems" a bit odd, but guess they mean that
             | even if things are specified precisely (so there's no
             | ambiguity) not everyone would follow the rules or
             | something? And they want to play nice with other tools
             | following the TSV "standard"
        
               | 8n4vidtmkvmk wrote:
               | Please. I wrote a csv parser a couple weeks ago in an
               | hour or two. It's not that hard to handle the quoting and
               | edge cases. Yes, maybe different parsers will handle them
               | differently, but just document your choices and that's
               | that. How is ambiguity better than completely disallowing
               | certain chars? That's a non-starter
        
         | romanows wrote:
         | Reading quick, it's because the tab is used to indicate nested
         | tabular data in a column. I wonder why not just have a zsv in
         | the zsv?
        
       | jiggawatts wrote:
       | This copied the superficial data layout without the key benefit
       | of modern columnar formats: segment elimination.
       | 
       | Most such formats support efficient querying by skipping the disk
       | read step entirely when a chunk of data is not relevant to a
       | query. This is done by splitting the data into segments of about
       | 100K rows, and then calculating the min/max range for each
       | column. That is stored separately in a header or small metadata
       | file. This allows huge chunks of the data to be entirely skipped
       | if it falls out of range of some query predicate.
       | 
       | PS: the same compression ratio advantages could be achieved by
       | compressing columns stored as JSON arrays, but such a format
       | could encode all Unicode characters and has a readily available
       | decoder in all mainstream programming languages.
        
       | orthoxerox wrote:
       | It is simple, but how do you access the price in row #1234567890?
       | If your data doesn't have this many records and can fit into RAM,
       | a basic NLJSON or CSV will work just as well.
        
         | CapitalistCartr wrote:
         | What is NLJSON?
        
           | svieira wrote:
           | New Line delimited JSON
        
           | chuckadams wrote:
           | Also known as JSONL, or JSON Lines. Basically a file of JSON
           | objects separated by newlines. Popular format for logs these
           | days for obvious reasons.
        
             | rzzzt wrote:
             | NDJSON is the shorthand I've seen:
             | https://github.com/ndjson/ndjson-spec
        
         | cm2187 wrote:
         | Like parquet this isn't really meant for RDBMS type of
         | database, more like for analytics over large datasets. I work
         | in an environment where we typically have tables with over 300
         | columns, 10s if not 100s millions of rows daily. When you want
         | to do a simple sum/group by involving 2 or 3 columns, it is
         | great to have a column store file format, where you only read
         | the columns you need and those are compressed.
         | 
         | The price you pay is that it is inefficient for single record
         | access, or for "select * " kind of queries.
        
           | orthoxerox wrote:
           | I _was_ comparing it with Parquet, which is much more
           | complex, but has features that help you access the data in
           | less than O(n), like row groups and pages.
        
       | psanford wrote:
       | The only benefit this format provides is the ability to read some
       | columns without needing to read all columns. Unfortunately it is
       | not a seekable format. That's a pretty big miss.
       | 
       | It also wouldn't be that hard to make it seekable. All you would
       | have to do is make each tsv file two columns: record-id, value.
        
       | selimnairb wrote:
       | We need to have a come to Jesus meeting about these columnar
       | formats.
        
       | andenacitelli wrote:
       | Can we just all converge on Parquet + Arrow and call it a day
       | please? Too much effort being put into 1..N ways to solve a
       | problem that would be better put towards a single standard.
       | 
       | We work with Parquet + Arrow every day at $DAYJOB in a ML and Big
       | Data context and it's been great. We don't even think we're using
       | it to its fullest potential, but it's never been the bottleneck
       | for us.
        
         | pdimitar wrote:
         | How is the data schema description language btw? I haven't used
         | either yet.
        
       | btbuildem wrote:
       | Colour me out of the loop, but what is the utility of this type
       | of approach? I can't seem to grok this.
        
       | warthog wrote:
       | This is very promising for nested datapoints
        
       ___________________________________________________________________
       (page generated 2024-04-13 23:00 UTC)