https://github.com/Hafthor/zsvutil Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} Hafthor / zsvutil Public * Notifications * Fork 0 * Star 17 * ZSV Utility for converting csv/tsv to/from zip-separated-values License MIT license 17 stars 0 forks Branches Tags Activity Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights Hafthor/zsvutil This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Name Name Last commit message Last commit date Latest commit History 7 Commits src/com/hafthor src/com/hafthor .gitignore .gitignore LICENSE LICENSE README.md README.md View all files Repository files navigation * README * MIT license zsvutil A utility for converting TSV files from/to ZSV files. Introducing ZSV - ZIP Separated Values TL;DR ZSV (ZIP Separated Values) is a columnar data storage format with features similar to Parquet, Orc or Avro, however, it is built upon the simple technologies of TSV (tab separated values) and ZIP, making it easy to understand, create and consume, but still provide the query performance characteristics of a modern columnar store format. Tenets * Be simple * Prefer mature, widely available technologies * Favor human readability * Be easy to parse and generate * Be efficient for simple tabular data * Prefer longevity over novelty Description Given an original source, products.tsv, zsvutil import creates a products.zsv file that is just a .zip file with a file inside for each column, for example, SKU, Description and Price. Inside those files is just the TSV for that column, compressed. FAQ Why is ZSV built on ZIP file format? Why not use .targz? ZIP is a widely available, mature technology that is easy to use and has built-in support in many languages and platforms. .targz is a single gzip stream of a tar file, which is a collection of files. This makes it effectively impossible to seek to a specific file without reading and decompressing the whole stream up to that file. ZIP files are a collection of individually compressed files, with a directory as a footer to the file, which makes it easy to seek to a specific file without reading the whole file. Why is ZSV built on TSV format? Why not CSV? Why not JSON? TSV is a simple, human-readable format that is easy to understand and manipulate. It is also trivial to parse and generate. CSV is also a good choice, but it is more complex than TSV, with quoting and escaping rules that can be confusing, ambiguous and inconsistent. JSON is a good format for nested data, but it is not as easy to read or write as TSV. JSON is also not as efficient as TSV for simple tabular data. What are some key shortcomings of ZSV? ZSV is not a good choice for binary or unstructured textual data. The main limitation is that the data in the columns must not include the tab character - or newline character . This is a limitation of the TSV format. Any escaping or encoding of these characters would make the format less human-readable, harder to parse and could introduce ambiguity and consistency problems. How well is ZSV supported by tools and platforms? Today ZSV is not widely supported by tools and platforms, but it is easy to convert between TSV and ZSV using zsvutil. It should be relatively easy to add support for ZSV to any tool that supports columnar data formats. What is an ideal use case for ZSV? If you are currently using TSV files and want to improve query performance without changing your data format, ZSV is a good choice. Simple Columnar Storage Example Given products.tsv with a header line SKU- Description- Price- Region AA- Item AA- 111.11- US BB- Item BB- 222.22- US CC- Item CC- 333.33- US we would have a ZIP file products.zsv with the files SKU, Description and Price inside. Each file would have just that column's data. Note that column names MUST be allowed by .zip format as entry names. Also, the tab character - MUST NOT be used in the name. products.zsv * SKU AABBCC * Description Item AAItem BBItem CC * Price 111.11222.22333.33 * Region USUSUS Note the number of rows in each column MUST be the same, except for Constant Columns (see below). The nature of .zip files makes it possible to seek and read just the columns required without having to read/decode the other columns. Note that newline MUST NOT appear in the actual column data since it is used to separate rows. Note that each column row MUST end with a including the last one. Additional features These are features that are not required, but may be useful in some cases. They are somewhat counter to our tenet of being simple, but they may be useful enough to warrant the additional complexity. These features are mostly independent of each other, so you can use one or more of them without using the others. Constant Columns Constant Columns allow us to add an invariant column, which is useful for partition keys. Note that the field has no trailing newline . products.zsv * SKU AABBCC * Description Item AAItem BBItem CC * Price 111.11222.22333.33 * Region US Compound Columns If a collection of columns are always accessed together, it may make sense to combine them, for example if SKU and Description were never accessed independently, we could make products.zsv look like this: products.zsv * SKU AABBCC * Description-Price Item AA-111.11Item BB-222.22Item CC-333.33 * Region US Note that Constant Columns MUST NOT participate in Compound Columns. Note that along with newline , the tab - character MUST NOT appear in the column data in a Compound Column. Each row in any column MUST include the same number of columns as its entry name. Repeated Columns Data may be repeated using Compound Columns, if desired, for example: products.zsv * SKU AABBCC * Description-Price Item AA-111.11Item BB-222.22Item CC-333.33 * Price 111.11222.22333.33 * Region US It is up to the reader to decide the optimal combination of ZIP entries to read to meet the requirements and avoid reading unnecessary data. The same combination of columns may appear in a different order, especially when the data is sorted. Nested/Binary Data Data may be nested by storing a ZIP of compressed row blob files inside the ZSV. products.zsv * SKU AABBCC * Description-Price Item AA-111.11Item BB-222.22Item CC-333.33 * -Images (inner stored ZIP) + 0 <> + 1 <> + 2 <> Data stored inside, Image data for BB, for example, is directly seekable and fetchable without reading through any of the other data. The image data itself may be compressed, but Images ZIP itself would not be compressed inside products.zsv. Note the column name is prefixed with a tab - character to indicate to the reader that this is a nested column. Row Groups Row Groups may be used to split up longer data sets inside a bigger .zsv. This is done by repeating the column file names followed by a double tab -- and a unique number for each rowgroup. products.zsv * SKU--0 AABB * Description-Price--0 Item AA-111.11Item BB-222.22 * Region--0 US * SKU--1 CC * Description-Price--1 Item CC-333.33 * Region--1 US Note the number of rows in each column of the row group MUST be equal. The columns referenced in each row group MUST be equal. Columns referenced in each row group SHOULD be in the same order and grouped together, however, this is not a strict requirement and readers MUST NOT assume an order of files. Constant columns may be different in each row group when named with the double tab or there can be a single constant column as though there were no rowgroups. products.zsv * SKU--0 AABB * Description-Price--0 Item AA-111.11Item BB-222.22 * SKU--1 CC * Description-Price--1 Item CC-333.33 * Region US Metadata ZIP files support having comments on file entries inside. This may be used to hold metadata about the contents that are otherwise unavailable, such as row counts, partition information, sorting, distinct values, min/max text or values, all in a bare keyname JSON format. products.zsv * SKU--0 {rows:2, distinct:2, maxlength:2, min:"AA", max:"BB"} AABB * Description--0 {rows:2, distinct:2, maxlength:7} Item AAItem BB * Price--0 {rows:2, distinct:2, minvalue:111.11, maxvalue:222.22} 111.11222.22 * SKU--1 {rows:1, distinct:1, maxlength:2, min:"CC", max:"CC"} CC * Description--1 {rows:1, distinct:1, maxlength:7} Item CC * Price--1 {rows:1, distinct:1, minvalue:333.33, maxvalue:333.33} 333.33 * Region {} US Alternative CSV Inner Format While TSV is the preferred inner format for ZSV, a form using CSV is also possible. Each line has comma separated values and each value is either a quoted string with JSON escapes possible, a JSON number, or a bare string, but with no escapes and with forbidden characters. products.zsv * SKU "AA""BB""CC" * Description-Price "Item AA",111.11"Item BB",222.22"Item CC",333.33 * Price 111.11 222.22 333.33 * Region "US" About ZSV Utility for converting csv/tsv to/from zip-separated-values Resources Readme License MIT license Activity Stars 17 stars Watchers 1 watching Forks 0 forks Report repository Releases No releases published Packages 0 No packages published Languages * Java 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.