https://dl.acm.org/doi/10.1145/3662010.3663452 skip to main content * ACM Digital Library home * ACM Association for Computing Machinery corporate logo * Advanced Search * Browse * About * + Sign in + Register * * Advanced Search * Journals * Magazines * Proceedings * Books * SIGs * Conferences * People * * More * Search ACM Digital Library[ ] SearchSearch Advanced Search 10.1145/3662010.3663452acmconferencesArticle/Chapter ViewAbstract Publication PagesmodConference Proceedingsconference-collections mod * Conference * Proceedings * Upcoming Events * Authors * Affiliations * Award Winners * More * Home * Conferences * MOD * Proceedings * DaMoN '24 * NULLS!: Revisiting Null Representation in Modern Columnar Formats research-article Open access Share on * * * * * * NULLS!: Revisiting Null Representation in Modern Columnar Formats Authors: [default-pr]Xinyu Zeng, [default-pr]Ruijun Meng, [default-pr]Andrew Pavlo, [default-pr]Wes McKinney, [default-pr] Huanchen ZhangAuthors Info & Claims DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware Article No.: 10, Pages 1 - 10 https://doi.org/10.1145/3662010.3663452 Published: 09 June 2024 Publication History 0citation487Downloads Metrics Total Citations0 Total Downloads487 Last 12 Months487 Last 6 weeks245 Get Citation Alerts New Citation Alert added! This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited. To manage your alert preferences, click on the button below. Manage my Alerts New Citation Alert! Please log in to your account PDFeReader * Contents DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware NULLS!: Revisiting Null Representation in Modern Columnar Formats Pages 1 - 10 PREVIOUS ARTICLE How to Be Fast and Not Furious: Looking Under the Hood of CPU Cache Prefetching Previous NEXT ARTICLE In situ neighborhood sampling for large-scale GNN training Next + Abstract + References ACM Digital Library * + Information & Contributors + Bibliometrics & Citations + View Options + References + Media + Tables + Share Abstract Nulls are common in real-world data sets, yet recent research on columnar formats and encodings rarely address Null representations. Popular file formats like Parquet and ORC follow the same design as C-Store from nearly 20 years ago that only stores non-Null values contiguously. But recent formats store both non-Null and Null values, with Nulls being set to a placeholder value. In this work, we analyze each approach's pros and cons under different data distributions, encoding schemes (with different best SIMD ISA), and implementations. We optimize the bottlenecks in the traditional approach using AVX512. We also propose a Null-filling strategy called SmartNull, which can determine the Null values best for compression ratio at encoding time. From our micro-benchmarks, we argue that the optimal Null compression depends on several factors: decoding speed, data distribution, and Null ratio. Our analysis shows that the Compact layout performs better when Null ratio is high and the Placeholder layout is better when the Null ratio is low or the data is serial-correlated. References [1] 2018. Iterating over set bits quickly (SIMD edition). https:// lemire.me/blog/2018/03/08/ iterating-over-set-bits-quickly-simd-edition/. Google Scholar [2] 2019. Really fast bitset decoding for "average" densities. https:// lemire.me/blog/2019/05/03/ really-fast-bitset-decoding-for-average-densities/. Google Scholar [3] 2024. Aligning Velox and Apache Arrow: Towards composable data management. https://engineering.fb.com/2024/02/20/developer-tools/ velox-apache-arrow-15-composable-data-management/. Google Scholar [4] 2024. Apache Arrow. https://arrow.apache.org/. Google Scholar [5] 2024. Apache Arrow DataFusion. https://github.com/apache/ arrow-datafusion/. Google Scholar [6] 2024. Apache ORC. https://orc.apache.org/. Google Scholar [7] 2024. Apache Parquet. https://parquet.apache.org/. Google Scholar [8] 2024. Arrow C++ C->S Conversion. https://github.com/apache/arrow/ blob/1eb46f763a73d313466fdc895eae1f35fac37945/cpp/src/arrow/util/ spaced.h#L66-L94. Google Scholar [9] 2024. Dremio. https://www.dremio.com/. Google Scholar [10] 2024. Influx Data FDAP stack. https://www.influxdata.com/glossary/ fdap-stack/. Google Scholar [11] 2024. MonetDB Data Compression Doc. https://www.monetdb.org/ documentation-Dec2023/admin-guide/system-resources/data-compression/. Google Scholar [12] 2024. Velox's SIMDized BM to SV. https://github.com/facebookincubator /velox/blob/02ca9b0b4f554868b533d2f6526a480ea1e7d035/velox/common/ base/SimdUtil-inl.h#L179. Google Scholar [13] Daniel J Abadi et al. 2007. Column Stores for Wide and Sparse Data. In CIDR, Vol. 2007. 292--297. Google Scholar [14] Azim Afroozeh and Peter Boncz. 2023. The FastLanes Compression Layout: Decoding> 100 Billion Integers per Second with Scalar Code. Proceedings of the VLDB Endowment 16, 9 (2023), 2132--2144. Digital Library Google Scholar [15] PA Boncz and M Zukowski. 2012. Vectorwise: Beyond column stores. IEEE Data Engineering Bulletin 35, 1 (2012), 21--27. Google Scholar [16] Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution. In CIDR. Google Scholar [17] E. F. Codd. 1975. Understanding Relations (Installment #6). FDT Bull. ACM SIGFIDET SIGMOD 7, 1 (1975), 1--4. Google Scholar [18] E. F. Codd. 1979. Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4, 4 (dec 1979), 397--434. https://doi.org/10.1145/320107.320109 Digital Library Google Scholar [19] Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. https://doi.org /10.14778/3476249.3476297 Digital Library Google Scholar [20] Gerhard Hill and Andrew Ross. 2009. Reducing outer joins. VLDB J. 18, 3 (2009), 599--610. https://doi.org/10.1007/S00778-008-0110-5 Digital Library Google Scholar [21] Hao Jiang, Chunwei Liu, John Paparrizos, Andrew A. Chien, Jihong Ma, and Aaron J. Elmore. 2021. Good to the Last Bit: Data-Driven Encoding with CodecDB. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 843--856. https://doi.org/10.1145/3448016.3457283 Digital Library Google Scholar [22] Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. CoRR abs/1911.13014 (2019). arXiv:1911.13014 http:// arxiv.org/abs/1911.13014 Google Scholar [23] Maximilian Kuschewski, David Sauerwein, Adnan Alhomssi, and Viktor Leis. 2023. BtrBlocks: Efficient Columnar Compression for Data Lakes. Proceedings of the ACM on Management of Data 1, 2 (2023), 1--26. Digital Library Google Scholar [24] D. Lemire and L. Boytsov. 2013. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1 (May 2013), 1--29. https://doi.org/10.1002/spe.2203 Digital Library Google Scholar [25] Daniel Lemire, Leonid Boytsov, and Nathan Kurz. 2014. SIMD Compression and the Intersection of Sorted Integers. CoRR abs/ 1401.6399 (2014). arXiv:1401.6399 http://arxiv.org/abs/1401.6399 Google Scholar [26] Daniel Lemire, Gregory Ssi-Yan-Kai, and Owen Kaser. 2016. Consistently faster and smaller compressed bitmaps with Roaring. Software: Practice and Experience 46, 11 (April 2016), 1547--1569. https://doi.org/10.1002/spe.2402 Digital Library Google Scholar [27] Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. 2023. A deep dive into common open formats for analytical dbmss. Proceedings of the VLDB Endowment 16, 11 (2023), 3044--3056. Digital Library Google Scholar [28] Yihao Liu, Xinyu Zeng, and Huanchen Zhang. 2024. LeCo: Lightweight Compression via Learning Serial Correlations. Proc. ACM Manag. Data 2, 1, Article 65 (mar 2024), 28 pages. https://doi.org/10.1145/ 3639320 Digital Library Google Scholar [29] Dimitar Misev, Mikhail Rodionychev, and Peter Baumann. 2023. Performance of Null Handling in Array Databases. In 2023 IEEE International Conference on Big Data (BigData). IEEE, 247--254. Google Scholar [30] Amadou Ngom, Prashanth Menon, Matthew Butrovich, Lin Ma, Wan Shen Lim, Todd C Mowry, and Andrew Pavlo. 2021. Filter Representation in Vectorized Query Execution. In Proceedings of the 17th International Workshop on Data Management on New Hardware. 1--7. Digital Library Google Scholar [31] Mark Raasveldt and Hannes Muhleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19). 1981--1984. https:// doi.org/10.1145/3299869.3320212 Digital Library Google Scholar [32] Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, Rene Muller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang. 2013. DB2 with BLU Acceleration: So Much More than Just a Column Store. Proc. VLDB Endow. 6, 11 (2013), 1080--1091. https://doi.org/10.14778/2536222.2536233 Digital Library Google Scholar [33] Kenneth A. Ross. 2004. Selection conditions in main memory. ACM Trans. Database Syst. 29, 1 (mar 2004), 132--161. https://doi.org/ 10.1145/974750.974755 Digital Library Google Scholar [34] Etienne Toussaint, Paolo Guagliardo, Leonid Libkin, and Juan Sequeda. 2022. Troubles with Nulls, Views from the Users. Proc. VLDB Endow. 15, 11 (2022), 2613--2625. https://doi.org/10.14778/3551793.3551818 Digital Library Google Scholar [35] Adrian Vogelsgesang, Michael Haubenschild, Jan Finis, Alfons Kemper, Viktor Leis, Tobias Muhlbauer, Thomas Neumann, and Manuel Then. 2018. Get Real: How Benchmarks Fail to Represent the Real World. In Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018, Alexander Bohm and Tilmann Rabl (Eds.). ACM, 1:1--1:6. https:// doi.org/10.1145/3209950.3209952 Digital Library Google Scholar [36] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. Proceedings of the VLDB Endowment 17, 2 (2023), 148--161. Digital Library Google Scholar Recommendations * Columnar formats for schemaless LSM-based document stores In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a ... Read More * An Empirical Evaluation of Columnar Storage Formats Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to ... Read More * Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations Null hypothesis statistical significance tests (NHST) are widely used in quantitative research in the empirical sciences including scientometrics. Nevertheless, since their introduction nearly a century ago significance tests have been controversial. ... Read More Comments Please enable JavaScript to view thecomments powered by Disqus. Information & Contributors Information Published In cover image ACM Conferences DaMoN '24: Proceedings of the 20th International Workshop on Data Management on New Hardware June 2024 123 pages ISBN:9798400706677 DOI:10.1145/3662010 * Editors: * Author PictureCarsten Binnig TU Darmstadt, Germany , * Author PictureNesime Tatbul Intel Labs and MIT, USA Copyright (c) 2024 Owner/Author. This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License. Sponsors * SIGMOD: ACM Special Interest Group on Management of Data Publisher Association for Computing Machinery New York, NY, United States Publication History Published: 09 June 2024 Check for updates Qualifiers * Research-article * Research * Refereed limited Conference SIGMOD/PODS '24 Sponsor: * SIGMOD SIGMOD/PODS '24: International Conference on Management of Data June 10, 2024 AA, Santiago, Chile Acceptance Rates DaMoN '24 Paper Acceptance Rate 14 of 25 submissions, 56%; Overall Acceptance Rate 94 of 127 submissions, 74% More Contributors [loader-7e6] Other Metrics View Article Metrics Bibliometrics & Citations Bibliometrics Article Metrics * 0 Total Citations * 487 Total Downloads * Downloads (Last 12 months)487 * Downloads (Last 6 weeks)245 Reflects downloads up to 03 Nov 2024 Other Metrics View Author Metrics Citations View Options View options PDF View or Download as a PDF file. PDF eReader View online with eReader. eReader Get Access Login options Check if you have access through your login credentials or your institution to get full access on this article. Sign in Full Access Get this Publication Media Figures Other Tables Share Share Share this Publication link Copy Link Copied! Copying failed. Share on social media XLinkedInRedditFacebookemail Affiliations [default-pr] Xinyu Zeng Tsinghua University https://orcid.org/0009-0002-6858-1457 View Profile [default-pr] Ruijun Meng Tsinghua University https://orcid.org/0000-0003-2311-4476 View Profile [default-pr] Andrew Pavlo Carnegie Mellon University https://orcid.org/0000-0001-6040-6991 View Profile [default-pr] Wes McKinney Posit PBC https://orcid.org/0000-0003-4028-1639 View Profile [default-pr] Huanchen Zhang Tsinghua University https://orcid.org/0009-0001-4821-1558 View Profile Download PDF Go to Go to Show all references Request permissionsExpand All Collapse Expand Table Authors Info & Affiliations View Table of Contents Export Citations Select Citation format[BibTeX ] * Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download,a status dialog will open to start the export process. The process may takea few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download + Download citation + Copy citation Footer Categories * Journals * Magazines * Books * Proceedings * SIGs * Conferences * Collections * People About * About ACM Digital Library * ACM Digital Library Board * Subscription Information * Author Guidelines * Using ACM Digital Library * All Holdings within the ACM Digital Library * ACM Computing Classification System * Accessibility Statement Join * Join ACM * Join SIGs * Subscribe to Publications * Institutions and Libraries Connect * Contact us via email * ACM on Facebook * ACM DL on X * ACM on Linkedin * Send Feedback * Submit a Bug Report The ACM Digital Library is published by the Association for Computing Machinery. Copyright (c) 2024 ACM, Inc. * Terms of Usage * Privacy Policy * Code of Ethics ACM Digital Library home ACM Association for Computing Machinery corporate logo Your Search Results Download Request We are preparing your search results for download ... We will inform you here when the file is ready. Download now! Your Search Results Download Request Your file of search results citations is now ready. Download now! Your Search Results Download Request Your search export query has expired. Please try again.