https://www.nayuki.io/page/binary-array-set Project Nayuki Navigation * Recent * Programming * Math * Random * About/Contact * GitHub --------------------------------------------------------------------- [binary-arr] Binary array set Introduction The binary array set^[0] is a very space-efficient data structure that supports adding elements and testing membership reasonably quickly. It basically works as a collection of sorted arrays with power-of-2 sizes. Adding one element to a BAS takes amortized Th(log n) time (worst-case Th(n)), searching/testing for an element takes worst-case Th((log n)^2) time, and removing an element is not supported at all (although the entire set can be cleared easily). Despite the lack of deletion functionality, the data structure is still useful in applications that only add and test but don't delete - for example, breadth-first search maintains an ever-growing set of visited nodes that shouldn't be revisited. To compare time complexities with a popular alternative, a balanced binary search tree takes worst-case Th(log n) time alike for adding, testing, or removing one element. The binary array set data structure is succinct because it only uses Th(log n) extra space for overhead (i.e. the space not used for storing the element values themselves). For modest sizes of n (e.g. over 1000), the amount of overhead space used by a BAS is practically negligible. The data structure that best minimizes overhead space would be the single flat array (which uses only Th(1) extra space), but it doesn't support both fast insertion and testing simultaneously. Note that despite the word "array" in the data structure's name, the set is in fact orderless and can present elements in any order. By comparison, a balanced BST uses Th(n) extra space for pointers to all the tree nodes. (B-trees also use Th(n) extra space, but the scaling constant approaches zero as the degree increases.) This extra space usage for node pointers would be problematic in the situation where the array of raw data values (without pointers and other overheads) barely fits in system memory. This disadvantage of node-based data structures is especially painful in languages like Java, where each object (i.e. the tree nodes) inherently has tens of bytes of overhead. Internal details How this data structure works is that it has a sequence of [?]log[2] n[?] + 1 slots, where the slot at index k (0-based counting) is either null or an ascending-sorted array of length 2^k. For example, the abstract conceptual set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13} could be represented as a binary array set in one way like this: Number of slots: [?]log[2] 13[?] + 1 = 4. Slot 0: [5]. (length 1) Slot 1: Null. Slot 2: [2, 3, 9, 13]. (length 4) Slot 3: [1, 4, 6, 7, 8, 10, 11, 12]. (length 8) Note that the size of the set is 13, whose binary representation is 1101[2]. The binary representation corresponds with which slots are filled. Also note that the union of all the arrays in the binary array set is exactly equal to the set of elements being represented, no more, no less, without duplicates. To test if a given value is a member of the set, we simply perform the standard binary search in each array (in the non-null slots) and return whether a result was found. To add a new element, we wrap the element in a new array of length 1. If slot 0 is empty, we simply store the new array there. Otherwise we merge the new array with the array already at slot 0, which forms a new sorted array of length 2. Next we check if slot 1 is empty; we repeat this merge/check process on increasing slot indexes until we find an empty slot. Interactive visualization [ ] Add value Clear set Slot # Values Source code [download-i] Java (SE 7+) + BinaryArraySet.java + BinaryArraySetTest.java (JUnit) The class implements the java.util.Set interface, so it can be used as a drop-in replacement of HashSet or TreeSet as long as remove() is not called. The iterator is not fail-fast on concurrent modifications. The maximum set size is Integer.MAX_VALUE (i.e. 2^31 - 1). The implemented iterator is not guaranteed to output the elements in sorted order. The benefits of this unsorted iterator are short code, Th(1) time complexity per element^[1], and Th(1) storage space at all times. It is indeed possible to design an iterator that yields elements in sorted order; this would require more code and a priority queue, Th(log log n) time per element, and Th (log n) space. Python + binaryarrayset.py + binaryarrayset-test.py This implementation only supports the methods {add(), clear()}, built-in functions {len(), iter()}, and operator in - so it only supports a fraction of the methods that the built-in set type has. C++ (C++11 and above) + BinaryArraySet.hpp + BinaryArraySetTest.cpp The class is modeled after the Java version, supporting the methods size(), add(), contains(), empty(), and clear(). It has far fewer features than std::set. Currently no iterator is implemented. Rust + binaryarrayset.rs + binaryarrayset-test.rs This struct and its methods are modeled after std::collections:: BTreeSet, but has fewer features. A move iterator and an immutable reference iterator are supported. License: MIT (open source) Notes * This data structure is described in Introduction to Algorithms ("CLRS") 2nd and 3rd editions, chapter 17 "Amortized Analysis", problem 17-2 "Making binary search dynamic" (page 473 in the 3rd edition). It is also adapted for use in log-structured merge-trees. * The binary-merge behavior of this data structure is like a simplified version of binomial heaps with flat arrays instead of binary tree nodes. * The binary array set can be readily adapted to a map/dictionary if needed. * [0]: The name "binary array set" for this data structure was made up by me. I had asked a question on Stack Overflow to see if there was an existing name for this structure, but the result was negative. Thus it should be impossible to search for existing literature using the BAS name. * [1]: Strictly speaking, the iterator needs worst-case Th(log n) time, amortized Th(1) time to process each element, because it needs to skip over empty slots that have no array. However, the number of slots is at most log[2] n + 1, which is less than 64 on real computers, which is a rather small number. Moreover, it's possible to use a linked list of arrays (with no empty elements) instead of an array of arrays (with empty slots), which avoids the Th(log n) time cost of skipping empty slots. Hence we can treat the act of skipping empty slots to be effectively a constant-time operation. More info * Pomona College: CSCI 140 - Introduction to Algorithms - Amortized Analysis Categories: Programming, Java, Python, C++ Last updated: 2024-02-25 --------------------------------------------------------------------- Related / browse * AA tree set * B-tree set * Compact hash map (Java) * Binomial heap * Transcript of "New Money (Filling the Void)" --------------------------------------------------------------------- Sidebar Recent [binary-arr] Binary array set [meanings-o] Meanings of home equity [drinking-d] Drinking distilled water [sorting-al] Sorting algorithms demo (Java) [poor-feedb] Poor feedback from readers (more) Random [the-versat] The versatile sieve of Eratosthenes [fast-md5-h] Fast MD5 hash implementation in x86 assembly [dvorak-key] Dvorak keyboard in use warning sign (more) RSS feed [rss-feed] Subscribe for updates --------------------------------------------------------------------- Feedback: Question/comment? Contact me Copyright (c) 2024 Project Nayuki