https://ketansingh.me/posts/lets-talk-skiplist/ Ketan Singh Collection of musing and ramblings Let's talk SkipList Posted at -- Aug 6, 2022 Background SkipLists often come up when discussing "obscure" data-structures but in reality they are not that obscure, in fact many of the production grade softwares actively use them. In this post I'll try to go into SkipLists by describing how to make a toy implementation, potential optimizations and real world use cases of them. So what are SkipLists anyway? Well, SkipList are referential data structures inspired from linked list and binary trees. They are collection of sort linked list arranged in different levels, where levels are designed in such a way that it allows skipping nodes to get a logarithmic complexity when searching for the keys later. They are alternative to binary trees and even in some cases the legendary B-Tree, in fact last level of SkipList looks somewhat like a B+Tree. SkipLists perform very well with random write heavy workloads where ordered storage or fast lookup is required. Construction of level(s) is probabilistic in nature, thus these lists can be vulnerable to becoming unbalanced if underlying random algorithm isn't uniform enough. These are also typically a lot easier to implement compared to HashMaps and Trees which make them ideal for an alternative choice specially if only in memory lookup is required. Let's try to take a peek at SkipList with a diagram. SkipList This diagram essentially represents a collection of linked list data-structure organized in sorted order of the keys. There are different levels, which contain subset of elements from previous level in sorted order. This might depend on implementation but usually there are no multiple copies of same node but it's just a way to represent same node across different levels. For example in the diagram, Node 5 which is present in all 3 levels connects to Node 24, Node 15, Node 7 . It may appear that it has multiple copies but it's the same node with multiple connections for different levels. Toy Implementation For the implementation, I'll be using Go with generics because I have been wanting to use them for a while now and there's no better use case than having a generic type safe collection. Using generic often requires constraints or traits (as rustaceans say) and I couldn't find the constraints to enforce order in the standard library so I used golang.org/x/exp/constraints package which gives us Ordered constraint which will give us a way to compare and keep them sorted them in our SkipList. Structures Let's define a struct to store key and value for the individual node type Record[K constraints.Ordered, V any] struct { Key K Value V } Then the Node itself type SkipNode[K constraints.Ordered, V any] struct { record *Record[K, V] forward []*SkipNode[K, V] } Note that forward[i] represents next member of the list at i th level. In the above diagram, for example, forward[0] for Node 5 will have pointer to Node 7, forward[1] to Node 15 and forward[2] to Node 24 Once we have these structs, we can define constructor functions to help us quickly build the struct func NewRecord[K constraints.Ordered, V any](key K, value V) *Record[K, V] { return &Record[K, V]{ Key: key, Value: value, } } func NewSkipNode[K constraints.Ordered, V any](key K, value V, level int) *SkipNode[K, V] { return &SkipNode[K, V]{ record: NewRecord(key, value), forward: make([]*SkipNode[K, V], level+1), } } Finally, we can write the SkipList struct as following type SkipList[K constraints.Ordered, V any] struct { head *SkipNode[K, V] level int size int } We need head node which is basically a "dummy node" to help us maintain connection to the rest of the SkipList, size is the number of elements in the SkipList and the levels is the number of levels in the current SkipList which will be used later for Insert and Find operations. Finally to construct the SkipList, we can write the function as following func NewSkipList[K constraints.Ordered, V any]() *SkipList[K, V] { return &SkipList[K, V]{ head: NewSkipNode[K, V](new(K), new(V), 0), level: -1, size: 0, } } Find Operation Find operation is the core algorithm for almost every operation involved with SkipList. It's pretty much like a binary search but in linked list, as we hit the "roadblock" in our search, we start looking for a level below, skipping bunch of nodes and taking advantage of ordered and leveled structure of SkipList. Algorithm can be summarized as following * Start looking for your key from the top level of the SkipList. * Keep moving forward in the same level as long as key is smaller than what you're looking for. * If you happen to find the key, return it's value. * If next key is bigger than your target key then start looking in the level below. func (s *SkipList[K, V]) Find(key K) (V, bool) { x := s.head for i := s.level; i >= 0; i-- { for { if x.forward[i] == nil || x.forward[i].record.Key > key { break } else if x.forward[i].record.Key == key { return x.forward[i].record.Value, true } else { x = x.forward[i] } } } return *new(V), false } Visually we can imagine something like following happening while looking for a key find in skiplist Insert Operation To perform the insert to the SkipList, we need couple of "helper" methods. First one is the which figures out the level for the new node. In our implementation we're using naive probabilistic approach which is using random function to choose levels, where probability of level is 2^(-L+1). This works for the most part and is very simple to implement but it's efficiency depends on how good the random function is. It may very well always choose level 0. Ideally what we want from the SkipList is to have half the number of nodes than level below it to take the full advantage of the binary search. func (s *SkipList[K, V]) getRandomLevel() int { level := 0 for rand.Int31()%2 == 0 { level += 1 } return level } Second one is adjustLevel, this method takes care of increasing the SkipList head's forward pointer array size to be able to hold pointer for new levels, in case new node's level is going to be greater than what SkipList currently contains. We don't really care what Key Value we store in head node's records, this the reason why I am just storing the zero value in there by *new(K), *new(V). I can also store nil pointers but this will do. func (s *SkipList[K, V]) adjustLevel(level int) { temp := s.head.forward s.head = NewSkipNode(*new(K), *new(V), level) s.level = level copy(s.head.forward, temp) } Insertion build up on what we described before for "helper" methods and Find operation. It can be summarized in following steps * Assign a level to the new node * Resize the SkipList's head to be able to store points for the assigned level * Start looking for the appropriate place for the new node at each level by comparing keys and at the same time start building "next" pointer array for the new node. Do it as you travel further down the levels. * Update the pointers to insert the new node at the correct position. func (s *SkipList[K, V]) Insert(key K, value V) { newLevel := s.getRandomLevel() if newLevel > s.level { s.adjustLevel(newLevel) } newNode := NewSkipNode[K, V](key, value, newLevel) updates := make([]*SkipNode[K, V], newLevel+1) x := s.head for i := s.level; i >= 0; i-- { for x.forward[i] != nil && x.forward[i].record.Key < key { x = x.forward[i] } updates[i] = x } for i := 0; i <= newLevel; i++ { newNode.forward[i] = updates[i].forward[i] updates[i].forward[i] = newNode } s.size += 1 } Visually we can imagine something like following happening while looking for a key Insert into SkipList Delete Operation Delete operation is very similar to find operation, you just keep removing references as you find your keys across the different levels. func (s *SkipList[K, V]) Delete(key K) { x := s.head for i := s.level; i >= 0; i-- { for { if x.forward[i] == nil || x.forward[i].record.Key > key { break } else if x.forward[i].record.Key == key { x.forward[i] = x.forward[i].forward[i] } else { x = x.forward[i] } } } } Visually we can imagine deletion as following. Delete from SkipList Improving performance Memory Access SkipList like linked list or binary tree are not cache friendly. Most modern CPU tries to predict which memory is going to be used in the future. Most common algorithm is that it assumes program is going to ask for big chunk of contiguous data (like an array) so it tries to load contiguous memory blocks into the cache lines as predicted, but linked list nodes are not necessarily contiguous which will cause trashing in the CPU cache because only few blocks would be relevant, rest of the space would be wasted as the next node can be in completely different chunk of memory. Arrays fit best in L1 cache but with linked list there's no guarantee that next node will fit into that or even L2 or L3 which is dependent on how good is memory allocator. If we look at the latency numbers in the following diagram, it gives us clear picture how bad performance can be with if there's too frequent cache miss. This implies accessing different nodes in a list can be upto 100-200x times slow compared to accessing next element in an array. Latency Using a shared memory allocator Bane of the problem with SkipLists or LinkedList in general is the memory access pattern which is cache unfriendly. Let's take a look on how efficient allocation can help us improve the performance. Allocation Memory allocation can take 3 forms which can be seen in above diagram (1) Is when allocator assigns randomly spread blocks of memory. (2) Is when allocator assigns blocks of memory which are contiguous but not necessary in correct order of list pointers. (3) Is when allocator assigns blocks of memory which are contiguous and in line with the list access pattern. There's a good chance that memory allocated for naive implementation might look like something like (1) which basically represents chunks of memory spread all over the place. Any access to next node of the List is going to be cache unfriendly. If we're using a memory pool or a shared allocator we can have something like (2). All the memory for the nodes can be allocated from a dedicated block of memory. This increases the data cache hit rate and TLB cache hit rate. (3) is what we want in ideal situation but then implementation can be very complicated specially with delete operation. Some programming languages make it easy to replace the memory allocator like C++ and Rust, This however is tricky with Go. This can be done with manual memory management and using a custom memory pool. Using unrolled SkipList unrolled Unrolled SkipList is one which stores multiple elements in each node. Unrolling is designed to give a better cache performance which can depend on the size of the objects which are accessed. These variants can "waste" extra space as each node allocates memory for the array which may not be occupied all the time. Delete operation can be tricky as it creates "gaps" in the node array. Searching is very similar to skip list, and when a candidate is found, the array is searched through linear search. To insert, you push to the array. If no excessive space is available, the insertion happens by adding a new SkipList node. Keeping track of elements at each levels I talked briefly about this while describing one of the helper methods. In my implementation level determination is pretty much non-deterministic and based on probabilties but problem is that, if there's a bad random function there's a good chance we end up creating unbalanced SkipList. Unbalanced meaning, one of the levels having too many nodes than ideally required. So in order to make search performance as fast as possible, a smart implementation should ensure that a level has only half the elements of the previous level. Which allows best "skipping" while looking up a key. Ofcourse actively balancing can degrade the insert and delete performance so there needs to be a proper balance. SkipList in the wild SkipLists are not very common when compared to say balanced trees or hashtables but once in a while you do run into them. Personally I have seen them only at the following places but I am pretty sure this is just anecdotal and if you look you can find it at more common places. RocksDB RocksDB is an embeddable persistent key-value store for fast storage. RocksDB can also be the foundation for a client-server database, one of the most popular database using it these days is CockroachDB. RocksDB is built on LevelDB to be scalable to run on servers with multicore CPU, to efficiently use fast storage, to support IO-bound, in-memory and write-once workloads. Internally, RocksDB uses LSM Trees for storage and implementation needs a Memtable to store sorted keys in the memory before it's flushed into the disk in the form of SST files. One of the implemtation of Memtable in RocksDB is done by SkipList in RocksDB. New writes always insert data to memtable, and reads has to query memtable before reading from SST files, because data in memtable is newer. Skiplist-based memtable provides general good performance to both read and write, random access and sequential scan. Besides, it provides some other useful features that other memtable implementations don't currently support, like Concurrent Insert and Insert with Hint. Redis SortedSet Redis Sorted Sets are similar to Sets with the feature that members are stored with user defined values. To quote Antirez, it was chosen over balanced trees because 1. They are not very memory intensive. It's up to you basically. Changing parameters about the probability of a node to have a given number of levels will make then less memory intensive than btrees. 2. A sorted set is often target of many ZRANGE or ZREVRANGE operations, that is, traversing the skip list as a linked list. With this operation the cache locality of skip lists is at least as good as with other kind of balanced trees. 3. They are simpler to implement, debug, and so forth. For instance thanks to the skip list simplicity I received a patch (already in Redis master) with augmented skip lists implementing ZRANK in O(log(N)). It required little changes to the code. MuQSS Linux Scheduler Con Kolivas maintains a series of scheduler patch sets that he has tuned considerably over the years for his own use, focusing primarily on latency reduction for a better desktop experience. In early October 2016, Kolivas updated the design of his popular desktop scheduler patch set, which he renamed MuQSS. MuQSS is CPU Scheduler with multiple run queues, one per CPU. Instead of linked lists, the queues have been implemented as skip lists. Kolivas's implementation is a custom skip list for his scheduler. References * https://johnysswlab.com/the-quest-for-the-fastest-linked-list/ * https://dgraph.io/blog/post/ manual-memory-management-golang-jemalloc/ * https://github.com/facebook/rocksdb/wiki/MemTable# concurrent-insert * https://www.igvita.com/2012/02/06/ sstable-and-log-structured-storage-leveldb/ * https://www.cockroachlabs.com/docs/stable/architecture/ storage-layer.html * https://vldb.org/pvldb/vol9/p1389-srinivasan.pdf * https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.170.719 &rep=rep1&type=pdf * https://15721.courses.cs.cmu.edu/spring2018/papers/ 08-oltpindexes1/skiplists-done-right2016.pdf * https://15721.courses.cs.cmu.edu/spring2018/papers/ 08-oltpindexes1/pugh-skiplists-cacm1990.pdf * https://github.com/dai-shi/excalidraw-claymate * https://lwn.net/Articles/720227/ (c) ketansingh.me