https://dinesh.wiki/posts/build-your-own-persistent-kv-store/ Dinesh Gowda * Posts * Categories * Tags * Resume * About * * [ ] Read More >> [ ] Build Your Own Fast, Persistent KV Store https://github.com/dineshgowda24/bitcask-rb 2023.2.13 databases 1652 8 mins 1. 1. Bitcask 1. Goals 2. Data Format 1. What's the essential thing that we need to store? 2. Given a cursor at the location of the data in file, will you be able to read the data? 3. Storing primitive data types 4. Auditing & Security 3. Number System 4. Largest Key & Value Size 5. Stiching Everything 1. Serialize & Deserialize 2. DiskStore 6. Closing thoughts Databases have always fascinated me, and I have always dreamt of building a simple, hobby database for fun. I have read several blog posts about building redis, git, compiler, and interpreter, but none about building a database. This post is an outcome of a long and treacherous search of posts on making a database and eventually encountering the Bitcask paper. Bitcask Bitcask is a local key/value store where the data is persisted in a file. One operating system process will open the file for writing at any time. When the file size reaches a considerable theoretical threshold, we close the file and open a new one. Goals 1. Low latency per item, read or written. 2. High throughput, especially when writing an incoming stream of random items. 3. Ability to handle datasets much more significant than RAM w/o degradation. 4. Crash friendliness, both in terms of a fast recovery and not losing data. 5. Ease of backup and restore. 6. A relatively simple, understandable code structure and data format. Data Format Let's start breaking it down. What's the essential thing that we need to store? 1. Key 2. Value b-kv Given a cursor at the location of the data in file, will you be able to read the data? The answer is No! Let me explain As key and value is variable length. We will need to figure out how much to read. So we need to store the size of the key and its value. b-kv-size Now that we have the key and value size written in the file. Given a file cursor pointing to the location of the data. We know the first 8 bytes represent key and value size. Once we read that, we see the size of the actual key and the value to be read. Storing primitive data types Although our kv database is MVP-ready. Suppose we want to store data types like integer, float and string. Once the data is written in the file, the data type is lost, and everything will be treated as a string, as we are not storing any information. To preserve type information, let's store two more fields, keytype and valuetype, representing the type of data stored. b-kv-size-type Auditing & Security For auditing and security, bitcask suggests storing 32-bit EPOC timestamp and CRC(Cyclic Redundancy Check), respectively. These values are generated when data is written to the file. The final data format would look like something below. We would store 20 bytes of additional data for every key and value. * 1st 4 bytes are a 32-bit integer representing CRC. * The following 4 bytes are a 32-bit integer representing EPOC timestamp. * The following 8 bytes are two 32-bit integers representing keysize and valuesize. * The next 4 bytes are two 16-bit integers representing keytype and valuetype. * The remaining bytes are our key and value. Data Format Number System Computers represent data in sets of binary digits. The representation comprises bits grouped into more extensive collections, such as bytes. The first 24 bytes are unsigned integers if you notice our data format. By default, when integers are written to a file, they are not stored in binary. Let me explain Suppose we run the below code. What would be the size of the data written in the file? 1 File.open('sample.txt', 'w') do |file| 2 [1, 12, 123, 1234, 12345, 123456, 1234567, 12345678].each do |num| 3 file.write(num) 4 end 5 end It's 36 bytes because, by default, they are written as strings where each character is 1 byte. $${\textsf{ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 36 bytes}}$$ But this could be more efficient. In our data format, we discussed that for any key and value, we would be storing 20 bytes of metadata. So we cannot keep them as strings, as it would result in a variable length field. The solution is to encode it and store it in binary format. If we encoded them as 4-byte integers and stored them in binary format. The size of the file would 32 bytes. $${\textsf{ 4 * 8 = 32 bytes}}$$ Largest Key & Value Size The largest key or value stored in file is a function of the type of integer of keysize or valuesize respectively. $${\large{\mathsf{\max_{\substack{\mathsf{1 :Integer, 18 DATA_TYPE[:Float] => :Float, 19 DATA_TYPE[:String] => :String 20 }.freeze 21 22 DATA_TYPE_DIRECTIVE = { 23 DATA_TYPE[:Integer] => 'q<', 24 DATA_TYPE[:Float] => 'E' 25 }.freeze 26 27 end 28 end Serialize & Deserialize 1. serialize serializes the data to our data format. It identifies the type of data, generates a header and computes crc. It returns the length of the data and the binary encoded data. 2. deserialize does the opposite of serialize. It validates crc, decodes the binary encoded data and returns epoc, key and value. 27 module Bitcask 28 module Serializer 29 30 def serialize(epoc:, key:, value:) 31 key_type = type(key) 32 value_type = type(value) 33 34 key_bytes = pack(key, key_type) 35 value_bytes = pack(value, value_type) 36 37 header = serialize_header(epoc: epoc, keysz: key_bytes.length, key_type: key_type, value_type: value_type, 38 valuesz: value_bytes.length) 39 data = key_bytes + value_bytes 40 41 [crc32_header_offset + data.length, crc32(header + data) + header + data] 42 end 43 44 def deserialize(data) 45 return 0, '', '' unless crc32_valid?(desearlize_crc32(data[..crc32_offset - 1]), data[crc32_offset..]) 46 47 epoc, keysz, valuesz, key_type, value_type = deserialize_header(data[crc32_offset..crc32_header_offset - 1]) 48 key_bytes = data[crc32_header_offset..crc32_header_offset + keysz - 1] 49 value_bytes = data[crc32_header_offset + keysz..] 50 51 [epoc, unpack(key_bytes, key_type), unpack(value_bytes, value_type)] 52 end 53 54 def serialize_header(epoc:, key_type:, keysz:, value_type:, valuesz:) 55 [epoc, keysz, valuesz, DATA_TYPE[key_type], DATA_TYPE[value_type]].pack(HEADER_FORMAT) 56 end 57 58 def deserialize_header(header_data) 59 header = header_data.unpack(HEADER_FORMAT) 60 61 [header[0], header[1], header[2], DATA_TYPE_LOOK_UP[header[3]], DATA_TYPE_LOOK_UP[header[4]]] 62 end 63 64 end 65 end Below are a few helper functions used for packing, unpacking and generating crc. 66 module Bitcask 67 module Serializer 68 69 def crc32_offset 70 CRC32_SIZE 71 end 72 73 def header_offset 74 HEADER_SIZE 75 end 76 77 def crc32_header_offset 78 crc32_offset + header_offset 79 end 80 81 def crc32(data_bytes) 82 [Zlib.crc32(data_bytes)].pack(CRC32_FORMAT) 83 end 84 85 def desearlize_crc32(crc) 86 crc.unpack1(CRC32_FORMAT) 87 end 88 89 def crc32_valid?(digest, data_bytes) 90 digest == Zlib.crc32(data_bytes) 91 end 92 93 def pack(attribute, attribute_type) 94 case attribute_type 95 when :Integer, :Float 96 [attribute].pack(directive(attribute_type)) 97 when :String 98 attribute.encode('utf-8') 99 else 100 raise StandardError, 'Invalid attribute_type for pack' 101 end 102 end 103 104 def unpack(attribute, attribute_type) 105 case attribute_type 106 when :Integer, :Float 107 attribute.unpack1(directive(attribute_type)) 108 when :String 109 attribute 110 else 111 raise StandardError, 'Invalid attribute_type for unpack' 112 end 113 end 114 115 private 116 117 def directive(attribute_type) 118 DATA_TYPE_DIRECTIVE[DATA_TYPE[attribute_type]] 119 end 120 121 def type(attribute) 122 attribute.class.to_s.to_sym 123 end 124 125 end 126 end DiskStore Finally, let us create a class called DiskStore, our persistent key-value store. It has the following methods. 1. Get : Returns value for a given key from the store. flowchart LR; A([get]) --> C{key in look_up table?} C -->|yes| D(seek to data loc in file) D --> F(read log_size bytes) F --> G{crc valid?} G --> |no| E G --> |yes| H (deserialize) H --> J(return value) C -->|no| E(return empty string) 2. Put : Write key-value into the store. flowchart LR; A([put]) --> B(generate epoc) B --> C(serialize data) C --> G(add to key_dir) G --> D(write to file) D --> E(incr write_pos by data size) E --> F(return) 1 module Bitcask 2 class DiskStore 3 4 include Serializer 5 6 def initialize(db_file = 'bitcask.db') 7 @db_fh = File.open(db_file, 'a+b') 8 @write_pos = 0 9 @key_dir = {} 10 end 11 12 def get(key) 13 key_struct = @key_dir[key] 14 return '' if key_struct.nil? 15 16 @db_fh.seek(key_struct[:write_pos]) 17 epoc, key, value = deserialize(@db_fh.read(key_struct[:log_size])) 18 19 value 20 end 21 22 def put(key, value) 23 log_size, data = serialize(epoc: Time.now.to_i, key: key, value: value) 24 25 @key_dir[key] = key_struct(@write_pos, log_size, key) 26 persist(data) 27 incr_write_pos(log_size) 28 29 nil 30 end 31 32 def flush 33 @db_fh.flush 34 end 35 36 private 37 38 def persist(data) 39 @db_fh.write(data) 40 @db_fh.flush 41 end 42 43 def incr_write_pos(pos) 44 @write_pos += pos 45 end 46 47 def key_struct(write_pos, log_size, key) 48 { write_pos: write_pos, log_size: log_size, key: key } 49 end 50 51 end 52 end Closing thoughts To avoid overloading you folks with too much data, I have left out a few things from the article. The next improvement that you should be thinking about is How to initialize the DiskStore with a pre-existing data file? You can implement it in the language of your choice. If you are blocked anywhere, check out the complete working code at https:// github.com/dineshgowda24/bitcask-rb. It has implementations for initializing the DiskStore with a pre-existing data file and benchmarks if you are interested in any of those. Happy implementing your own KV store! updatedupdated2023-02-142023-02-14 See Also: * Materialized View: SQL Queries on Steroids key-value performance bitcask diy * Materialized View: SQL Queries on Steroids > Please enable JavaScript to view the comments powered by Disqus. Copyright (c) 2023 dinesh gowda Powered by Hugo * * * * * *