https://dinesh.wiki/posts/build-your-own-persistent-kv-store/

Dinesh Gowda

  * Posts
  * Categories
  * Tags
  * Resume
  * About
  *  
  * [                    ]
   
    Read More >>

[ ]

Build Your Own Fast, Persistent KV Store

https://github.com/dineshgowda24/bitcask-rb
 2023.2.13  databases  1652  8 mins

 1.
     1. Bitcask
         1. Goals
     2. Data Format
         1. What's the essential thing that we need to store?
         2. Given a cursor at the location of the data in file, will
            you be able to read the data?
         3. Storing primitive data types
         4. Auditing & Security
     3. Number System
     4. Largest Key & Value Size
     5. Stiching Everything
         1. Serialize & Deserialize
         2. DiskStore
     6. Closing thoughts

Databases have always fascinated me, and I have always dreamt of
building a simple, hobby database for fun. I have read several blog
posts about building redis, git, compiler, and interpreter, but none
about building a database. This post is an outcome of a long and
treacherous search of posts on making a database and eventually
encountering the Bitcask paper.

 Bitcask

Bitcask is a local key/value store where the data is persisted in a
file. One operating system process will open the file for writing at
any time. When the file size reaches a considerable theoretical
threshold, we close the file and open a new one.

 Goals

 1. Low latency per item, read or written.
 2. High throughput, especially when writing an incoming stream of
    random items.
 3. Ability to handle datasets much more significant than RAM w/o
    degradation.
 4. Crash friendliness, both in terms of a fast recovery and not
    losing data.
 5. Ease of backup and restore.
 6. A relatively simple, understandable code structure and data
    format.

 Data Format

Let's start breaking it down.

 What's the essential thing that we need to store?

 1. Key
 2. Value

b-kv

 Given a cursor at the location of the data in file, will you be able
to read the data?

The answer is No!

Let me explain

As key and value is variable length. We will need to figure out how
much to read. So we need to store the size of the key and its value.

b-kv-size

Now that we have the key and value size written in the file. Given a
file cursor pointing to the location of the data. We know the first 8
bytes represent key and value size. Once we read that, we see the
size of the actual key and the value to be read.

 Storing primitive data types

Although our kv database is MVP-ready. Suppose we want to store data
types like integer, float and string. Once the data is written in the
file, the data type is lost, and everything will be treated as a
string, as we are not storing any information.

To preserve type information, let's store two more fields, keytype
and valuetype, representing the type of data stored.

b-kv-size-type

 Auditing & Security

For auditing and security, bitcask suggests storing 32-bit EPOC
timestamp and CRC(Cyclic Redundancy Check), respectively. These
values are generated when data is written to the file.

The final data format would look like something below. We would store
20 bytes of additional data for every key and value.

  * 1st 4 bytes are a 32-bit integer representing CRC.
  * The following 4 bytes are a 32-bit integer representing EPOC
    timestamp.
  * The following 8 bytes are two 32-bit integers representing
    keysize and valuesize.
  * The next 4 bytes are two 16-bit integers representing keytype and
    valuetype.
  * The remaining bytes are our key and value.

Data Format

 Number System

Computers represent data in sets of binary digits. The representation
comprises bits grouped into more extensive collections, such as
bytes. The first 24 bytes are unsigned integers if you notice our
data format. By default, when integers are written to a file, they
are not stored in binary.

Let me explain

Suppose we run the below code. What would be the size of the data
written in the file?

1 File.open('sample.txt', 'w') do |file|
2     [1, 12, 123, 1234, 12345, 123456, 1234567, 12345678].each do |num|
3         file.write(num)
4     end
5 end

It's 36 bytes because, by default, they are written as strings where
each character is 1 byte.

$${\textsf{ 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 36 bytes}}$$

But this could be more efficient. In our data format, we discussed
that for any key and value, we would be storing 20 bytes of metadata.
So we cannot keep them as strings, as it would result in a variable
length field. The solution is to encode it and store it in binary
format.

If we encoded them as 4-byte integers and stored them in binary
format. The size of the file would 32 bytes.

$${\textsf{ 4 * 8 = 32 bytes}}$$

 Largest Key & Value Size

The largest key or value stored in file is a function of the type of
integer of keysize or valuesize respectively.

$${\large{\mathsf{\max_{\substack{\mathsf{1<x<2^{x}-1}}} f(x)}}} \
textsf{where x is the size of an integer in bits}$$

As our keysize or valuesize are 4 bytes(32 bits) unsigned integers.
The largest value of key or value that we can store is

$${{\mathsf{ 2^{32-1}} = \textsf{4294967295 bytes = 4.29 GB}}}$$

 Stiching Everything

Let's create a Serializer module which encapsulates serialization and
deserialization logic. We need to encode our data into a binary
sequence. Ruby has Array.pack, which packs the data into a binary
sequence according to a directive. Directives inform how data should
be encoded in a binary sequence. Below is the list of directives we
will be using.

 1. L< : 32-bit unsigned integer, little endian (uint32_t)
 2. S< : 16-bit unsigned integer, little endian (uint16_t)
 3. q< : 64-bit signed integer, little endian (int64_t)
 4. E : double-precision, little-endian

All of our directives have little-endian byte order. This does not
have any performance benefit. You can encode them in big-endian order
as well. We specify the endianness to ensure our database file is
cross-platform compatible.

 1 module Bitcask
 2   module Serializer
 3
 4     HEADER_FORMAT = 'L<L<L<S<S<'
 5     HEADER_SIZE = 16
 6
 7     CRC32_FORMAT = 'L<'
 8     CRC32_SIZE = 4
 9
10     DATA_TYPE = {
11       Integer: 1,
12       Float: 2,
13       String: 3
14     }.freeze
15
16     DATA_TYPE_LOOK_UP = {
17       DATA_TYPE[:Integer] => :Integer,
18       DATA_TYPE[:Float] => :Float,
19       DATA_TYPE[:String] => :String
20     }.freeze
21
22     DATA_TYPE_DIRECTIVE = {
23       DATA_TYPE[:Integer] => 'q<',
24       DATA_TYPE[:Float] => 'E'
25     }.freeze
26
27   end
28 end

 Serialize & Deserialize

 1. serialize serializes the data to our data format. It identifies
    the type of data, generates a header and computes crc. It returns
    the length of the data and the binary encoded data.
 2. deserialize does the opposite of serialize. It validates crc,
    decodes the binary encoded data and returns epoc, key and value.

27 module Bitcask
28   module Serializer
29
30     def serialize(epoc:, key:, value:)
31       key_type = type(key)
32       value_type = type(value)
33
34       key_bytes = pack(key, key_type)
35       value_bytes = pack(value, value_type)
36
37       header = serialize_header(epoc: epoc, keysz: key_bytes.length, key_type: key_type, value_type: value_type,
38                                 valuesz: value_bytes.length)
39       data = key_bytes + value_bytes
40
41       [crc32_header_offset + data.length, crc32(header + data) + header + data]
42     end
43
44     def deserialize(data)
45       return 0, '', '' unless crc32_valid?(desearlize_crc32(data[..crc32_offset - 1]), data[crc32_offset..])
46
47       epoc, keysz, valuesz, key_type, value_type = deserialize_header(data[crc32_offset..crc32_header_offset - 1])
48       key_bytes = data[crc32_header_offset..crc32_header_offset + keysz - 1]
49       value_bytes = data[crc32_header_offset + keysz..]
50
51       [epoc, unpack(key_bytes, key_type), unpack(value_bytes, value_type)]
52     end
53
54     def serialize_header(epoc:, key_type:, keysz:, value_type:, valuesz:)
55       [epoc, keysz, valuesz, DATA_TYPE[key_type], DATA_TYPE[value_type]].pack(HEADER_FORMAT)
56     end
57
58     def deserialize_header(header_data)
59       header = header_data.unpack(HEADER_FORMAT)
60
61       [header[0], header[1], header[2], DATA_TYPE_LOOK_UP[header[3]], DATA_TYPE_LOOK_UP[header[4]]]
62     end
63
64   end
65 end

Below are a few helper functions used for packing, unpacking and
generating crc.

 66 module Bitcask
 67   module Serializer
 68
 69     def crc32_offset
 70       CRC32_SIZE
 71     end
 72
 73     def header_offset
 74       HEADER_SIZE
 75     end
 76
 77     def crc32_header_offset
 78       crc32_offset + header_offset
 79     end
 80
 81     def crc32(data_bytes)
 82       [Zlib.crc32(data_bytes)].pack(CRC32_FORMAT)
 83     end
 84
 85     def desearlize_crc32(crc)
 86       crc.unpack1(CRC32_FORMAT)
 87     end
 88
 89     def crc32_valid?(digest, data_bytes)
 90       digest == Zlib.crc32(data_bytes)
 91     end
 92
 93     def pack(attribute, attribute_type)
 94       case attribute_type
 95       when :Integer, :Float
 96         [attribute].pack(directive(attribute_type))
 97       when :String
 98         attribute.encode('utf-8')
 99       else
100         raise StandardError, 'Invalid attribute_type for pack'
101       end
102     end
103
104     def unpack(attribute, attribute_type)
105       case attribute_type
106       when :Integer, :Float
107         attribute.unpack1(directive(attribute_type))
108       when :String
109         attribute
110       else
111         raise StandardError, 'Invalid attribute_type for unpack'
112       end
113     end
114
115     private
116
117     def directive(attribute_type)
118       DATA_TYPE_DIRECTIVE[DATA_TYPE[attribute_type]]
119     end
120
121     def type(attribute)
122       attribute.class.to_s.to_sym
123     end
124
125   end
126 end

 DiskStore

Finally, let us create a class called DiskStore, our persistent
key-value store. It has the following methods.

 1. Get : Returns value for a given key from the store.

flowchart LR; A([get]) --> C{key in
look_up table?} C -->|yes| D(seek to
data loc in file) D --> F(read
log_size bytes) F --> G{crc valid?} G --> |no| E G --> |yes| H
(deserialize) H --> J(return value) C -->|no| E(return
empty string)

 2. Put : Write key-value into the store.

flowchart LR; A([put]) --> B(generate epoc) B --> C(serialize data) C
--> G(add to key_dir) G --> D(write to file) D --> E(incr write_pos
by data size) E --> F(return)

 1 module Bitcask
 2   class DiskStore
 3
 4     include Serializer
 5
 6     def initialize(db_file = 'bitcask.db')
 7       @db_fh = File.open(db_file, 'a+b')
 8       @write_pos = 0
 9       @key_dir = {}
10     end
11
12     def get(key)
13       key_struct = @key_dir[key]
14       return '' if key_struct.nil?
15
16       @db_fh.seek(key_struct[:write_pos])
17       epoc, key, value = deserialize(@db_fh.read(key_struct[:log_size]))
18
19       value
20     end
21
22     def put(key, value)
23       log_size, data = serialize(epoc: Time.now.to_i, key: key, value: value)
24
25       @key_dir[key] = key_struct(@write_pos, log_size, key)
26       persist(data)
27       incr_write_pos(log_size)
28
29       nil
30     end
31
32     def flush
33       @db_fh.flush
34     end
35
36     private
37
38     def persist(data)
39       @db_fh.write(data)
40       @db_fh.flush
41     end
42
43     def incr_write_pos(pos)
44       @write_pos += pos
45     end
46
47     def key_struct(write_pos, log_size, key)
48       { write_pos: write_pos, log_size: log_size, key: key }
49     end
50
51   end
52 end

 Closing thoughts

To avoid overloading you folks with too much data, I have left out a
few things from the article. The next improvement that you should be
thinking about is How to initialize the DiskStore with a pre-existing
data file?

You can implement it in the language of your choice. If you are
blocked anywhere, check out the complete working code at https://
github.com/dineshgowda24/bitcask-rb. It has implementations for
initializing the DiskStore with a pre-existing data file and
benchmarks if you are interested in any of those.

Happy implementing your own KV store!

updatedupdated2023-02-142023-02-14
 
 
 

See Also:

  * Materialized View: SQL Queries on Steroids

key-value performance bitcask diy

  * Materialized View: SQL Queries on Steroids >

Please enable JavaScript to view the comments powered by Disqus.
 
Copyright (c) 2023  dinesh gowda
Powered by Hugo

  *  
  *  
  *  
  *  
  *  
  *