https://unum.cloud/post/2021-12-31-dbms-startups/

 

     
  * Home  
  * Blog  
  * Tags  
  * About  
  * Mentions  
  * Jobs

 

  * Home
  * Blog
  * Tags
  * About
  * Mentions
  * Jobs

2021 in Database Startups: Gold Rush!

The Motifs, the Promises and the Technical Debt of the hottest DBMS
software of 2021

2021-12-31 2703 words 13 mins read

Contents

  * BIIIG Disclaimer
  * Why is there Action?
      + How was the money spent?
      + The Technical Debt
  * The New Cohort, the Wannabees
      + Graphs & Series vs Tables & Docs
  * 2021 for Unum & UnumDB

I recently read an article by Andy Pavlo, one of the most famous
people in the database world, reflecting on the database market in
2021. He calls today - "the golden age of DataBase Management
Software" (DBMS), and he is right! In one year, many of the startups
in the space have raised more than in their entire previous
decade-long history!

Gold Rush in DBMS in 2021

When you see separate news on yet another DBMS startup, their rounds
and their promises - you read it, you agree with it, and you forget
it. But once you stitch everything together, every piece of the
puzzle, every article, the numbers just become absurd! Let's
investigate!

---------------------------------------------------------------------

To skip the disclaimer and cut to the chase, jump here:

  * Biggest past DBMS IPOs and the rise of interest
  * Biggest DBMS private rounds in 2021

To discuss on HackerNews, jump into this thread.

BIIIG Disclaimer

Before we go further, I want to remind readers that we are hardly
unbiased. We have an internal DBMS project - UnumDB, that lets us
analyze Peta-scale datasets without Google-scale budgets. We needed
something truly fast for our analytical workloads, and when
everything else failed, we wrote a new DBMS from scratch! Now it's in
private beta, but it will be a publicly available product next year.
So I will be talking about our future competitors, and you should
probably discount for that.

Second, we are quite critical about modern programming practices,
business and money, in general. We will see some big numbers below,
but this is hardly an indicator of a rapidly advancing technology.
Often the opposite.

Gartner Hype Cycles

People immediately forget about deep fundamental research or
engineering when the hype comes. Building a software business becomes
a buzz-word fitting content. The more "AI" and "Cloud" you can fit on
your landing page, the better the outcome. We of all should know,
it's in our domain name . We are yet to see the need and the ability
to digest all of that capital.

Dumpling Warrior

Big money means big responsibility, so let's just hope they will use
it for the best of our industry!

Why is there Action?

Database Management Software is a pretty dull kind of IT, at least
until recently. The last time it was hip - Oracle was built. Then, a
few decades and a ton of open-source software later, new companies
emerged.

 Startup  Raised pre-IPO IPO Date  Raised @ IPO Current Cap   Focus
Mongo     311M          Nov, 2017 192M        36B        documents
Elastic   162M          Nov, 2018 252M        11B        documents
Snowflake 1400M         Sep, 2020 3360M       104B       tables

Fueled by seemingly infinite VC pockets, all of them do essentially
the same thing - storage infrastructure. They may call it a "data
lake", a "Database Management Software" (DBMS), a "cloud-native
storage solution", or something else, but it's just a piece of code
that saves data to disk and retrieves it. The faster and more
scalable is your solution, the better is your technology. The metric
is as simple as that.

The fact that the company could raise 1.4 Billion before going
public is already shocking. On their public debut, the company was
valued at 22B. A few hours later - it skyrocketed to 75B,
attracting even wider swarms of VCs to this seemingly forgotten IT
sector. On that day, Snowflake drew 3.36 Billion more, accumulating
to a total of 4.76B. There are over 100 countries worldwide with
annual government budgets smaller than that. Including countries with
a population of over 25 million!

How was the money spent?

Before continuing to the new contenders, it's worth analyzing how the
last cohort spent this year. Here are some takeways for FY2021 income
statements of MDB, ESTC and SNOW.

 Company  Revenue Cost of Revenue  G&A   M&S   R&D   EBIT
Mongo     590M   177M (30%)     92M  325M 205M -206M
Elastic   608M   161M (26%)     104M 274M 199M -129M
Snowflake 592M   243M (41%)     176M 479M 238M -544M

Everyone closed this year at a net loss. All had revenues between
500M and 600M, with the cost of revenue at around 25%-40%.
Generally, DBMS vendors only target the top 3 clouds for managed
deployments, meaning that these 3 DBMS brands paid about 580M to
AWS, Azure and GCP. We must also remember that every one of the cloud
brands has its adaptation of MongoDB and Elastic. I can only imagine
how much revenue Amazon DocumentDB has. The part that is clear, none
of the companies spent more on R&D, than on M&S, and we can feel that
in the accumulated technical debt.

The Technical Debt

GAAP and IFRS both have rules regarding what is R&D spending.
However, companies are never forced to elaborate on how that R&D
further splits into different projects within the company. Let's take
Elastic as an example. From their income statements, R&D budgets
were:

  * 2019: 100M
  * 2020: 165M
  * 2021: 199M

They have spent 450M over three years on R&D alone. As of right now,
I would describe the ELK stack as two parts (even though it's
technically three products): storing data and visualizing it. The old
core part and the new.

How old, might you ask? The oldest commit in the ElasticSearch
repository is from 2010, but the core is much older. ElasticSearch is
just a wrapper around the open-source Lucene library, and the latter
was started in September 2001. It's still maintained, but early
design decisions invariably affect the outcome. The world was very
different back in 2001. The first dual-core CPUs were just announced
and would take another four years until reaching the desktop market
in 2005! Things, such as multi-threading that is a first-class
citizen in modern software, could only be retrofitted there.

I guess, about 20 years ago, people still hoped that high-performance
software could be written in high-level languages. Today we know,
that doesn't work. You need either C, C++, or Assembly. If you have a
long-running application, you must think about the memory allocation
strategies beforehand to reduce the unavoidable memory fragmentation
in automatic reference counting environments and avoid multi-thread
stalls in the Garbage Collecting ones.

I am not even touching the numerous vulnerabilities that a big &
popular piece of high-level open-source software would collect over
the years. Such projects have so many dependencies that security
issues like the recent Log4J become unavoidable. Elastic suffered
from that vulnerability just as much as every other data processing
tool written in Java. The world changes, the hardware changes, the
software must change too. Due to the engineering problems above,
Elastic isn't vertically scalable by design.

The same applies to theoretical research. If you were to design a
text search two decades ago, without deep Theoretical Computer
Science (CS) training, you would probably build an "inverted index".
Simply speaking, you keep lists of documents identifiers for every
word, describing where those are present. Once fetched, search
results are scored with TF-IDF, an algorithm published in 1972.
"Substring Search" is one of the most studied problems in CS. Dozens
of efficient algorithms were already published in the last 50 years.
Of course, a complex algorithm with the lowest asymptotic curve
doesn't always translate into faster code. Still, it would be nice to
have Tech startups competing in their ingenuity and investing in
fundamental research or advanced engineering. Not the vibrancy of
their dashboards.

The whole situation may not be so dramatic, but it's how I see most
of the startups below. Still, there must be a place for one more
Snowflake in such a vast market. Whom is it going to be?

The New Cohort, the Wannabees

I may have ended the last part with too obvious of a question.
Everyone knows who will be the next Snowflake - it's Databricks! They
go after the same customers. They compete in the same benchmarks,
they have raised mind-boggling sums of money, they are both founded
by a long list of names, most of whom held leadership positions in
the most prominent American enterprises. Both are, of course,
headquartered in Silicon Valley.

I took the first 20 DBMS startups I could remember, who have raised
over 50M during their lifetime.

  Company    Raised in    Total   Valuation   Total     Share Raised
                2021     Raised              Rounds      Last Year
Databricks   2600M     3500M    38B      9         74%
CockroachDB  438M      633M     5B       9         69%
Neo4J        392M      582M     3B       10        67%
Clickhouse   300M      300M     2B       2         100%
Yugabyte     188M      291M     2B       5         64%
Redis        111M      356M     5B       10        31%
TigerGraph   105M      171M     1B       6         61%
Starburst    100M      164M     1B       3         60%
PlanetScale  80M       105M     1B       4         76%
SingleStore  80M       318M     1B       10        25%
Materialize  60M       100M     1B       3         60%
TimescaleDB  40M       71M      1B       5         56%
Datastax     37M       227M     1B       9         16%
TiDB         0M        341M     3B       5         0%
MariaDB      0M        123M     1B       11        0%
InfluxData   0M        120M     1B       5         0%
Greenplum ([?] 0M        96M      0B       6         0%
EMC)
OmniSci      0M        92M      0B       5         0%
(MapD)
Kinetica     0M        77M      0B       4         0%
SQream       0M        77M      0B       4         0%

All the data was gathered from Crunchbase. The valuations aren't
always reported, so I had to make a few guesses here and there,
rounding to the closest number of billions. This list may not be
exhaustive, but it's impressive, nevertheless. Let's do some
arithmetics:

 1. There are a total of 20 famous DBMS startups on that list.
 2. Together they raised a total of 7.7 Billion throughout their
    lifetime.
 3. Out of 20, 13 have raised in 2021.
 4. Those 13 companies have attracted 6.8 billion over their
    lifetime.
 5. Out of 6.8 Billion, the 4.5 Billion were attracted in 2021
    alone.
 6. This year they raised 2x more than in all previous years
    combined!

Graphs & Series vs Tables & Docs

DataBricks, Redis, Cockroach Labs and Neo4J, seem to be the closest
to IPOs, but only Redis has publicly announced the intention. They
are expected to go public in May 2022 at a 5B valuation. Neo4J
rebranded this year and raised three rounds between June and
November, 174M + 152M + 66M, according to Crunchbase.

Investors across all those companies intersect a lot. In alphabetical
order, it's the Altimeter, Andressen Horowitz, Benchmark, Coatue,
Index, Insight, Redpoint, Tiger Global. The usual suspects. This year
they were mostly interested in following DBMS sectors:

 1. Graphs, like Neo4J, TigerGraph
 2. Time-Series, like ClickHouse
 3. In-Memory, like Redis

Classics is still the king, however. Gartner estimates that the DBMS
market will reach 150 Billion by 2026. Tabular (commonly called
relational) databases still represent the most significant yet
shrinking part of the market. Snowflake, DataBricks, PlanetScale,
CockroachDB, TiDB, Yugabyte, Greenplum are all examples of companies
that solve the same problem - horizontal scalability of relational
DBs. It proved to be a substantial limiting factor of the open-source
Postgres and MySQL, so myriads of startups started patching, forking
and wrapping them, solving one scalability issue at a time. Greenplum
was doing that in 2005. PlanetScale began to do that in 2018. Time
goes, but very few things change.

DB Rankings

Luckily, non-tabular have spiced things up! They are growing so fast
that they are stretching the whole market by an average of 17%/year!
Of those, Mongo is on top! Couchbase was their rising competitor
until it flopped during IPO this year. It is now valued at 1B, 35x
less than Mongo. Out of the big players, MongoDB seems the most
promising to us. Their stack is C++. They don't try to adapt SQL,
which is excellent! They have a JSON-like query language, which isn't
great, but get's most of the work done. Hopefully, it's just a step
towards altogether abandoning text-based communication protocols and
moving to the bright future of Remote Procedure Calls.

Mongo also made a bunch of notable strategic acquisitions over its
history. One of them - Realm, was the most ambitious and auspicious
persistent storage engine for mobile. Great for the Mongo developers
community and great for the future of building apps! Most
importantly, though MongoDB acquired WiredTiger. It's an open-source
project that implements persistent key-value stores (KVS). The
essential piece of software for building a DBMS company. Only a few
companies have ever made decent persistent data structures:

 1. LevelDB, by Google
 2. RocksDB, by Facebook
 3. WiredTiger, acquired by MongoDB
 4. UnumDB, by Unum 

As you can imagine, it's pretty hard to get right if only the titans
of the industry try  Everyone who wants to be a trillion-dollar
internet company should have a good KVS. Oracle probably doesn't have
one. That's why they are worth only 250B 

A considerable part of the startups in the list above are built by
composing RocksDB and an open-source query evaluation engine. In
comparatively good cases, a startup would choose a piece of a
database to optimize and focus on that tiny part, never acknowledging
the whole picture. It's less risky, more understandable, and VCs are
ready to fund that!

2021 for Unum & UnumDB

On the other side of the globe, we are the upside-down ones! The
geeks, the freaks, the nerds, who have been doing weird risky fun
things for the past 6 years. The ones who assemble their
liquid-cooled servers to design the fastest software on bare-metal!

---------------------------------------------------------------------

This year was huge for us. Here are some of our 2021 achievements and
hopes for 2022: nerd-alert

  * Our generic GraphBLAS implementation outperformed cuBLAS and
    CUTLASS on Nvidia Ampere, Turing & Volta GPUs in classical GEMM
    workloads on some matrix sizes. I guess it is the wet dream of
    every HPC developer - to design a fast matrix multiplication
    kernel. We are not implementing the whole BLAS or GraphBLAS
    specification in any case. We intersect in functionality but
    mostly try to target very large, sparse, low-precision
    matrix-to-matrix operations, like the ones needed for training
    modern Artificial Neural Networks.

  * We progressed from fine-tuning pre-trained Transformer Neural
    Networks, as we did since 2017, to training them from scratch.
    It's exceptionally computationally demanding and requires
    research in numerically-stable optimization algorithms. People
    seem to favour large-batch methods like LAMB and LARS, but the
    evidence of their superiority over AdamW isn't always conclusive.

  * We increased the fleet of AVX-512-enabled x86 machines and ARM
    CPUs, which led to even more exciting optimizations in our
    internal frameworks, mainly compression and search. Still waiting
    for proper SVE2 hardware support in 2022. Data-centers will
    receive ginormous upgrades next year, with budgets sometimes
    exceeding the numbers in this article. We will be upgrading too -
    here is what everyone will be buying! I can't wait to have 1 PB
    of low-latency NVME storage in a 1U chassis.

  * Linux is becoming increasingly a bottleneck for High-Performance
    Computing, especially High-Performance I/O. The ceiling seems to
    be around 10M operations/second, or about 8x Intel Optane SSD per
    storage node. We are fully utilizing io_uring and GPUDirect
    Storage and actively working on SPDK to bypass the Linux kernel
    and the file system altogether. The SSD becomes a
    block-addressable asynchronous device. We decide what to write
    and at which addresses. This comes at extreme engineering costs
    but seems wholly justified. It will further amplify what is
    already an order of magnitude speed improvement in randomized
    read operations, critical for analytical BI applications. We have
    recently release our 100 GB/node benchmarks and just finished
    conducting 1 TB/node benchmarks:

Batch-Read speed of UnumDB, RocksDB, LevelDB and WiredTiger

At this scale, the staggering difference in performance translates
into stress tests and benchmarks that can run for days, weeks or even
months, similar to training State-of-the-Art neural networks. Even on
something as small as 1 Billion entries, WiredTiger would crash,
while other DBs would continue. So we had to restart the benchmarks
bunch of times. UnumDB, on the other hand, spent less time finishing
the 10 TB benchmark than others spent on the 1 TB. Seems like a work
worth continuing, right?

---------------------------------------------------------------------

Together with upcoming hardware upgrades, such software would store
all the data and run all the analytics of a Fortune 500 company on
just a couple of servers, with minimal latency, higher flexibility,
tremendously lower costs and higher security than today!

What a time to be an engineer! No matter where you are, in Africa or
America, in Europe or Asia, or at the crossroads, like us, we all
have equal access to knowledge! Just open up arxiv.org, and let's
start hacking, like in the old days! Good luck and much health to all
of us in 2022!

Subscribe to our newsletter ! If you also work with Terabytes of
data and want to try our magical UnumDB, let us know!

Author Ashot Vardanian

LastMod 2021-12-31

software vc dbms elastic redis mongo snowflake databricks neo4j
Apple to Apple Comparison: M1 Max vs Intel Next
     
(c) 2015 - 2021Unum