https://unum.cloud/post/2021-12-31-dbms-startups/ * Home * Blog * Tags * About * Mentions * Jobs * Home * Blog * Tags * About * Mentions * Jobs 2021 in Database Startups: Gold Rush! The Motifs, the Promises and the Technical Debt of the hottest DBMS software of 2021 2021-12-31 2703 words 13 mins read Contents * BIIIG Disclaimer * Why is there Action? + How was the money spent? + The Technical Debt * The New Cohort, the Wannabees + Graphs & Series vs Tables & Docs * 2021 for Unum & UnumDB I recently read an article by Andy Pavlo, one of the most famous people in the database world, reflecting on the database market in 2021. He calls today - "the golden age of DataBase Management Software" (DBMS), and he is right! In one year, many of the startups in the space have raised more than in their entire previous decade-long history! Gold Rush in DBMS in 2021 When you see separate news on yet another DBMS startup, their rounds and their promises - you read it, you agree with it, and you forget it. But once you stitch everything together, every piece of the puzzle, every article, the numbers just become absurd! Let's investigate! --------------------------------------------------------------------- To skip the disclaimer and cut to the chase, jump here: * Biggest past DBMS IPOs and the rise of interest * Biggest DBMS private rounds in 2021 To discuss on HackerNews, jump into this thread. BIIIG Disclaimer Before we go further, I want to remind readers that we are hardly unbiased. We have an internal DBMS project - UnumDB, that lets us analyze Peta-scale datasets without Google-scale budgets. We needed something truly fast for our analytical workloads, and when everything else failed, we wrote a new DBMS from scratch! Now it's in private beta, but it will be a publicly available product next year. So I will be talking about our future competitors, and you should probably discount for that. Second, we are quite critical about modern programming practices, business and money, in general. We will see some big numbers below, but this is hardly an indicator of a rapidly advancing technology. Often the opposite. Gartner Hype Cycles People immediately forget about deep fundamental research or engineering when the hype comes. Building a software business becomes a buzz-word fitting content. The more "AI" and "Cloud" you can fit on your landing page, the better the outcome. We of all should know, it's in our domain name . We are yet to see the need and the ability to digest all of that capital. Dumpling Warrior Big money means big responsibility, so let's just hope they will use it for the best of our industry! Why is there Action? Database Management Software is a pretty dull kind of IT, at least until recently. The last time it was hip - Oracle was built. Then, a few decades and a ton of open-source software later, new companies emerged. Startup Raised pre-IPO IPO Date Raised @ IPO Current Cap Focus Mongo 311M Nov, 2017 192M 36B documents Elastic 162M Nov, 2018 252M 11B documents Snowflake 1400M Sep, 2020 3360M 104B tables Fueled by seemingly infinite VC pockets, all of them do essentially the same thing - storage infrastructure. They may call it a "data lake", a "Database Management Software" (DBMS), a "cloud-native storage solution", or something else, but it's just a piece of code that saves data to disk and retrieves it. The faster and more scalable is your solution, the better is your technology. The metric is as simple as that. The fact that the company could raise 1.4 Billion before going public is already shocking. On their public debut, the company was valued at 22B. A few hours later - it skyrocketed to 75B, attracting even wider swarms of VCs to this seemingly forgotten IT sector. On that day, Snowflake drew 3.36 Billion more, accumulating to a total of 4.76B. There are over 100 countries worldwide with annual government budgets smaller than that. Including countries with a population of over 25 million! How was the money spent? Before continuing to the new contenders, it's worth analyzing how the last cohort spent this year. Here are some takeways for FY2021 income statements of MDB, ESTC and SNOW. Company Revenue Cost of Revenue G&A M&S R&D EBIT Mongo 590M 177M (30%) 92M 325M 205M -206M Elastic 608M 161M (26%) 104M 274M 199M -129M Snowflake 592M 243M (41%) 176M 479M 238M -544M Everyone closed this year at a net loss. All had revenues between 500M and 600M, with the cost of revenue at around 25%-40%. Generally, DBMS vendors only target the top 3 clouds for managed deployments, meaning that these 3 DBMS brands paid about 580M to AWS, Azure and GCP. We must also remember that every one of the cloud brands has its adaptation of MongoDB and Elastic. I can only imagine how much revenue Amazon DocumentDB has. The part that is clear, none of the companies spent more on R&D, than on M&S, and we can feel that in the accumulated technical debt. The Technical Debt GAAP and IFRS both have rules regarding what is R&D spending. However, companies are never forced to elaborate on how that R&D further splits into different projects within the company. Let's take Elastic as an example. From their income statements, R&D budgets were: * 2019: 100M * 2020: 165M * 2021: 199M They have spent 450M over three years on R&D alone. As of right now, I would describe the ELK stack as two parts (even though it's technically three products): storing data and visualizing it. The old core part and the new. How old, might you ask? The oldest commit in the ElasticSearch repository is from 2010, but the core is much older. ElasticSearch is just a wrapper around the open-source Lucene library, and the latter was started in September 2001. It's still maintained, but early design decisions invariably affect the outcome. The world was very different back in 2001. The first dual-core CPUs were just announced and would take another four years until reaching the desktop market in 2005! Things, such as multi-threading that is a first-class citizen in modern software, could only be retrofitted there. I guess, about 20 years ago, people still hoped that high-performance software could be written in high-level languages. Today we know, that doesn't work. You need either C, C++, or Assembly. If you have a long-running application, you must think about the memory allocation strategies beforehand to reduce the unavoidable memory fragmentation in automatic reference counting environments and avoid multi-thread stalls in the Garbage Collecting ones. I am not even touching the numerous vulnerabilities that a big & popular piece of high-level open-source software would collect over the years. Such projects have so many dependencies that security issues like the recent Log4J become unavoidable. Elastic suffered from that vulnerability just as much as every other data processing tool written in Java. The world changes, the hardware changes, the software must change too. Due to the engineering problems above, Elastic isn't vertically scalable by design. The same applies to theoretical research. If you were to design a text search two decades ago, without deep Theoretical Computer Science (CS) training, you would probably build an "inverted index". Simply speaking, you keep lists of documents identifiers for every word, describing where those are present. Once fetched, search results are scored with TF-IDF, an algorithm published in 1972. "Substring Search" is one of the most studied problems in CS. Dozens of efficient algorithms were already published in the last 50 years. Of course, a complex algorithm with the lowest asymptotic curve doesn't always translate into faster code. Still, it would be nice to have Tech startups competing in their ingenuity and investing in fundamental research or advanced engineering. Not the vibrancy of their dashboards. The whole situation may not be so dramatic, but it's how I see most of the startups below. Still, there must be a place for one more Snowflake in such a vast market. Whom is it going to be? The New Cohort, the Wannabees I may have ended the last part with too obvious of a question. Everyone knows who will be the next Snowflake - it's Databricks! They go after the same customers. They compete in the same benchmarks, they have raised mind-boggling sums of money, they are both founded by a long list of names, most of whom held leadership positions in the most prominent American enterprises. Both are, of course, headquartered in Silicon Valley. I took the first 20 DBMS startups I could remember, who have raised over 50M during their lifetime. Company Raised in Total Valuation Total Share Raised 2021 Raised Rounds Last Year Databricks 2600M 3500M 38B 9 74% CockroachDB 438M 633M 5B 9 69% Neo4J 392M 582M 3B 10 67% Clickhouse 300M 300M 2B 2 100% Yugabyte 188M 291M 2B 5 64% Redis 111M 356M 5B 10 31% TigerGraph 105M 171M 1B 6 61% Starburst 100M 164M 1B 3 60% PlanetScale 80M 105M 1B 4 76% SingleStore 80M 318M 1B 10 25% Materialize 60M 100M 1B 3 60% TimescaleDB 40M 71M 1B 5 56% Datastax 37M 227M 1B 9 16% TiDB 0M 341M 3B 5 0% MariaDB 0M 123M 1B 11 0% InfluxData 0M 120M 1B 5 0% Greenplum ([?] 0M 96M 0B 6 0% EMC) OmniSci 0M 92M 0B 5 0% (MapD) Kinetica 0M 77M 0B 4 0% SQream 0M 77M 0B 4 0% All the data was gathered from Crunchbase. The valuations aren't always reported, so I had to make a few guesses here and there, rounding to the closest number of billions. This list may not be exhaustive, but it's impressive, nevertheless. Let's do some arithmetics: 1. There are a total of 20 famous DBMS startups on that list. 2. Together they raised a total of 7.7 Billion throughout their lifetime. 3. Out of 20, 13 have raised in 2021. 4. Those 13 companies have attracted 6.8 billion over their lifetime. 5. Out of 6.8 Billion, the 4.5 Billion were attracted in 2021 alone. 6. This year they raised 2x more than in all previous years combined! Graphs & Series vs Tables & Docs DataBricks, Redis, Cockroach Labs and Neo4J, seem to be the closest to IPOs, but only Redis has publicly announced the intention. They are expected to go public in May 2022 at a 5B valuation. Neo4J rebranded this year and raised three rounds between June and November, 174M + 152M + 66M, according to Crunchbase. Investors across all those companies intersect a lot. In alphabetical order, it's the Altimeter, Andressen Horowitz, Benchmark, Coatue, Index, Insight, Redpoint, Tiger Global. The usual suspects. This year they were mostly interested in following DBMS sectors: 1. Graphs, like Neo4J, TigerGraph 2. Time-Series, like ClickHouse 3. In-Memory, like Redis Classics is still the king, however. Gartner estimates that the DBMS market will reach 150 Billion by 2026. Tabular (commonly called relational) databases still represent the most significant yet shrinking part of the market. Snowflake, DataBricks, PlanetScale, CockroachDB, TiDB, Yugabyte, Greenplum are all examples of companies that solve the same problem - horizontal scalability of relational DBs. It proved to be a substantial limiting factor of the open-source Postgres and MySQL, so myriads of startups started patching, forking and wrapping them, solving one scalability issue at a time. Greenplum was doing that in 2005. PlanetScale began to do that in 2018. Time goes, but very few things change. DB Rankings Luckily, non-tabular have spiced things up! They are growing so fast that they are stretching the whole market by an average of 17%/year! Of those, Mongo is on top! Couchbase was their rising competitor until it flopped during IPO this year. It is now valued at 1B, 35x less than Mongo. Out of the big players, MongoDB seems the most promising to us. Their stack is C++. They don't try to adapt SQL, which is excellent! They have a JSON-like query language, which isn't great, but get's most of the work done. Hopefully, it's just a step towards altogether abandoning text-based communication protocols and moving to the bright future of Remote Procedure Calls. Mongo also made a bunch of notable strategic acquisitions over its history. One of them - Realm, was the most ambitious and auspicious persistent storage engine for mobile. Great for the Mongo developers community and great for the future of building apps! Most importantly, though MongoDB acquired WiredTiger. It's an open-source project that implements persistent key-value stores (KVS). The essential piece of software for building a DBMS company. Only a few companies have ever made decent persistent data structures: 1. LevelDB, by Google 2. RocksDB, by Facebook 3. WiredTiger, acquired by MongoDB 4. UnumDB, by Unum As you can imagine, it's pretty hard to get right if only the titans of the industry try Everyone who wants to be a trillion-dollar internet company should have a good KVS. Oracle probably doesn't have one. That's why they are worth only 250B A considerable part of the startups in the list above are built by composing RocksDB and an open-source query evaluation engine. In comparatively good cases, a startup would choose a piece of a database to optimize and focus on that tiny part, never acknowledging the whole picture. It's less risky, more understandable, and VCs are ready to fund that! 2021 for Unum & UnumDB On the other side of the globe, we are the upside-down ones! The geeks, the freaks, the nerds, who have been doing weird risky fun things for the past 6 years. The ones who assemble their liquid-cooled servers to design the fastest software on bare-metal! --------------------------------------------------------------------- This year was huge for us. Here are some of our 2021 achievements and hopes for 2022: nerd-alert * Our generic GraphBLAS implementation outperformed cuBLAS and CUTLASS on Nvidia Ampere, Turing & Volta GPUs in classical GEMM workloads on some matrix sizes. I guess it is the wet dream of every HPC developer - to design a fast matrix multiplication kernel. We are not implementing the whole BLAS or GraphBLAS specification in any case. We intersect in functionality but mostly try to target very large, sparse, low-precision matrix-to-matrix operations, like the ones needed for training modern Artificial Neural Networks. * We progressed from fine-tuning pre-trained Transformer Neural Networks, as we did since 2017, to training them from scratch. It's exceptionally computationally demanding and requires research in numerically-stable optimization algorithms. People seem to favour large-batch methods like LAMB and LARS, but the evidence of their superiority over AdamW isn't always conclusive. * We increased the fleet of AVX-512-enabled x86 machines and ARM CPUs, which led to even more exciting optimizations in our internal frameworks, mainly compression and search. Still waiting for proper SVE2 hardware support in 2022. Data-centers will receive ginormous upgrades next year, with budgets sometimes exceeding the numbers in this article. We will be upgrading too - here is what everyone will be buying! I can't wait to have 1 PB of low-latency NVME storage in a 1U chassis. * Linux is becoming increasingly a bottleneck for High-Performance Computing, especially High-Performance I/O. The ceiling seems to be around 10M operations/second, or about 8x Intel Optane SSD per storage node. We are fully utilizing io_uring and GPUDirect Storage and actively working on SPDK to bypass the Linux kernel and the file system altogether. The SSD becomes a block-addressable asynchronous device. We decide what to write and at which addresses. This comes at extreme engineering costs but seems wholly justified. It will further amplify what is already an order of magnitude speed improvement in randomized read operations, critical for analytical BI applications. We have recently release our 100 GB/node benchmarks and just finished conducting 1 TB/node benchmarks: Batch-Read speed of UnumDB, RocksDB, LevelDB and WiredTiger At this scale, the staggering difference in performance translates into stress tests and benchmarks that can run for days, weeks or even months, similar to training State-of-the-Art neural networks. Even on something as small as 1 Billion entries, WiredTiger would crash, while other DBs would continue. So we had to restart the benchmarks bunch of times. UnumDB, on the other hand, spent less time finishing the 10 TB benchmark than others spent on the 1 TB. Seems like a work worth continuing, right? --------------------------------------------------------------------- Together with upcoming hardware upgrades, such software would store all the data and run all the analytics of a Fortune 500 company on just a couple of servers, with minimal latency, higher flexibility, tremendously lower costs and higher security than today! What a time to be an engineer! No matter where you are, in Africa or America, in Europe or Asia, or at the crossroads, like us, we all have equal access to knowledge! Just open up arxiv.org, and let's start hacking, like in the old days! Good luck and much health to all of us in 2022! Subscribe to our newsletter ! If you also work with Terabytes of data and want to try our magical UnumDB, let us know! Author Ashot Vardanian LastMod 2021-12-31 software vc dbms elastic redis mongo snowflake databricks neo4j Apple to Apple Comparison: M1 Max vs Intel Next (c) 2015 - 2021Unum