[HN Gopher] Building a new database management system in academi...
       ___________________________________________________________________
        
       Building a new database management system in academia (2017)
        
       Author : greghn
       Score  : 67 points
       Date   : 2023-06-16 11:50 UTC (11 hours ago)
        
 (HTM) web link (www.cs.cmu.edu)
 (TXT) w3m dump (www.cs.cmu.edu)
        
       | yingjunwu wrote:
       | Proud to see my name (https://twitter.com/YingjunWu) mentioned in
       | Andy's blog. I was Andy's visiting PhD at CMU and was the top 1
       | contributor to Peloton (https://github.com/cmu-db/peloton).
       | 
       | Today, building a database from scratch is extremely difficult,
       | for several reasons: 1. it anyways takes a long time; 2. there
       | are so many successful (open-source) databases; 3. hiring top
       | engineers are so expensive. 4. you won't get enough attention
       | unless your system is drastically better than existing ones.
       | 
       | An interesting observation is that very few database was built
       | since 2020 - almost all the newly built databases were developed
       | on top of existing databases (PostgreSQL, ClickHouse, etc).
       | 
       | I started building RisingWave
       | (https://github.com/risingwavelabs/risingwave) in early 2021. The
       | only reason we built the system from scratch was that none of the
       | existing systems can address the problem we are solving -
       | distributed SQL stream processing at cloud scale. We tried Flink
       | but gave up, as it's too heavy and it's architecture was not
       | designed for the cloud environment.
       | 
       | If you want to build a database from scratch, or are simply
       | interested in databases, we may talk.
        
       | esjeon wrote:
       | For unsuspecting readers: this article talks about the
       | _feasibility_ of building a new _practical_ DBMS. Database is the
       | most critical piece of software for businesses, so it has already
       | been thoroughly explored and researched. It 's very difficult to
       | find a better solution for existing problems. One should either
       | invent a new paradigm or tackle unsolved problems to justify the
       | cost of development.
       | 
       | Technology-wise, writing a toy DBMS is nothing difficult. Even
       | undergraduates can do it.
        
       | pcthrowaway wrote:
       | Uh... the link to macrobase redirects to pornhub.
       | 
       | Not a good look for people browsing at work
        
         | apavlo wrote:
         | Yikes! Thanks for the heads up. Peter left Stanford so I guess
         | they took over the domain name :-(
        
           | pcthrowaway wrote:
           | Excellent, well done on the quick turnaround.
           | 
           | I hope no one had to have a hard conversation with their boss
           | just now.
           | 
           | (for the curious, the link was previously *NSFW*
           | http://macrobase dot io *NSFW*)
        
             | pbailis wrote:
             | Macrobase PI here - someone squatted on that io domain a
             | long time ago while the project was active. Once we moved
             | to macrobase.stanford.edu, a fan apparently took interest
             | in our old domain. Thanks Andy for updating the link.
        
               | pcthrowaway wrote:
               | "A fan" hehe. I wonder if some genius at pornhub
               | marketing thought the word had enough innuendo to be
               | worth paying the squatter for redirect rights. It's not
               | exactly "Macrohard"[ware] or its inverse (who we can
               | thank for Github and VS Code)
        
       | onetimeuse92304 wrote:
       | It may seem daunting, but I think many people make it more
       | complex / difficult than it needs to be.
       | 
       | I have rolled out two transactional databases of my own. In both
       | cases I had to provide very specific properties and for some
       | reason I could not find an existing product that would meet all
       | requirements. For example, one of them was an embedded device
       | that was very restricted in memory, all operations needed to run
       | with hard bounds on time and memory and the storage for the data
       | was a flash chip without wear levelling which required the
       | database itself to manage writes to prolong the chip's life.
       | 
       | The key is to notice how your database system is going to be
       | different from others and what properties are not essential.
       | 
       | Also, making general purpose DBMS tends to be much more complex
       | vs making more niche solutions where you know a bit more about
       | what the uses are going to be and what kinds of loads you can
       | expect.
       | 
       | Creating a custom engine for a given application can be very
       | simple task because you can easily cross out requirements you
       | don't care about and you only care that it works well for the
       | loads that this particular application can generate.
       | 
       | Also, it is unlikely you are going to beat fierce competition in
       | general purpose "and a kitchen sink" database management system
       | market, but much easier to find a niche that is underserved and
       | create a usable, competitive product with relatively little
       | effort. That's how SQLite started.
        
       | stakhanov wrote:
       | This is an announcement from 2017 about "the next five years",
       | which time period is now squarely in the past.
       | 
       | Did the DBMS ever come into existence? (If so: link, please). If
       | not: Why should we be interested in this announcement in 2023?
        
         | whoevercares wrote:
         | DeepDive from he listed become SnorkelAI, which become hot
         | lately
        
         | paddw wrote:
         | This is Andy Pavlo, so he probably got sidetracked with
         | https://ottertune.com/
         | 
         | Not sure what op's intention with this was
        
           | apavlo wrote:
           | Actually, it was a combination of three things:
           | 
           | 1. OtterTune Start-up (https://ottertune.com)
           | 
           | 2. Biological Daughter
           | (https://twitter.com/andy_pavlo/status/1187841279260004355)
           | 
           | 3. Pandemic
           | 
           | When the pandemic first started, I had a bunch of CMU
           | students reach out to me saying that their summer internships
           | were rescinded and that they were looking for a project to
           | work on so that they wouldn't have a gap in their CV. I ended
           | up taking on _any_ student that could program C++ even if
           | they hadn 't taken my DB class before. It as my way of trying
           | to help. But our research group grew to about 35 people. That
           | was not sustainable and the code quality suffered greatly.
           | 
           | We ended killing the project and now all our self-driving
           | work is done in the context of Postgres
           | (https://db.cs.cmu.edu/papers/2023/p27-lim.pdf).
           | 
           | I also now realize that building the DBMS engine first then
           | building the query optimizer second is the wrong order. Our
           | future project is going to start with the optimizer first.
        
             | lifepillar wrote:
             | I see that MVCC is still your preferred way of doing CC,
             | and what academic research is mostly focused. I am
             | wondering whether that's an advantage for in-memory
             | databases specifically.
             | 
             | I was once discussing MVCC vs 2PL with an experienced
             | Sybase and SQL Server guy, and he claimed that, when
             | transactions are implemented properly and the database is
             | well-designed (no surrogate keys, in particular), 2PL leads
             | to better performance and no deadlocks, while "readers do
             | not block writers" leads to lots of aborted transactions in
             | a heavy OLTP workload. I verified that (I should still have
             | the code around): lots of conflicts in PostgreSQL vs smooth
             | concurrent execution with no retries in Sybase and SQL
             | Server.
             | 
             | I have since heard similar opinions from other SQL Server
             | practitioners: they disable MVCC and rely only on good ol'
             | 2PL.
        
               | apavlo wrote:
               | See our 2014 paper on evaluating CC protocols on in-
               | memory system with high contention / core counts:
               | 
               | https://www.vldb.org/pvldb/vol8/p209-yu.pdf
               | 
               | All the protocols regress to the same. This evaluation
               | was only with stored procedures though. It would be worth
               | doing a similar investigation with conversational DB
               | protocols (e.g., JDBC, ODBC).
        
             | stakhanov wrote:
             | Awesome, thanks for the quick response.
        
             | erichocean wrote:
             | > _Our future project is going to start with the optimizer
             | first._
             | 
             | What's your opinion of recent attempts like LingoDB, that
             | move the query optimizer into a traditional compiler stack,
             | in this case, MLIR?
        
               | apavlo wrote:
               | LingoDB is an interesting system. Jana has done great
               | work with it. I like projects that take unorthodox
               | approaches to old problems.
               | 
               | The problem with (most) query optimizers is that they
               | take a one shot approach at optimization. I think an
               | optimizer should be built from the groundup to support
               | adaptive query optimization. Something similar to
               | Berkeley's Eddies project from 20 years ago.
        
               | zinclozenge wrote:
               | Do you know if there is anybody taking this approach?
               | Alternatively, what would you consider to be the current
               | state of the art when it comes to query optimizers?
        
               | zinclozenge wrote:
               | There's also mutable that compiles to WASM and lets it
               | get JITed by v8 https://github.com/mutable-org/mutable.
        
             | pcthrowaway wrote:
             | Oh hey, please fix the link to macrobase; it redirects to
             | pornhub
        
               | brazzledazzle wrote:
               | Oh wow
        
             | cmrdporcupine wrote:
             | I'm curious, wondering if you could explicate why you feel
             | starting from the query optimization end is key? I have my
             | (amateur) guesses, but would love to hear your expert
             | opinion.
        
             | zX41ZdbW wrote:
             | I'm doing a similar thing - invite every student who is
             | interested, without interviewing or skill tests:
             | https://github.com/ClickHouse/ClickHouse/issues/42194
             | 
             | It works if you target for ~10% outcome if you have a good
             | CI system with a decent test coverage and a ton of fuzzing.
        
             | zX41ZdbW wrote:
             | Moreover, making as many as possible people to learn
             | database engineering and production C++ experience - is one
             | of my goals with ClickHouse.
        
       | eatonphil wrote:
       | If you're interested in the idea of databases built from scratch
       | since the time this post was written in 2017 (based on GitHub
       | contributions info), here are a few:
       | 
       | - Materialize: 2017
       | 
       | - DuckDB: 2018
       | 
       | - RedPanda: 2019
       | 
       | - TigerBeetle: 2020
        
         | zX41ZdbW wrote:
         | According to my estimation, a new database engine is born every
         | week - mostly key-value and document databases. Only a small
         | subset of them survive after one year. According to a guess by
         | Stonebreaker, a DBMS takes around 7 years to become mature
         | enough for general applications.
        
           | eatonphil wrote:
           | > According to my estimation, a new database engine is born
           | every week
           | 
           | Fair. I'm talking about databases with funding backing them
           | (either by universities or otherwise).
        
         | tlarkworthy wrote:
         | and DuckDB came out of academia too and is not based on
         | Postgres either (highly relevant and notably absent in the
         | authors list of academic DBs at the end of the article)
         | 
         | https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf
         | 
         | EDIT: oh the article is old
        
           | lmwnshn wrote:
           | You may be interested in DuckDB's CIDR talk on the little
           | miracles that made it possible. [0]
           | 
           | [0] https://twitter.com/motherduck/status/1615487300523429889
        
         | zachmu wrote:
         | Dolt started in 2018: https://doltdb.com
         | 
         | Yes we have commit history from 2015 but that's from an earlier
         | db project (noms) that we forked and built on top of
        
         | jchrisa wrote:
         | I am building a new immutable cryptographically verified
         | database using IPLD data structures and prolly trees. This
         | allows changes made anywhere to be transparently synced, and
         | for operations to be commuted amongst untrusted peers, for
         | instance allowing for shared index maintenance.
         | 
         | https://use-fireproof.com/docs/architecture
         | 
         | It's also the easiest way to write React apps. Here are some
         | ChatGPT expert builders that I've trained to use the CSS
         | framework of your choice with Fireproof: https://use-
         | fireproof.com/docs/chatgpt-quick-start/#react-ex...
        
         | otoolep wrote:
         | I started rqlite[1] in 2014[2], FWIW. While I didn't build the
         | storage engine, or the consensus system, I've built the entire
         | "management" part of the RDBMS from scratch. I'm almost 10
         | years at it, and there is still plenty to do.
         | 
         | [1] https://www.rqlite.io
         | 
         | [2] https://www.philipotoole.com/9-years-of-open-source-
         | database...
        
       | whoevercares wrote:
       | FWIW, many of his recent student went to Databricks now
        
       | dang wrote:
       | Discussed at the time:
       | 
       |  _Building a Database System in Academia_ -
       | https://news.ycombinator.com/item?id=13931752 - March 2017 (15
       | comments)
        
       ___________________________________________________________________
       (page generated 2023-06-16 23:01 UTC)