[HN Gopher] Building a new database management system in academi...
___________________________________________________________________
Building a new database management system in academia (2017)
Author : greghn
Score : 67 points
Date : 2023-06-16 11:50 UTC (11 hours ago)
(HTM) web link (www.cs.cmu.edu)
(TXT) w3m dump (www.cs.cmu.edu)
| yingjunwu wrote:
| Proud to see my name (https://twitter.com/YingjunWu) mentioned in
| Andy's blog. I was Andy's visiting PhD at CMU and was the top 1
| contributor to Peloton (https://github.com/cmu-db/peloton).
|
| Today, building a database from scratch is extremely difficult,
| for several reasons: 1. it anyways takes a long time; 2. there
| are so many successful (open-source) databases; 3. hiring top
| engineers are so expensive. 4. you won't get enough attention
| unless your system is drastically better than existing ones.
|
| An interesting observation is that very few database was built
| since 2020 - almost all the newly built databases were developed
| on top of existing databases (PostgreSQL, ClickHouse, etc).
|
| I started building RisingWave
| (https://github.com/risingwavelabs/risingwave) in early 2021. The
| only reason we built the system from scratch was that none of the
| existing systems can address the problem we are solving -
| distributed SQL stream processing at cloud scale. We tried Flink
| but gave up, as it's too heavy and it's architecture was not
| designed for the cloud environment.
|
| If you want to build a database from scratch, or are simply
| interested in databases, we may talk.
| esjeon wrote:
| For unsuspecting readers: this article talks about the
| _feasibility_ of building a new _practical_ DBMS. Database is the
| most critical piece of software for businesses, so it has already
| been thoroughly explored and researched. It 's very difficult to
| find a better solution for existing problems. One should either
| invent a new paradigm or tackle unsolved problems to justify the
| cost of development.
|
| Technology-wise, writing a toy DBMS is nothing difficult. Even
| undergraduates can do it.
| pcthrowaway wrote:
| Uh... the link to macrobase redirects to pornhub.
|
| Not a good look for people browsing at work
| apavlo wrote:
| Yikes! Thanks for the heads up. Peter left Stanford so I guess
| they took over the domain name :-(
| pcthrowaway wrote:
| Excellent, well done on the quick turnaround.
|
| I hope no one had to have a hard conversation with their boss
| just now.
|
| (for the curious, the link was previously *NSFW*
| http://macrobase dot io *NSFW*)
| pbailis wrote:
| Macrobase PI here - someone squatted on that io domain a
| long time ago while the project was active. Once we moved
| to macrobase.stanford.edu, a fan apparently took interest
| in our old domain. Thanks Andy for updating the link.
| pcthrowaway wrote:
| "A fan" hehe. I wonder if some genius at pornhub
| marketing thought the word had enough innuendo to be
| worth paying the squatter for redirect rights. It's not
| exactly "Macrohard"[ware] or its inverse (who we can
| thank for Github and VS Code)
| onetimeuse92304 wrote:
| It may seem daunting, but I think many people make it more
| complex / difficult than it needs to be.
|
| I have rolled out two transactional databases of my own. In both
| cases I had to provide very specific properties and for some
| reason I could not find an existing product that would meet all
| requirements. For example, one of them was an embedded device
| that was very restricted in memory, all operations needed to run
| with hard bounds on time and memory and the storage for the data
| was a flash chip without wear levelling which required the
| database itself to manage writes to prolong the chip's life.
|
| The key is to notice how your database system is going to be
| different from others and what properties are not essential.
|
| Also, making general purpose DBMS tends to be much more complex
| vs making more niche solutions where you know a bit more about
| what the uses are going to be and what kinds of loads you can
| expect.
|
| Creating a custom engine for a given application can be very
| simple task because you can easily cross out requirements you
| don't care about and you only care that it works well for the
| loads that this particular application can generate.
|
| Also, it is unlikely you are going to beat fierce competition in
| general purpose "and a kitchen sink" database management system
| market, but much easier to find a niche that is underserved and
| create a usable, competitive product with relatively little
| effort. That's how SQLite started.
| stakhanov wrote:
| This is an announcement from 2017 about "the next five years",
| which time period is now squarely in the past.
|
| Did the DBMS ever come into existence? (If so: link, please). If
| not: Why should we be interested in this announcement in 2023?
| whoevercares wrote:
| DeepDive from he listed become SnorkelAI, which become hot
| lately
| paddw wrote:
| This is Andy Pavlo, so he probably got sidetracked with
| https://ottertune.com/
|
| Not sure what op's intention with this was
| apavlo wrote:
| Actually, it was a combination of three things:
|
| 1. OtterTune Start-up (https://ottertune.com)
|
| 2. Biological Daughter
| (https://twitter.com/andy_pavlo/status/1187841279260004355)
|
| 3. Pandemic
|
| When the pandemic first started, I had a bunch of CMU
| students reach out to me saying that their summer internships
| were rescinded and that they were looking for a project to
| work on so that they wouldn't have a gap in their CV. I ended
| up taking on _any_ student that could program C++ even if
| they hadn 't taken my DB class before. It as my way of trying
| to help. But our research group grew to about 35 people. That
| was not sustainable and the code quality suffered greatly.
|
| We ended killing the project and now all our self-driving
| work is done in the context of Postgres
| (https://db.cs.cmu.edu/papers/2023/p27-lim.pdf).
|
| I also now realize that building the DBMS engine first then
| building the query optimizer second is the wrong order. Our
| future project is going to start with the optimizer first.
| lifepillar wrote:
| I see that MVCC is still your preferred way of doing CC,
| and what academic research is mostly focused. I am
| wondering whether that's an advantage for in-memory
| databases specifically.
|
| I was once discussing MVCC vs 2PL with an experienced
| Sybase and SQL Server guy, and he claimed that, when
| transactions are implemented properly and the database is
| well-designed (no surrogate keys, in particular), 2PL leads
| to better performance and no deadlocks, while "readers do
| not block writers" leads to lots of aborted transactions in
| a heavy OLTP workload. I verified that (I should still have
| the code around): lots of conflicts in PostgreSQL vs smooth
| concurrent execution with no retries in Sybase and SQL
| Server.
|
| I have since heard similar opinions from other SQL Server
| practitioners: they disable MVCC and rely only on good ol'
| 2PL.
| apavlo wrote:
| See our 2014 paper on evaluating CC protocols on in-
| memory system with high contention / core counts:
|
| https://www.vldb.org/pvldb/vol8/p209-yu.pdf
|
| All the protocols regress to the same. This evaluation
| was only with stored procedures though. It would be worth
| doing a similar investigation with conversational DB
| protocols (e.g., JDBC, ODBC).
| stakhanov wrote:
| Awesome, thanks for the quick response.
| erichocean wrote:
| > _Our future project is going to start with the optimizer
| first._
|
| What's your opinion of recent attempts like LingoDB, that
| move the query optimizer into a traditional compiler stack,
| in this case, MLIR?
| apavlo wrote:
| LingoDB is an interesting system. Jana has done great
| work with it. I like projects that take unorthodox
| approaches to old problems.
|
| The problem with (most) query optimizers is that they
| take a one shot approach at optimization. I think an
| optimizer should be built from the groundup to support
| adaptive query optimization. Something similar to
| Berkeley's Eddies project from 20 years ago.
| zinclozenge wrote:
| Do you know if there is anybody taking this approach?
| Alternatively, what would you consider to be the current
| state of the art when it comes to query optimizers?
| zinclozenge wrote:
| There's also mutable that compiles to WASM and lets it
| get JITed by v8 https://github.com/mutable-org/mutable.
| pcthrowaway wrote:
| Oh hey, please fix the link to macrobase; it redirects to
| pornhub
| brazzledazzle wrote:
| Oh wow
| cmrdporcupine wrote:
| I'm curious, wondering if you could explicate why you feel
| starting from the query optimization end is key? I have my
| (amateur) guesses, but would love to hear your expert
| opinion.
| zX41ZdbW wrote:
| I'm doing a similar thing - invite every student who is
| interested, without interviewing or skill tests:
| https://github.com/ClickHouse/ClickHouse/issues/42194
|
| It works if you target for ~10% outcome if you have a good
| CI system with a decent test coverage and a ton of fuzzing.
| zX41ZdbW wrote:
| Moreover, making as many as possible people to learn
| database engineering and production C++ experience - is one
| of my goals with ClickHouse.
| eatonphil wrote:
| If you're interested in the idea of databases built from scratch
| since the time this post was written in 2017 (based on GitHub
| contributions info), here are a few:
|
| - Materialize: 2017
|
| - DuckDB: 2018
|
| - RedPanda: 2019
|
| - TigerBeetle: 2020
| zX41ZdbW wrote:
| According to my estimation, a new database engine is born every
| week - mostly key-value and document databases. Only a small
| subset of them survive after one year. According to a guess by
| Stonebreaker, a DBMS takes around 7 years to become mature
| enough for general applications.
| eatonphil wrote:
| > According to my estimation, a new database engine is born
| every week
|
| Fair. I'm talking about databases with funding backing them
| (either by universities or otherwise).
| tlarkworthy wrote:
| and DuckDB came out of academia too and is not based on
| Postgres either (highly relevant and notably absent in the
| authors list of academic DBs at the end of the article)
|
| https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf
|
| EDIT: oh the article is old
| lmwnshn wrote:
| You may be interested in DuckDB's CIDR talk on the little
| miracles that made it possible. [0]
|
| [0] https://twitter.com/motherduck/status/1615487300523429889
| zachmu wrote:
| Dolt started in 2018: https://doltdb.com
|
| Yes we have commit history from 2015 but that's from an earlier
| db project (noms) that we forked and built on top of
| jchrisa wrote:
| I am building a new immutable cryptographically verified
| database using IPLD data structures and prolly trees. This
| allows changes made anywhere to be transparently synced, and
| for operations to be commuted amongst untrusted peers, for
| instance allowing for shared index maintenance.
|
| https://use-fireproof.com/docs/architecture
|
| It's also the easiest way to write React apps. Here are some
| ChatGPT expert builders that I've trained to use the CSS
| framework of your choice with Fireproof: https://use-
| fireproof.com/docs/chatgpt-quick-start/#react-ex...
| otoolep wrote:
| I started rqlite[1] in 2014[2], FWIW. While I didn't build the
| storage engine, or the consensus system, I've built the entire
| "management" part of the RDBMS from scratch. I'm almost 10
| years at it, and there is still plenty to do.
|
| [1] https://www.rqlite.io
|
| [2] https://www.philipotoole.com/9-years-of-open-source-
| database...
| whoevercares wrote:
| FWIW, many of his recent student went to Databricks now
| dang wrote:
| Discussed at the time:
|
| _Building a Database System in Academia_ -
| https://news.ycombinator.com/item?id=13931752 - March 2017 (15
| comments)
___________________________________________________________________
(page generated 2023-06-16 23:01 UTC)