[HN Gopher] Database Gyms [pdf]
___________________________________________________________________
Database Gyms [pdf]
Author : greghn
Score : 89 points
Date : 2023-06-18 14:09 UTC (1 days ago)
(HTM) web link (www.cidrdb.org)
(TXT) w3m dump (www.cidrdb.org)
| lmwnshn wrote:
| [first author here] I'm not sure why this is on the front page.
| Speaking only on my own behalf, I like to think of this as a
| paper that's motivated by problems that I kept running into while
| re-implementing papers related to self-driving database systems
| [0] research.
|
| My TLDR would be: existing research has focused on trying to
| develop better models of database system behavior, but look at
| recent trends in modeling. Transformers, foundation models,
| AutoML -- modeling is increasingly "solved", as long as you have
| the right training data. Training data is the bottleneck now. How
| can we optimize the training data collection pipeline? Can we
| engineer training data that generalizes better? What
| opportunities arise when you control the entire pipeline?
|
| Elaborating on that, I think you can abstract existing training
| data collection pipelines into these four modules:
|
| - [Synthesizer]: The field has standardized on the use of various
| synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload
| trace formats for real-world workloads (e.g., postgres_log, MySQL
| general query log). Research on workload forecasting and dataset
| scaling exists. In 2023, why can't I say "assuming trends hold,
| show me what my workload and database state will look like 3
| months from now"?
|
| - [Trainer]: Given a workload and state (e.g., from the
| Synthesizer), existing research executes the workload on the
| state to produce training data. But executing workloads in real-
| time kind of sucks. Maybe you have a workload trace that's one
| month long, well, I don't want to wait one month for training
| data. But I can't just smash all the queries together either,
| that wouldn't be representative of actual deployment conditions.
| So right now, I'm intrigued by the idea of executing workloads in
| faster than real-time. Think of a fast-forward button on physics
| simulators, where you can reduce simulation fidelity in exchange
| for speed. Can we do that for databases? I'm also interested in
| playing tricks to help the training data generalize across
| different hardware, and in general, there seems to be a lot of
| unexplored opportunity here. Actively working on this!
|
| - [Planner]: Given the training data (e.g., from the Trainer) and
| an objective function (e.g., latency, throughput), you might
| consider a set of tuning actions that improve the objective
| (e.g., build some indexes, change some knob settings). But how
| should you represent these actions? For example, a number of
| papers one-hot encode the possible set of indexes, but (1) you
| cannot actually do this in practice, there are too many indexes,
| and (2) you lose the notion of "distance" between your actions
| (e.g., indexes on the same table should probably be considered
| "related" in some way). Our research group is currently exploring
| some ideas here.
|
| - [Decider]: Finally, once you're done applying all this domain-
| specific stuff to encode the states and actions, you're solidly
| in the realm of "learning to pick the best action" and can
| probably hand it off to a ML library. Why reinvent the wheel? :P
| That said, you can still do interesting work here (e.g., UDO is
| intelligent about batched action evaluation), but it's not
| something that I'm currently that interested in (relative to the
| other stuff above, which is more of an uncharted territory).
|
| If anyone is at SIGMOD this week, I'm happy to chat! :)
|
| [0] https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf
| BrentOzar wrote:
| tl;dr - paper by Andy Pavlo's team. Conclusion:
|
| > Most of the previous work in using ML for DBMS automation has
| focused on designing better ML models of DBMS behavior, but
| recent advances in ML have largely automated model design. The
| challenge now is to obtain good training data for building these
| models. This paper outlined the architecture of the database gym,
| an integrated environment that generates training data by using
| the DBMS to simulate itself at the highest possible fidelity.
| zubiaur wrote:
| For context, Andy leads Ottertune, an AI powered database
| tuner.
| CaveTech wrote:
| I had a feeling this was OtterTune's product through a reading
| of the abstract before I noticed the names. Really interesting
| and excited to see where they take it.
|
| ps. Brent you helped kickstart my career when I was exposed to
| your early contributions on StackOverflow and
| MetaStackOverflow. By far the largest accelerant I ever had on
| the DBMS front, really cool seeing you around so I wanted to
| send a friendly thank you.
| apavlo wrote:
| > I had a feeling this was OtterTune's product through a
| reading of the abstract before I noticed the names.
|
| No, this project is separate from OtterTune. I keep our CMU
| research strictly firewalled from OtterTune for legal
| reasons.
|
| The Database Gym project rose up from the ashes of the
| NoisePage self-driving DBMS project. See my comment from last
| week about why it failed:
|
| https://news.ycombinator.com/item?id=36355963
| Dopameaner wrote:
| a random aside, I truly thank you for your course. You're
| the teacher I wish I had in my bachelor's 8 years back.
| Singlehandedly revived my interest in learning databases.
| :).
|
| Through your course, I reached a stage where I can help OSS
| projects and its features.
| BrentOzar wrote:
| > so I wanted to send a friendly thank you.
|
| Awww, thanks! That's awesome to hear! My big goal is to make
| other peoples' journeys through data easier.
| [deleted]
___________________________________________________________________
(page generated 2023-06-19 23:00 UTC)