[HN Gopher] Database Gyms [pdf]
       ___________________________________________________________________
        
       Database Gyms [pdf]
        
       Author : greghn
       Score  : 89 points
       Date   : 2023-06-18 14:09 UTC (1 days ago)
        
 (HTM) web link (www.cidrdb.org)
 (TXT) w3m dump (www.cidrdb.org)
        
       | lmwnshn wrote:
       | [first author here] I'm not sure why this is on the front page.
       | Speaking only on my own behalf, I like to think of this as a
       | paper that's motivated by problems that I kept running into while
       | re-implementing papers related to self-driving database systems
       | [0] research.
       | 
       | My TLDR would be: existing research has focused on trying to
       | develop better models of database system behavior, but look at
       | recent trends in modeling. Transformers, foundation models,
       | AutoML -- modeling is increasingly "solved", as long as you have
       | the right training data. Training data is the bottleneck now. How
       | can we optimize the training data collection pipeline? Can we
       | engineer training data that generalizes better? What
       | opportunities arise when you control the entire pipeline?
       | 
       | Elaborating on that, I think you can abstract existing training
       | data collection pipelines into these four modules:
       | 
       | - [Synthesizer]: The field has standardized on the use of various
       | synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload
       | trace formats for real-world workloads (e.g., postgres_log, MySQL
       | general query log). Research on workload forecasting and dataset
       | scaling exists. In 2023, why can't I say "assuming trends hold,
       | show me what my workload and database state will look like 3
       | months from now"?
       | 
       | - [Trainer]: Given a workload and state (e.g., from the
       | Synthesizer), existing research executes the workload on the
       | state to produce training data. But executing workloads in real-
       | time kind of sucks. Maybe you have a workload trace that's one
       | month long, well, I don't want to wait one month for training
       | data. But I can't just smash all the queries together either,
       | that wouldn't be representative of actual deployment conditions.
       | So right now, I'm intrigued by the idea of executing workloads in
       | faster than real-time. Think of a fast-forward button on physics
       | simulators, where you can reduce simulation fidelity in exchange
       | for speed. Can we do that for databases? I'm also interested in
       | playing tricks to help the training data generalize across
       | different hardware, and in general, there seems to be a lot of
       | unexplored opportunity here. Actively working on this!
       | 
       | - [Planner]: Given the training data (e.g., from the Trainer) and
       | an objective function (e.g., latency, throughput), you might
       | consider a set of tuning actions that improve the objective
       | (e.g., build some indexes, change some knob settings). But how
       | should you represent these actions? For example, a number of
       | papers one-hot encode the possible set of indexes, but (1) you
       | cannot actually do this in practice, there are too many indexes,
       | and (2) you lose the notion of "distance" between your actions
       | (e.g., indexes on the same table should probably be considered
       | "related" in some way). Our research group is currently exploring
       | some ideas here.
       | 
       | - [Decider]: Finally, once you're done applying all this domain-
       | specific stuff to encode the states and actions, you're solidly
       | in the realm of "learning to pick the best action" and can
       | probably hand it off to a ML library. Why reinvent the wheel? :P
       | That said, you can still do interesting work here (e.g., UDO is
       | intelligent about batched action evaluation), but it's not
       | something that I'm currently that interested in (relative to the
       | other stuff above, which is more of an uncharted territory).
       | 
       | If anyone is at SIGMOD this week, I'm happy to chat! :)
       | 
       | [0] https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf
        
       | BrentOzar wrote:
       | tl;dr - paper by Andy Pavlo's team. Conclusion:
       | 
       | > Most of the previous work in using ML for DBMS automation has
       | focused on designing better ML models of DBMS behavior, but
       | recent advances in ML have largely automated model design. The
       | challenge now is to obtain good training data for building these
       | models. This paper outlined the architecture of the database gym,
       | an integrated environment that generates training data by using
       | the DBMS to simulate itself at the highest possible fidelity.
        
         | zubiaur wrote:
         | For context, Andy leads Ottertune, an AI powered database
         | tuner.
        
         | CaveTech wrote:
         | I had a feeling this was OtterTune's product through a reading
         | of the abstract before I noticed the names. Really interesting
         | and excited to see where they take it.
         | 
         | ps. Brent you helped kickstart my career when I was exposed to
         | your early contributions on StackOverflow and
         | MetaStackOverflow. By far the largest accelerant I ever had on
         | the DBMS front, really cool seeing you around so I wanted to
         | send a friendly thank you.
        
           | apavlo wrote:
           | > I had a feeling this was OtterTune's product through a
           | reading of the abstract before I noticed the names.
           | 
           | No, this project is separate from OtterTune. I keep our CMU
           | research strictly firewalled from OtterTune for legal
           | reasons.
           | 
           | The Database Gym project rose up from the ashes of the
           | NoisePage self-driving DBMS project. See my comment from last
           | week about why it failed:
           | 
           | https://news.ycombinator.com/item?id=36355963
        
             | Dopameaner wrote:
             | a random aside, I truly thank you for your course. You're
             | the teacher I wish I had in my bachelor's 8 years back.
             | Singlehandedly revived my interest in learning databases.
             | :).
             | 
             | Through your course, I reached a stage where I can help OSS
             | projects and its features.
        
           | BrentOzar wrote:
           | > so I wanted to send a friendly thank you.
           | 
           | Awww, thanks! That's awesome to hear! My big goal is to make
           | other peoples' journeys through data easier.
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2023-06-19 23:00 UTC)