[HN Gopher] What goes around comes around and around [pdf]
       ___________________________________________________________________
        
       What goes around comes around and around [pdf]
        
       Author : craigkerstiens
       Score  : 69 points
       Date   : 2024-07-01 15:41 UTC (7 hours ago)
        
 (HTM) web link (db.cs.cmu.edu)
 (TXT) w3m dump (db.cs.cmu.edu)
        
       | bob1029 wrote:
       | This paper is a really good treatment of the space from my
       | perspective.
       | 
       | I think the greatest power in the relational model comes from its
       | ability to directly represent cyclical dependencies without
       | forcing weird workarounds. Many real-world domains have
       | ambiguities regarding which types should be strict dependents of
       | another. This confounds approaches relying on serialization. As
       | mentioned in the paper, many major providers offer extensions to
       | SQL which allow you to iterate through the graph implied by these
       | relations with a single logical command.
       | 
       | > The impact of AI/ML on DBMSs will be significant
       | 
       | I agree with this but not in the way the authors may have
       | intended. I think the impact will be mostly negative. The amount
       | of energy being spent on blackbox query generator approaches
       | could be better spent elsewhere. You can get extremely close, but
       | this often doesn't matter.
       | 
       | > Do not ignore the out-of-box experience.
       | 
       | This is why everyone says to start with SQLite now.
        
       | simonz05 wrote:
       | The paper is inspired by a hacker news comment:
       | https://x.com/andy_pavlo/status/1807799839616614856
        
         | neonate wrote:
         | This one, apparently:
         | https://news.ycombinator.com/item?id=28736405 - mentioned here
         | https://x.com/andy_pavlo/status/1807799843693396420
        
       | didgetmaster wrote:
       | When I was building an object store years ago; I needed a way to
       | attach metadata tags to each object. The objects themselves could
       | be files like a picture, a document, or some music; and I wanted
       | to allow tags to denote things like the author, the camera, or
       | the music genre.
       | 
       | Most systems use things like file extended attributes or a
       | separate database to store such metadata; but I wanted something
       | different. It needed to be able to attach tags to hundreds of
       | millions of objects and find things that matched certain tags
       | quickly.
       | 
       | I invented a key-value store to hold the metadata and got it
       | working well. When it started to look like a big columnar store
       | with sparsely populated rows; I decided to see if it could handle
       | queries like a relational database. To my surprise it not only
       | did it well, it could outperform many of them.
       | 
       | There are data models besides relational that can work extremely
       | well for certain data sets.
        
         | whartung wrote:
         | I'm not going to have 100's of millions of objects, so I can't
         | speak to that.
         | 
         | But for my one hobby project, I'm using RDF and a triple store.
         | Even with a "small" dataset, you can get an explosion of
         | properties.
         | 
         | I want to be able to add arbitrary properties to arbitrary
         | things and relate them all together. Build the graph
         | organically.
         | 
         | So far, its working really well. But underneath, its (likely)
         | just a couple of b+trees do all of the heavy lifting.
        
           | enord wrote:
           | Most vendors use three indexes for triples and 4 or 6 for
           | quads. All the indexes are covering, which is to say they
           | triplicate all data---in other words the database consists
           | only of indexes.
           | 
           | Aint that just neat?
        
       | burcs wrote:
       | What an amazing read, here's hoping they'll both be around for
       | the 2044 edition. 101 is not too old to write another research
       | paper Dr. Stonebraker!
        
       | paulsutter wrote:
       | This paper has a very concise and easier-to-understand definition
       | of Google's Mapreduce:
       | 
       | > To a first approximation, MR runs a single query:
       | 
       | > SELECT map() FROM crawl_table GROUP BY reduce()
       | 
       | Or you could read the entire Google Mapreduce paper
        
         | krackers wrote:
         | Isn't the GROUP BY run before the SELECT though, e.g. "SELECT
         | MAX(t) FROM foo GROUP BY t"? I think to do it the way they
         | suggest you'd probably need to create a temp table like
         | 
         | WITH mapped as SELECT map() from crawl_table SELECT * FROM
         | mapped GROUP BY reduce()
        
           | Sesse__ wrote:
           | Yes. MapReduce's model is basically:                 1. Map
           | (key, value) -> (new_key, tmp_value)       2. Group by
           | new_key       3. Reduce (new_key, all tmp_values for that
           | key) -> (new_key, new_values)
           | 
           | In that respect, it's not that far from SQL with custom
           | aggregates. I guess the most precise SQL representation would
           | be                 SELECT REDUCE(MAP(t)) FROM foo GROUP BY
           | KEY(MAP(t))
           | 
           | (I've both been on the MapReduce team, and worked on an SQL
           | database. I don't honestly think they're that comparable.)
        
       | paulsutter wrote:
       | Great article, one bit of errata: actually ChatGPT does not
       | expose its internal embedding, so the use of embeddings for RAG
       | are just optional or even coincidental. You can also use ordinary
       | search like Elasticsearch (a point that's somehow often lost).
       | 
       | Besides, the internal embedding for ChatGPT is per-token (~word),
       | whereas the embedding used for RAG search is per-document
       | (retrieval document might be small like a paragraph or page, or
       | could be as large the the whole source document), so these
       | wouldn't be usable for this purpose anyway
       | 
       | > One compelling feature of vector DBMSs is that they provide
       | better integration with AI tools (e.g., Chat- GPT [16], LangChain
       | [36]) than RDBMSs. These sys- tems natively support transforming
       | a record's data into an embedding upon insertion using these
       | tools and then uses the same transformation to convert a query's
       | in- put arguments into an embedding to perform the ANN search;
       | other DBMSs require the application to perform these
       | transformations outside of the database.
        
       | bitwize wrote:
       | The relational model is to data what Lisp is to code: despite
       | attempts to beat it, nothing really can because all those other
       | models are expressible in terms of it (and, usually, can be made
       | very efficient in practice).
       | 
       | RDBMS and Lisp sit near the tao of their respective domains,
       | which is why I advise people to stick with an RDBMS unless they
       | have a really, really, really good reason not to. Or as Nik
       | Suresh put it, "Just use Postgres. You nerd. You dweeb."
        
         | npalli wrote:
         | LOL, you just outed yourself as a smug Lisp weenie. Utter
         | confidence is absence of any evidence. The obvious glaring
         | difference is RDMS utterly dominate the database space
         | something Lisp doesn't even come close to.
        
         | redbar0n wrote:
         | If you like Lisp I presume you would prefer Datalog over SQL,
         | as that is used in the Clojure related database Datomic.
         | Datalog is much more elegant and composable than SQL.
        
       | paulsutter wrote:
       | More specifically, blockchains are designed to avoid double-
       | spending in a low-trust environment. If you're not trying to
       | avoid double-spending, OR you're not in a low-trust environment,
       | you probably dont need a blockchain.
       | 
       | > The ideal use case for blockchain databases is peer-to- peer
       | applications where one cannot trust anybody. There is no
       | centralized authority that controls the ordering of updates to
       | the database. Thus, blockchain implementa- tions use a BFT commit
       | protocol to determine which transaction to apply to the database
       | next.
        
       | joatmon-snoo wrote:
       | I don't know how I feel about this paper: on the one hand, I
       | agree with the sentiment that the relational data model is the
       | natural end state if you keep adding features to a data system
       | (and it perfectly captures my sentiment about vector DBs) and
       | it's silly to not use SQL out of the gate.
       | 
       | On the other hand, the paper is kind of dismissive about
       | engineering nuance and gets some details blatantly wrong.
       | 
       | - MapReduce is alive and well, it just has a different name now
       | (for Googlers, that name is Flume). I'm pretty confident that
       | your cloud bill - whether or not you use GCP, AWS, or Azure, is
       | powered by a couple hundred, if not thousand, of jobs like this.
       | 
       | - Pretty sure anyone running in production has a hard serving
       | dependency on Redis or Memcache _somewhere_ in their stack,
       | because even if you're not using it directly, I would bet that
       | one of your cloud service providers uses a distributed, shared-
       | nothing KV cache under the hood.
       | 
       | - The vast majority of software is not backed by a truly
       | serializable ACID database implementation.
       | 
       | -- MySQL's default isolation level has internal consistency
       | violations[1] and its DDL is non-transactional.
       | 
       | -- The classic transaction example of a "bank transfer" is
       | hilariously mis-representative - ACH is very obviously not
       | implemented using an inter-bank database that supports
       | serializable transactions.
       | 
       | -- A lot of search applications - I would venture to say most -
       | don't need transactional semantics. Do you think Google Search is
       | transactional? Or GitHub code search?
       | 
       | [1]: https://jepsen.io/analyses/mysql-8.0.34
        
         | bob1029 wrote:
         | > The classic transaction example of a "bank transfer" is
         | hilariously mis-representative - ACH is very obviously not
         | implemented using an inter-bank database that supports
         | serializable transactions.
         | 
         | This is meant more as a pedagogical tool rather than a literal
         | representation of how the system works. The _intra_ -bank
         | aspects of ACH absolutely do rely on serializable transactions.
        
       | SoftTalker wrote:
       | In a technology career that started in the early 1990s, one of
       | the constants has been relational databases and SQL. There is no
       | better general-purpose data storage and query architecture, and
       | it's the first (and usually last) thing I consider for almost any
       | new development project that involves storing and retreiving
       | data.
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:01 UTC)