[HN Gopher] What goes around comes around and around [pdf]
___________________________________________________________________
What goes around comes around and around [pdf]
Author : craigkerstiens
Score : 69 points
Date : 2024-07-01 15:41 UTC (7 hours ago)
(HTM) web link (db.cs.cmu.edu)
(TXT) w3m dump (db.cs.cmu.edu)
| bob1029 wrote:
| This paper is a really good treatment of the space from my
| perspective.
|
| I think the greatest power in the relational model comes from its
| ability to directly represent cyclical dependencies without
| forcing weird workarounds. Many real-world domains have
| ambiguities regarding which types should be strict dependents of
| another. This confounds approaches relying on serialization. As
| mentioned in the paper, many major providers offer extensions to
| SQL which allow you to iterate through the graph implied by these
| relations with a single logical command.
|
| > The impact of AI/ML on DBMSs will be significant
|
| I agree with this but not in the way the authors may have
| intended. I think the impact will be mostly negative. The amount
| of energy being spent on blackbox query generator approaches
| could be better spent elsewhere. You can get extremely close, but
| this often doesn't matter.
|
| > Do not ignore the out-of-box experience.
|
| This is why everyone says to start with SQLite now.
| simonz05 wrote:
| The paper is inspired by a hacker news comment:
| https://x.com/andy_pavlo/status/1807799839616614856
| neonate wrote:
| This one, apparently:
| https://news.ycombinator.com/item?id=28736405 - mentioned here
| https://x.com/andy_pavlo/status/1807799843693396420
| didgetmaster wrote:
| When I was building an object store years ago; I needed a way to
| attach metadata tags to each object. The objects themselves could
| be files like a picture, a document, or some music; and I wanted
| to allow tags to denote things like the author, the camera, or
| the music genre.
|
| Most systems use things like file extended attributes or a
| separate database to store such metadata; but I wanted something
| different. It needed to be able to attach tags to hundreds of
| millions of objects and find things that matched certain tags
| quickly.
|
| I invented a key-value store to hold the metadata and got it
| working well. When it started to look like a big columnar store
| with sparsely populated rows; I decided to see if it could handle
| queries like a relational database. To my surprise it not only
| did it well, it could outperform many of them.
|
| There are data models besides relational that can work extremely
| well for certain data sets.
| whartung wrote:
| I'm not going to have 100's of millions of objects, so I can't
| speak to that.
|
| But for my one hobby project, I'm using RDF and a triple store.
| Even with a "small" dataset, you can get an explosion of
| properties.
|
| I want to be able to add arbitrary properties to arbitrary
| things and relate them all together. Build the graph
| organically.
|
| So far, its working really well. But underneath, its (likely)
| just a couple of b+trees do all of the heavy lifting.
| enord wrote:
| Most vendors use three indexes for triples and 4 or 6 for
| quads. All the indexes are covering, which is to say they
| triplicate all data---in other words the database consists
| only of indexes.
|
| Aint that just neat?
| burcs wrote:
| What an amazing read, here's hoping they'll both be around for
| the 2044 edition. 101 is not too old to write another research
| paper Dr. Stonebraker!
| paulsutter wrote:
| This paper has a very concise and easier-to-understand definition
| of Google's Mapreduce:
|
| > To a first approximation, MR runs a single query:
|
| > SELECT map() FROM crawl_table GROUP BY reduce()
|
| Or you could read the entire Google Mapreduce paper
| krackers wrote:
| Isn't the GROUP BY run before the SELECT though, e.g. "SELECT
| MAX(t) FROM foo GROUP BY t"? I think to do it the way they
| suggest you'd probably need to create a temp table like
|
| WITH mapped as SELECT map() from crawl_table SELECT * FROM
| mapped GROUP BY reduce()
| Sesse__ wrote:
| Yes. MapReduce's model is basically: 1. Map
| (key, value) -> (new_key, tmp_value) 2. Group by
| new_key 3. Reduce (new_key, all tmp_values for that
| key) -> (new_key, new_values)
|
| In that respect, it's not that far from SQL with custom
| aggregates. I guess the most precise SQL representation would
| be SELECT REDUCE(MAP(t)) FROM foo GROUP BY
| KEY(MAP(t))
|
| (I've both been on the MapReduce team, and worked on an SQL
| database. I don't honestly think they're that comparable.)
| paulsutter wrote:
| Great article, one bit of errata: actually ChatGPT does not
| expose its internal embedding, so the use of embeddings for RAG
| are just optional or even coincidental. You can also use ordinary
| search like Elasticsearch (a point that's somehow often lost).
|
| Besides, the internal embedding for ChatGPT is per-token (~word),
| whereas the embedding used for RAG search is per-document
| (retrieval document might be small like a paragraph or page, or
| could be as large the the whole source document), so these
| wouldn't be usable for this purpose anyway
|
| > One compelling feature of vector DBMSs is that they provide
| better integration with AI tools (e.g., Chat- GPT [16], LangChain
| [36]) than RDBMSs. These sys- tems natively support transforming
| a record's data into an embedding upon insertion using these
| tools and then uses the same transformation to convert a query's
| in- put arguments into an embedding to perform the ANN search;
| other DBMSs require the application to perform these
| transformations outside of the database.
| bitwize wrote:
| The relational model is to data what Lisp is to code: despite
| attempts to beat it, nothing really can because all those other
| models are expressible in terms of it (and, usually, can be made
| very efficient in practice).
|
| RDBMS and Lisp sit near the tao of their respective domains,
| which is why I advise people to stick with an RDBMS unless they
| have a really, really, really good reason not to. Or as Nik
| Suresh put it, "Just use Postgres. You nerd. You dweeb."
| npalli wrote:
| LOL, you just outed yourself as a smug Lisp weenie. Utter
| confidence is absence of any evidence. The obvious glaring
| difference is RDMS utterly dominate the database space
| something Lisp doesn't even come close to.
| redbar0n wrote:
| If you like Lisp I presume you would prefer Datalog over SQL,
| as that is used in the Clojure related database Datomic.
| Datalog is much more elegant and composable than SQL.
| paulsutter wrote:
| More specifically, blockchains are designed to avoid double-
| spending in a low-trust environment. If you're not trying to
| avoid double-spending, OR you're not in a low-trust environment,
| you probably dont need a blockchain.
|
| > The ideal use case for blockchain databases is peer-to- peer
| applications where one cannot trust anybody. There is no
| centralized authority that controls the ordering of updates to
| the database. Thus, blockchain implementa- tions use a BFT commit
| protocol to determine which transaction to apply to the database
| next.
| joatmon-snoo wrote:
| I don't know how I feel about this paper: on the one hand, I
| agree with the sentiment that the relational data model is the
| natural end state if you keep adding features to a data system
| (and it perfectly captures my sentiment about vector DBs) and
| it's silly to not use SQL out of the gate.
|
| On the other hand, the paper is kind of dismissive about
| engineering nuance and gets some details blatantly wrong.
|
| - MapReduce is alive and well, it just has a different name now
| (for Googlers, that name is Flume). I'm pretty confident that
| your cloud bill - whether or not you use GCP, AWS, or Azure, is
| powered by a couple hundred, if not thousand, of jobs like this.
|
| - Pretty sure anyone running in production has a hard serving
| dependency on Redis or Memcache _somewhere_ in their stack,
| because even if you're not using it directly, I would bet that
| one of your cloud service providers uses a distributed, shared-
| nothing KV cache under the hood.
|
| - The vast majority of software is not backed by a truly
| serializable ACID database implementation.
|
| -- MySQL's default isolation level has internal consistency
| violations[1] and its DDL is non-transactional.
|
| -- The classic transaction example of a "bank transfer" is
| hilariously mis-representative - ACH is very obviously not
| implemented using an inter-bank database that supports
| serializable transactions.
|
| -- A lot of search applications - I would venture to say most -
| don't need transactional semantics. Do you think Google Search is
| transactional? Or GitHub code search?
|
| [1]: https://jepsen.io/analyses/mysql-8.0.34
| bob1029 wrote:
| > The classic transaction example of a "bank transfer" is
| hilariously mis-representative - ACH is very obviously not
| implemented using an inter-bank database that supports
| serializable transactions.
|
| This is meant more as a pedagogical tool rather than a literal
| representation of how the system works. The _intra_ -bank
| aspects of ACH absolutely do rely on serializable transactions.
| SoftTalker wrote:
| In a technology career that started in the early 1990s, one of
| the constants has been relational databases and SQL. There is no
| better general-purpose data storage and query architecture, and
| it's the first (and usually last) thing I consider for almost any
| new development project that involves storing and retreiving
| data.
___________________________________________________________________
(page generated 2024-07-01 23:01 UTC)