[HN Gopher] Things to know about databases
___________________________________________________________________
Things to know about databases
Author : grech
Score : 449 points
Date : 2022-06-27 15:08 UTC (7 hours ago)
(HTM) web link (architecturenotes.co)
(TXT) w3m dump (architecturenotes.co)
| thedougd wrote:
| I have to plug the "Designing Data-Intensive Applications" book.
| It dives deep into the inner workings of various database
| architectures.
|
| https://dataintensive.net/
| SulphurSmell wrote:
| This article is informative. I have found that databases in
| general tend to be less sexy than the front-end apps...especially
| with the recent cohort of devs. As an old bastard, I would pass
| on one thing: Realize that any reasonably used database will
| likely outlast the applications leveraging it. This is especially
| true the bigger it gets, and the longer it stays in production.
| That said, if you are influencing the design of a database,
| imagine years later what someone looking at it might want to know
| if having to rip all the data out into some other store. Having
| migrated many legacy systems, I tend to sleep better when I know
| the data is well-structured and easy to normalize. In those
| cases, I really don't care so much about the apps. If I can sort
| out (haha) the data, I worry less about the new apps I need to
| design. I have been known to bury documentation into for-purpose
| tables...that way I know that info won't be lost. Export the
| schema regularly, version it, check it in somewhere. And, if you
| can, _please_ , limit the use of anything that can hold a NULL.
| Not every RDBMS handles NULL the same way. Big old databases live
| a looooong time.
| emerongi wrote:
| > Realize that any reasonably used database will likely outlast
| the applications leveraging it.
|
| I love this statement. It's true too, having seen a decades-old
| database that needed to be converted to Postgres. The old
| application was going to be thrown away, but the data was still
| relevant :).
| Yhippa wrote:
| I think this is and will continue to be a common use case.
| I'm very thankful for these applications that the data was
| still stuck in a crusty old relational database for me to
| work on top of as I built a new application.
|
| It's going to be interesting when this same problem occurs
| years from now when people are trying to reverse schemas from
| NoSQL databases or if they become difficult to extract.
|
| The only sticking point is when business logic is put into
| stored procedures. On one hand if you're building an app on
| top of it, there's a temptation to extract and optimize that
| logic in your new back-end. On the other hand, it is kind of
| nice to even have it at all should the legacy app go _poof_.
| SulphurSmell wrote:
| >business logic is put into stored procedures
|
| This is a double-edged sword. I have seen _massive_
| business logic baked into stored procedures...so much so,
| that the applications themselves are rather slim. If this
| stored procedure code is properly versioned and otherwise
| managed, this is not entirely bad. If the data model is
| sound, I don 't worry that much...stored procs vs 100KLOC
| of Java code? I can tell you what is easier to migrate. The
| other side of it is that stored procedures can also serve
| as an API into the actual database. I built a system once
| (90's) where the developers never accessed the actual
| database tables directly. They always called stored
| procedures. The advantage here was that we could tune the
| snot out of the database without a code
| change/build/deploy. It also allowed the DBA some control
| over poorly written, runaway queries. YMMV. I think today I
| probably would try to keep the database as vanilla as
| possible.
| richardw wrote:
| Also an old timer. I've gone from mostly complex app and
| no DB logic to mostly heavily optimised DB and lots of
| procs. They protect the DB, make carefully crafted
| interfaces available and allow changes to the
| implementation.
|
| Except for eg infinitely scalable cloud data stores like
| Google's, or for ML where it's just massive data and you
| need a dumb-ish store of many GB of parquet.
| SulphurSmell wrote:
| >protect the DB
|
| I share this sentiment. The apps will come and go, the
| real value is in the data. If the database can cover its
| ass, I am less concerned about ham-fisted developers
| randomly messing it up with ill-conceived DML. It's not
| that they are malicious...it just happens. I have seen
| devs that code in Erlang, Haskell and even Assembly...run
| in terror at doing some SQL. It's weird. Trust but
| verify. And hire persnickity, passionate DBAs.
| evilduck wrote:
| About a decade ago I worked for an insurance company. It was
| an offshoot that was spun out of of another insurance company
| from another state, which itself was decades old. As best as
| I could infer from my vantage point, my expertise at the
| time, and the spare time I was willing to investigate the
| matter, the database schema and a good chunk of the core data
| tables were first created in the late-80s on a mainframe and
| had outlived 4 or 5 application rewrites and (at least) two
| SQL variant migrations. I'm hand-waving exact details because
| _nobody_ from the original company or that time period was
| still around even prior to the corporate split and so there
| was nobody who could answer history questions in detail, but
| that 's also a testament to how persistent data can be. There
| was one developer from the parent company they slapped with
| golden handcuffs who knew where most of the bodies were hid
| in that software stack that enabled decent productivity but
| even she was lacking a solid 15 years of first-hand
| experience of its inception. To the best of my knowledge that
| database is still in use today.
|
| Databases in heavy use will not just outlast your
| application, they have a strong chance of outlasting your
| career and they very well may outlast you as a person.
| SulphurSmell wrote:
| Are you me? LOL
| YetAnotherNick wrote:
| > I have found that databases in general tend to be less sexy
| than the front-end apps
|
| I don't know if there is a single soul who believes this. If
| you are designing a database, it is much more cooler than front
| end apps.
| SulphurSmell wrote:
| I think they are wonderful (from the Codd and Date days...)
| but mostly everyone else disagrees.
| mmcnl wrote:
| "Show me your flowchart and conceal your tables, and I shall
| continue to be mystified. Show me your tables, and I won't
| usually need your flowchart; it'll be obvious." -- Fred Brooks,
| The Mythical Man Month (1975)
| SulphurSmell wrote:
| This is going on my wall. Thanks so much.
| irrational wrote:
| The NULL issue is so true. We migrated a large database from
| Oracle to Postgres. It took 2 years. By far and away the
| biggest issue was rewriting queries to account for the
| (correct) way Postgres handles NULLs versus how Oracle does it.
|
| Also, in my experience, the database is almost always the main
| cause of any performance issues. I would much rather hire
| someone who is very good at making the database perform well
| than making the front end perform well. If you are seeking to
| be a full stack developer, devote much more time to the
| database layer than anything else.
| SulphurSmell wrote:
| >the database is almost always the main cause of any
| performance issues
|
| I would be careful with the term "cause". There is a
| symbiotic relationship between the application and the
| database. Or, if talking to a DBA...a database and its
| applications. Most databases can store any sets of arbitrary
| information...but _how_ they are stored (read: structure)
| must take into account how the data is to be used. When the
| database designer can be told up-front (by the app dev team)
| considerations can be made to optimize performance along
| whatever vector is most desired (e.g. read speed, write
| speed, consistency, concurrency, etc). Most database
| performance issues result when these considerations are left
| out. Related: Just because a query works (ie. returns the
| right data) does not mean it 's the _best_ query.
| hodgesrm wrote:
| The article left out one of the most fundamantal topics of
| databases--clustering of data in storage is everything.
| Examples:
|
| 1. If you store data in rows it's quite fast to
| insert/update/delete individual rows. Moreover, it's easy to do
| it concurrently. However reads can be very slow because you
| read the entire table if you scan a single column. That's why
| OLAP databases use column storage.
|
| 2. If you sort insert data in the table, reading ranges based
| on the sort key(s) is very fast. On the other hand inserts may
| spray data over over the entire table, (eventually) forcing
| writes to all blocks, which is very slow. That's why many OLTP
| databases use heap (unsorted) row organization.
|
| In small databases you don't notice the differences, but they
| become dominant as volume increases. I believe this fact alone
| explains a lot of the proliferation of DBMS types as enterprise
| datasest have grown larger.
|
| Edit: minor clarification
| beckingz wrote:
| I heard about Flywaydb today, which appears to be an open
| source database versioning tool. Pretty interesting!
| https://flywaydb.org/
| vladsanchez wrote:
| Pretty open-source, until you need "premium" features like
| "rollback" :/ (headwall)
| mjb wrote:
| Introductory material is always welcome, but I suspect this isn't
| going to hit the target for most people. For example:
|
| > Therefore, if the price isn't an issue, SSDs are a better
| option -- especially since modern SSDs are just about as reliable
| as HDDs
|
| This needs a tiny extra bit of detail: if you're buying random IO
| (IOPS) or throughput (MB/s), SSDs are significantly (orders of
| magnitude!) cheaper than HDDs. HDDs are only cheaper on space,
| and only if your need for throughput or IO doesn't cause you to
| "strand" space.
|
| > Consistency can be understood after a successful write, update,
| or delete of a row. Any read request immediately receives the
| latest value of the row.
|
| This isn't the ACID definition of C, and is closer to the
| distributed systems (CAP) one. I can't fault the article for
| getting this wrong, though - it's super confusing!
| googletron wrote:
| You are absolutely right about the C being more inline with CAP
| one.
|
| I have a post in draft to discuss disk trade offs which digs
| into this aspect, its impossible to dig into everything in this
| level of a post.
| yla92 wrote:
| Great post. Also highly recommend Designing Data-Intensive
| Applications by Martin Kleppmann
| (https://www.amazon.com/Designing-Data-Intensive-
| Applications...). The sections on "Storage and Retrieval",
| "Replication", "Partitioning" and "Transactions" really opened up
| my eyes!
| lysecret wrote:
| Absolutely loved the book. Can someone recommend similar books?
| wombatpm wrote:
| Database Design for Mere Mortals by Ray Hernandez
| avinassh wrote:
| Database Internals is also pretty good.
| skrtskrt wrote:
| Seconding Database Internals - it's not just about
| "Internals of a database", as part 2 gets nitty gritty with
| the general problems of distributed systems, consensus,
| consistency, availability, etc. etc.
| dangets wrote:
| I have not read it personally, but I've seen 'How Query
| Engines Work' highly recommended several times before. I have
| a procrasinatory tab open to check it out some day.
|
| https://leanpub.com/how-query-engines-work
| itsmemattchung wrote:
| Second this.
|
| I really like how he (Martin Kelppman) in the book starts with
| a primitive data structure for constructing a database design,
| and then evolves the system slowly and describes the various
| trade offs with building a database from the ground up.
| bironran wrote:
| Nice post, though for the indexing "introduction-deep-dive" I
| would still recommend newbies to look at https://use-the-index-
| luke.com/ .
| konfusinomicon wrote:
| also check out rick james's mysql documents
| http://mysql.rjweb.org/
|
| I send those 2 links to coworkers all the time
| googletron wrote:
| Great resource! I have it linked as a reference!
| sonofacorner wrote:
| This is great. Thanks for sharing!
| jwr wrote:
| Some of the explanations are questionable: I think they were
| overly simplified, and while I applaud the goal, some things just
| aren't that simple.
|
| I highly recommend reading https://jepsen.io/consistency and
| clicking on each model on the map. This is the best resource I
| found so far for understanding databases, especially distributed
| ones.
| googletron wrote:
| I would love the feedback, what was questionable? striking the
| balance is tough. jepsen's content is great.
| gumby wrote:
| Everyone can disagree on what is the precise place to slice
| "this is beginner content" from "this is almost-beginner
| content". I could stick my own oar in in this regard but I
| won't.
|
| I think your level of abstraction is quite good for the
| absolute "what on earth are people talking about when they
| use that 'database' word?". With an extremely high level
| understanding, when they encounter more detail they'll have a
| "place to put it".
| Diggsey wrote:
| One thing that can be surprising is that for "REPEATABLE
| READ", not all "reads" are actually repeatable.
|
| There are at least two ways (that I'm aware of) that this can
| be violated. For example, if you run an update statement like
| this: UPDATE foo SET bar = bar + 1
|
| Then the read of "bar" will always use the latest value,
| which may be different from the value other statements in the
| same transaction saw.
| kerblang wrote:
| Not sure what you're claiming here...
|
| Repeatable read isolation creates read locks so that other
| transactions cannot write to those records. Of course our
| own transaction has to first wait for outstanding writes to
| those records to commit before starting.
|
| Best as I know the goal is not to prevent one's own
| transaction from updating the records we read; the read
| locks will just get upgraded to write locks.
| petergeoghegan wrote:
| > Some of the explanations are questionable: I think they were
| overly simplified, and while I applaud the goal, some things
| just aren't that simple.
|
| I am an expert on the subject matter, and I don't think that
| the overall approach is questionable. The approach that the
| author took seems fine to me.
|
| The definition of certain basic concepts like 'consistency' is
| even confusing to experts at times. This is made all the more
| confusing by introducing concepts from the distributed systems
| world, where consistency is often understood to mean something
| else.
|
| Here's an example of that that I'm familiar with, where an
| expert admits to confusion about the basic definition of
| consistency in the sense that it appears in ACID:
|
| https://queue.acm.org/detail.cfm?id=3469647
|
| This is a person that is a longtime peer of the people that
| invented the concepts!
|
| Not trying to rigorously define these things makes a great deal
| of sense in the context of a high level overview. Getting the
| general idea across is far more important.
| jandrewrogers wrote:
| > "Scale of data often works against you, and balanced trees are
| the first tool in your arsenal against it."
|
| An ironic caveat to this is that balanced trees don't scale well,
| only offering good performance across a relatively narrow range
| of data size. This is a side-effect of being "balanced", which
| necessarily limits both compactness and concurrency.
|
| That said, concurrent B+trees are an absolute classic and provide
| important historical context for the tradeoffs inherent in
| indexing. Modern hardware has evolved to the point where B+trees
| will often offer disappointing results, so their use in indexing
| has dwindled with time.
| hashmash wrote:
| What kinds of indexing structures are used instead, and how do
| they differ from B+trees? Do you have examples of which
| relational databases have replaced B+tree indexes?
| jandrewrogers wrote:
| The property being optimized for, relative to B+trees, is
| extreme compactness of representation. In the pantheon of
| possible indexing algorithms, B+trees are pretty far on the
| bloated end of the spectrum in terms of the ratio between
| data space and index space. All indexes have a scaling
| performance cliff due to the index structure filling up and
| eventually overflowing available cache, crowding out the data
| and forcing page faults for almost every index write. In
| B+tree indexes this happens relatively early and often.
|
| Radically improving index compactness is achieved by
| loosening design constraints on B+trees: the indexes
| represent a partial order which only converges on a total
| order at the limit and the search structure is unbalanced. In
| the abstract these appear slightly less efficient but it
| enables the use of selectivity-maximizing succinct
| representations of the key space that can get pretty close to
| the information theoretic limits. Scalability gains result
| from the radical reduction in cache footprint when
| represented this way.
|
| Optimal compressive indexes are not computable (being
| equivalent to AI), so the efficient approximation strategies
| people come up with tend to be diverse, colorful, and
| sometimes impractical. Tangentially, some flavors have
| excellent write performance. It is not a trivial algorithm
| problem but there are a few design families that generalize
| well to real databases engines. I wouldn't describe this as a
| fully solved problem but many ordinary cases are covered.
|
| There isn't much incentive to design a relational database
| engine that can use these types of indexes, since the types
| of workloads and data models that recommend them usually
| aren't relational. Someone could, there just isn't much
| incentive. It is more _de rigueur_ for graph, spatiotemporal,
| and some types of analytical databases, where there is no
| other practical option if scalability matters at all.
| xmprt wrote:
| I know Clickhouse uses MergeTrees which are different from
| B+trees. However it can't really be used as an RDBMS. It's
| especially bad at point reads.
|
| https://en.wikipedia.org/wiki/Log-structured_merge-tree
| mikeklaas wrote:
| There are projects that use LSMTs as the storage engine for
| RDBMS' (like RocksDB); I'm not sure it's accurate to say
| "they can't be use as an RDMBS".
| petergeoghegan wrote:
| It's definitely possible, and can make a lot of sense --
| MyRocks/RocksDB for MySQL seems like an interesting and
| well designed system to me. It is fairly natural to
| compare MyRocks to InnoDB, since they're both MySQL
| storage engines. That kind of comparison is usually far
| more useful than an abstract comparison that ignores the
| practicalities of concurrency control and recovery.
|
| The fact that MyRocks doesn't use B+Trees seems like half
| the story. Less than half, even. The really important
| difference between MyRocks and InnoDB is that MyRocks
| uses log-structured storage (one LSM tree for
| everything), while InnoDB uses a traditional write-ahead
| log with checkpoints, and with logical UNDO. There are
| multiple dimensions to optimize here, not just a single
| dimension. Focusing only on time/speed is much too
| reductive. In fact, Facebook themselves have said that
| they didn't set out to improve performance as such by
| adopting MyRocks. The actual goal was price/performance,
| particularly better write amplification and space
| amplification.
| petergeoghegan wrote:
| > Modern hardware has evolved to the point where B+trees will
| often offer disappointing results, so their use in indexing has
| dwindled with time.
|
| This is pure nonsense. B+Trees are used extensively and by
| default by 5 out of 5 of the top database systems, according to
| db-engines.com.
| jandrewrogers wrote:
| You don't actually address the point.
|
| If your database engine is an old design or your data is
| small by modern standards, then a B+tree will be one of the
| few indexing algorithms available and if the data is small it
| will probably work. Modern database kernels targeting modern
| hardware and storage densities typically aren't using B+trees
| and the reasons why are well-understood. No one with any
| sense is using a B+tree to index e.g. a trillion records,
| which is a pretty ordinary thing to do on a single server in
| 2022.
|
| You can't just swap out indexing architectures due to their
| dependency on storage engine and scheduling behavior, so
| older databases like PostgreSQL will be using B+trees for the
| indefinite future even if suboptimal.
|
| The transition away from B+tree based architectures in new
| databases engines started about 10-15 years ago. Back then I
| used them ubiquitously but I honestly don't remember the last
| time I've seen one in a new design.
| funcDropShadow wrote:
| > so older databases like PostgreSQL will be using B+trees
| for the indefinite future even if suboptimal.
|
| PostgreSQL 14 comes with 6 builtin index types[1]: B-tree,
| Gist, SP-Gist, Gin, Brin, and Hash. More can be plugged in
| as extensions.
|
| [1]: Chapters 63-69 of
| https://www.postgresql.org/docs/14/internals.html
|
| Edited: Fixed the link to version 14.
| petergeoghegan wrote:
| > You don't actually address the point.
|
| You said that B-Trees "use in indexing has dwindled with
| time". This is demonstrably false.
|
| > Back then I used them ubiquitously but I honestly don't
| remember the last time I've seen one in a new design.
|
| Even if that was true (which it definitely isn't), why
| would anybody judge the commercial or scientific relevance
| of B-Trees by looking at what new systems do? There are
| very few new systems that are intended to be competitive as
| general purpose systems, which is where most of the market
| is.
|
| You still haven't actually named a single example of a
| "modern database kernel" that exemplifies what you're
| talking about.
| jandrewrogers wrote:
| You are understating the limitations of B+trees for real
| workloads. A common and growing problem is the lack of
| online indexing that scales, the particularly data model
| doesn't matter that much. Index construction throughput
| and scaling has been a serious problem at some pretty
| boring companies I've done work for.
|
| Use of B+trees in _new_ database kernels has definitely
| diminished. I 'm not counting the installed base of
| SQLite etc. Ubiquity doesn't make something the pinnacle
| of technology -- just as often it means "legacy installed
| base". I still use PostgreSQL a lot and mod it when I
| need to but I am not under any illusions about its
| limitations.
|
| A "modern" database kernel that can efficiently use
| modern hardware is going to be a thread-per-core
| architecture with all I/O and execution scheduling done
| in user space, and the ability to operate on modern
| storage densities found on database servers, which can
| exceed a petabyte of direct-attached storage. The
| implications of storage density and its interaction with
| indexing drive most of the real changes in the way
| database kernels are designed. You can find elements of
| this in open source, but mostly in big data platforms
| rather than proper database engines.
|
| That said, no one builds new high-end databases for
| retail anymore, the economics don't make sense. All the
| money moved to more specialized implementations that
| cater to smaller audiences where you don't need to
| advertise. The kernels are fully general, and widely
| reused, but the interfaces and surrounding bits are
| purpose-built for particular workloads. Hell, my old
| storage engines are still used under license by that lot.
| The days of database billboards on the 101 are an
| anachronism.
| petergeoghegan wrote:
| > You are understating the limitations of B+trees for
| real workloads.
|
| I never said anything about workloads. All I said was
| that your statements about B+Trees having dwindling usage
| are clearly false.
|
| If you make a claim that is self-evidently bogus, then
| you shouldn't expect anything else that you may have said
| at the same time to be taken seriously.
| hashmash wrote:
| Can you provide an explicitly named "modern" database
| that doesn't use B+tree indexes, and what specifically
| does it use instead?
| jiggawatts wrote:
| You keep talking about how B-Trees are rarely used but
| I've seen relatively new systems deployed that use them
| (or some minor variation). FoundationDB, FASTER, and a
| few others.
|
| Other than in-memory hash indexing as used by SAP HANA,
| I'm not aware of any other data structures anywhere near
| as popular for database engines.
|
| Can you name the data structure(s) that have superseded
| these?
| heisjustsosmart wrote:
| Read this guy's past posts, you will save yourself a lot
| of time. He does a lot of this sort of thing.
| jandrewrogers wrote:
| ...says the person who is confidently oblivious to the
| problems of mixing virtual methods and direct paged
| memory.
| petergeoghegan wrote:
| Thanks for the tip
| trhoad wrote:
| An interesting subject! The article could do with an edit,
| however. There are lots of grammatical errors.
| AtNightWeCode wrote:
| Not sure how to use these recommendations in practice though even
| if the info is somewhat correct. SQL is a beast of tech and it is
| used because of battle history and since there is simply no other
| viable tech replacing it when it comes to transactions and
| aggregated queries.
|
| Indexes are a nightmare to get right. Often performance
| optimizations of SQL databases include removing indexes as much
| as adding indexes.
| vorpalhex wrote:
| It's not that SQL is all that beastly, it's that most tutorials
| fail to explain the internals and basics and so you just see
| all these features and interfaces of the system and can't build
| a mental model of how the system works.
| AtNightWeCode wrote:
| Well, SQL does come with liberties. I worked with expensive
| commercial software that destroys the performance of
| databases by doing everything from complicated ad hoc queries
| to massive amounts of point reads.
| larrik wrote:
| Indexes aren't a "make my DB faster" magic wand. They have
| benefits and costs.
|
| If you are seeing performance gains from _removing_ indexes,
| then I 'm assuming your workload is very heavy on
| writes/updates compared to reads.
| AtNightWeCode wrote:
| Mostly because of overlapping indexes. Then if there are
| include columns it may get out of hand. Not too difficult to
| achieve. Just blindly follow recommendations from a tool or a
| cloud service.
| roflyear wrote:
| Or you're using MySQL ;)
| dspillett wrote:
| Too many indexes can cause significant performance problems
| if RAM is short. If the indexes are actually used (rather
| than sitting idle on disk because other indexes are better
| choices for all your applications' typical queries) then they
| will "compete" for memory potentially causing a cache
| thrashing situation.
|
| But yes, the issue with too many indexes is more often that
| they harm write performance.
|
| A related issue is indexes that are too wide, either covering
| many columns or "including" them. As well as eating disk
| space they also eat extra memory (and potentially cause extra
| IO load) when used (less rows per page, so more pages loaded
| into RAM for the same query).
|
| Both problems together, too many indexes many of which are
| too wide, usually comes from blindly accepting
| recommendations from automated tools (particularly when they
| are right that there is a problem, and it is a problem that a
| given index may solve, but fixing the queries so existing
| indexes are useful could have a much greater effect than
| adding the indexes).
| manish_gill wrote:
| What tool was used to create the visuals?
| donatj wrote:
| I still think about my first job out of college. Shopping cart
| application, we would add indexes exclusively _when there was a
| problem_ rather than proactively based on expected usage
| patterns. It 's genuinely a testament to MySQL that we got as far
| as we did without knowing anything about what we were doing.
|
| One of my most popular StackOverflow questions to this day is
| about how to handle _one million_ rows in a single MySQL table (
| _shudder_ ).
|
| The product I work on now collects more rows than that _a day_ in
| a number of tables.
| Merad wrote:
| > a dirty read occurs when you perform a read, and another
| transaction updates the same row but doesn't commit the work, you
| perform another read, and you can access the uncommitted (dirty)
| value
|
| It's even worse than this with MS SQL Server. When using the READ
| UNCOMMITTED isolation level it's actually possible to read
| corrupted data, e.g. you might read a string while it's being
| updated, so the result row you get contains a mix of the old
| value and new value of the column. SQL Server essentially does
| the "we got a badass over here" Neil deGrasse Tyson meme and
| throws data at you as fast as it can. Unfortunately I've worked
| on several projects where someone apparently thought that READ
| UNCOMMITTED was a magic "go fast" button for SQL and used it all
| throughout the app.
| jiggawatts wrote:
| I really wish SERIALIZABLE was the default transaction
| isolation level and anything lower was opt in... with warnings.
| hodgesrm wrote:
| SERIALIZABLE is ridiculously slow if you have any level of
| concurrency in your app. READ COMMITTED is a reasonable
| default in general. The behavior GP is describing sounds like
| an out and out bug.
|
| Dirty reads incidentally weren't supported for quite some
| time in the Sybase architecture (which forked to MS SQL
| Server in 1992). There was a Sybase effort to add dirty read
| support around 1995 or so. The project name was "Lolita."
| throwaway787544 wrote:
| Can anyone give me a brief understanding of stored procedures and
| when I should use them?
| tiffanyh wrote:
| #1 thing you should know, RDBMS can solve pretty much every data
| storage/retrieval problem you have.
|
| If you're choosing something other than an RDBMS - you should
| rethink why.
|
| Because unless you're at massive scale (which still doesn't
| justify it), choosing something else is rarely the right
| decision.
| VirusNewbie wrote:
| It is very frustrating to work with engineers who don't
| understand the nuances of RDBMS and assume they can solve all
| the things. The small company I work for has 3B rows. We have a
| high write volume. _Can_ you use an RDBMs database to solve
| this? Sure, but it would be a terrible waste of engineering
| effort.
| qaq wrote:
| ? as opposed to buggy reimplementation of subset of RDBMS
| functionality on the Application side?
| googletron wrote:
| Good point. Its often the problem space and other constraints
| that usually drive these decisions. Its important that you deal
| with problems when you have them.
| throwamon wrote:
| Isn't this like saying you can solve every programming problem
| you have with <insert your favorite Turing-complete language
| here>? Of course you can, but aren't there _any_ cases where
| the tradeoffs outweigh the benefits, even if it 's about
| something selfish like ergonomics or, dare I say, fun?
| _the_inflator wrote:
| I agree with you. However, conversely I don't see anything
| that proves him wrong. Databases are not like programming
| languages. There is a reason why we don't use punch cards
| anymore.
| w0m wrote:
| Is 'Not performance bound, and dot knowing the future shape of
| your data' a valid reason? Less overhead on initial rollout to
| just Toss it up there.
|
| > choosing something else is rarely the right decision
|
| I think this is a little bit of a 'We always did it this way'
| statement.
| xmprt wrote:
| I think the tradeoff is similar to using a weakly typed vs
| strongly typed language. Strong typing is more up front
| effort but it will save you down the line because it's more
| predictable. Similarly, an RDBMS will require more up front
| planning and design and regular maintenance but that extra
| planning will save you more time down the line.
| adrianmonk wrote:
| > _future shape of your data_
|
| Contrary to what people seem to assume, you actually can
| change the schema of a database and migrate the existing data
| to the new schema. There's a learning curve, but it's doable.
|
| If you go schema-less, you run into another problem: not
| knowing the _past_ shape of your data. When you try to load
| old records (from previous years), you may find that they don
| 't look like the ones you wrote recently. And, if your code
| was changed, it may fail to handle them.
|
| This makes it hard to safely change code that handles stored
| data. You can avoid changing that code, you can accept
| breakage, or you can do a deep-dive research project before
| making a change.
|
| If you have a schema, you have a contract about what the data
| looks like, and you can have guarantees that it follows that
| contract.
| grogers wrote:
| Maintaining backwards compatibility for reading old records
| in code is not hard. You can always rewrite all rows to the
| newer format if you want to remove the conpat code, or if
| the structure changes in an incompatible way. It's pretty
| comparable to what you have to do to evolve the code/schema
| safely together.
|
| Having schema is much better for ad-hoc queries though,
| doubly so if your schemaless types aren't JSON (e.g.
| protobufs).
| alecfong wrote:
| I find myself forced to model access pattern when choosing
| non relational dbs. This often results in a much less
| flexible model if you didn't put a lot of thought into it.
| Ymmv
| gryn wrote:
| as long as you're willing to write a whole lots of data
| validation, constraint checking scripts by hand in the
| future, ETL scripts for non-trivial analytical queries
| (depending on what NoSQL you chose, but if you chose it for
| perf this one is usual a price you have to pay). and keep a
| very rigorous track of the conceptual model of your data
| somewhere else, or simply don't care about its consistency
| when different parts of your not-schema have contradicting
| data (at that point why are you even storing it?)
|
| and that you ruled out using a JSON string column(s) as a
| dump for the uncertain parts, de-normalization and indexing,
| and the EAV schema as potential solutions to your problems.
|
| the point is noting is free, and you have to be sure it's a
| price your willing to pay.
|
| are you ready to give up joins ?, have your data be modeled
| after the exact queries your going to make ?, for you data to
| be duplicated across many places ? etc ...
| emerongi wrote:
| With Postgres, you can always just have a JSONB column for
| data whose shape you're unsure of. Personally, I'd rather
| start with Postgres and dump data into there and retain the
| powers of RDBMS for the future, rather than the other way
| around and end up finding out that I really would like to
| have features that come out of the box with relational
| databases.
|
| I think a valid reason for not choosing a relational database
| is if your business plan requires that you grow to be a
| $100B+ company with hundreds of millions of users. Otherwise,
| you will probably be fine with RDBMS, even if it will require
| some optimizing in the future.
| packetlost wrote:
| Postgres' JSON implementation perfectly adheres to the JSON
| spec, which actually sucks if you need to support things
| like NaNs, Inf, etc. It's a good option, but it doesn't
| work for all datasets.
| slaymaker1907 wrote:
| Wow, TIL that NaN isn't valid JSON.
| mslot wrote:
| Most $100B+ companies (e.g. Google, Meta, Amazon) were
| built primarily using relational databases for the first
| 10-15 years.
| qaq wrote:
| No longer an issue with things like Spanner, CockroachDB
| etc
| dspillett wrote:
| There are circumstances where you really don't know the shape
| of the data, especially when prototyping for proof of concept
| purposes, but usually not understanding the shape of your
| data is something that you should fix up-front as it
| indicates you don't actually understand the problem you are
| trying to solve.
|
| More often than not it is worth sometime thinking and
| planning to work out at least the core requirements in that
| area, to save yourself a lot of refactoring (or throwing away
| and restarting) later, and potentially hitting bugs in
| production that a relational DB with well-defined constraints
| could have saved you from while still in dev.
|
| Programming is brilliant. Many weeks of it sometimes save you
| whole hours of up-front design work.
| spmurrayzzz wrote:
| > usually not understanding the shape of your data is
| something that you should fix up-front as it indicates you
| don't actually understand the problem you are trying to
| solve.
|
| This is a good point and probably correct often enough, but
| I also think not understanding the entire problem you are
| solving is not only common, but in fact necessary to most
| early-stage velocity. There is need to iterate and adapt
| frequently, sometimes as part of your go-to-market
| strategy, in order to fully understand the problem space.
|
| > a relational DB with well-defined constraints could have
| saved you from while still in dev
|
| This presumes that systems built on top of non-RDBMS are
| incapable of enforcing similar constraints. This has not
| been my experience personally. But its possible I don't
| understand your meaning of constraints in this context. I
| assumed it to mean, for instance, something like schemas
| which are fairly common now in the nosql world. Curious
| what other constraints were you referencing?
| marcosdumay wrote:
| > This presumes that systems built on top of non-RDBMS
| are incapable of enforcing similar constraints.
|
| Are you kidding? They never can.
|
| The entire point of ditching the relational model is
| discarding data constraints and normalization.
| spmurrayzzz wrote:
| Never? Many NoSQL stores are offering parity in many of
| the feature verticals that were historically the sole
| domain of RDBMS.
|
| Mongo has always had semantics to support normalized data
| modeling, has schema support, and has had distributed
| multi-document ACID transactions since 2019 [1]. You
| don't have to use those features as they're opt-in, but
| they're there.
|
| I know that full parity between the two isn't feasible,
| but to say they never can is a mischaracterization.
|
| [1] Small addendum on this: Jepsen highlighted some
| issues with their implementation of snapshot isolation
| and some rightful gripes about poor default config
| settings and wonky API (you need to specify snapshot read
| concern on all queries in conjunction with majority write
| concern, which isnt highlighted in some docs). But with
| the right config, their only throat clearing was whether
| snapshot isolation was "full ACID", which would apply to
| postgres as well given they use the same model.
| funcDropShadow wrote:
| What is the point of using MongoDB with multi-document
| ACID transactions? Enabling durability in MongoDB is
| usually costly enough that you can't find a performance
| benefit compared to Postgres. With JSONB support in
| PostgreSQL, I dare say, it can express anything that
| MongoDB can express with its data model and query
| language. That leaves scalability as the only possible
| advantage of MongoDB compared to PostgreSQL. And the
| scalability of MongoDB is rather restrictive, compared to
| e.g. Cassandra.
|
| And I would never trust a database that has such a bad
| track record, regarding durability as MongoDB, although I
| admit that PostgreSQL had theoretical problems there as
| well in the past.
| spmurrayzzz wrote:
| I actually agree with you on the point about multi-
| document tx's, I wouldn't choose mongo solely _for_ that
| feature. Its nice to have maybe for the niche use case in
| your nosql workload for when its beneficial. But the
| point I was originally making was that nosql stacks are
| not fundamentally incompatible with most of the features
| or safety constraints offered by other RDBMS.
|
| > And I would never trust a database that has such a bad
| track record, regarding durability as MongoDB
|
| I can't comment on your own experience obviously, but
| I've been using mongo since 2011 in high throughput
| distributed systems and it's been mostly great (one of my
| current systems averages ~38 million docs per minute,
| operating currently at 5 9s of uptime).
|
| Definitely some pain points initially in the transition
| to WiredTiger, but that largely was a positive move for
| the stack as a whole. Durability fires have not plagued
| my experience thus far, not to say they won't in the
| future of course.
|
| As you noted, Postgres has had its own headaches as well.
| Finding out that all their literature claiming their
| transactions were serializable when they were in fact
| _not_ serializable could be considered a mar on their
| record. But much like mongo, they have been quick to
| address implementation bugs as they are found.
| funcDropShadow wrote:
| > I can't comment on your own experience obviously, but
| I've been using mongo since 2011 in high throughput
| distributed systems and it's been mostly great (one of my
| current systems averages ~38 million docs per minute,
| operating currently at 5 9s of uptime).
|
| > Definitely some pain points initially in the transition
| to WiredTiger, but that largely was a positive move for
| the stack as a whole. Durability fires have not plagued
| my experience thus far, not to say they won't in the
| future of course.
|
| Good to read that some are actually using MongoDB to
| their benefit. Indeed I have encountered problems with
| durability in the wild. Nothing, that I would like to
| repeat. But as always for a specific use case the answer
| is: it depends. For a general advice with what to start a
| new project I would select PostgreSQL in 10 out of 10
| cases, if a database server is actually required.
| dspillett wrote:
| _> schemas_
|
| The confusion there is due to the fact that non-R-DBMS
| (particular when referred to as noSQL) can mean several
| different things.
|
| In this context I was replying to a comment about not
| knowing the shape of your data which implies that person
| was thinking about solutions that are specifically
| described as schemaless, which is what a lot of people
| assume (in my experience) if you say non-relational or
| noSQL.
|
| That is the sort of constraints I was meaning: primary &
| unique keys and foreign keys for enforcing referential
| integrity and other validity rules enforced at the
| storage level. There are times when you can't enforce
| these things immediately with good performance
| (significantly distributed data stores that need
| concurrent distributed writes for instance - but the need
| for those is less common for most developers than the big
| data hype salespeople might have you believe) so then you
| have to consider letting go of them (I would advise
| considering it very carefully).
| happimess wrote:
| > There is need to iterate and adapt frequently,
| sometimes as part of your go-to-market strategy, in order
| to fully understand the problem space.
|
| If you're pivoting so hard that your SQL schema breaks,
| how is a schemaless system going to help you? You'll
| still have to either throw out your old data (easy in
| both cases) or figure out a way to map old records onto
| new semantics (hard in both cases).
|
| I agree with GP that this is a central problem to solve,
| not something to figure out _after_ you write software.
| Build your house on stone.
| spmurrayzzz wrote:
| >If you're pivoting so hard that your SQL schema breaks,
| how is a schemaless system going to help you? You'll
| still have to either throw out your old data (easy in
| both cases) or figure out a way to map old records onto
| new semantics (hard in both cases).
|
| I agree with your comment that it's a central problem to
| solve and that both options, throwing out data or map old
| records onto new semantics, is an endemic choice both
| stacks need to make. I don't agree that it's always
| possible to solve entirely up front though.
|
| In my experience, it has been less so about whether the
| storage engine is schemaless or not, even many modern
| nosql stacks now ship with schemas (e.g. MongoDB). I
| think the differentiation I make between these platforms
| is mostly around APIs. Expressive, flexible semantics
| that (in theory) let you move quickly.
|
| As an aside, I also think the differentiation between all
| these systems is largely unimpactful for most software
| engineers. And the choice often made is one of
| qualitative/subjective analysis of dev ergonomics etc. At
| scale there are certainly implementation details that
| begin to disproportionately impact the way you write
| software, sometimes prohibitively so, but most folks
| aren't in that territory.
| funcDropShadow wrote:
| Admittedly, my experience with MongoDB and Cassandra has
| gained some rust over the last decade, but what makes you
| say such nosql databases have expressive APIs? Compared
| to PostgreSQL they have miniscule query languages and it
| is very hard, if at all possible, to express constraints.
| And constraints, sometimes self-imposed sometimes not,
| are what makes projects successful, even startups. Many
| startups try to find this one little niche they can
| dominate. That is a self-imposed constraint. People tend
| to think freedom makes them creative, productive, and
| inventive, while in fact the opposite is the truth. With
| opposite of freedom I mean carefully selected constraints
| not oppression.
| roflyear wrote:
| No absolutely not. 1 hr spent making a schema and a few hours
| of migrations is way less than the headaches you'll have by
| going nosql first.
| marcosdumay wrote:
| > Is 'Not performance bound, and dot knowing the future shape
| of your data' a valid reason?
|
| That's a very good reason for going with a RDBMS even if
| looks like it's not the clearest winner for your use case.
|
| If you invert any of those conditions, it may become
| interesting to study alternatives.
| jpdb wrote:
| Hard disagree. The operational overhead of RDBMS and
| specifically their inherent reliance on a single primary node
| makes them, in my opinion, a bad place to start your
| architecture.
|
| I want to be able to treat the servers in my database tier as
| cattle instead of pets and RDBMSs don't fit this paradigm well.
| Either NoSQL or NewSQL databases are, in my opinion, a much
| better place to start.
|
| I feel like RDBMSs being the "default" option is because most
| people have worked with them in the past and already understand
| them. It doesn't mean they are the best tool for the job or
| even the tool most likely to solve the unknown problems you'll
| encounter in the future.
| roflyear wrote:
| What a joke.
| imachine1980_ wrote:
| You can have multi tb postgree database, that are fast and
| usable whit limited number of cache layers for speed, but you
| probably don need it. mediums migrate from single postress in
| 2020.
| funcDropShadow wrote:
| I don't understand why this comment is down voted. I've
| been part of a project that uses a dozen different
| PostgreSQL databases from different services, one of those
| databases is multi TB with an OLAP usage patterns. And it
| beats performance-wise all the previous attempts to solve
| that specific problem.
| anonymousDan wrote:
| What is your go to NewSQL database these days (and why) out
| of interest?
| hhjinks wrote:
| Only once have I worked on a project where a document
| database did not completely gimp our ability to deliver the
| data that was required of us, and that was only because that
| data was regularly cloned to a relational database we could
| use for asynchronous tasks. As a project grows, I have,
| without fail, come to find that you need relations to
| efficiently deliver the data that new requirements demand.
| corpMaverick wrote:
| Similarly. People don't use Object Modeling/Entity relation-
| ship diagrams anymore.
|
| Every day, I see people struggling with problems that would be
| easy to understand if you had one. You don't even need to have
| an RDBMs. They are good just to model how things are related to
| each other.
| aoms wrote:
| You're speaking my language. After more than 20 years of custom
| software dev, this statement has so much merit.
| jimnotgym wrote:
| The other day I said to a junior dev, when you started planning
| a locking scheme to handle concurrency in you file based system
| it is time to swap to a db
| roflyear wrote:
| Very frequently polled queues come to mind, but usually I'll
| use a db first anyway as there are benefits to it.
| dobin wrote:
| I just use files
| spmurrayzzz wrote:
| The premise here (I think, correct me if I'm mistaken) is that
| there are net-negative tradeoffs to using nosql/non-rdbms.
|
| If that assumption is true, then it follows that the same
| argument used in the last statement also applies-- that if
| you're not at massive scale, then its likely the aforementioned
| tradeoff of not using RDBMS is likely de minimis.
|
| (This assumes that the tradeoffs are of the magnitude that they
| only manifest impact at scale, hard to address that without
| concrete examples though)
| [deleted]
| [deleted]
| bcrosby95 wrote:
| > (This assumes that the tradeoffs are of the magnitude that
| they only manifest impact at scale, hard to address that
| without concrete examples though)
|
| The tradeoff is usually flexibility. You run into flexibility
| problems anytime requirements change. Scale doesn't factor
| in.
| nlnn wrote:
| I've found it's not just scale, but also down to query patterns
| across the data being stored.
|
| I'm with you on using an RDBMS for almost everything, but
| worked on quite a few projects where alternatives were needed.
|
| One involved a lot of analytics queries (aggregations, filters,
| grouping etc.) on ~100-200GB of data. No matter what we tried,
| we couldn't get enough performance from Postgres (column-based
| DBs / Parquet alternatives gave us 100x speedups for many
| queries).
|
| Another was for storing ~100M rows of data in a table with ~70
| columns or so of largely text based data. Workload was
| predominantly random reads of subsets of 1M rows and ~20
| columns at a time. Performance was also very poor in
| Postgres/MySQL. We ended up using a key/value store, heavily
| compressing everything before storing, and got a 30x speedup
| compared to using an RDBMS using a far smaller instance host
| size.
|
| I wouldn't call either of them massive scale, more just data
| with very specific query needs.
| snarfy wrote:
| It's pretty old problem as they are competing ideas. It's
| OLTP vs OLAP. Postgres is designed for OLTP.
| giovannibonetti wrote:
| > Another was for storing ~100M rows of data in a table with
| ~70 columns or so of largely text based data. Workload was
| predominantly random reads of subsets of 1M rows and ~20
| columns at a time.
|
| Kimball's dimensional modelling helps a lot in cases like
| this, since probably there is a lot of repeated data in these
| columns.
| randomdata wrote:
| _> RDBMS can solve pretty much every data storage /retrieval
| problem you have._
|
| Except the most important problem: A pleasant API. Which is, no
| doubt, why 95% of those considering something other than an
| RDBMS are making such considerations.
|
| RDBMS _can_ have pleasant APIs. It is not a fundamental
| limitation. We have built layers upon layers upon layers of
| abstraction over popular RDBMSes to provide nice APIs and they
| work well enough. But those additional layers come with a lot
| of added complexity and undesirable dependencies that most
| would prefer to see live in the DBMS itself instead.
|
| At least among the RDBMSes we've heard of, there does not seem
| to be much interest in improving the APIs at the service level
| to make them more compelling to use natively like alternative
| offerings outside of the relational space have done.
| swagasaurus-rex wrote:
| EdgeDB looks promising. Postgres under the hood so you know
| it's stable.
| dgb23 wrote:
| With SQL you kind of have two options/extremes that are
| unpleasant in their own way.
|
| You either model things in a very domain specific and classic
| fashion. Here you get the benefit of being quite declarative
| and ad-hoc queries are natural. Also your schema is stronger,
| as in it can catch more misuse by default. But this kind of
| schema tends to have _logical_ repetition and is so specific
| that change/evolution is quite painful, because every new
| addition or use-case needs a migration.
|
| Or you model things very generically, more data driven than
| schema driven. You lose schema strength and you definitely
| lose sensible ad-hoc queries. But you gain flexibility and
| generality and can cover much more ground.
|
| You can kind of get around this dichotomy with views, perhaps
| triggers and such. In an ideal world you'd want the former to
| be your views and the latter to be your foundational schema.
|
| But now you get into another problem, which is that
| homogeneous tables are just _super_ rigid as result sets.
| There are plenty of very common types you cannot cover. For
| example tagged unions, or any kind of even shallowly nested
| result (extremely common use case), or multiple result groups
| in one query. All of these things usually mean you want
| multiple queries (read transaction) or you use non-SQL stuff
| like building JSONs (super awkward).
|
| If you can afford to use something like SQLite, then some of
| the concerns go away. The DB is right there so it's fine to
| query it repeatedly in small chunks.
|
| I wonder if we're generally doing it wrong though, especially
| in web development. Shouldn't the backend code quite
| literally live on the database? I wish my backend language
| would be a data base query language first and a general
| purpose language second, so to speak. Clojure and its datalog
| flavors come close. But I'm thinking of something even more
| integrated and purpose built.
| N_A_T_E wrote:
| Isn't that what the ActiveRecord pattern is supposed to be?
| (Something something n+1's and over fetching data.)
| eurasiantiger wrote:
| If only we had a generic GraphQL resolver for entity-
| based SQL schemas.
|
| Oh wait, we do have Prisma. And it suffers from those
| same issues.
| drpotato wrote:
| Have you looked at PostGraphile? It's doesn't have n + 1
| or over-fetching issues.
| paulryanrogers wrote:
| What is a pleasant API? For what kind of data?
| Yhippa wrote:
| I wonder if parent post meant something other than SQL?
| brightball wrote:
| I've honestly never understood why people have such a
| distaste for SQL. SQL and Linux/Unix have been the biggest
| constants of my entire programming career to this point
| (20ish years). I always know I can count on them.
| bakuninsbart wrote:
| SQL is great at what it is designed to do, and absolutely
| horrible for anything else. Sometimes I have had to use
| tools that _only_ allow (a subset of) SQL for querying
| data. (Especially BI) Doing regex date validation in a json
| derived from a string in an sql dialect without function-
| definitions is horrendous. These kinds of "why the f do i
| have to use sql for _this_ "-moments happened surprisingly
| often to me working as a data analyst.
|
| But then I'm trying to do relatively simple things in
| Pandas or R, like changing column values based on multiple
| field value conditions, and it is a struggle requiring at
| least 17 brackets, while in SQL it would be a simple
| statement that anyone can understand after reading it once.
| funcDropShadow wrote:
| The problem you are describing is probably rooted in the
| specific SQL dialect you had to use. Selecting from
| inside JSON strings and matching regular expressions
| should be a no brainer. And it is straight forward in
| PostgreSQL, e.g.
| nequo wrote:
| > simple things in Pandas or R, like changing column
| values based on multiple field value conditions, and it
| is a struggle requiring at least 17 brackets
|
| With dplyr, wouldn't this be mutate(x = case_when(... ~
| y, TRUE ~ x)) or the same but with ifelse?
| oarabbus_ wrote:
| I haven't used much R, but how would you do something
| like: CASE WHEN SUM(daily_revenue)
| OVER (PARTITION BY department, TRIM(SUBSTR(region, 5)) IN
| ('North','West','Misc')) > AVG(revenue) OVER
| (ORDER BY sale_time ASC rows BETWEEN 28 PRECEDING AND
| CURRENT ROW) AND NOT COALSECE(had_prev_month_party,
| FALSE) THEN pizza_party_points + 5
| WHEN <above> AND had_prev_month_party THEN
| pizza_party_points + 3 WHEN MIN(sale_time)
| over (PARTITION BY department) = DATE_TRUNC('month',
| current_date) then 5 ELSE
| GREATEST(pizza_party_points - 1, 0) END as
| pizza_party_performance_points_current
|
| this example may be a bit esoteric but it actually draws
| from lots of real-world cases (comparing a partitioned
| sum with an overall average, checking conditions on other
| columns, messy data parsing and manipulation, implicit
| waterfall nature of CASE WHEN vs. explicit elseif, etc)
| oarabbus_ wrote:
| >Doing regex date validation in a json derived from a
| string in an sql dialect without function-definitions is
| horrendous.
|
| Doing regex is horrendous, but doing it on SQL in a
| modern database is no more difficult than in a full-
| fledged programming language. Most modern DBs have strong
| JSON support and built-in regex functions
|
| 1. JSON_CAST/JSON_PARSE your data
|
| 2. REGEXP_EXTRACT() on the result, here's several date
| validator regex from a SO post
| (https://stackoverflow.com/questions/15491894/regex-to-
| valida...)
|
| And that's it. In fact in many cases it's probably faster
| to do it natively in SQL than to export it to python or R
| and parse there.
| robonerd wrote:
| I blame ORMs. ORMs are promoted by scare mongering novice
| developers away from learning SQL in the first place. I'm
| ashamed to say I fell for it for a few years. When I
| eventually learned SQL it was like a fog being lifted from
| my mind that I hadn't even noticed before.
| reidjs wrote:
| I tried not to use an ORM for my last project, but I you
| ended up rolling my own ORM for the application anyways
| because it was easier to keep the structures in the web
| application tied to the database. How else do you keep
| the two consistent?
| jstarfish wrote:
| I wholly agree with you, but I'll say this:
|
| When it comes to prototyping, I'm not going to fuck with
| something like Java-- I'm going to reach for Python. If I
| don't _know_ what I 'm doing to begin with, I don't need
| something telling me all the ways in which I'm doing it
| wrong.
|
| Same goes for SQL/NoSQL. I loosely know what I need to
| model and may revise it arbitrarily. SQL does not lend
| itself to this. NoSQL was _designed_ for it.
|
| NoSQL is the "fuck you dad you can't tell me what to do" of
| the database world. SQL is your annoying dad that tries to
| institute curfews and won't let you go to underage drinking
| parties. In the end, it's the latter you're going to be
| calling from a holding cell, but there's a lot more fun you
| can have with the former.
| funcDropShadow wrote:
| In the case of SQL/NoSQL I think it is the other way
| around. In order to use a NoSQL database to its strength,
| you have to know all access path up front [1]. Because
| you are designing your table structure with all the
| queries in mind. In SQL on the other hand, you can always
| go for a normalized structure and you are pretty safe to
| be prepared for most requirements that are coming in the
| future.
|
| Additionally, I think drafting a DDL schema is a great
| way to learn about a new application domain. It forces me
| to ask hard questions. And that improves my understanding
| of the domain. I guess that is similar to some people who
| like to prototype using Haskell type signatures.
|
| [1] Unless you mean specifically a document store without
| schemas.
| jrop wrote:
| > If I don't know what I'm doing to begin with, I don't
| need something telling me all the ways in which I'm doing
| it wrong.
|
| This is unrelated to this conversation, but this is my
| main beef with Rust. I love Rust (like a lot), but it's
| just not good for prototyping. End of non-pertinent rant.
| oarabbus_ wrote:
| Honestly, I think it's because a lot of folks go their
| entire careers avoiding set-based thinking. Many of these
| folks are talented programmers so I would assume they'd
| become quite decent at SQL if they devoted time to it. I'm
| speaking more on the DML side than the DDL side here.
|
| In my experience <20% of developers are good enough to be
| dangerous with SQL, and maybe 5% what I'd consider adept.
|
| The rest range from "SELECT * is as far as I'll go; where's
| the ORM" to "why the hell are you using cursors to
| aggregate"
|
| SQL is powerful, elegant, and reliable. With modern DB
| support of JSON, ARRAY, statistical functions, and much
| more, SQL is probably the #1 most underutilized/improperly
| leveraged tool today. SQL-killers have been coming out for
| 40 years now and (for its domain!) SQL's lead is pulling
| farther away if anything.
|
| *yes there are some questionable implementations, so please
| replace "SQL" with "PostgreSQL" if nonstandard SQL
| implementations have caused nightmares for you in the past.
| omginternets wrote:
| Where can I learn about the relationship between set-
| theory and SQL?
| oarabbus_ wrote:
| I'm not too well-versed in the academic/mathematical
| background of SQL, other than than to say it's closer to
| relational algebra than Set theory, and the resource to
| read is probably the 1970 Codd paper:
| https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
|
| SQL does implement the set operations of UNION,
| INTERSECT, and EXCEPT, but I meant "thinking in sets"
| more colloquially.
|
| edit: this is more practical than theoretical but the
| author here actually does a nice job of discussing "set-
| based" thinking vs. procedural
| https://www.itprotoday.com/open-source-sql/fangraphs-
| tags-cl...
| randomdata wrote:
| SQL has some annoying quirks but on balance I am not sure
| there is the distaste for SQL that you claim. It's fine,
| even good in many circumstances, but also too low level for
| some tasks which pushes common problems onto application
| developers to solve over and over again or build up layers
| of dependencies to utilize someone else's solution. That's
| not the fault of SQL but the fault of the RDBMS.
|
| Our layers of abstractions atop SQL along with approaches
| taken by DBMSes outside of the common relational names have
| shown that there is room for improvement within those
| tasks, able to be done natively by the DBMS, and it does
| not have to come at the expense of losing SQL when you need
| to be able to describe lower level questions. Different
| tools for different jobs, but little will within the RDMBS
| space to explore those other tools.
| zwkrt wrote:
| I don't think that people have a distaste for SQL, I think
| that large companies have a vested interest in trying to
| sell no SQL solutions because they more tightly integrate
| into a "cloud" ecosystem. A database solves that problem
| domain too well, it's not marketable.
| eurasiantiger wrote:
| Can you write an SQL query to return arbitrary JSON?
| Returning queried data in a nested form is a must nowadays.
| ElectricalUnion wrote:
| > Can you write an SQL query to return arbitrary JSON?
| Returning queried data in a nested form is a must
| nowadays.
|
| Yes. https://www.sqlite.org/json1.html
| TheaomBen wrote:
| Also https://www.postgresql.org/docs/current/functions-
| json.html
| slaymaker1907 wrote:
| Yeah, a lot of RDBMS are adding JSON support, but the support
| is often a bit clunky to use and may be missing important
| features. If you're dealing with a bunch of semistructured
| APIs that return JSON natively, Mongo makes it really easy to
| just dump all that into a collection and then just add
| indices as needed.
| jjav wrote:
| > Except the most important problem: A pleasant API
|
| A pleasant API is clearly not the most important business
| problem a database is there to solve.
|
| The data in it is presumably the life and blood of the
| business, whereas the API is something only developers need
| to deal with.
|
| But that aside, the interface will be SQL which is quite
| powerful, long-lived (most important) and, fortunately, very
| pleasant.
| wrs wrote:
| From the SERIALIZABLE explanation: "The database runs the queries
| one by one ... It is essential to have some retry mechanism since
| queries can fail."
|
| I know they're trying to simplify, but this is confusing. If the
| first part is true, the second part can't be. In reality the
| database does execute the queries concurrently, but will try to
| make it _seem_ like they were done one by one. If it can't manage
| that, a query will fail and have to be retried by the
| application.
| googletron wrote:
| I believe there was a caveat around this exact point later in
| the post. It was really tough striking a balance for people
| learning this for the first time and more knowledgeable
| audience without confusing them further.
|
| I do appreciate the feedback and will look to add some more
| color here! Thank you!
| blupbar123 wrote:
| It's kind of saying something which isn't true. Optimally one
| would find a wording that doesn't confuse beginners but also
| is factual, IMHO.
| r0b05 wrote:
| Nicely written and informative!
| googletron wrote:
| Thank you!
| ssd8991 wrote:
| jrm4 wrote:
| To go big picture; I'm kind of glad databases are largely like
| cars in this respect, in ways that other software tooling isn't.
|
| Which is to say they're frequently good enough such that the
| human working with them on whatever level can safely _not know_ a
| lot of these details and get a LOT done. Kudos to whoever
| deserves them here.
| charcircuit wrote:
| Isn't that true for almost all software? You only need to know
| the implementation of a small subset of parts. I would say
| databases are worse since you need to know how they are
| implemented else you will start making O(rows) queries or doing
| other inefficient stuff.
| jrm4 wrote:
| Going broadly (which is all I can do because I teach this
| stuff and don't build in depth) -- "the database" is the part
| I can most easily "abstract" away as if it were walled off?
|
| As opposed to aspirationally discrete classifications that
| end up being porous, e.g. MVC, "Object Oriented" etc.
| Linda703 wrote:
| dennalp wrote:
| Really nice guide.
| otherflavors wrote:
| why is this tagged "MySQL" but not also "SQL"
| googletron wrote:
| Thanks! Added!
| googletron wrote:
| This is a quick rundown of database indexes and transactions.
| Excited to continue sharing these notes with community!
| mgrouchy wrote:
| I have been really enjoying the content so far, any hits on
| whats coming up?
| googletron wrote:
| We have another couple of notes from a few companies like
| Temporal, Sentry, and Gadget.
| itsmemattchung wrote:
| Awesome. Looking forward to additional content.
|
| If possible, would be great to get Mark Brooker (Principal
| at AWS) to provide some notes on bridging the gap between
| CAP theorem and how AWS relaxes constraints for building
| out Elastic Block Storage (EBS).
|
| https://brooker.co.za/blog/2014/07/16/pacelc.html
| googletron wrote:
| I will definitely reach out! Thanks for the suggestion!
___________________________________________________________________
(page generated 2022-06-27 23:00 UTC)