[HN Gopher] DELETEs Are Difficult
       ___________________________________________________________________
        
       DELETEs Are Difficult
        
       Author : radimm
       Score  : 125 points
       Date   : 2024-11-25 06:46 UTC (6 days ago)
        
 (HTM) web link (notso.boringsql.com)
 (TXT) w3m dump (notso.boringsql.com)
        
       | aurareturn wrote:
       | DELETE FROM films;
       | 
       | I'm surprised databases makes it so easy to just delete an entire
       | table. I think the command should be                 DELETE FROM
       | films YES-I-KNOW-WHAT-I-AM-DOING;
        
         | magicalhippo wrote:
         | Agreed, I've long been thinking that DELETE should require a
         | WHERE clause.
         | 
         | If you really want to just delete everything, just do WHERE
         | 1=1.
        
           | buro9 wrote:
           | UPDATEs should require a WHERE clause too.
           | 
           | At which point we could just say all SQL should have a WHERE
           | clause.
        
             | jaredsohn wrote:
             | I think SELECTs without WHERE clauses are fine
        
               | marcosdumay wrote:
               | But requiring them is fine too. Even more on Postrges,
               | that has booleans.
        
             | ahoka wrote:
             | And queries should start with WHERE.
        
               | awestroke wrote:
               | FROM, then WHERE, then SELECT
        
               | hamandcheese wrote:
               | This is the way.
        
               | Jabbles wrote:
               | Pipe Syntax In SQL
               | 
               | https://research.google/pubs/sql-has-problems-we-can-fix-
               | the...
               | 
               | https://news.ycombinator.com/item?id=41338877
        
           | yen223 wrote:
           | It would be nice if there was a psql option that would warn
           | you if you're about to do any kind of update, delete or
           | insert without doing a BEGIN;
           | 
           | (You could always set autocommit=false, which forces you to
           | explicitly commit or rollback every action, but that has its
           | own baggage)
        
         | jsemrau wrote:
         | MySQL has this as default as far as I recall. But then I never
         | delete, I just set "deleted" to yes.
        
           | physicsguy wrote:
           | That just wouldn't fly where you have business customers,
           | many insist on data deletion at the end of contracts. In
           | practice though partitioning or using seperate databases can
           | be a better strategy for dealing with that as otherwise
           | dealing with backups are challenging.
        
           | kijin wrote:
           | It might depend on the version, but last time I checked,
           | DELETEing an entire table was much slower than TRUNCATE
           | TABLE.
        
             | eddd-ddde wrote:
             | I'm pretty sure that only applies to Postgres.
        
               | magicalhippo wrote:
               | Sybase SQLAnywhere as well, and not unlikely MSSQL too
               | given its shared ancestry.
               | 
               | Delete with WHERE is sufficiently slow in MSSQL we have
               | to do batched deletes, but I can't recall offhand if that
               | holds for whole table deletion as well.
        
           | evanelias wrote:
           | You're probably thinking of the --safe-updates option [1] for
           | the `mysql` CLI, also available as the memorable alias
           | --i-am-a-dummy. This requires UPDATE and DELETE to have
           | either a WHERE clause or a LIMIT clause. Under the hood, the
           | command-line client option just manipulates the
           | sql_safe_updates session variable [2] to enforce the UPDATE
           | and DELETE requirement, as well as a couple other unrelated
           | variables to prevent overly-huge SELECTs.
           | 
           | It's not enabled by default out of the box, but some
           | companies do override their configuration to enable it for
           | new sessions, iirc Facebook did this.
           | 
           | [1] https://dev.mysql.com/doc/refman/8.4/en/mysql-
           | tips.html#safe...
           | 
           | [2] https://dev.mysql.com/doc/refman/8.4/en/server-system-
           | variab...
        
         | RedShift1 wrote:
         | There is pg-safeupdate: https://github.com/eradman/pg-
         | safeupdate
        
         | cruffle_duffle wrote:
         | That is why you always get in the habit of wrapping your stuff
         | in "BEGIN TRANSACTION". Then if and when you fuck up you can
         | issue a rollback and be all good.
        
         | lowbloodsugar wrote:
         | I've deleted the _internet_??
        
       | orionblastar wrote:
       | Most databases I used have a Status column we could mark as
       | active, inactive, or deleted. That way, you can see what records
       | were marked as deleted and change them back in case of accidental
       | deletion.
       | 
       | Keep record retention with the Date_Modified column so you can
       | use SQL delete to remove those deleted records that are older
       | than a year or so.
        
         | arielcostas wrote:
         | I do something similar, but instead keep a "date_deleted"
         | column null by default, and the "active" column as a boolean.
         | 
         | That way, I kill two birds in one stone by having a dedicated
         | column for last deletion (instead of updating a record that is
         | supposedly deleted) and the status just as a boolean instead of
         | some enum, or integer or string.
        
         | alexanderscott wrote:
         | this is a "soft delete". as the author notes, depending on the
         | nature of the data being stored a soft delete does not meet the
         | requirements of many data privacy laws and compliance
         | regulations (like GDPR's right to erasure).
        
           | isbvhodnvemrwvn wrote:
           | And in postgres soft delete is more expensive than a regular
           | delete because it's effectively an insert and update, while
           | delete is just an update.
        
           | kijin wrote:
           | There are different kinds of soft delete.
           | 
           | I've had cases where the rows in question absolutely could
           | not be hard deleted, because of legacy foreign key relations.
           | But the PII in those rows had to go. So we did a kind of
           | "firm delete" by setting all columns (except the PK and a few
           | necessary flags) to their default values and/or null.
        
       | Terr_ wrote:
       | For many of the most painful deletion questions, the root problem
       | is that when the software was first made the
       | stakeholders/product-org didn't think about use-cases for
       | deleting things. At best, they assume a "do not show" property
       | can be placed onto things, which falls apart when you get to
       | legal issues that compel actual removal.
        
         | buildingcrash7 wrote:
         | >software was first made the stakeholders/product-org
         | 
         | Practically all building, physical and software is made for a
         | purpose first, the process is mainly an obstacle that needs to
         | be minimized. A piece of software is trying to solve a problem
         | just like a door is. It's driven by economics where the
         | recipients don't want to pay any more than they need to and
         | someone is always willing to undercut you by cutting corners.
        
         | tempodox wrote:
         | That depends on the context. In some cases you're not allowed
         | to physically delete because you need to provide an audit trail
         | for 10 years or so. You could move those rows into separate
         | audit tables instead. However, that requirement should not come
         | as a surprise.
        
         | BlueTemplar wrote:
         | But, just like this article using <<physically deleted>>, when
         | in practice it's not the case (the bits are just freed to be
         | overwritten some unknown amounts of time later), does legal
         | compliance just completely ignores this fact of actual physical
         | deletion ??*
         | 
         | (AFAIK it takes several passes of overwriting bits with random
         | bits on magneto-mechanical storage to not be able to retrieve
         | any significant fragments of the original data, and things are
         | even worse on transistor storage, which casually makes copies
         | of the data for wear leveling reasons.)
         | 
         | *With the exception of state secrets of course, where we know
         | that storage is mechanically destroyed <<with extreme
         | prejudice>>.
        
       | yen223 wrote:
       | A big asterisk that should be added to the article is that all
       | that applies to Postgres.
       | 
       | Other databases have their own deletion mechanisms (and deletion
       | quirks)
       | 
       | It's a very good article otherwise.
        
       | jandrewrogers wrote:
       | DELETE is expensive at a deep fundamental level that we don't
       | think about much in computer science because we are more worried
       | about losing data. The article is about Postgres but it
       | generalizes. We don't actually have any computer science for
       | DELETE optimized databases. I've idly looked into delete-
       | optimization in databases as thought experiments, since there
       | isn't much in the way of literature on it, and it is far more
       | difficult than I think many people intuit. The article is partly
       | a manifestation of this reality.
       | 
       | I think the nature of DELETE is one of the more interesting open
       | problems in computer science. It is one of those things that,
       | when we require precision, turns out to be very difficult to
       | define.
        
         | redox99 wrote:
         | I think garbage collection memory management can be thought of
         | a delete optimized database.
        
           | tybit wrote:
           | Runtimes with garbage collectors typically optimize for
           | allocation, not deletion.
        
             | mike_hearn wrote:
             | Generational GC optimizes for both. They assume that most
             | objects die young, so choose to relocate live objects and
             | just mark the entire region that was evacuated as empty. So
             | this is a very efficient way to delete data.
        
           | SgtBastard wrote:
           | Only if you profoundly misunderstand what GC is.
        
           | actionfromafar wrote:
           | I have long day-dreamed of what a "use it (soon) or loose it"
           | runtime would mean. Allocated blocks would just expire after
           | a set time.
        
             | okasaki wrote:
             | So kind of like browsers work now with the tab unloading
             | and Android killing apps?
             | 
             | Personally I find it really obnoxious and disruptive.
        
               | sitkack wrote:
               | You would prefer something a little more persistent?
        
               | dotancohen wrote:
               | I would prefer a 48 hour day, so that I could get
               | everything done that needs to be done.
               | 
               | Or maybe a 72 hour day, so I'd have time for things that
               | I'd just enjoy.
        
               | okasaki wrote:
               | I prefer to do my own app lifecycle management.
        
         | mpweiher wrote:
         | Yes...but it goes even deeper.
         | 
         | For example, in physics, the paradox of Maxwells Demon is
         | resolved when you consider the cost of deleting data:
         | 
         | "In 1982, Charles Bennett showed that, however well prepared,
         | eventually the demon will run out of information storage space
         | and must begin to erase the information it has previously
         | gathered.[8][12] Erasing information is a thermodynamically
         | irreversible process that increases the entropy of a system."
         | 
         | https://en.wikipedia.org/wiki/Maxwell's_demon#Recent_progres...
         | 
         | It is also difficult for humans to delete information. In my
         | humble and only a little facetious opinion this is one of the
         | main drivers for ever new "To Do" apps: the existing app gets
         | full because deleting is too hard, so we start fresh with a new
         | app. The app isn't the point, the starting fresh is.
         | 
         | The underlying reason there being that the cost of maintaining
         | small (to do) notes can be greater than the value of the note,
         | which is one of the reasons we still use little scraps of paper
         | and other mechanisms that will effectively auto-delete.
         | 
         |  _Understanding the micronote lifecycle: improving mobile
         | support for informal note taking_
         | 
         | https://dl.acm.org/doi/10.1145/985692.985779
        
           | rpier001 wrote:
           | We need a to-do app following these principles... then
           | another for product management.
        
           | loa_in_ wrote:
           | Just add a __delete all__ button.
        
             | mycall wrote:
             | My solution to most task list items is just wait long
             | enough. Eventually the items often are no longer needed.
        
             | mpweiher wrote:
             | Yeah, that _should_ work.
             | 
             | Alas, it doesn't.
        
           | lukevp wrote:
           | Every day I'd have more and more state accumulation on my
           | machine - open apps, unsaved changes, open tabs. I've tried
           | many methods for preventing this from happening over the
           | years, but the only and most effective solution I've been
           | using for the last year - a script I wrote that just quits
           | every browser tab and every open app (leaving unsaved apps
           | still running) every evening. I wake up and the machine is
           | new and fresh. It's amazing and has benefited my productivity
           | so much. It changes the state game to where if I have a
           | specific task I want to resume tomorrow I have to make myself
           | a clear note to read the next day, and if I have tabs open
           | that I care about, I have to bookmark them. What I've found
           | is that the actual info that needs to be transferred between
           | days is very small.
           | 
           | I have an endless reminders app and todo list. I wonder if
           | something similar (items expire automatically unless you flag
           | them as permanent or something) would help keep a clearer
           | list. Sometimes ephemerality is best!
        
             | wenc wrote:
             | > Every day I'd have more and more state accumulation on my
             | machine - open apps, unsaved changes, open tabs.
             | 
             | I resonate with your comment.
             | 
             | I grew up during a time when PC state was ephemeral (DOS
             | days). Unsaved changes essentially meant lost data. Open
             | apps were gone once the computer was shut down (and the
             | computer was shut down daily - a 250W power supply had an
             | expensive power draw; in contrast a Mac Mini only sips 1W
             | when sleeping). This helped me develop a habit of keeping
             | only necessary things open, bookmarking what I want to
             | keep, and of habitually pressing Ctrl+S to save data. I
             | never keep any tabs open (my browser loses all tabs upon
             | close anyway), and my inbox is zero.
             | 
             | The cost I pay for this is context recovery -- every day, I
             | have to essentially set everything up again. I do write
             | notes or leave comments in code to remind myself where I
             | left off, but I essentially started fresh. But there is an
             | upside to this: I start each day from an uncluttered slate,
             | which leads to clarity in my head. When context is
             | ephemeral, I'm more likely to not be beholden to an
             | existing state.
             | 
             | This actually helps me increase the quality of my writing
             | and my code. It's akin to the heuristic of "throwing away
             | your first draft and rewriting" to achieve higher quality.
             | Write your code once, delete it, and write it again from
             | scratch. The next version takes a quarter of the time to
             | write but can be twice the quality because you've
             | prototyped it in your head but you are not bound to your
             | crufty first attempt. There was an HN discussion on this a
             | while ago:
             | 
             | https://grantslatton.com/software-pathfinding
             | 
             | If you save state, you can resume more quickly, which is
             | great if you're on a roll, but it's also an accumulation of
             | more clutter that blocks certain more creative thoughts.
        
           | alexwasserman wrote:
           | I like the idea of a todo list that comes with a built in
           | auto-delete.
           | 
           | You either do your to dos, or it auto-deletes them for you.
           | No worry about it getting full, but also some pressure to
           | actually get them done or they'll be wiped.
           | 
           | And if you're happy they're wiped, then you probably didn't
           | need to do it at all.
           | 
           | I wonder if there's something like that already.
        
             | teitoklien wrote:
             | you can already do it, with Apple's native Shortcuts app
             | and Reminders app. You can set an automation to delete all
             | reminders after a fixed time has passed with the 'Remove
             | Reminders' action
             | 
             | Otherwise you can just use cronjobs with a small python
             | snippet to parse and decide what reminders tagged with
             | which labels to delete after X,Y,Z many days and hit the
             | APIs of most major todo apps like todoist, asana, etc to
             | just delete those tasks. Heck works with your local
             | markdown based todo lists too.
        
           | MetaWhirledPeas wrote:
           | > The app isn't the point, the starting fresh is.
           | 
           | This is the crux of what made Google Inbox so good. The UX
           | encouraged "deleting" and starting fresh. This was made
           | possible not just through the controls, but also through the
           | promise that undelete would be possible. People want to start
           | fresh, but they also don't want to lose anything; that's the
           | conundrum.
        
         | akira2501 wrote:
         | > We don't actually have any computer science for DELETE
         | optimized databases.
         | 
         | There is actually a fair amount if you consider databases with
         | fixed length records. Which used to be the dominant form.
        
         | brightball wrote:
         | Yep, it's tough. One of the more unexpectedly complicated
         | aspects of GDPR compliance is the strain that full deletes put
         | on a system.
         | 
         | You have to break it into small batches across the cascade of
         | associated data.
        
         | epistasis wrote:
         | There was a common thread through all of my algorithms and data
         | structures class: Hash maps, b-trees, etc. are all beautiful
         | structures until you add a delete operation and have to start
         | dealing with all those little holes... Removing data
         | complicates everything.
        
         | mike_hearn wrote:
         | What about LSM trees? Something like RocksDB is very efficient
         | at deleting. A delete operation is a tiny write (a tombstone)
         | and then the actual deletion is done via compaction in the
         | background with entire tablets being freed at once when the
         | live data is evacuated. It's actually most efficient when
         | deleting large ranges - when just replacing data it's not so
         | efficient due to the write amplification.
         | 
         | That said, I agree with you in general. An under-used technique
         | is to simply encrypt everything and then use a key store that
         | is guaranteed capable of deleting data. This makes it easy to
         | comply with legal deletion requests even across backups, though
         | of course, you need to manage the keys very carefully and
         | ensure that they are also backed up.
        
           | selecsosi wrote:
           | +1 This was our strategy at TempoIQ for our ts storage engine
           | (built on top of rocks).
           | 
           | Very efficient and effective at managing tons of data
           | ingestion (and deletion) at scale.
           | 
           | Not an easy out of the box tech to build on top of though
           | when you have to build all the analytics/management pieces
           | that something like PG gets you so I get the lack of public
           | examples
        
           | latency-guy2 wrote:
           | Well tombstoning is fundamentally punting the operation, the
           | data is still there taking up space and computation if the
           | flagged entry does not get removed from varying levels of
           | query plans.
           | 
           | I agree that it meets the requirements for batched DELETE,
           | and that's likely as best as we can make it.
           | 
           | But I wonder if there was a better way. I know there are
           | research DBs out there that experimented with reusing the
           | tombstone entry for new INSERT/UPDATE operations, but these
           | suck when you want to do batched INSERT/UPDATE on a range
           | since they're scattered all about in a table, and you lose
           | ordering + monotonic properties.
        
         | kccqzy wrote:
         | It's absolutely true that we don't think about it much. When I
         | was first taught balanced binary trees, deletion wasn't even a
         | topic that needed to be learned. Same thing later when I was
         | taught balanced binary trees. Then again in hash tables. It's
         | an operation that's overlooked in CS education.
        
         | toast0 wrote:
         | > We don't actually have any computer science for DELETE
         | optimized databases.
         | 
         | Depending on how much deleting and when, there might be
         | engineering if not science for this.
         | 
         | If everything is deleted on a schedule, partitioned databases
         | and dropping whole partitions as they expire is a well worn
         | path. Soft delete and compaction also works pretty well if
         | most, but not all things will be deleted. A generational
         | garbage collection kind of thing.
         | 
         | As others said, fixed sized records are easier to manage
         | deletion/replacement with, too.
        
         | hinkley wrote:
         | If someone dumped that project in my lap and said fix it (and
         | it I was more used to low level programming), I'd probably
         | start be refreshing myself on the last 10+ years of GC advances
         | since I stopped reading SIGPLAN. Particularly multithreaded
         | sweep. Because essentially you want to decouple delete from
         | free so you can not do 100% of the bookkeeping work in the
         | middle of time sensitive operations. But not fall as far behind
         | as Postgres can with its vacuuming albatross.
         | 
         | In a way, deletion is a form of eventual consistency. The user
         | loses access to the data but the system still knows about it
         | for a while.
         | 
         | Just off the top of my head, I would think for LSM systems, you
         | would resort to snapshotting as the edit history became much
         | larger than the retained row count, and as you delete old
         | snapshots (two GC roots) you could compare the old and the new
         | and drop everything that didn't survive. You only have to
         | finish well before the next snapshot interval, and if you
         | maintain a queue you only have to process them on average as
         | fast as the snapshot interval.
         | 
         | And for BTree systems you can amortize the deletes across every
         | insert, the way some realtime systems clean up a few free
         | pointers on every allocation.
        
       | Svip wrote:
       | > Unlike DELETEs, UPDATEs don't trigger cascaded actions - they
       | only involve triggers that are explicitly defined.
       | 
       | That's not entirely true. ON UPDATE CASCADE is a thing for
       | foreign keys, at least in PostgreSQL (which this article is
       | talking about), meaning that the foreign row referencing the row
       | gets updated. Though, personally, I would never use ON UPDATE
       | CASCADE, as it seems kind of funky.
        
         | arcanemachiner wrote:
         | > Though, personally, I would never use ON UPDATE CASCADE, as
         | it seems kind of funky.
         | 
         | As a relative noob in the world of database administration, I'm
         | glad to hear that someone else feels this way.
        
           | pepelotas wrote:
           | It does if your key is an auto increment or random unique
           | identifier. But if you had a key that is also data, say a
           | "Genre" colum, it starts to make sense that you'd want to
           | cascade updates
        
         | rrr_oh_man wrote:
         | > Though, personally, I would never use ON UPDATE CASCADE, as
         | it seems kind of funky.
         | 
         | Why?
        
           | Svip wrote:
           | Personally, I like to be explicit and in control. In the
           | application layer, I may be far away (at least mentally
           | speaking) from the constraints in the database, and if I
           | update/delete something, I don't want it to "magically"
           | cascade through the database. For those reasons, I always
           | prefer RESTRICT, both for ON DELETE and ON UPDATE. This
           | forces me to clean up before I make the actual change I'm
           | interested in, and anyone reading the code later, can uncover
           | that behaviour quickly.
           | 
           | That being said, I can see the arguments for ON DELETE
           | CASCADE, particularly in a codebase, where there is a lot of
           | functionality _in_ the database itself (in the formed of
           | stored procedures, and what have you), since you are always
           | mentally closer to the action. But ON UPDATE CASCADE feels
           | weird, because why are you updating the primary key (which is
           | usually what foreign keys references) of your rows? That
           | feels like something that needs a good explanation.
           | 
           | Though, I do recognise, you need to jump through a lot of
           | hoops to modify your primary key values with ON UPDATE
           | RESTRICT, because you basically need to cover all your bases
           | in a large convoluted common table expression (depending on
           | the number of foreign keys, of course), when an ON UPDATE
           | CASCADE would do that for you. But I'd also rather be blocked
           | from updating primary row values entirely, since it feels
           | like the _wrong_ thing to do. (Yes, I know that foreign keys
           | doesn 't _have_ to reference other primary keys, and there
           | may be niche cases for this, but personally, I 'd just avoid
           | it altogether.)
        
             | rrr_oh_man wrote:
             | Thanks for the explanation.
             | 
             | The first case I was thinking about was a merger of two
             | accounts. Second case would be a db-wide update of one or
             | more keys for some reason. That's why I tend to leave it ON
             | UPDATE CASCADE.
             | 
             | Although both of these cases are more diffuse muscle memory
             | than rational foresight.
        
             | SoftTalker wrote:
             | > if I update/delete something, I don't want it to
             | "magically" cascade through the database.
             | 
             | This is the big reason I don't like triggers either. I
             | would use them in only one case: if the database didn't
             | support an auto-incrementing "identity" type, I might use a
             | trigger to simulate that. But just as often I would prefer
             | a stored procedure that got the new ID from a sequence, and
             | then did the insert, especially if there were other things
             | I needed a stored procedure to deal with.
        
         | hamandcheese wrote:
         | > That's not entirely true. ON UPDATE CASCADE is a thing for
         | foreign keys, at least in PostgreSQL (which this article is
         | talking about), meaning that the foreign row referencing the
         | row gets updated.
         | 
         | For this to work with soft-deletes, wouldn't you have to
         | include the soft-delete column in a compound foreign key? That
         | sounds funky indeed.
        
         | SoftTalker wrote:
         | You really shouldn't be updating primary key values though.
        
           | magicalhippo wrote:
           | While I overall agree, there have been cases where I have
           | found it handy.
           | 
           | In one case, we had a table of invoices and a table of goods
           | items, and each goods item should point to an invoice.
           | 
           | If one wants to use the natural key, invoice number, and have
           | a foreign key to ensure the goods items can't point to an
           | invoice that doesn't exist, then ON UPDATE CASCADE was needed
           | in case the user had to change the invoice number (due to
           | mistyping it or similar).
           | 
           | Of course, if one does not use a natural key then this isn't
           | such an issue.
           | 
           | Another case was where we've had to merge databases after two
           | companies merged. If they had overlapping primary key ranges,
           | then the easiest was to switch to ON UPDATE CASCADE and then
           | renumber the existing rows before inserting the other rows.
           | 
           | We'd change back after the merge though.
        
             | marcosdumay wrote:
             | > If one wants to use the natural key
             | 
             | Yeah, that's why you shouldn't use natural keys as primary
             | ones.
        
               | magicalhippo wrote:
               | Learned that lesson a bit too late for that.
               | 
               | But yea, these days I never use natural keys.
        
               | marcosdumay wrote:
               | I really hate that the educational literature says it's
               | an option. Yet everybody knows it's something you should
               | never do, no exceptions.
        
       | burntcaramel wrote:
       | If data isn't actually removed until vacuuming, then are systems
       | that perform SQL DELETES actually GDPR compliant? Because
       | technically the private data is still there on disk and could be
       | recovered. "Until the autovacuum process or a manual VACUUM
       | operation reclaims the space, the "deleted" data remains."
        
         | konha wrote:
         | Yes. GDPR allows for delays when complying with deletion
         | requests. You should ideally document it and factor the delay
         | into any deadlines you might be bound to.
         | 
         | You'd need to make sure the process is somewhat predictable,
         | like running the vacuum on a set schedule so you know for sure
         | what maximum amount of time a deletion request will take.
        
         | lucianbr wrote:
         | If vacuum runs at least once per day, seems pretty GDPR
         | compliant to me. Even if it runs once every two or three days.
         | 
         | Now if your database runs vacuum once every 6 months, yeah,
         | DELETE might not actually be a delete. But is it really a GDPR
         | issue? What's really going on in this system?
         | 
         | I don't think any EU agency is going to fine your company if
         | the data you say you deleted survived 6 or even 60 hours after
         | deletion, if that is the end of it.
        
         | dataflow wrote:
         | Even vacuuming wouldn't actually destroy the data right?
         | Because filesystems don't guarantee they will overwrite or wipe
         | any particular disk blocks. And even if they did, SSDs still
         | wouldn't promise that the blocks aren't remapped instead of
         | being wiped & reused.
        
           | Polizeiposaune wrote:
           | > Because filesystems don't guarantee they will overwrite or
           | wipe any particular disk blocks.
           | 
           | Some filesystems have a richer interface to the underlying
           | storage device, allowing them to invoke commands such as ATA
           | TRIM or SCSI UNMAP - either incrementally as blocks are
           | freed, or on demand - which request that the underlying
           | storage device forget the block contents.
           | 
           | So the necessary interfaces exist and are widely available,
           | and even if imperfect they improve the situation.
        
             | dataflow wrote:
             | > Some filesystems have a richer interface to the
             | underlying storage device, allowing them to invoke commands
             | such as ATA TRIM or SCSI UNMAP
             | 
             | No, that's not a guarantee of data erasure. Not just
             | because it's just a request that the device can disregard,
             | but also because filesystems play tricks (like storing
             | small bits of data inline, or logging data in various
             | places, etc.) and they don't clear all those blocks just
             | because you wanted to clear a couple bytes.
        
           | __turbobrew__ wrote:
           | Yea the only way to be sure that data is gone is through
           | mechanical destruction (shredding) of the drives. Sometimes
           | you can write something to a SSD and then not be able to
           | delete it due to a hardware fault, but the data can still be
           | read.
           | 
           | I wonder if a GDPR nation has made a ruling on the extent of
           | data erasure? Surely you cannot expect a company to shred a
           | SSD every time someone asks for their data to be deleted.
           | 
           | With our current understanding of physics you cannot destroy
           | information outside of maybe throwing something in a black
           | hole -- and even then you may still be able to get the
           | information back from hawking radiation after many eons -- so
           | the question is how much should we scramble information
           | before it is considered "deleted"?
        
         | tzs wrote:
         | It's also likely still somewhere on backups.
         | 
         | GDPR supervisory authorities disagree on what to do about data
         | in backups. France has said you don't have to delete data from
         | backups. The Danish authorities have said that you have to
         | delete from backups where it is technically possible. The UK
         | (which still has GDPR) has said that you must put the data
         | "beyond use" which most have taken to mean that you have to
         | make sure that if you ever restore data from the backup you
         | will omit the data that is supposed to be forgotten.
         | 
         | I don't know what other supervisor authorities have said--those
         | three are just the ones that tend to show up when searching on
         | this topic. I would not be surprised if there are at least a
         | dozen other different opinions from the rest.
        
       | physicsguy wrote:
       | They are difficult but it shouldn't be underestimated the cost of
       | trying to keep consistency with the alternatives either.
       | 
       | I sometimes think that people ask the wrong question on this sort
       | of thing - rather than thinking "what technical solution should I
       | come up with" you should be thinking "what is the business
       | requirement here, and what consistency guarantees are needed?"
       | 
       | In many cases you want to soft delete anyway and mark rows as
       | stale rather than deleting them wholesale. Cascade deletes need
       | to be very carefully thought about, as while they're very handy
       | they can be quite destructive if the relationships are not mapped
       | out.
       | 
       | Personally having spent some time now in the microservices hole,
       | I miss all the power SQL databases give you for this sort of
       | thing. I think everyone should spend some time reading and
       | digesting 'Designing Data Intensive Applications' and evaluating
       | the trade offs in detail.
        
       | xg15 wrote:
       | > _For example, deleting 1 million rows in a single transaction
       | is a textbook case of what not to do. Instead, splitting the
       | operation into smaller batches, such as deleting 10,000 rows
       | across 100 iterations, is far more effective._
       | 
       | Why do I as a user have to do that? Why can't the database
       | implement batching internally and automatically transform my
       | 1-million-rows query into an appropriate list of batched queries?
       | 
       | (Edit: Thanks a lot for the answers, that makes more sense - in
       | particular the point that this would also _lock_ one million rows
       | at once)
        
         | sureglymop wrote:
         | And a follow up question: would the current best way to handle
         | this be to "mark records as deletable" and then do the batched
         | deletion operations when convenient?
        
           | rawgabbit wrote:
           | Create a column called MarkedForDeletion. Create a job that
           | starts in the off hours to detect how many locks are present
           | on the table, if low then delete X records. Else wait for Y
           | minutes. Put this in a loop. If error detected, breakout of
           | the loop.
        
         | Nican wrote:
         | Transactional consistency / ACID guarantees.
         | 
         | Before you execute the query, you should be able to query any
         | of the data, and after you execute the query, none of the data
         | should be available. The mechanisms to make a transactional
         | database is tricky.
         | 
         | Some databases, like CockroachDB, provides some built-in TTL
         | capabilities.
         | 
         | But also- if you are having to delete huge ranges of data and
         | do not care about consistency, you are probably looking at an
         | analytical workload, and there would be better databases suited
         | for that, like Clickhouse.
        
           | tremon wrote:
           | > none of the data should be available
           | 
           | As written, that's not required. The data should not be
           | retrieveable by query, but ACID only specifies what happens
           | at the client-server boundary, not what happens at the
           | server-storage boundary (Durability prescribes that the
           | server must persist the data, but not _how_ to persist it). A
           | database that implements DELETEs by only tombstoning the row
           | and doesn 't discard the data until the next re-index or
           | vacuum operation would still be ACID-compliant.
        
         | asQuirreL wrote:
         | One reason I can think of is that the database needs to
         | maintain atomicity and isolate effects of any given operation
         | (the A and I in ACID).
         | 
         | By manually batching the deletes, you are telling the database
         | that the whole operation does not need to be atomic and other
         | operations can see partial updates of it as they run. The
         | database wouldn't be able to do that for every large delete
         | without breaking its guarantees.
        
           | wruza wrote:
           | I think that gp's comment can be reinterpreted as: why should
           | this landmine exist when databases could notify a reader of
           | its manual about this issue in an explicit way, for example:
           | DELETE FROM t WHERE ... BATCH 100
           | 
           | Which would simulate batched queries when called outside of
           | transaction. This would remove the need of a client to be
           | connected (or at least active) for a duration of this lenghty
           | operation.
           | 
           | If DELETE is so special, make special ways to manage it.
           | Don't offload what is _your_ competence onto a clueless user,
           | it's recipe for disaster. Replace DELETE with anything and
           | it's still true.                 ALTER DATABASE d SET
           | UNBATCHED DELETE LIMIT 500000
           | 
           | I know a guy (not me) who deleted rows from an OLTP table
           | that served a country-level worth of clients and put it down
           | for two days. That is completely database's fault. If its
           | engine was designed properly for bigdata, it should have
           | refused to do so on a table with gazillions of rows and
           | suggested a proper way to do it.
        
             | magicalhippo wrote:
             | You kinda have that already for certain databases[1] with
             | DELETE TOP 100. We have a few cleanup tasks that just runs
             | that in a loop until zero rows affected.
             | 
             | That said, I agree it would be nice to have a DELETE BATCH
             | option to make it even easier.
             | 
             | [1]: https://learn.microsoft.com/en-
             | us/sql/t-sql/statements/delet...
        
             | cogman10 wrote:
             | Rather than batching, I would want a "NO ROLLBACK DELETE"
             | sort of command. The real expensive part of the delete is
             | rewriting the records into the transaction log so that a
             | cancel or crash can undo the delete.
             | 
             | If you've gone to the effort of batching things, you are
             | still writing out those records, you are just giving the db
             | a chance to delete them from the log.
             | 
             | I'd like to save my ssds that heartache and instead allow
             | the database to just delete.
             | 
             | In MSSQL in some extreme circumstances, we've partitioned
             | our tables specifically so we can use the 'TRUNCATE TABLE'
             | command as delete is just too expensive.
             | 
             | That operation can wipe gbs in seconds.
        
               | pixl97 wrote:
               | I would say it can unlink gbs in seconds. The data is
               | still on the disk until it's trimed or overwritten.
        
               | immibis wrote:
               | So why does it need to be copied into the WAL log until
               | vacuum runs?
               | 
               | And vacuum is not expected or required to be atomic,
               | since it deletes data that was necessarily unreferenced
               | anyway, so it also shouldn't need to copy the old data
               | into WAL files.
        
               | mattashii wrote:
               | Many DBMSs with index-oriented storage (MySQL, Oracle,
               | MSSQL) use undo logging for a transaction's MVCC, so that
               | for deletion the old version is put into the undo log of
               | that transaction and referred to as an old version of the
               | record (or page, or ...), immediately cleaning up space
               | on the page for new data while the transaction is still
               | goin on. This is great for short transactions and record
               | updates, as a page only has to hold one tuple version at
               | a time, but that is at the cost of having to write the
               | tuples that are being removed into a log, just in case
               | the transaction needs to roll back.
        
               | Kinrany wrote:
               | What happens when two transactions select and then delete
               | two rows in the opposite order while requesting "no
               | rollback"?
        
               | mike_hearn wrote:
               | Yes the commercial databases make it easier to handle
               | this.
               | 
               | One simple way in Oracle is to take a table lock, copy
               | the data you want to preserve out to a temporary table,
               | truncate the target table, copy the data back in.
        
         | karrieberrie553 wrote:
         | Because they mean different things. Deleting all records in a
         | SINGLE transaction means something different than many smaller
         | transactions. Even if for most use cases, you want the latter,
         | it may be the case that you need to lock the DB down to perform
         | the deletions in a single step.
        
         | martinsnow wrote:
         | Admin panels may do that. Say for example deleting a bucket
         | with a million rows.
         | 
         | In that case I'd fire off a job and tell the user, yeah sure
         | the delete is successful. But check the logs to be sure.
        
           | dotancohen wrote:
           | I show these as pending.
           | 
           | No message (other than an INFO log message) when the
           | operation completes successfully, but failures are handled
           | and notified properly.
        
       | cannibalXxx wrote:
       | this content reminds me of this post where we can get an idea of
       | how to build complex and efficient queries. https://chat-
       | to.dev/post?id=724
        
       | redman25 wrote:
       | One solution for performance degradation with soft deletes is to
       | partition the table by some field like `created` monthly. Queries
       | will need to include `created` in the query is the main downside.
        
         | kccqzy wrote:
         | I've only heard of this trick being employed on OLAP scenarios.
         | Is this kind of partitioning also advisable for OLTP workloads?
        
       | pmarreck wrote:
       | My only recommendation would be, no matter which strategy you go
       | with, cover it with tests to make sure the right information
       | stays and the right information gets physically (or marked)
       | deleted, and that data marked for deletion is invisible to the UI
       | except via admin access.
       | 
       | But, indeed, proper deletion is surprisingly difficult,
       | especially when you consider cascades on a complex table
       | containing many defined foreign-key relationships.
        
       | tonymet wrote:
       | Why do DBs perform Delete operations online? Wouldn't it be
       | better to soft-delete (at the table-space level ) and then run
       | scheduled task to clean up the table spaces?
       | 
       | Similar to git. When you "delete" files they are just removed
       | from the tree. It isn't until later that all refs and reflog
       | references have expired , and gc is run, that the objects are
       | actually removed.
        
         | MatmaRex wrote:
         | If you read the linked article, it explains that this is
         | exactly what they do.
        
           | tonymet wrote:
           | thanks i got the summary for free
        
       | junto wrote:
       | Reminds me of this old post about deleting large amounts of data
       | efficiently at MySpace. Page has now gone but was archived.
       | 
       | https://web.archive.org/web/20090525233504/http://blogs.msdn...
       | 
       | Brent Ozar talked about this back in 2008 in reference to working
       | with large tables in MSSQL Server:
       | 
       | https://www.brentozar.com/archive/2018/04/how-to-delete-just...
        
       ___________________________________________________________________
       (page generated 2024-12-01 23:01 UTC)