[HN Gopher] DELETEs Are Difficult
___________________________________________________________________
DELETEs Are Difficult
Author : radimm
Score : 125 points
Date : 2024-11-25 06:46 UTC (6 days ago)
(HTM) web link (notso.boringsql.com)
(TXT) w3m dump (notso.boringsql.com)
| aurareturn wrote:
| DELETE FROM films;
|
| I'm surprised databases makes it so easy to just delete an entire
| table. I think the command should be DELETE FROM
| films YES-I-KNOW-WHAT-I-AM-DOING;
| magicalhippo wrote:
| Agreed, I've long been thinking that DELETE should require a
| WHERE clause.
|
| If you really want to just delete everything, just do WHERE
| 1=1.
| buro9 wrote:
| UPDATEs should require a WHERE clause too.
|
| At which point we could just say all SQL should have a WHERE
| clause.
| jaredsohn wrote:
| I think SELECTs without WHERE clauses are fine
| marcosdumay wrote:
| But requiring them is fine too. Even more on Postrges,
| that has booleans.
| ahoka wrote:
| And queries should start with WHERE.
| awestroke wrote:
| FROM, then WHERE, then SELECT
| hamandcheese wrote:
| This is the way.
| Jabbles wrote:
| Pipe Syntax In SQL
|
| https://research.google/pubs/sql-has-problems-we-can-fix-
| the...
|
| https://news.ycombinator.com/item?id=41338877
| yen223 wrote:
| It would be nice if there was a psql option that would warn
| you if you're about to do any kind of update, delete or
| insert without doing a BEGIN;
|
| (You could always set autocommit=false, which forces you to
| explicitly commit or rollback every action, but that has its
| own baggage)
| jsemrau wrote:
| MySQL has this as default as far as I recall. But then I never
| delete, I just set "deleted" to yes.
| physicsguy wrote:
| That just wouldn't fly where you have business customers,
| many insist on data deletion at the end of contracts. In
| practice though partitioning or using seperate databases can
| be a better strategy for dealing with that as otherwise
| dealing with backups are challenging.
| kijin wrote:
| It might depend on the version, but last time I checked,
| DELETEing an entire table was much slower than TRUNCATE
| TABLE.
| eddd-ddde wrote:
| I'm pretty sure that only applies to Postgres.
| magicalhippo wrote:
| Sybase SQLAnywhere as well, and not unlikely MSSQL too
| given its shared ancestry.
|
| Delete with WHERE is sufficiently slow in MSSQL we have
| to do batched deletes, but I can't recall offhand if that
| holds for whole table deletion as well.
| evanelias wrote:
| You're probably thinking of the --safe-updates option [1] for
| the `mysql` CLI, also available as the memorable alias
| --i-am-a-dummy. This requires UPDATE and DELETE to have
| either a WHERE clause or a LIMIT clause. Under the hood, the
| command-line client option just manipulates the
| sql_safe_updates session variable [2] to enforce the UPDATE
| and DELETE requirement, as well as a couple other unrelated
| variables to prevent overly-huge SELECTs.
|
| It's not enabled by default out of the box, but some
| companies do override their configuration to enable it for
| new sessions, iirc Facebook did this.
|
| [1] https://dev.mysql.com/doc/refman/8.4/en/mysql-
| tips.html#safe...
|
| [2] https://dev.mysql.com/doc/refman/8.4/en/server-system-
| variab...
| RedShift1 wrote:
| There is pg-safeupdate: https://github.com/eradman/pg-
| safeupdate
| cruffle_duffle wrote:
| That is why you always get in the habit of wrapping your stuff
| in "BEGIN TRANSACTION". Then if and when you fuck up you can
| issue a rollback and be all good.
| lowbloodsugar wrote:
| I've deleted the _internet_??
| orionblastar wrote:
| Most databases I used have a Status column we could mark as
| active, inactive, or deleted. That way, you can see what records
| were marked as deleted and change them back in case of accidental
| deletion.
|
| Keep record retention with the Date_Modified column so you can
| use SQL delete to remove those deleted records that are older
| than a year or so.
| arielcostas wrote:
| I do something similar, but instead keep a "date_deleted"
| column null by default, and the "active" column as a boolean.
|
| That way, I kill two birds in one stone by having a dedicated
| column for last deletion (instead of updating a record that is
| supposedly deleted) and the status just as a boolean instead of
| some enum, or integer or string.
| alexanderscott wrote:
| this is a "soft delete". as the author notes, depending on the
| nature of the data being stored a soft delete does not meet the
| requirements of many data privacy laws and compliance
| regulations (like GDPR's right to erasure).
| isbvhodnvemrwvn wrote:
| And in postgres soft delete is more expensive than a regular
| delete because it's effectively an insert and update, while
| delete is just an update.
| kijin wrote:
| There are different kinds of soft delete.
|
| I've had cases where the rows in question absolutely could
| not be hard deleted, because of legacy foreign key relations.
| But the PII in those rows had to go. So we did a kind of
| "firm delete" by setting all columns (except the PK and a few
| necessary flags) to their default values and/or null.
| Terr_ wrote:
| For many of the most painful deletion questions, the root problem
| is that when the software was first made the
| stakeholders/product-org didn't think about use-cases for
| deleting things. At best, they assume a "do not show" property
| can be placed onto things, which falls apart when you get to
| legal issues that compel actual removal.
| buildingcrash7 wrote:
| >software was first made the stakeholders/product-org
|
| Practically all building, physical and software is made for a
| purpose first, the process is mainly an obstacle that needs to
| be minimized. A piece of software is trying to solve a problem
| just like a door is. It's driven by economics where the
| recipients don't want to pay any more than they need to and
| someone is always willing to undercut you by cutting corners.
| tempodox wrote:
| That depends on the context. In some cases you're not allowed
| to physically delete because you need to provide an audit trail
| for 10 years or so. You could move those rows into separate
| audit tables instead. However, that requirement should not come
| as a surprise.
| BlueTemplar wrote:
| But, just like this article using <<physically deleted>>, when
| in practice it's not the case (the bits are just freed to be
| overwritten some unknown amounts of time later), does legal
| compliance just completely ignores this fact of actual physical
| deletion ??*
|
| (AFAIK it takes several passes of overwriting bits with random
| bits on magneto-mechanical storage to not be able to retrieve
| any significant fragments of the original data, and things are
| even worse on transistor storage, which casually makes copies
| of the data for wear leveling reasons.)
|
| *With the exception of state secrets of course, where we know
| that storage is mechanically destroyed <<with extreme
| prejudice>>.
| yen223 wrote:
| A big asterisk that should be added to the article is that all
| that applies to Postgres.
|
| Other databases have their own deletion mechanisms (and deletion
| quirks)
|
| It's a very good article otherwise.
| jandrewrogers wrote:
| DELETE is expensive at a deep fundamental level that we don't
| think about much in computer science because we are more worried
| about losing data. The article is about Postgres but it
| generalizes. We don't actually have any computer science for
| DELETE optimized databases. I've idly looked into delete-
| optimization in databases as thought experiments, since there
| isn't much in the way of literature on it, and it is far more
| difficult than I think many people intuit. The article is partly
| a manifestation of this reality.
|
| I think the nature of DELETE is one of the more interesting open
| problems in computer science. It is one of those things that,
| when we require precision, turns out to be very difficult to
| define.
| redox99 wrote:
| I think garbage collection memory management can be thought of
| a delete optimized database.
| tybit wrote:
| Runtimes with garbage collectors typically optimize for
| allocation, not deletion.
| mike_hearn wrote:
| Generational GC optimizes for both. They assume that most
| objects die young, so choose to relocate live objects and
| just mark the entire region that was evacuated as empty. So
| this is a very efficient way to delete data.
| SgtBastard wrote:
| Only if you profoundly misunderstand what GC is.
| actionfromafar wrote:
| I have long day-dreamed of what a "use it (soon) or loose it"
| runtime would mean. Allocated blocks would just expire after
| a set time.
| okasaki wrote:
| So kind of like browsers work now with the tab unloading
| and Android killing apps?
|
| Personally I find it really obnoxious and disruptive.
| sitkack wrote:
| You would prefer something a little more persistent?
| dotancohen wrote:
| I would prefer a 48 hour day, so that I could get
| everything done that needs to be done.
|
| Or maybe a 72 hour day, so I'd have time for things that
| I'd just enjoy.
| okasaki wrote:
| I prefer to do my own app lifecycle management.
| mpweiher wrote:
| Yes...but it goes even deeper.
|
| For example, in physics, the paradox of Maxwells Demon is
| resolved when you consider the cost of deleting data:
|
| "In 1982, Charles Bennett showed that, however well prepared,
| eventually the demon will run out of information storage space
| and must begin to erase the information it has previously
| gathered.[8][12] Erasing information is a thermodynamically
| irreversible process that increases the entropy of a system."
|
| https://en.wikipedia.org/wiki/Maxwell's_demon#Recent_progres...
|
| It is also difficult for humans to delete information. In my
| humble and only a little facetious opinion this is one of the
| main drivers for ever new "To Do" apps: the existing app gets
| full because deleting is too hard, so we start fresh with a new
| app. The app isn't the point, the starting fresh is.
|
| The underlying reason there being that the cost of maintaining
| small (to do) notes can be greater than the value of the note,
| which is one of the reasons we still use little scraps of paper
| and other mechanisms that will effectively auto-delete.
|
| _Understanding the micronote lifecycle: improving mobile
| support for informal note taking_
|
| https://dl.acm.org/doi/10.1145/985692.985779
| rpier001 wrote:
| We need a to-do app following these principles... then
| another for product management.
| loa_in_ wrote:
| Just add a __delete all__ button.
| mycall wrote:
| My solution to most task list items is just wait long
| enough. Eventually the items often are no longer needed.
| mpweiher wrote:
| Yeah, that _should_ work.
|
| Alas, it doesn't.
| lukevp wrote:
| Every day I'd have more and more state accumulation on my
| machine - open apps, unsaved changes, open tabs. I've tried
| many methods for preventing this from happening over the
| years, but the only and most effective solution I've been
| using for the last year - a script I wrote that just quits
| every browser tab and every open app (leaving unsaved apps
| still running) every evening. I wake up and the machine is
| new and fresh. It's amazing and has benefited my productivity
| so much. It changes the state game to where if I have a
| specific task I want to resume tomorrow I have to make myself
| a clear note to read the next day, and if I have tabs open
| that I care about, I have to bookmark them. What I've found
| is that the actual info that needs to be transferred between
| days is very small.
|
| I have an endless reminders app and todo list. I wonder if
| something similar (items expire automatically unless you flag
| them as permanent or something) would help keep a clearer
| list. Sometimes ephemerality is best!
| wenc wrote:
| > Every day I'd have more and more state accumulation on my
| machine - open apps, unsaved changes, open tabs.
|
| I resonate with your comment.
|
| I grew up during a time when PC state was ephemeral (DOS
| days). Unsaved changes essentially meant lost data. Open
| apps were gone once the computer was shut down (and the
| computer was shut down daily - a 250W power supply had an
| expensive power draw; in contrast a Mac Mini only sips 1W
| when sleeping). This helped me develop a habit of keeping
| only necessary things open, bookmarking what I want to
| keep, and of habitually pressing Ctrl+S to save data. I
| never keep any tabs open (my browser loses all tabs upon
| close anyway), and my inbox is zero.
|
| The cost I pay for this is context recovery -- every day, I
| have to essentially set everything up again. I do write
| notes or leave comments in code to remind myself where I
| left off, but I essentially started fresh. But there is an
| upside to this: I start each day from an uncluttered slate,
| which leads to clarity in my head. When context is
| ephemeral, I'm more likely to not be beholden to an
| existing state.
|
| This actually helps me increase the quality of my writing
| and my code. It's akin to the heuristic of "throwing away
| your first draft and rewriting" to achieve higher quality.
| Write your code once, delete it, and write it again from
| scratch. The next version takes a quarter of the time to
| write but can be twice the quality because you've
| prototyped it in your head but you are not bound to your
| crufty first attempt. There was an HN discussion on this a
| while ago:
|
| https://grantslatton.com/software-pathfinding
|
| If you save state, you can resume more quickly, which is
| great if you're on a roll, but it's also an accumulation of
| more clutter that blocks certain more creative thoughts.
| alexwasserman wrote:
| I like the idea of a todo list that comes with a built in
| auto-delete.
|
| You either do your to dos, or it auto-deletes them for you.
| No worry about it getting full, but also some pressure to
| actually get them done or they'll be wiped.
|
| And if you're happy they're wiped, then you probably didn't
| need to do it at all.
|
| I wonder if there's something like that already.
| teitoklien wrote:
| you can already do it, with Apple's native Shortcuts app
| and Reminders app. You can set an automation to delete all
| reminders after a fixed time has passed with the 'Remove
| Reminders' action
|
| Otherwise you can just use cronjobs with a small python
| snippet to parse and decide what reminders tagged with
| which labels to delete after X,Y,Z many days and hit the
| APIs of most major todo apps like todoist, asana, etc to
| just delete those tasks. Heck works with your local
| markdown based todo lists too.
| MetaWhirledPeas wrote:
| > The app isn't the point, the starting fresh is.
|
| This is the crux of what made Google Inbox so good. The UX
| encouraged "deleting" and starting fresh. This was made
| possible not just through the controls, but also through the
| promise that undelete would be possible. People want to start
| fresh, but they also don't want to lose anything; that's the
| conundrum.
| akira2501 wrote:
| > We don't actually have any computer science for DELETE
| optimized databases.
|
| There is actually a fair amount if you consider databases with
| fixed length records. Which used to be the dominant form.
| brightball wrote:
| Yep, it's tough. One of the more unexpectedly complicated
| aspects of GDPR compliance is the strain that full deletes put
| on a system.
|
| You have to break it into small batches across the cascade of
| associated data.
| epistasis wrote:
| There was a common thread through all of my algorithms and data
| structures class: Hash maps, b-trees, etc. are all beautiful
| structures until you add a delete operation and have to start
| dealing with all those little holes... Removing data
| complicates everything.
| mike_hearn wrote:
| What about LSM trees? Something like RocksDB is very efficient
| at deleting. A delete operation is a tiny write (a tombstone)
| and then the actual deletion is done via compaction in the
| background with entire tablets being freed at once when the
| live data is evacuated. It's actually most efficient when
| deleting large ranges - when just replacing data it's not so
| efficient due to the write amplification.
|
| That said, I agree with you in general. An under-used technique
| is to simply encrypt everything and then use a key store that
| is guaranteed capable of deleting data. This makes it easy to
| comply with legal deletion requests even across backups, though
| of course, you need to manage the keys very carefully and
| ensure that they are also backed up.
| selecsosi wrote:
| +1 This was our strategy at TempoIQ for our ts storage engine
| (built on top of rocks).
|
| Very efficient and effective at managing tons of data
| ingestion (and deletion) at scale.
|
| Not an easy out of the box tech to build on top of though
| when you have to build all the analytics/management pieces
| that something like PG gets you so I get the lack of public
| examples
| latency-guy2 wrote:
| Well tombstoning is fundamentally punting the operation, the
| data is still there taking up space and computation if the
| flagged entry does not get removed from varying levels of
| query plans.
|
| I agree that it meets the requirements for batched DELETE,
| and that's likely as best as we can make it.
|
| But I wonder if there was a better way. I know there are
| research DBs out there that experimented with reusing the
| tombstone entry for new INSERT/UPDATE operations, but these
| suck when you want to do batched INSERT/UPDATE on a range
| since they're scattered all about in a table, and you lose
| ordering + monotonic properties.
| kccqzy wrote:
| It's absolutely true that we don't think about it much. When I
| was first taught balanced binary trees, deletion wasn't even a
| topic that needed to be learned. Same thing later when I was
| taught balanced binary trees. Then again in hash tables. It's
| an operation that's overlooked in CS education.
| toast0 wrote:
| > We don't actually have any computer science for DELETE
| optimized databases.
|
| Depending on how much deleting and when, there might be
| engineering if not science for this.
|
| If everything is deleted on a schedule, partitioned databases
| and dropping whole partitions as they expire is a well worn
| path. Soft delete and compaction also works pretty well if
| most, but not all things will be deleted. A generational
| garbage collection kind of thing.
|
| As others said, fixed sized records are easier to manage
| deletion/replacement with, too.
| hinkley wrote:
| If someone dumped that project in my lap and said fix it (and
| it I was more used to low level programming), I'd probably
| start be refreshing myself on the last 10+ years of GC advances
| since I stopped reading SIGPLAN. Particularly multithreaded
| sweep. Because essentially you want to decouple delete from
| free so you can not do 100% of the bookkeeping work in the
| middle of time sensitive operations. But not fall as far behind
| as Postgres can with its vacuuming albatross.
|
| In a way, deletion is a form of eventual consistency. The user
| loses access to the data but the system still knows about it
| for a while.
|
| Just off the top of my head, I would think for LSM systems, you
| would resort to snapshotting as the edit history became much
| larger than the retained row count, and as you delete old
| snapshots (two GC roots) you could compare the old and the new
| and drop everything that didn't survive. You only have to
| finish well before the next snapshot interval, and if you
| maintain a queue you only have to process them on average as
| fast as the snapshot interval.
|
| And for BTree systems you can amortize the deletes across every
| insert, the way some realtime systems clean up a few free
| pointers on every allocation.
| Svip wrote:
| > Unlike DELETEs, UPDATEs don't trigger cascaded actions - they
| only involve triggers that are explicitly defined.
|
| That's not entirely true. ON UPDATE CASCADE is a thing for
| foreign keys, at least in PostgreSQL (which this article is
| talking about), meaning that the foreign row referencing the row
| gets updated. Though, personally, I would never use ON UPDATE
| CASCADE, as it seems kind of funky.
| arcanemachiner wrote:
| > Though, personally, I would never use ON UPDATE CASCADE, as
| it seems kind of funky.
|
| As a relative noob in the world of database administration, I'm
| glad to hear that someone else feels this way.
| pepelotas wrote:
| It does if your key is an auto increment or random unique
| identifier. But if you had a key that is also data, say a
| "Genre" colum, it starts to make sense that you'd want to
| cascade updates
| rrr_oh_man wrote:
| > Though, personally, I would never use ON UPDATE CASCADE, as
| it seems kind of funky.
|
| Why?
| Svip wrote:
| Personally, I like to be explicit and in control. In the
| application layer, I may be far away (at least mentally
| speaking) from the constraints in the database, and if I
| update/delete something, I don't want it to "magically"
| cascade through the database. For those reasons, I always
| prefer RESTRICT, both for ON DELETE and ON UPDATE. This
| forces me to clean up before I make the actual change I'm
| interested in, and anyone reading the code later, can uncover
| that behaviour quickly.
|
| That being said, I can see the arguments for ON DELETE
| CASCADE, particularly in a codebase, where there is a lot of
| functionality _in_ the database itself (in the formed of
| stored procedures, and what have you), since you are always
| mentally closer to the action. But ON UPDATE CASCADE feels
| weird, because why are you updating the primary key (which is
| usually what foreign keys references) of your rows? That
| feels like something that needs a good explanation.
|
| Though, I do recognise, you need to jump through a lot of
| hoops to modify your primary key values with ON UPDATE
| RESTRICT, because you basically need to cover all your bases
| in a large convoluted common table expression (depending on
| the number of foreign keys, of course), when an ON UPDATE
| CASCADE would do that for you. But I'd also rather be blocked
| from updating primary row values entirely, since it feels
| like the _wrong_ thing to do. (Yes, I know that foreign keys
| doesn 't _have_ to reference other primary keys, and there
| may be niche cases for this, but personally, I 'd just avoid
| it altogether.)
| rrr_oh_man wrote:
| Thanks for the explanation.
|
| The first case I was thinking about was a merger of two
| accounts. Second case would be a db-wide update of one or
| more keys for some reason. That's why I tend to leave it ON
| UPDATE CASCADE.
|
| Although both of these cases are more diffuse muscle memory
| than rational foresight.
| SoftTalker wrote:
| > if I update/delete something, I don't want it to
| "magically" cascade through the database.
|
| This is the big reason I don't like triggers either. I
| would use them in only one case: if the database didn't
| support an auto-incrementing "identity" type, I might use a
| trigger to simulate that. But just as often I would prefer
| a stored procedure that got the new ID from a sequence, and
| then did the insert, especially if there were other things
| I needed a stored procedure to deal with.
| hamandcheese wrote:
| > That's not entirely true. ON UPDATE CASCADE is a thing for
| foreign keys, at least in PostgreSQL (which this article is
| talking about), meaning that the foreign row referencing the
| row gets updated.
|
| For this to work with soft-deletes, wouldn't you have to
| include the soft-delete column in a compound foreign key? That
| sounds funky indeed.
| SoftTalker wrote:
| You really shouldn't be updating primary key values though.
| magicalhippo wrote:
| While I overall agree, there have been cases where I have
| found it handy.
|
| In one case, we had a table of invoices and a table of goods
| items, and each goods item should point to an invoice.
|
| If one wants to use the natural key, invoice number, and have
| a foreign key to ensure the goods items can't point to an
| invoice that doesn't exist, then ON UPDATE CASCADE was needed
| in case the user had to change the invoice number (due to
| mistyping it or similar).
|
| Of course, if one does not use a natural key then this isn't
| such an issue.
|
| Another case was where we've had to merge databases after two
| companies merged. If they had overlapping primary key ranges,
| then the easiest was to switch to ON UPDATE CASCADE and then
| renumber the existing rows before inserting the other rows.
|
| We'd change back after the merge though.
| marcosdumay wrote:
| > If one wants to use the natural key
|
| Yeah, that's why you shouldn't use natural keys as primary
| ones.
| magicalhippo wrote:
| Learned that lesson a bit too late for that.
|
| But yea, these days I never use natural keys.
| marcosdumay wrote:
| I really hate that the educational literature says it's
| an option. Yet everybody knows it's something you should
| never do, no exceptions.
| burntcaramel wrote:
| If data isn't actually removed until vacuuming, then are systems
| that perform SQL DELETES actually GDPR compliant? Because
| technically the private data is still there on disk and could be
| recovered. "Until the autovacuum process or a manual VACUUM
| operation reclaims the space, the "deleted" data remains."
| konha wrote:
| Yes. GDPR allows for delays when complying with deletion
| requests. You should ideally document it and factor the delay
| into any deadlines you might be bound to.
|
| You'd need to make sure the process is somewhat predictable,
| like running the vacuum on a set schedule so you know for sure
| what maximum amount of time a deletion request will take.
| lucianbr wrote:
| If vacuum runs at least once per day, seems pretty GDPR
| compliant to me. Even if it runs once every two or three days.
|
| Now if your database runs vacuum once every 6 months, yeah,
| DELETE might not actually be a delete. But is it really a GDPR
| issue? What's really going on in this system?
|
| I don't think any EU agency is going to fine your company if
| the data you say you deleted survived 6 or even 60 hours after
| deletion, if that is the end of it.
| dataflow wrote:
| Even vacuuming wouldn't actually destroy the data right?
| Because filesystems don't guarantee they will overwrite or wipe
| any particular disk blocks. And even if they did, SSDs still
| wouldn't promise that the blocks aren't remapped instead of
| being wiped & reused.
| Polizeiposaune wrote:
| > Because filesystems don't guarantee they will overwrite or
| wipe any particular disk blocks.
|
| Some filesystems have a richer interface to the underlying
| storage device, allowing them to invoke commands such as ATA
| TRIM or SCSI UNMAP - either incrementally as blocks are
| freed, or on demand - which request that the underlying
| storage device forget the block contents.
|
| So the necessary interfaces exist and are widely available,
| and even if imperfect they improve the situation.
| dataflow wrote:
| > Some filesystems have a richer interface to the
| underlying storage device, allowing them to invoke commands
| such as ATA TRIM or SCSI UNMAP
|
| No, that's not a guarantee of data erasure. Not just
| because it's just a request that the device can disregard,
| but also because filesystems play tricks (like storing
| small bits of data inline, or logging data in various
| places, etc.) and they don't clear all those blocks just
| because you wanted to clear a couple bytes.
| __turbobrew__ wrote:
| Yea the only way to be sure that data is gone is through
| mechanical destruction (shredding) of the drives. Sometimes
| you can write something to a SSD and then not be able to
| delete it due to a hardware fault, but the data can still be
| read.
|
| I wonder if a GDPR nation has made a ruling on the extent of
| data erasure? Surely you cannot expect a company to shred a
| SSD every time someone asks for their data to be deleted.
|
| With our current understanding of physics you cannot destroy
| information outside of maybe throwing something in a black
| hole -- and even then you may still be able to get the
| information back from hawking radiation after many eons -- so
| the question is how much should we scramble information
| before it is considered "deleted"?
| tzs wrote:
| It's also likely still somewhere on backups.
|
| GDPR supervisory authorities disagree on what to do about data
| in backups. France has said you don't have to delete data from
| backups. The Danish authorities have said that you have to
| delete from backups where it is technically possible. The UK
| (which still has GDPR) has said that you must put the data
| "beyond use" which most have taken to mean that you have to
| make sure that if you ever restore data from the backup you
| will omit the data that is supposed to be forgotten.
|
| I don't know what other supervisor authorities have said--those
| three are just the ones that tend to show up when searching on
| this topic. I would not be surprised if there are at least a
| dozen other different opinions from the rest.
| physicsguy wrote:
| They are difficult but it shouldn't be underestimated the cost of
| trying to keep consistency with the alternatives either.
|
| I sometimes think that people ask the wrong question on this sort
| of thing - rather than thinking "what technical solution should I
| come up with" you should be thinking "what is the business
| requirement here, and what consistency guarantees are needed?"
|
| In many cases you want to soft delete anyway and mark rows as
| stale rather than deleting them wholesale. Cascade deletes need
| to be very carefully thought about, as while they're very handy
| they can be quite destructive if the relationships are not mapped
| out.
|
| Personally having spent some time now in the microservices hole,
| I miss all the power SQL databases give you for this sort of
| thing. I think everyone should spend some time reading and
| digesting 'Designing Data Intensive Applications' and evaluating
| the trade offs in detail.
| xg15 wrote:
| > _For example, deleting 1 million rows in a single transaction
| is a textbook case of what not to do. Instead, splitting the
| operation into smaller batches, such as deleting 10,000 rows
| across 100 iterations, is far more effective._
|
| Why do I as a user have to do that? Why can't the database
| implement batching internally and automatically transform my
| 1-million-rows query into an appropriate list of batched queries?
|
| (Edit: Thanks a lot for the answers, that makes more sense - in
| particular the point that this would also _lock_ one million rows
| at once)
| sureglymop wrote:
| And a follow up question: would the current best way to handle
| this be to "mark records as deletable" and then do the batched
| deletion operations when convenient?
| rawgabbit wrote:
| Create a column called MarkedForDeletion. Create a job that
| starts in the off hours to detect how many locks are present
| on the table, if low then delete X records. Else wait for Y
| minutes. Put this in a loop. If error detected, breakout of
| the loop.
| Nican wrote:
| Transactional consistency / ACID guarantees.
|
| Before you execute the query, you should be able to query any
| of the data, and after you execute the query, none of the data
| should be available. The mechanisms to make a transactional
| database is tricky.
|
| Some databases, like CockroachDB, provides some built-in TTL
| capabilities.
|
| But also- if you are having to delete huge ranges of data and
| do not care about consistency, you are probably looking at an
| analytical workload, and there would be better databases suited
| for that, like Clickhouse.
| tremon wrote:
| > none of the data should be available
|
| As written, that's not required. The data should not be
| retrieveable by query, but ACID only specifies what happens
| at the client-server boundary, not what happens at the
| server-storage boundary (Durability prescribes that the
| server must persist the data, but not _how_ to persist it). A
| database that implements DELETEs by only tombstoning the row
| and doesn 't discard the data until the next re-index or
| vacuum operation would still be ACID-compliant.
| asQuirreL wrote:
| One reason I can think of is that the database needs to
| maintain atomicity and isolate effects of any given operation
| (the A and I in ACID).
|
| By manually batching the deletes, you are telling the database
| that the whole operation does not need to be atomic and other
| operations can see partial updates of it as they run. The
| database wouldn't be able to do that for every large delete
| without breaking its guarantees.
| wruza wrote:
| I think that gp's comment can be reinterpreted as: why should
| this landmine exist when databases could notify a reader of
| its manual about this issue in an explicit way, for example:
| DELETE FROM t WHERE ... BATCH 100
|
| Which would simulate batched queries when called outside of
| transaction. This would remove the need of a client to be
| connected (or at least active) for a duration of this lenghty
| operation.
|
| If DELETE is so special, make special ways to manage it.
| Don't offload what is _your_ competence onto a clueless user,
| it's recipe for disaster. Replace DELETE with anything and
| it's still true. ALTER DATABASE d SET
| UNBATCHED DELETE LIMIT 500000
|
| I know a guy (not me) who deleted rows from an OLTP table
| that served a country-level worth of clients and put it down
| for two days. That is completely database's fault. If its
| engine was designed properly for bigdata, it should have
| refused to do so on a table with gazillions of rows and
| suggested a proper way to do it.
| magicalhippo wrote:
| You kinda have that already for certain databases[1] with
| DELETE TOP 100. We have a few cleanup tasks that just runs
| that in a loop until zero rows affected.
|
| That said, I agree it would be nice to have a DELETE BATCH
| option to make it even easier.
|
| [1]: https://learn.microsoft.com/en-
| us/sql/t-sql/statements/delet...
| cogman10 wrote:
| Rather than batching, I would want a "NO ROLLBACK DELETE"
| sort of command. The real expensive part of the delete is
| rewriting the records into the transaction log so that a
| cancel or crash can undo the delete.
|
| If you've gone to the effort of batching things, you are
| still writing out those records, you are just giving the db
| a chance to delete them from the log.
|
| I'd like to save my ssds that heartache and instead allow
| the database to just delete.
|
| In MSSQL in some extreme circumstances, we've partitioned
| our tables specifically so we can use the 'TRUNCATE TABLE'
| command as delete is just too expensive.
|
| That operation can wipe gbs in seconds.
| pixl97 wrote:
| I would say it can unlink gbs in seconds. The data is
| still on the disk until it's trimed or overwritten.
| immibis wrote:
| So why does it need to be copied into the WAL log until
| vacuum runs?
|
| And vacuum is not expected or required to be atomic,
| since it deletes data that was necessarily unreferenced
| anyway, so it also shouldn't need to copy the old data
| into WAL files.
| mattashii wrote:
| Many DBMSs with index-oriented storage (MySQL, Oracle,
| MSSQL) use undo logging for a transaction's MVCC, so that
| for deletion the old version is put into the undo log of
| that transaction and referred to as an old version of the
| record (or page, or ...), immediately cleaning up space
| on the page for new data while the transaction is still
| goin on. This is great for short transactions and record
| updates, as a page only has to hold one tuple version at
| a time, but that is at the cost of having to write the
| tuples that are being removed into a log, just in case
| the transaction needs to roll back.
| Kinrany wrote:
| What happens when two transactions select and then delete
| two rows in the opposite order while requesting "no
| rollback"?
| mike_hearn wrote:
| Yes the commercial databases make it easier to handle
| this.
|
| One simple way in Oracle is to take a table lock, copy
| the data you want to preserve out to a temporary table,
| truncate the target table, copy the data back in.
| karrieberrie553 wrote:
| Because they mean different things. Deleting all records in a
| SINGLE transaction means something different than many smaller
| transactions. Even if for most use cases, you want the latter,
| it may be the case that you need to lock the DB down to perform
| the deletions in a single step.
| martinsnow wrote:
| Admin panels may do that. Say for example deleting a bucket
| with a million rows.
|
| In that case I'd fire off a job and tell the user, yeah sure
| the delete is successful. But check the logs to be sure.
| dotancohen wrote:
| I show these as pending.
|
| No message (other than an INFO log message) when the
| operation completes successfully, but failures are handled
| and notified properly.
| cannibalXxx wrote:
| this content reminds me of this post where we can get an idea of
| how to build complex and efficient queries. https://chat-
| to.dev/post?id=724
| redman25 wrote:
| One solution for performance degradation with soft deletes is to
| partition the table by some field like `created` monthly. Queries
| will need to include `created` in the query is the main downside.
| kccqzy wrote:
| I've only heard of this trick being employed on OLAP scenarios.
| Is this kind of partitioning also advisable for OLTP workloads?
| pmarreck wrote:
| My only recommendation would be, no matter which strategy you go
| with, cover it with tests to make sure the right information
| stays and the right information gets physically (or marked)
| deleted, and that data marked for deletion is invisible to the UI
| except via admin access.
|
| But, indeed, proper deletion is surprisingly difficult,
| especially when you consider cascades on a complex table
| containing many defined foreign-key relationships.
| tonymet wrote:
| Why do DBs perform Delete operations online? Wouldn't it be
| better to soft-delete (at the table-space level ) and then run
| scheduled task to clean up the table spaces?
|
| Similar to git. When you "delete" files they are just removed
| from the tree. It isn't until later that all refs and reflog
| references have expired , and gc is run, that the objects are
| actually removed.
| MatmaRex wrote:
| If you read the linked article, it explains that this is
| exactly what they do.
| tonymet wrote:
| thanks i got the summary for free
| junto wrote:
| Reminds me of this old post about deleting large amounts of data
| efficiently at MySpace. Page has now gone but was archived.
|
| https://web.archive.org/web/20090525233504/http://blogs.msdn...
|
| Brent Ozar talked about this back in 2008 in reference to working
| with large tables in MSSQL Server:
|
| https://www.brentozar.com/archive/2018/04/how-to-delete-just...
___________________________________________________________________
(page generated 2024-12-01 23:01 UTC)