[HN Gopher] Does Your Test Suite Account for Weak Transaction Is...
       ___________________________________________________________________
        
       Does Your Test Suite Account for Weak Transaction Isolation?
        
       Author : todsacerdoti
       Score  : 34 points
       Date   : 2024-01-04 15:06 UTC (7 hours ago)
        
 (HTM) web link (concerningquality.com)
 (TXT) w3m dump (concerningquality.com)
        
       | skyde wrote:
       | I knew of this issue for some time (default DB isolation is less
       | than SERIALIZABLE on most DB).
       | 
       | But assuming you do not want to change the default isolation for
       | your transaction or for the whole DB. How do you write unit-test
       | that check your code will work correctly because it's using
       | "select balance from accounts FOR UPDATE"?
       | 
       | It seem to me the only way to really validate this is using tool
       | like TLA+. Or something like https://github.com/microsoft/coyote
        
         | skyde wrote:
         | Even using tool like this it seem to be a combinatorial
         | explosion issue.
         | 
         | If your app contain 10 distinct transactions.
         | 
         | you have to test what happen if 2 instances of tx #1 run
         | concurrently.
         | 
         | But also if tx #1 and tx #2 run concurrently.
         | 
         | And also what if tx #1 and tx #3 run concurrently ....
        
         | amw-zero wrote:
         | The issue is that the race condition exists in the database,
         | not in any application code. So you can either simulate the
         | race, or you need to integration test, and that's outside the
         | scope of TLA+ and model checking.
         | 
         | I haven't heard of Coyote (it's still amazing how many tools
         | are out there). It's good that it operates on the actual code
         | level, but I'm still not sure that would reproduce concurrency
         | non-determinism at the database level.
        
           | skyde wrote:
           | yes if your test are using a mock of the db ex: https://githu
           | b.com/microsoft/coyote/blob/main/Samples/Accoun... you can
           | simulate the races.
           | 
           | The problem is your mock need to implement the same isolation
           | level as your real DB and support transaction ...
           | 
           | You could use SQLITE in memory DB to run your test but that
           | would make your test a lot slower I assume.
        
             | amw-zero wrote:
             | Exactly, it's a tricky problem. Implementing all
             | transaction isolation levels in a mock is quite an
             | ambitious endeavor.
        
         | mrloba wrote:
         | It's not perfect, but I have had success running postgres in
         | Docker and running integration tests against that. Usually you
         | can trigger the problem by running a handful of queries in
         | parallel.
        
           | amw-zero wrote:
           | Yes exactly. That's the approach taken here. The main
           | downside is that it can require sleeps to properly reproduce
           | race conditions.
        
           | skyde wrote:
           | this is the current setup we use in my team but I am trying
           | to move away for this to something more systematic.
        
       | nick__m wrote:
       | Why optimistic locking is never considered in those articles ?
       | Adding a version column and checking the affected numbers of row
       | scales a lot better that a select for update...
        
         | amw-zero wrote:
         | Probably because it's more work, and it's logic that you have
         | to write and get correct. Whereas "for update" pretty much just
         | works.
         | 
         | But there are definitely many valid solutions, including just
         | setting the specific transaction to serializable and handling
         | transaction retries.
        
           | GauntletWizard wrote:
           | _and handling retries_ is something I almost never see, and
           | more infrequently see done right.
        
           | gav wrote:
           | Optimistic locking is a great solution to avoid holding a
           | lock for an indeterminate period of time as you wait for a
           | user or external system to provide updates.
        
         | hinkley wrote:
         | I don't know about databases but I've done similar with testing
         | interactive web apps and it works quite well. When your
         | architecture supports it, at least.
         | 
         | Essentially a hidden field on the page that tracks interactions
         | and outstanding Ajax calls. When you assert that a field on the
         | page hasn't changed, you need to know that it's not still in
         | the process of painting. And just waiting "long enough" makes
         | for very slow testing overall.
        
         | skyde wrote:
         | because row version does not make it possible to update more
         | than 1 row atomically.
        
           | giovannibonetti wrote:
           | Yes it does. UPDATE foo = 'abc' WHERE (id, version) IN ((123,
           | 1), (234, 2))
        
             | skyde wrote:
             | while true this is not what I mean.
             | 
             | if your goal is to transfer 10 dollars for account A to
             | account B but only if account A balance is larger than 10
             | dollars.
             | 
             | You have to update account A to be 10 dollars less
             | 
             | and
             | 
             | you have to update account B to be 10 dollars more.
             | 
             | you can read both account a keep track of current version
             | number for each.
             | 
             | then do
             | 
             | "UPDATE Accounts SET Balance = CASE AccountID WHEN 1 THEN
             | Balance - 10 WHEN 2 THEN Balance + 10 END WHERE (AccountID,
             | version) IN ((1,v1), (2,v2))"
             | 
             | But if using Read Uncommited isolation level I am not sure
             | this UPDATE would actually lock both row until commit.
        
               | skyde wrote:
               | you still have to do
               | 
               | IF @@ROWCOUNT = 2 BEGIN COMMIT TRANSACTION; END ELSE
               | BEGIN ROLLBACK TRANSACTION; END
               | 
               | to make sure both rows have been updated. so its a lot
               | simpler to explicitly take a lock on the rows.
        
               | giovannibonetti wrote:
               | You're right. Locking the rows or using serializable
               | isolation would be required to achieve an atomic
               | operation if and only if both accounts are in the
               | pristine state before it.
        
       | trhway wrote:
       | >Hopefully it's clear why this is an issue. If you have an
       | important column value, say a user's account balance, you might
       | query multiple different values in the same transaction
       | 
       | and that is a feature, not a bug. During your transaction user
       | withdrew money from his account. If you wanna see the same value,
       | i.e. block that user from withdrawals while you process his
       | current balance, you take a lock on that row. Similar in many
       | ways to the concurrent programming with threads.
        
         | aphyr wrote:
         | The point of isolation levels is that the database manages
         | those locks (or other concurrency control mechanisms) for you.
        
           | wrs wrote:
           | Bearing in mind that "handling" the locks may mean a deadlock
           | is detected and the transaction is aborted by the DB, which
           | the application has to handle by retrying in a correct
           | manner. (Which is another point often missed in naive
           | application code.)
        
           | trhway wrote:
           | Not exactly - the database doesn't know whether you want to
           | block that account update or not. So, speaking very roughly,
           | it allows you to choose general level of such a blocking
           | behavior. It manages those locks, very coarse, at
           | serializable and some of those locks at repeatable read. At
           | read committed it manages basically only small set of locks
           | just to avoid dirty reads (plus of course all the
           | WAL/rollback/etc. which is the main features and the point of
           | the database with the isolation levels being just the icing
           | on that cake). Isolation levels is a tool, and you choose
           | what is needed and suitable.
           | 
           | The serializable actually is a pain in the neck for any
           | meaningfully serious large enterprise application. The
           | decreased concurrency kills overall performance, and I
           | haven't seen any such application running in serializable
           | (and for example at my current job our customers are running
           | our application on servers with low tens of TB RAM (largest
           | so far was 40TB RAM) per single machine with high hundreds of
           | cores with thousands of users - in serializable that would
           | slow to a crawl as it wouldn't allow to achieve the
           | concurrency the database on those machines is otherwise
           | capable of). While not precise illustration it still
           | communicates my point - the serializable against read
           | committed is like the global Python lock against the fine
           | grained locks in normal languages.
        
         | jen20 wrote:
         | If you're storing a balance, rather than a ledger of the
         | transactions which caused that balance to be the case, you're
         | probably already off to a bad start when it comes to design.
         | 
         | Just imagine if you called your bank and asked them why my
         | balance was $500, and they said "I don't know, hit F5 and see
         | if it changes"...
        
           | gav wrote:
           | When it comes to this topic, almost all examples don't match
           | how things work in the real world where they are eventually
           | consistent.
           | 
           | For example, my bank lets me go negative and then if I don't
           | settle by the end of the day, overdraft protection will kick
           | in for a little more than that negative amount.
        
           | allknowingfrog wrote:
           | I briefly dabbled in financial ledgers in a previous job. We
           | made plans to do what you seem to be describing, which is to
           | store the transactions themselves and calculate a balance as
           | needed, but we frankly lacked expertise. I always wondered
           | what would happen if we scaled to millions of transactions in
           | that system.
           | 
           | Is there a standard strategy for optimizing balance queries
           | against a series of transactions? I assumed that we would
           | eventually have to cache some kind of snapshot for each
           | account to avoid the overhead of totaling every transaction
           | on every balance check.
        
       | agentultra wrote:
       | Generally, for a lot of application usage patterns, _read
       | committed_ is the default for a reason.
       | 
       | From the Postgres docs:
       | 
       | > _The partial transaction isolation provided by Read Committed
       | level is adequate for many applications, and this level is fast
       | and simple to use. However, for applications that do complex
       | queries and updates, it may be necessary to guarantee a more
       | rigorously consistent view of the database than the Read
       | Committed level provides._
       | 
       | Also, consider storing the transactions that change the balances
       | (credits, debits, etc) in a ledger and calculate the balance. You
       | can avoid the complex update logic and keep your accountants and
       | auditors happy.
       | 
       | Stricter isolation levels are useful but have a lot of trade-offs
       | to be aware of.
       | 
       | Sometimes restructuring your data is a lot more effective than
       | trying to reason about locks and concurrency. The more locks you
       | have to more points of contention, dead locks, etc you have to be
       | concerned about. Humans and general rules/folklore are alright at
       | avoiding some mistakes but these systems get complicated quickly
       | and can hide a surprising number of errors.
       | 
       | As for testing these things, I've gotten a lot of good mileage
       | out of model checking. TLA+ and Alloy are better at finding
       | concurrency errors than I am at avoiding them.
        
         | karamanolev wrote:
         | Can you recommended a good article that goes over the ledger
         | pattern (the easy part) and the performant calculation of
         | balances as the ledger grows? I'm going to start working on a
         | warehouse management system in the coming months and that would
         | come in handy.
        
           | SoftTalker wrote:
           | I don't have a link but when I've implemented systems with a
           | "ledger" of changes I would have a field that indicated
           | whether the ledger item had been "posted" to the the affected
           | account(s) and therefore the balance was just the account
           | balance +/- any unposted ledger items.
           | 
           | I've had ledgers that were posted once a day, or more often,
           | up to nearly real-time.
           | 
           | But once a ledger item is posted, it should never be changed.
           | If an error is discovered, make a new ledger entry to correct
           | it.
        
         | amw-zero wrote:
         | How can you store ledger operations without locking on the
         | balance? The race condition still exists - you can't accept a
         | withdrawal if the user doesn't have enough funds.
        
           | stetrain wrote:
           | You can use some sort of concurrency index on the ledger.
           | 
           | For example you read the ledger entries for Account Id X
           | 
           | Those ledger entries are numbered sequentially up to the
           | current index Y
           | 
           | You calculate the current balance, approve the withdrawal,
           | and try to insert the next ledger entry as Y + 1
           | 
           | If another operation has already inserted a Y + 1 your insert
           | fails and you throw an error or try again, starting with the
           | new ledger state and balance.
           | 
           | Basically optimistic concurrency but knowing that you are
           | always appending to the ledger and that the index is always
           | incrementing.
        
       ___________________________________________________________________
       (page generated 2024-01-04 23:01 UTC)