[HN Gopher] Rediscovering Transaction Processing from History an...
___________________________________________________________________
Rediscovering Transaction Processing from History and First
Principles
Author : todsacerdoti
Score : 40 points
Date : 2024-07-23 10:52 UTC (1 days ago)
(HTM) web link (tigerbeetle.com)
(TXT) w3m dump (tigerbeetle.com)
| simonz05 wrote:
| In this X thead Joran comments how it all started with a HN
| comment back in 2019.
| https://x.com/jorandirkgreef/status/1815702485190774957
| jorangreef wrote:
| It was pretty surreal to sit next to someone at a dinner in NYC
| two months ago, be introduced, and realize that they're someone
| you had an HN exchange with 5 years ago.
|
| Here's the HN thread:
| https://news.ycombinator.com/item?id=20352439
|
| And the backstory (on how this led to TigerBeetle):
| https://x.com/jorandirkgreef/status/1788930243077853234
| mihaic wrote:
| I honestly don't understand why this isn't a Postgres extension.
| In what case is it better to have two databases?
|
| Realistically, I can't see a scenario where you need something
| like this but at the same time are sure that you don't need both
| atomic database and financial ledger operations.
| twic wrote:
| > We saw that there were greater gains to be had than settling
| for a Postgres extension or stored procedures.
|
| > Today, as we announce our Series A of $24 million
|
| Can't raise a 24 million series A for a Postgres extension!
| mihaic wrote:
| That's true, but who are the executives that buy this and
| then force developers to create a monstrous architecture that
| has all sorts of race conditions outside of the ledger?
| jorangreef wrote:
| To be clear, TB moves the code to the data, rather than the
| data to the code, and precisely so that you don't have
| "race conditions outside the ledger".
|
| Instead, all kinds of complicated debit/credit contracts
| (up to 8k financial transactions at a time, linked together
| atomically) can be expressed in a single request to the
| database, composed in terms of a rich set of debit/credit
| primitives (e.g. two-phase debit/credit with rollback after
| a timeout), to enforce financial consistency directly in
| the database.
|
| On the other hand, moving the data to the code, to make
| decisions outside the OLTP database was exactly the anti-
| pattern we were wanting to fix in the central bank switch,
| as it tried to implement debit/credit primitives but over
| general-purpose DBMS. It's really hard to get these things
| right on top of Postgres.
|
| And even if you get the primitives right, the performance
| is fundamentally limited by row locks interacting with RTTs
| and contention. Again, these row locks are not only
| external, but also internal (i.e. how I/O interacts with
| CPU inside the DBMS), and why stored procedures or
| extensions aren't enough to fix the performance.
| jorangreef wrote:
| Hey mihaic! Thanks for the question.
|
| > I honestly don't understand why this isn't a Postgres
| extension.
|
| We considered a Postgres extension at the time (as well as
| stored procedures or even an embedded in-process DBMS).
|
| However, this wouldn't have moved the needle to where we needed
| it to be. Our internal design requirements (TB started as an
| internal project at Coil, contracting on a central bank switch)
| were literally a three order of magnitude increase in
| performance--to keep up with where transaction workloads were
| going.
|
| While an extension or stored procedures would reduce external
| locking, the general-purpose DBMS design implementing them
| still tends to do far too much internal locking, interleaving
| disk I/O with CPU and coupling resources. In contrast,
| TigerBeetle explicitly decouples disk I/O and CPU to amortize
| internal locking and so "pipeline in bulk" for mechanical
| sympathy. Think SIMD vectorization but applied to state machine
| execution.
|
| For example, before TB's state machine executes 1 request of 8k
| transactions, all data dependencies are prefetched in advance
| (typically from L1/2/3 cache) so that the CPU becomes like a
| sprinter running the 100 meters. This suits extreme OLTP
| workloads where a few million debit/credit transactions need to
| be pushed through less than 10 accounts/rows (e.g. for a small
| central bank switch with 10 banks around the table). This is
| pathological for a general-purpose DBMS design, but easy for TB
| because hot accounts are hot in cache, and all locking (whether
| external or internal) is amortized across 8k transactions.
|
| I spoke at QCon SF on this
| (https://www.youtube.com/watch?v=32LMicc0gRA) and matklad did
| two IronBeetle episodes walking through the code
| (https://www.youtube.com/watch?v=v5ThOoK3OFw&list=PL9eL-
| xg48O...).
|
| But the big problem with extensions or stored procedures is
| that they still tend to have a "one transaction at a time"
| mindset at the network layer. In other words, they don't
| typically amortize network requests beyond a 1:1 ratio of
| logical transaction to physical SQL transaction; they're not
| ergonomic if you want to pack a few thousand logical
| transactions in one physical query.
|
| On the other hand, TB's design is like "stored procedures meets
| group commit on steroids", packing up to 8k logical
| transactions in 1 physical query, and amortizing the costs not
| only of state machine execution (as described above) but also
| syscalls, networking and fsync (it's something roughly like 4
| syscalls, 4 memcopies and 4 network messages to execute 8k
| transactions--really hard for Postgres to match that).
|
| Postgres is also nearly 30 years old. It's an awesome database
| but hardware, software and research into how you would design a
| transaction processing database today has advanced
| significantly since then. For example, we wanted more safety
| around things like Fsyncgate by having an explicit storage
| fault model. We also wanted deterministic simulation testing
| and static memory allocation, and to follow NASA's Power of Ten
| Rules for Safety-Critical code.
|
| A Postgres extension would have been a showstopper for these
| things, but these were the technical contributions that needed
| to be made.
|
| I also think that some of the most interesting performance
| innovations (static memory allocation, zero-deserialization,
| zero-context switches, zero-syscalls etc.) are coming out of
| HFT these days. For example, Martin Thompson's Evolution of
| Financial Exchange Architectures:
| https://www.youtube.com/watch?v=qDhTjE0XmkE
|
| HFT is a great precursor to see where OLTP is going, because
| the major contention problem of OLTP is mostly solved by HFT
| architectures, and because the arbitrage and volume of HFT is
| now moving into other sectors--as the world becomes more
| transactional.
|
| > In what case is it better to have two databases?
|
| Finally, regarding two databases, this was something we wanted
| explicit in the architecture. Not to "mix cash and customer
| records" in one general-purpose mutable filing cabinet, but
| rather to have "separation of concerns", the variable-length
| customer records in the general-purpose DBMS (or filing
| cabinet) in the control plane, and the cash in the immutable
| financial transactions database (or bank vault) in the data
| plane.
|
| See also: https://docs.tigerbeetle.com/coding/system-
| architecture
|
| It's the same reason you would want Postgres + S3, or Postgres
| + Redpanda. Postgres is perfect as a general-purpose or OLGP
| database, but it's not specialized for OLAP like DuckDB, or
| specialized for OLTP like TigerBeetle.
|
| Again, appreciate the question and happy to answer more!
| mihaic wrote:
| Thanks for taking the time for the explanation and the
| rundown on the architecture. Sounds a bit like an LMAX
| disruptor for DB, which honestly is quite a natural
| implementation of performance. Kudos for the Zig
| implementation as well, I've never seen a project as serious
| in it.
|
| Personally, I still see challenges in developing on top of a
| system with data in two places unless there's a nice way to
| sync between them, and I would have seen the
| mutable/immutable classification as more of unlogged vs
| changes fully logged in DB, but I'm just doing armchair
| analysis here.
| jorangreef wrote:
| Huge pleasure! :)
|
| Exactly, the Martin Thompson talk I linked above is about
| the LMAX architecture. He gave this at QCon London I think
| in May 2020 and we were designing TigerBeetle in July 2020,
| pretty much lapping this up (I'd been a fan of Thompson's
| Mechanical Sympathy blog already for a few years by this
| point).
|
| I think the way to see this is not as "two places for the
| same type of data" but rather as "separation of concerns
| for radically different types of data" with different
| compliance/retention/mutability/access/performance/scale
| characteristics.
|
| It's also a natural architecture, and nothing new. How you
| would probably want to architect the "core" of a core
| banking system. We literally lifted the design for
| TigerBeetle directly out of the central bank switch's
| internal core, so that it would be dead simple to "heart
| transplant" back in later.
|
| The surprising thing though, was when small fintech
| startups, energy and gaming companies started reaching out.
| The primitives are easy to build with and unlock
| significantly more scale. Again, like using object storage
| in addition to Postgres is probably a good idea.
| jorangreef wrote:
| Joran from TigerBeetle here! Really stoked to see a bit of Jim
| Gray history on the front page and happy to dive into how TB's
| consensus and storage engine implements these ideas (starting
| from main!
| https://github.com/tigerbeetle/tigerbeetle/blob/main/src/tig...).
___________________________________________________________________
(page generated 2024-07-24 23:03 UTC)