[HN Gopher] Systems Correctness Practices at AWS: Leveraging For...
___________________________________________________________________
Systems Correctness Practices at AWS: Leveraging Formal and Semi-
Formal Methods
Author : yarapavan
Score : 109 points
Date : 2025-04-01 14:59 UTC (8 hours ago)
(HTM) web link (queue.acm.org)
(TXT) w3m dump (queue.acm.org)
| OhMeadhbh wrote:
| I find this highly unlikely. My first day at Amazon I encountered
| an engineer puzzling over a perfect sine wave in a graph. After
| looking at the scale I made the comment "oh. you're using 5
| minute metrics." Their response was "how could you figure that
| out just by looking at the graph?" When I replied "Queuing theory
| and control theory," their response was "what's that?"
|
| Since then, Amazon's hiring practices have only gotten worse. Can
| you invert a tree? Can you respond "tree" or "hash map" when
| you're asked what is the best data structure for a specific
| situation? Can you solve a riddle or code an ill-explained
| l33tcode problem? Are you sure you can parse HTML with regexes?
| You're Amazon material.
|
| Did you pay attention to the lecture about formal proofs. TLA+ or
| Coq/Kami? That's great, but it won't help you get a job at
| Amazon.
|
| The idea that formal proofs are used anywhere but the most
| obscure corners of AWS is laughable.
|
| Although... it _is_ a nice paper. Props to Amazon for supporting
| Ph.D. 's doing pure research that will never impact AWS' systems
| or processes.
| georgeburdell wrote:
| A graph of what?
| scubbo wrote:
| They specifically mention "the P team at AWS". The two
| following things are perfectly able to be simultaneously true:
|
| * The average Amazon engineer is not expected to have awareness
| of CS fundamentals that go beyond LeetCode-y challenges
|
| * The average Amazon engineer builds using tools which are each
| developed by a small core team who _are_ expected to know those
| fundamentals, and who package up those concepts to be usefully
| usable without full understanding
|
| (I did 10 years on CDO, and my experience matches yours when
| interacting with my peers, but every time I interacted with the
| actual Builder Tools teams I was very aware that lived in
| different worlds)
| nextos wrote:
| > The average Amazon engineer is not expected to have
| awareness of CS fundamentals that go beyond LeetCode-y
| challenges
|
| I find this a bit unsettling. There are dozens of _great_ CS
| schools in the US. Even non-elite BSc programs in EU
| sometimes teach formal methods.
|
| There are also some good introductory books now, e.g. [1].
| Perhaps its time to interview more on concepts and less on
| algorithmic tricks favored by LeetCode?
|
| I doubt current undergrads can't go beyond LeetCode-like
| challenges.
|
| [1] Formal Methods, An Appetizer.
| https://link.springer.com/book/10.1007/978-3-030-05156-3
| retrocryptid wrote:
| Modern CS programs teach to what they perceive to be the
| interview their students will encounter after graduation:
| what is a tree data structure, how to craft a SQL query and
| how to calculate a CRC with a python library. More advanced
| CS/CE departments still teach discrete math and
| compilers/parsing for students intending to go to grad
| schools.
|
| My experience with recent CS grads is it's easier to hire
| Art and Political Science grads and take the time to teach
| them programming and all it's fundamentals. At least they
| won't argue with you when you tell them not to use regexes
| to parse HTML.
| Dichlorodiphen wrote:
| We recently did a proof of concept with P for our system, and
| the reception was warm enough that I expect adoption within the
| year. I wouldn't exactly call us obscure, at least in the sense
| used above--greenfield big data system with a bog-standard mix
| of AWS services.
|
| I will say that time to productivity with P is low, and, in my
| experience, it emphasizes practical applicability more so than
| traditional modeling languages like TLA+ (this point is perhaps
| somewhat unsubstantiated, but the specific tooling we used is
| still internal). That is to say, it is fairly easy to sell to
| management, and I can see momentum already starting to pick up
| internally. No Ph.D. in formal methods required.
|
| And re: the hiring bar, I agree that the bulk of the
| distribution lies a bit further left than one would
| hope/imagine, but there is a long tail, and it only takes a
| handful of engineers to spearhead a project like this. For
| maintainability, we are betting on the approachable learning
| curve, but time will tell.
| Taikonerd wrote:
| The problem (in my mind) with using P is that only a small %
| of problems are formalizable as communicating state machines.
|
| In your greenfield big data system, what were the state
| machines? What states could they be in?
| matu3ba wrote:
| Thanks for the interesting insight. 1. What would you
| recommend to model multi-thread or multi-process problems on
| the same system? Is there a good read to give
| recommendations? 2. Is there a good overview on trade-offs
| between CSP, Actor model and other methods for this? 3. Are
| you aware of any process group models?
| gqgs wrote:
| Cool anecdote but I'm not sure how reasonable it is to expect
| every person to have expert domain knowledge and recall of
| every single computer science field just because they got a job
| to work at Amazon or any other MAANG company.
| nullorempty wrote:
| And what teams use these methods exactly?
| riknos314 wrote:
| The article directly mentions several.
|
| > Teams across AWS that build some of its flagship products--
| from storage (e.g., Amazon S3, EBS), to databases (e.g., Amazon
| DynamoDB, MemoryDB, Aurora), to compute (e.g., EC2, IoT)--have
| been using P to reason about the correctness of their system
| designs.
|
| Later it more specifically mentions these (I probably missed a
| few):
|
| S3 index, S3 ShardStore, Firecracker VMM, Aurora DB engine
| commit protocol, The RSA implementation on Graviton2
|
| (EDIT: formatting)
| yarapavan wrote:
| If you are wondering "why do formal methods at all?" or "how
| is AWS using P to gain confidence in correctness of their
| services?", the following re:Invent 2023 talk answers this
| question, provides an overview of P, and its impact inside
| AWS: (Re:Invent 2023 Talk) Gain confidence in system
| correctness & resilience with Formal Methods (Finding
| Critical Bugs Early!!) -
| https://www.youtube.com/watch?v=FdXZXnkMDxs
|
| Github: https://github.com/p-org/P
| pera wrote:
| > we also sought a language that would allow us to model check
| (and later prove) key aspects of systems designs while being more
| approachable to programmers.
|
| I find it a bit surprising that TLA+ with PlusCal can be
| challenging to learn for your average software engineer, could
| anyone with experience in P show an example of something that can
| be difficult to express in TLA+ which is significantly easier in
| P?
| jlcases wrote:
| I've noticed that the formalization of methods described by AWS
| parallels what we need in technical documentation. Complex
| systems require not just formal verification but also structured
| documentation following MECE principles (Mutually Exclusive,
| Collectively Exhaustive).
|
| In my experience, the interfaces between components (where most
| errors occur) are exactly where fragmented documentation fails. I
| implemented a hierarchical documentation system for my team that
| organizes knowledge as a conceptual tree, and the accuracy of
| code generation with AI assistants improved notably.
|
| Formal verification tools and structured documentation are
| complementary: verification ensures algorithmic correctness while
| MECE documentation guarantees conceptual and contextual
| correctness. I wonder if AWS has experimented with structured
| documentation systems specifically for AI, or if this remains an
| area to explore.
| jjmarr wrote:
| Could you explain how your "hierarchical documentation system"
| works? Does the tree parallel the code? Is there a particular
| tool that you use?
| jkaptur wrote:
| I'm interested in learning more about the benefits of MECE -
| I've never heard that before. In particular, it seems radically
| different from Divio's system [0], which presents the same
| information in many different ways.
|
| 0: https://docs.divio.com/documentation-system/
| zhengyi13 wrote:
| Is this not https://diataxis.fr/ ?
|
| (I'm engaged somewhat in trying to get our team to write
| _any_ documentation; once I 've got that, I'll start trying
| to organize along exactly these principles)
| csbartus wrote:
| Formalization, correctness is about models. [1]
|
| There are formal methods where the underlying model is
| mathematically sound. There are semi-formal methods where the
| underlying model is structured but not proven to be sound.
|
| For example, in your case ("organizes knowledge") a formal
| method is ologs from category theory. That method assures that
| the concepts and their relationship in your knowledge base are
| mathematically correct.
|
| When you want to transfer that knowledge base into a
| documentation system you might want look for a mathematically
| sound model, but I'm afraid there is no such model, so what's
| left is a semi-formal method / a likely-correct model.
|
| Right now I'm building such a likely-correct model for writing,
| contact me for more info.
|
| [1] https://www.osequi.com/slides/formalism-correctness-
| cost/for...
| csbartus wrote:
| I've recently created a likely-correct piece of software based on
| these principles.
|
| https://www.osequi.com/studies/list/list.html
|
| The structure (ontology, taxonomy) is created with ologs, a
| formal method from category theory. The behavior (choreography)
| is created with a semi-formal implementation (XState) of a formal
| method (Finite State Machines)
|
| The user-facing aspect of the software is designed with Concept
| Design, a semi-formal method from MIT CSAIL.
|
| Creating software with these methods is refreshing and fun. Maybe
| one day we can reach Tonsky's "Diagrams are code" vision.
|
| https://tonsky.me/blog/diagrams/
| matu3ba wrote:
| "Diagrams are code" exists as ladder diagrams for PLC, from
| which Structured Text can be derived, which typically are used
| by PICs for simple to estimate time behavior in very time-
| sensitive/critical (real-time) control applications.
|
| Stack + dynamic memory and other system resource modeling would
| need proper models with especially memory life-time
| visualization being an open research problem, let alone
| resource tracking being unsolved (Valgrind offers no flexible
| API, license restrictions, incomplete and other platforms than
| Linux are less complete etc).
|
| Reads and writes conditioned on arithmetic have all the issues
| related to complex arithmetic and I am unaware of (time) models
| for speculation or instruction re-ordering, let alone applied
| compiler optimizations.
| gqgs wrote:
| A key concern I've consistently had regarding formal verification
| systems is: how does one confirm the accuracy of the verifier
| itself?
|
| This issue appears to present an intrinsically unsolvable
| problem, implying that a formally verified system could still
| contain bugs due to potential issues in the verification
| software.
|
| While this perspective doesn't necessarily render formal
| verification impractical, it does introduce certain caveats that,
| in my experience, are not frequently addressed in discussions
| about these systems.
| whattheheckheck wrote:
| Yeah it reminds me of
| https://en.m.wikipedia.org/wiki/Tarski%27s_undefinability_th...
| alexisread wrote:
| I'd guess the best approach is similar to security - minimal
| TCB (trusted computing base) that you can verify by hand, and
| then construct your proof checker on top, and have it verify
| itself. Proof and type checkers have a lot of overlap, so you
| can gain coverage via the language types.
|
| I can't recall exactly, but I think CakeML
| (https://cakeml.org/jfp14.pdf) had a precursor lisp prover,
| written in a verified lisp, so would be amenable to this
| approach.
|
| EDIT: think it was this one
| https://www.schemeworkshop.org/2016/verified-lisp-2016-schem...
| guerrilla wrote:
| This is correct. That's why you have core type theories. Very
| small, easy to understand and implement in a canonocal way.
| Other languages can be compiled to them.
| csbartus wrote:
| Perhaps there is no such thing like absolute truth.
|
| In category theory / ologs, a formal method for knowledge
| representation, the result is always mathematically sound, yet
| ologs are prefixed with "According to the author's world view
| ..."
|
| On the other way truth, even if it's not absolute, it's very
| expensive.
|
| Lately AWS advocates a middle ground, the lightweight formal
| methods, which are much cheaper than formal methods yet deliver
| good enough correctness for their business case.
|
| In the same way MIT CSAIL's Daniel Jackson advocates a semi-
| formal approach to design likely-correct apps (Concept design)
|
| It seems there is a push for better correctness in software,
| without the aim of perfectionism.
___________________________________________________________________
(page generated 2025-04-01 23:00 UTC)