[HN Gopher] Systems Correctness Practices at AWS: Leveraging For...
       ___________________________________________________________________
        
       Systems Correctness Practices at AWS: Leveraging Formal and Semi-
       Formal Methods
        
       Author : yarapavan
       Score  : 109 points
       Date   : 2025-04-01 14:59 UTC (8 hours ago)
        
 (HTM) web link (queue.acm.org)
 (TXT) w3m dump (queue.acm.org)
        
       | OhMeadhbh wrote:
       | I find this highly unlikely. My first day at Amazon I encountered
       | an engineer puzzling over a perfect sine wave in a graph. After
       | looking at the scale I made the comment "oh. you're using 5
       | minute metrics." Their response was "how could you figure that
       | out just by looking at the graph?" When I replied "Queuing theory
       | and control theory," their response was "what's that?"
       | 
       | Since then, Amazon's hiring practices have only gotten worse. Can
       | you invert a tree? Can you respond "tree" or "hash map" when
       | you're asked what is the best data structure for a specific
       | situation? Can you solve a riddle or code an ill-explained
       | l33tcode problem? Are you sure you can parse HTML with regexes?
       | You're Amazon material.
       | 
       | Did you pay attention to the lecture about formal proofs. TLA+ or
       | Coq/Kami? That's great, but it won't help you get a job at
       | Amazon.
       | 
       | The idea that formal proofs are used anywhere but the most
       | obscure corners of AWS is laughable.
       | 
       | Although... it _is_ a nice paper. Props to Amazon for supporting
       | Ph.D. 's doing pure research that will never impact AWS' systems
       | or processes.
        
         | georgeburdell wrote:
         | A graph of what?
        
         | scubbo wrote:
         | They specifically mention "the P team at AWS". The two
         | following things are perfectly able to be simultaneously true:
         | 
         | * The average Amazon engineer is not expected to have awareness
         | of CS fundamentals that go beyond LeetCode-y challenges
         | 
         | * The average Amazon engineer builds using tools which are each
         | developed by a small core team who _are_ expected to know those
         | fundamentals, and who package up those concepts to be usefully
         | usable without full understanding
         | 
         | (I did 10 years on CDO, and my experience matches yours when
         | interacting with my peers, but every time I interacted with the
         | actual Builder Tools teams I was very aware that lived in
         | different worlds)
        
           | nextos wrote:
           | > The average Amazon engineer is not expected to have
           | awareness of CS fundamentals that go beyond LeetCode-y
           | challenges
           | 
           | I find this a bit unsettling. There are dozens of _great_ CS
           | schools in the US. Even non-elite BSc programs in EU
           | sometimes teach formal methods.
           | 
           | There are also some good introductory books now, e.g. [1].
           | Perhaps its time to interview more on concepts and less on
           | algorithmic tricks favored by LeetCode?
           | 
           | I doubt current undergrads can't go beyond LeetCode-like
           | challenges.
           | 
           | [1] Formal Methods, An Appetizer.
           | https://link.springer.com/book/10.1007/978-3-030-05156-3
        
             | retrocryptid wrote:
             | Modern CS programs teach to what they perceive to be the
             | interview their students will encounter after graduation:
             | what is a tree data structure, how to craft a SQL query and
             | how to calculate a CRC with a python library. More advanced
             | CS/CE departments still teach discrete math and
             | compilers/parsing for students intending to go to grad
             | schools.
             | 
             | My experience with recent CS grads is it's easier to hire
             | Art and Political Science grads and take the time to teach
             | them programming and all it's fundamentals. At least they
             | won't argue with you when you tell them not to use regexes
             | to parse HTML.
        
         | Dichlorodiphen wrote:
         | We recently did a proof of concept with P for our system, and
         | the reception was warm enough that I expect adoption within the
         | year. I wouldn't exactly call us obscure, at least in the sense
         | used above--greenfield big data system with a bog-standard mix
         | of AWS services.
         | 
         | I will say that time to productivity with P is low, and, in my
         | experience, it emphasizes practical applicability more so than
         | traditional modeling languages like TLA+ (this point is perhaps
         | somewhat unsubstantiated, but the specific tooling we used is
         | still internal). That is to say, it is fairly easy to sell to
         | management, and I can see momentum already starting to pick up
         | internally. No Ph.D. in formal methods required.
         | 
         | And re: the hiring bar, I agree that the bulk of the
         | distribution lies a bit further left than one would
         | hope/imagine, but there is a long tail, and it only takes a
         | handful of engineers to spearhead a project like this. For
         | maintainability, we are betting on the approachable learning
         | curve, but time will tell.
        
           | Taikonerd wrote:
           | The problem (in my mind) with using P is that only a small %
           | of problems are formalizable as communicating state machines.
           | 
           | In your greenfield big data system, what were the state
           | machines? What states could they be in?
        
           | matu3ba wrote:
           | Thanks for the interesting insight. 1. What would you
           | recommend to model multi-thread or multi-process problems on
           | the same system? Is there a good read to give
           | recommendations? 2. Is there a good overview on trade-offs
           | between CSP, Actor model and other methods for this? 3. Are
           | you aware of any process group models?
        
         | gqgs wrote:
         | Cool anecdote but I'm not sure how reasonable it is to expect
         | every person to have expert domain knowledge and recall of
         | every single computer science field just because they got a job
         | to work at Amazon or any other MAANG company.
        
       | nullorempty wrote:
       | And what teams use these methods exactly?
        
         | riknos314 wrote:
         | The article directly mentions several.
         | 
         | > Teams across AWS that build some of its flagship products--
         | from storage (e.g., Amazon S3, EBS), to databases (e.g., Amazon
         | DynamoDB, MemoryDB, Aurora), to compute (e.g., EC2, IoT)--have
         | been using P to reason about the correctness of their system
         | designs.
         | 
         | Later it more specifically mentions these (I probably missed a
         | few):
         | 
         | S3 index, S3 ShardStore, Firecracker VMM, Aurora DB engine
         | commit protocol, The RSA implementation on Graviton2
         | 
         | (EDIT: formatting)
        
           | yarapavan wrote:
           | If you are wondering "why do formal methods at all?" or "how
           | is AWS using P to gain confidence in correctness of their
           | services?", the following re:Invent 2023 talk answers this
           | question, provides an overview of P, and its impact inside
           | AWS: (Re:Invent 2023 Talk) Gain confidence in system
           | correctness & resilience with Formal Methods (Finding
           | Critical Bugs Early!!) -
           | https://www.youtube.com/watch?v=FdXZXnkMDxs
           | 
           | Github: https://github.com/p-org/P
        
       | pera wrote:
       | > we also sought a language that would allow us to model check
       | (and later prove) key aspects of systems designs while being more
       | approachable to programmers.
       | 
       | I find it a bit surprising that TLA+ with PlusCal can be
       | challenging to learn for your average software engineer, could
       | anyone with experience in P show an example of something that can
       | be difficult to express in TLA+ which is significantly easier in
       | P?
        
       | jlcases wrote:
       | I've noticed that the formalization of methods described by AWS
       | parallels what we need in technical documentation. Complex
       | systems require not just formal verification but also structured
       | documentation following MECE principles (Mutually Exclusive,
       | Collectively Exhaustive).
       | 
       | In my experience, the interfaces between components (where most
       | errors occur) are exactly where fragmented documentation fails. I
       | implemented a hierarchical documentation system for my team that
       | organizes knowledge as a conceptual tree, and the accuracy of
       | code generation with AI assistants improved notably.
       | 
       | Formal verification tools and structured documentation are
       | complementary: verification ensures algorithmic correctness while
       | MECE documentation guarantees conceptual and contextual
       | correctness. I wonder if AWS has experimented with structured
       | documentation systems specifically for AI, or if this remains an
       | area to explore.
        
         | jjmarr wrote:
         | Could you explain how your "hierarchical documentation system"
         | works? Does the tree parallel the code? Is there a particular
         | tool that you use?
        
         | jkaptur wrote:
         | I'm interested in learning more about the benefits of MECE -
         | I've never heard that before. In particular, it seems radically
         | different from Divio's system [0], which presents the same
         | information in many different ways.
         | 
         | 0: https://docs.divio.com/documentation-system/
        
           | zhengyi13 wrote:
           | Is this not https://diataxis.fr/ ?
           | 
           | (I'm engaged somewhat in trying to get our team to write
           | _any_ documentation; once I 've got that, I'll start trying
           | to organize along exactly these principles)
        
         | csbartus wrote:
         | Formalization, correctness is about models. [1]
         | 
         | There are formal methods where the underlying model is
         | mathematically sound. There are semi-formal methods where the
         | underlying model is structured but not proven to be sound.
         | 
         | For example, in your case ("organizes knowledge") a formal
         | method is ologs from category theory. That method assures that
         | the concepts and their relationship in your knowledge base are
         | mathematically correct.
         | 
         | When you want to transfer that knowledge base into a
         | documentation system you might want look for a mathematically
         | sound model, but I'm afraid there is no such model, so what's
         | left is a semi-formal method / a likely-correct model.
         | 
         | Right now I'm building such a likely-correct model for writing,
         | contact me for more info.
         | 
         | [1] https://www.osequi.com/slides/formalism-correctness-
         | cost/for...
        
       | csbartus wrote:
       | I've recently created a likely-correct piece of software based on
       | these principles.
       | 
       | https://www.osequi.com/studies/list/list.html
       | 
       | The structure (ontology, taxonomy) is created with ologs, a
       | formal method from category theory. The behavior (choreography)
       | is created with a semi-formal implementation (XState) of a formal
       | method (Finite State Machines)
       | 
       | The user-facing aspect of the software is designed with Concept
       | Design, a semi-formal method from MIT CSAIL.
       | 
       | Creating software with these methods is refreshing and fun. Maybe
       | one day we can reach Tonsky's "Diagrams are code" vision.
       | 
       | https://tonsky.me/blog/diagrams/
        
         | matu3ba wrote:
         | "Diagrams are code" exists as ladder diagrams for PLC, from
         | which Structured Text can be derived, which typically are used
         | by PICs for simple to estimate time behavior in very time-
         | sensitive/critical (real-time) control applications.
         | 
         | Stack + dynamic memory and other system resource modeling would
         | need proper models with especially memory life-time
         | visualization being an open research problem, let alone
         | resource tracking being unsolved (Valgrind offers no flexible
         | API, license restrictions, incomplete and other platforms than
         | Linux are less complete etc).
         | 
         | Reads and writes conditioned on arithmetic have all the issues
         | related to complex arithmetic and I am unaware of (time) models
         | for speculation or instruction re-ordering, let alone applied
         | compiler optimizations.
        
       | gqgs wrote:
       | A key concern I've consistently had regarding formal verification
       | systems is: how does one confirm the accuracy of the verifier
       | itself?
       | 
       | This issue appears to present an intrinsically unsolvable
       | problem, implying that a formally verified system could still
       | contain bugs due to potential issues in the verification
       | software.
       | 
       | While this perspective doesn't necessarily render formal
       | verification impractical, it does introduce certain caveats that,
       | in my experience, are not frequently addressed in discussions
       | about these systems.
        
         | whattheheckheck wrote:
         | Yeah it reminds me of
         | https://en.m.wikipedia.org/wiki/Tarski%27s_undefinability_th...
        
         | alexisread wrote:
         | I'd guess the best approach is similar to security - minimal
         | TCB (trusted computing base) that you can verify by hand, and
         | then construct your proof checker on top, and have it verify
         | itself. Proof and type checkers have a lot of overlap, so you
         | can gain coverage via the language types.
         | 
         | I can't recall exactly, but I think CakeML
         | (https://cakeml.org/jfp14.pdf) had a precursor lisp prover,
         | written in a verified lisp, so would be amenable to this
         | approach.
         | 
         | EDIT: think it was this one
         | https://www.schemeworkshop.org/2016/verified-lisp-2016-schem...
        
           | guerrilla wrote:
           | This is correct. That's why you have core type theories. Very
           | small, easy to understand and implement in a canonocal way.
           | Other languages can be compiled to them.
        
         | csbartus wrote:
         | Perhaps there is no such thing like absolute truth.
         | 
         | In category theory / ologs, a formal method for knowledge
         | representation, the result is always mathematically sound, yet
         | ologs are prefixed with "According to the author's world view
         | ..."
         | 
         | On the other way truth, even if it's not absolute, it's very
         | expensive.
         | 
         | Lately AWS advocates a middle ground, the lightweight formal
         | methods, which are much cheaper than formal methods yet deliver
         | good enough correctness for their business case.
         | 
         | In the same way MIT CSAIL's Daniel Jackson advocates a semi-
         | formal approach to design likely-correct apps (Concept design)
         | 
         | It seems there is a push for better correctness in software,
         | without the aim of perfectionism.
        
       ___________________________________________________________________
       (page generated 2025-04-01 23:00 UTC)