[HN Gopher] Why the OpenSSL punycode vulnerability was not detec...
       ___________________________________________________________________
        
       Why the OpenSSL punycode vulnerability was not detected by fuzz
       testing (2022)
        
       Author : lemper
       Score  : 86 points
       Date   : 2024-04-17 07:14 UTC (1 days ago)
        
 (HTM) web link (allsoftwaresucks.blogspot.com)
 (TXT) w3m dump (allsoftwaresucks.blogspot.com)
        
       | _flux wrote:
       | I suppose it would be nice to require at least 100% line coverage
       | for tested encryption-related functionality, when tested by a
       | fuzzer. Deviations from this could be easily detected in CI.
       | 
       | Testing cannot be used to prove that a flaw doesn't exist, only
       | that it does.
        
         | Retr0id wrote:
         | > Testing cannot be used to prove that a flaw doesn't exist
         | 
         | This is not universally true, formal methods can take you a
         | long way, depending on the problem domain.
         | 
         | And sometimes you can test for every possible input. This
         | usually isn't feasible, but it's nice when it is.
        
           | dllthomas wrote:
           | Formal methods aren't "testing".
           | 
           | As you say, though, exhaustive testing is sometimes possible.
        
             | bluGill wrote:
             | It is normally safe to assume the exhaustive testing isn't
             | possible because the total states to exhaustively tests
             | exceeds the number of atoms in the universe. There are a
             | few exceptions, but in general we should assume it is
             | always impossible to exhaustively test programs. Which
             | means we need to use something else to find what the limits
             | are and test only those, or use formal methods (both would
             | be my recommendation - though I'll admit I have yet to
             | figure out how to use formal methods)
        
               | dllthomas wrote:
               | > in general we should assume it is always impossible to
               | exhaustively test programs
               | 
               | The whole program? Yes.
               | 
               | For testing individual components, there's no need to
               | assume. The answer is very likely either clearly yes (the
               | input is a handful of booleans) or clearly no (the input
               | is several integers or even unbounded) and for those
               | cases where it's not immediately clear (one i16? Three
               | u8s?) it's probably still not hard to think through for
               | your particular case in your particular context.
        
               | mananaysiempre wrote:
               | Brute forcing a single 32-bit input has also been
               | usefully performed in some cases, such as for single-
               | precision implementations of special functions[1].
               | 
               | [1] https://randomascii.wordpress.com/2014/01/27/theres-
               | only-fou...
        
               | bluGill wrote:
               | 32 bits is a bit over 4 billion states, so that is
               | doable. However a lot of algorithms can have more than 32
               | bits of states and so quickly get beyond what is doable.
        
             | Retr0id wrote:
             | I agree that it's not testing in the strict sense, but
             | fuzzing isn't really testing either.
             | 
             | The work done by a a theorem prover and a fuzzer isn't all
             | that different (and sufficiently advanced fuzzers use SMT
             | solvers etc.)
        
               | NovemberWhiskey wrote:
               | This seems to neglect the fundamental difference between
               | static and dynamic analysis. Formal methods approaches
               | don't usually need me to run my code.
        
               | _flux wrote:
               | I guess I need to disagree :). Why isn't fuzzing testing?
               | Ultimately it ends up running the code and seeing if it
               | works--for some definition of "working", that is, it
               | doesn't crash--even if it may pick the ways it chooses
               | the input by considering how the code behaves while
               | running it.
               | 
               | Also, how is even "smart fuzzing" similar to theorem
               | provers in any way? Fuzzers still think in terms of runs,
               | but a theorem considers all cases. The best fuzzer
               | wouldn't be able to tell if a sort algorithm works for
               | all inputs, presuming the input size is unbounded.
               | Actually, my understanding is that a fuzzer wouldn't be
               | able to prove if identity function works for all kinds of
               | objects it may be given.
        
           | bluGill wrote:
           | Formal methods catch a different class of problem from other
           | tests and so are worth it. However they cannot catch
           | everything. The classic example: "midpoint = (high + low) /
           | 2" is easy to formally prove correct - but in fact it isn't
           | correct because high + low could overflow.
        
             | dinosaurdynasty wrote:
             | Any formal system worth its salt is going to model
             | overflow.
        
               | Jtsummers wrote:
               | And with SPARK/Ada it will. When I was learning it that
               | was actually a very common thing that it caught in my
               | code (various forms of over/underflow errors). In
               | practice most of them would not occur, but a few could
               | have under real-world use-cases.
        
             | thfuran wrote:
             | If you've formally proved that correct, then either it
             | can't overflow or there's a massive bug in the prover.
        
             | mathgradthrow wrote:
             | I don't mean to be rude, but why are you commenting on
             | formal methods when you clearly have no idea what they are?
        
               | bluGill wrote:
               | The specific example I gave was from a 1982 paper (I
               | forget who wrote it, but a big name in computer science)
               | where that was used to prove an algorithm and thus argue
               | that he didn't need to test it as it was proved correct.
               | 
               | The famous Donald Knuth has also stated "Beware of bugs
               | in the above code; I have only proved it correct, not
               | tried it.", which again hints that there are limits to
               | proved programs.
               | 
               | I'll admit that I don't know much about them, what I know
               | meringues me - but I can't figure out how to use them in
               | the real world. However I know enough to know that they
               | are not perfect.
        
               | Jtsummers wrote:
               | You use them judiciously, you learn their capabilities
               | and limitations and the scope of their applicability.
               | 
               | TLA+ is very useful, but it's not going to prove out your
               | _software_ itself, that will require testing or other
               | methods, it will help prove out your _specification_. I
               | 've used it to good effect both for designing distributed
               | and concurrent systems and for identifying issues in
               | existing systems (without an existing formal spec, by
               | creating it and then playing with the invariants and
               | other aspects of the TLA+ spec). SPARK/Ada, I mentioned
               | in my other comment, will detect many problems, and help
               | you prove the absence of many others. But it's also not
               | suitable (yet) for all projects, or more than a subset of
               | a project, since it has specific limitations on what it
               | can prove and is only compatible with a subset of Ada
               | (this is being expanded each year). The subset of Ada it
               | does apply to is actually quite significant and useful,
               | but you have to be capable of understanding that it is
               | limited and then apply it within that limited scope.
               | 
               | This is true for every other formal method approach I've
               | seen. Pay attention, read, ask questions, and don't judge
               | them all by one paper you half remember from 42 years
               | ago.
        
             | _flux wrote:
             | There is a point here: if your implementation doesn't match
             | the specification, then that can be the case. And your
             | example presumes the specification assumed unbounded
             | numbers, while an implementation did not have them. Well,
             | in that case the specification just needs to be revised to
             | include that information.
             | 
             | But the trick is to remember making the specification model
             | the real world at sufficient detail: not too detailed to
             | make the specification unwieldy, not too relaxed to skip
             | real real-world issues that need to be modeled.
             | 
             | I don't think there are too many good solutions to that,
             | except perhaps theorem provers from where you can also
             | extract code from (e.g. Coq), but it seems either these
             | tools are still difficult to use for engineers, or there is
             | an expectation that models and actual implementations are
             | going to be different anyway, making it difficult to see if
             | the implementation actually implements the spec.
        
           | _flux wrote:
           | True, I should have remembered to express it as _Non-
           | exhaustive testing cannot.._.
           | 
           | I also consider formal methods not testing. But model
           | checking (e.g. with tlc) is something in between, and sort of
           | takes advantage of the "This usually isn't feasible, but it's
           | nice when it is.": when you limit your domain enough, it
           | becomes feasible, yet exhaustively checking each small-world
           | scenario gives a pretty good confidence on the model being
           | correct.
        
         | taviso wrote:
         | > Testing cannot be used to prove that a flaw doesn't exist,
         | only that it does.
         | 
         | FWIW, I wrote a similar blog post about a different encryption
         | bug that really seemed like it should have been found by
         | fuzzing, _and_ had 100% coverage.
         | 
         | https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...
         | 
         | Not that I disagree with you, just a practical example.
        
           | jcims wrote:
           | It's pretty remarkable how effective fuzzing is despite the
           | layers upon layers of optimizations/assumptions that it
           | requires in order to be feasible at all (eg max_len = 10000).
           | I haven't tinkered with fuzzing since afl was a toddler but
           | its mechanism for pruning the test space seemed so brilliant
           | and tweakable at the time.
           | 
           | It would be interesting to find a way to create excursions
           | into these various parameter spaces but some of them are
           | baked into the infrastructure in such a way that it makes it
           | difficult.
        
       | TacticalCoder wrote:
       | Why is OpenSSL using punycode? To do internationalized domain
       | name parsing?
        
         | amiga386 wrote:
         | I would assume it's needed for hostname validation. Is one
         | hostname equivalent to another? Does this wildcard match this
         | hostname?
         | 
         | EDIT: I looked it up. It is used to implement RFC 8398
         | (Internationalized Email Addresses in X.509 Certificates)
         | 
         | https://www.rfc-editor.org/rfc/rfc8398#section-5
        
         | cryptonector wrote:
         | Yes.
        
       | stcredzero wrote:
       | Would it be possible to attach an LLM to a debugger sessions
       | executing all of the fuzzer seeds, and ask it to figure out how
       | to expand coverage?
        
         | swatcoder wrote:
         | Not to dismiss the novelty and power of LLM's but why would you
         | turn to a black box language interface for that?
         | 
         | Wouldn't you expect a designed system to be more efficient,
         | complete, reliable, and auditable? Most or those
         | characteristics are critical to security applications and the
         | nature of LLM's largely run counter to them.
        
           | agilob wrote:
           | Turns out LLMs are really good at writing fuzzers
           | 
           | https://verse.systems/blog/post/2024-03-09-using-llms-to-
           | gen...
           | 
           | https://google.github.io/oss-
           | fuzz/research/llms/target_gener...
        
             | cjbprime wrote:
             | These links are a little different to the GP comment,
             | though. Both of these cases (which I agree show LLMs being
             | an excellent choice for improving fuzzing coverage) are
             | static analysis, going from the project source code to a
             | new harness.
             | 
             | Some issues with that are that the model probably doesn't
             | have enough context to be given _all_ the project source
             | code, so you have to work out which subset to share,
             | including definitions of all relevant symbols (but not too
             | many).
             | 
             | It helps a bit that the foundation models were already pre-
             | trained on these large open source projects in oss-fuzz, so
             | they already know something about the project's symbols and
             | definitions from their original training sets -- and even
             | from public discussions about the code! -- but that
             | wouldn't work for a private codebase or for recent changes
             | to a large public one.
             | 
             | Then the harness source that the LLM writes might have
             | syntax errors/fail to compile, and you have to deal with
             | that somehow, and the harness source that the LLM writes
             | might be valid but not generate any coverage improvement
             | and you have to deal with that, and so on.
             | 
             | GP seems to be talking about instead some form of LLM-aided
             | dynamic analysis, where you are probably using some kind of
             | symbolic execution to generate new _seeds_ , not new
             | _harnesses_.
             | 
             | That's important work too, because I think in this case
             | (disagreeing with the blog post author) the vulnerable
             | function was actually reachable by existing harnesses, just
             | not through the seed corpora (at least the public ones).
             | 
             | One approach could be for the LLM to become a kind of a
             | symbolic execution constraint solver, using the debugger as
             | a form of path instrumentation and producing new seeds by
             | predicting what a new input would look like when you invert
             | each interesting constraint that the fuzzing coverage is
             | blocked by, as the debugger hits the test for that
             | constraint (which is difficult because it can require
             | actual computation, not pattern matching, and because of
             | path explosion).
             | 
             | Or perhaps more plausibly, the LLM could generate Z3 or
             | other SAT-solver code to define and solve for those
             | constraints to generate new seeds, replacing what is
             | currently extremely tedious and time-consuming work when
             | done by humans.
        
             | swatcoder wrote:
             | Those demonstrate that they're _capable_ of generating
             | _capable_ ones, which is really cool but also not
             | surpising.
             | 
             | What matters for engineering is how that technique compares
             | to others in the context of specific requirements.
             | 
             | A big part of "falling for hype" is in mistaking a new and
             | capable tool for the being the right or oprimal tool.
        
               | cjbprime wrote:
               | It's fine to have LLM skepticism as a default, but here
               | it's not justified. Google is showing here that the LLM-
               | written harnesses improve massively on the harnesses in
               | oss-fuzz that were written over many years by the
               | combined sum of open source security researchers. Most
               | dramatically, they improved tinyxml2 fuzzing coverage by
               | 31% compared to the existing oss-fuzz harnesses, through
               | an entirely automated flow for harness generation by
               | LLMs.
               | 
               | Whatever engineering technique you are imagining would be
               | better is not one that humanity actually applied to the
               | problem before the automated LLM harnesses were written.
               | In general, writing and improving fuzzing harnesses is
               | extremely tedious work that is not being done (or paid
               | for) by nearly enough people to adequately protect
               | critical open source software. The LLM approach is a
               | legitimate breakthrough in the field of open source
               | fuzzing.
        
               | swatcoder wrote:
               | Fair enough, interesting, and plausible! I looked at the
               | first link and saw it as more of a capabilities demo, but
               | didn't dig into the Google one. I'm mostly just
               | encouraging thoughtful reflection on tool choice by
               | raising questions, not making a case against. Very cool.
        
       | arp242 wrote:
       | (2022)
       | 
       | Previous:
       | 
       |  _Why CVE-2022-3602 was not detected by fuzz testing_ -
       | https://news.ycombinator.com/item?id=33693873 - Nov 2022 (171
       | comments)
        
         | dang wrote:
         | Year added above. Thanks!
        
       | cjbprime wrote:
       | I'm not sure the post (from 2022) is/was correct. I've looked
       | into it too, and I expect this _was_ reachable by the existing
       | x509 fuzzer. There 's a fallacy in assuming that a fuzzer will
       | solve for all reachable code paths in a reasonable time, and that
       | if it doesn't then there must be a problem with the harness. The
       | harness is a reasonable top-level x509 parsing harness, but
       | starting all the way from the network input makes solving those
       | deep constraints unlikely to happen by (feedback-driven) chance,
       | which is what I think happened here.
       | 
       | Of course, a harness that started from the punycode parsing --
       | instead of the top-level X509 parsing -- finds this vulnerability
       | immediately.
        
         | hedora wrote:
         | I've found that writing randomized unit tests for each small
         | part of a system finds this sort of stuff immediately.
         | 
         | In this case, a test that just generated 1,000,000 random
         | strings and passed them to punycode would probably have found
         | the problem (maybe not in the first run, but after a week or
         | two in continuous integration).
         | 
         | I try to structure the tests so they can run with dumb random
         | input or with coverage-guided input from a fuzzer. The former
         | usually finds >90% of the bugs the fuzzer would, and does so
         | 100x faster, so it's better than fuzzing during development,
         | but worse during nightly testing.
         | 
         | One other advantage for dumb random input is that it works with
         | distributed systems and things written in multiple languages
         | (where coverage information isn't readily available).
        
           | mattgreenrocks wrote:
           | I really like this idea because it avoids the issue of
           | fuzzers needing to burn tons of CPU just to get down to the
           | actual domain logic, which can have really thorny bugs. Per
           | usual, the idea of "unit" gets contentious quickly, but with
           | appropriate tooling I could foresee adding annotations to
           | code that leverage random input, property-based testing, and
           | a user-defined dictionary of known weird inputs.
        
           | paulddraper wrote:
           | > maybe not in the first run, but after a week or two in
           | continuous integration
           | 
           | You'd use a different seed for each CI run??
           | 
           | That sounds like a nightmare of non-determinism, and lost of
           | trust in CI system in general.
        
             | forty wrote:
             | Not if you log the seed of the failing runs
        
           | Arnavion wrote:
           | aka property testing.
        
         | cryptonector wrote:
         | > Of course, a harness that started from the punycode parsing
         | -- instead of the top-level X509 parsing -- finds this
         | vulnerability immediately.
         | 
         | Yes, it's nice to have a fuzzer that can start from a very
         | high-level entry point, but it's even nicer to fuzz lots of
         | entry points (one for every parser).
        
         | capitainenemo wrote:
         | If finding it required generating a valid signed certificate
         | from random string generation that seems essentially equivalent
         | to "unreachable" to me....
        
       ___________________________________________________________________
       (page generated 2024-04-18 23:00 UTC)