[HN Gopher] Why the OpenSSL punycode vulnerability was not detec...
___________________________________________________________________
Why the OpenSSL punycode vulnerability was not detected by fuzz
testing (2022)
Author : lemper
Score : 86 points
Date : 2024-04-17 07:14 UTC (1 days ago)
(HTM) web link (allsoftwaresucks.blogspot.com)
(TXT) w3m dump (allsoftwaresucks.blogspot.com)
| _flux wrote:
| I suppose it would be nice to require at least 100% line coverage
| for tested encryption-related functionality, when tested by a
| fuzzer. Deviations from this could be easily detected in CI.
|
| Testing cannot be used to prove that a flaw doesn't exist, only
| that it does.
| Retr0id wrote:
| > Testing cannot be used to prove that a flaw doesn't exist
|
| This is not universally true, formal methods can take you a
| long way, depending on the problem domain.
|
| And sometimes you can test for every possible input. This
| usually isn't feasible, but it's nice when it is.
| dllthomas wrote:
| Formal methods aren't "testing".
|
| As you say, though, exhaustive testing is sometimes possible.
| bluGill wrote:
| It is normally safe to assume the exhaustive testing isn't
| possible because the total states to exhaustively tests
| exceeds the number of atoms in the universe. There are a
| few exceptions, but in general we should assume it is
| always impossible to exhaustively test programs. Which
| means we need to use something else to find what the limits
| are and test only those, or use formal methods (both would
| be my recommendation - though I'll admit I have yet to
| figure out how to use formal methods)
| dllthomas wrote:
| > in general we should assume it is always impossible to
| exhaustively test programs
|
| The whole program? Yes.
|
| For testing individual components, there's no need to
| assume. The answer is very likely either clearly yes (the
| input is a handful of booleans) or clearly no (the input
| is several integers or even unbounded) and for those
| cases where it's not immediately clear (one i16? Three
| u8s?) it's probably still not hard to think through for
| your particular case in your particular context.
| mananaysiempre wrote:
| Brute forcing a single 32-bit input has also been
| usefully performed in some cases, such as for single-
| precision implementations of special functions[1].
|
| [1] https://randomascii.wordpress.com/2014/01/27/theres-
| only-fou...
| bluGill wrote:
| 32 bits is a bit over 4 billion states, so that is
| doable. However a lot of algorithms can have more than 32
| bits of states and so quickly get beyond what is doable.
| Retr0id wrote:
| I agree that it's not testing in the strict sense, but
| fuzzing isn't really testing either.
|
| The work done by a a theorem prover and a fuzzer isn't all
| that different (and sufficiently advanced fuzzers use SMT
| solvers etc.)
| NovemberWhiskey wrote:
| This seems to neglect the fundamental difference between
| static and dynamic analysis. Formal methods approaches
| don't usually need me to run my code.
| _flux wrote:
| I guess I need to disagree :). Why isn't fuzzing testing?
| Ultimately it ends up running the code and seeing if it
| works--for some definition of "working", that is, it
| doesn't crash--even if it may pick the ways it chooses
| the input by considering how the code behaves while
| running it.
|
| Also, how is even "smart fuzzing" similar to theorem
| provers in any way? Fuzzers still think in terms of runs,
| but a theorem considers all cases. The best fuzzer
| wouldn't be able to tell if a sort algorithm works for
| all inputs, presuming the input size is unbounded.
| Actually, my understanding is that a fuzzer wouldn't be
| able to prove if identity function works for all kinds of
| objects it may be given.
| bluGill wrote:
| Formal methods catch a different class of problem from other
| tests and so are worth it. However they cannot catch
| everything. The classic example: "midpoint = (high + low) /
| 2" is easy to formally prove correct - but in fact it isn't
| correct because high + low could overflow.
| dinosaurdynasty wrote:
| Any formal system worth its salt is going to model
| overflow.
| Jtsummers wrote:
| And with SPARK/Ada it will. When I was learning it that
| was actually a very common thing that it caught in my
| code (various forms of over/underflow errors). In
| practice most of them would not occur, but a few could
| have under real-world use-cases.
| thfuran wrote:
| If you've formally proved that correct, then either it
| can't overflow or there's a massive bug in the prover.
| mathgradthrow wrote:
| I don't mean to be rude, but why are you commenting on
| formal methods when you clearly have no idea what they are?
| bluGill wrote:
| The specific example I gave was from a 1982 paper (I
| forget who wrote it, but a big name in computer science)
| where that was used to prove an algorithm and thus argue
| that he didn't need to test it as it was proved correct.
|
| The famous Donald Knuth has also stated "Beware of bugs
| in the above code; I have only proved it correct, not
| tried it.", which again hints that there are limits to
| proved programs.
|
| I'll admit that I don't know much about them, what I know
| meringues me - but I can't figure out how to use them in
| the real world. However I know enough to know that they
| are not perfect.
| Jtsummers wrote:
| You use them judiciously, you learn their capabilities
| and limitations and the scope of their applicability.
|
| TLA+ is very useful, but it's not going to prove out your
| _software_ itself, that will require testing or other
| methods, it will help prove out your _specification_. I
| 've used it to good effect both for designing distributed
| and concurrent systems and for identifying issues in
| existing systems (without an existing formal spec, by
| creating it and then playing with the invariants and
| other aspects of the TLA+ spec). SPARK/Ada, I mentioned
| in my other comment, will detect many problems, and help
| you prove the absence of many others. But it's also not
| suitable (yet) for all projects, or more than a subset of
| a project, since it has specific limitations on what it
| can prove and is only compatible with a subset of Ada
| (this is being expanded each year). The subset of Ada it
| does apply to is actually quite significant and useful,
| but you have to be capable of understanding that it is
| limited and then apply it within that limited scope.
|
| This is true for every other formal method approach I've
| seen. Pay attention, read, ask questions, and don't judge
| them all by one paper you half remember from 42 years
| ago.
| _flux wrote:
| There is a point here: if your implementation doesn't match
| the specification, then that can be the case. And your
| example presumes the specification assumed unbounded
| numbers, while an implementation did not have them. Well,
| in that case the specification just needs to be revised to
| include that information.
|
| But the trick is to remember making the specification model
| the real world at sufficient detail: not too detailed to
| make the specification unwieldy, not too relaxed to skip
| real real-world issues that need to be modeled.
|
| I don't think there are too many good solutions to that,
| except perhaps theorem provers from where you can also
| extract code from (e.g. Coq), but it seems either these
| tools are still difficult to use for engineers, or there is
| an expectation that models and actual implementations are
| going to be different anyway, making it difficult to see if
| the implementation actually implements the spec.
| _flux wrote:
| True, I should have remembered to express it as _Non-
| exhaustive testing cannot.._.
|
| I also consider formal methods not testing. But model
| checking (e.g. with tlc) is something in between, and sort of
| takes advantage of the "This usually isn't feasible, but it's
| nice when it is.": when you limit your domain enough, it
| becomes feasible, yet exhaustively checking each small-world
| scenario gives a pretty good confidence on the model being
| correct.
| taviso wrote:
| > Testing cannot be used to prove that a flaw doesn't exist,
| only that it does.
|
| FWIW, I wrote a similar blog post about a different encryption
| bug that really seemed like it should have been found by
| fuzzing, _and_ had 100% coverage.
|
| https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...
|
| Not that I disagree with you, just a practical example.
| jcims wrote:
| It's pretty remarkable how effective fuzzing is despite the
| layers upon layers of optimizations/assumptions that it
| requires in order to be feasible at all (eg max_len = 10000).
| I haven't tinkered with fuzzing since afl was a toddler but
| its mechanism for pruning the test space seemed so brilliant
| and tweakable at the time.
|
| It would be interesting to find a way to create excursions
| into these various parameter spaces but some of them are
| baked into the infrastructure in such a way that it makes it
| difficult.
| TacticalCoder wrote:
| Why is OpenSSL using punycode? To do internationalized domain
| name parsing?
| amiga386 wrote:
| I would assume it's needed for hostname validation. Is one
| hostname equivalent to another? Does this wildcard match this
| hostname?
|
| EDIT: I looked it up. It is used to implement RFC 8398
| (Internationalized Email Addresses in X.509 Certificates)
|
| https://www.rfc-editor.org/rfc/rfc8398#section-5
| cryptonector wrote:
| Yes.
| stcredzero wrote:
| Would it be possible to attach an LLM to a debugger sessions
| executing all of the fuzzer seeds, and ask it to figure out how
| to expand coverage?
| swatcoder wrote:
| Not to dismiss the novelty and power of LLM's but why would you
| turn to a black box language interface for that?
|
| Wouldn't you expect a designed system to be more efficient,
| complete, reliable, and auditable? Most or those
| characteristics are critical to security applications and the
| nature of LLM's largely run counter to them.
| agilob wrote:
| Turns out LLMs are really good at writing fuzzers
|
| https://verse.systems/blog/post/2024-03-09-using-llms-to-
| gen...
|
| https://google.github.io/oss-
| fuzz/research/llms/target_gener...
| cjbprime wrote:
| These links are a little different to the GP comment,
| though. Both of these cases (which I agree show LLMs being
| an excellent choice for improving fuzzing coverage) are
| static analysis, going from the project source code to a
| new harness.
|
| Some issues with that are that the model probably doesn't
| have enough context to be given _all_ the project source
| code, so you have to work out which subset to share,
| including definitions of all relevant symbols (but not too
| many).
|
| It helps a bit that the foundation models were already pre-
| trained on these large open source projects in oss-fuzz, so
| they already know something about the project's symbols and
| definitions from their original training sets -- and even
| from public discussions about the code! -- but that
| wouldn't work for a private codebase or for recent changes
| to a large public one.
|
| Then the harness source that the LLM writes might have
| syntax errors/fail to compile, and you have to deal with
| that somehow, and the harness source that the LLM writes
| might be valid but not generate any coverage improvement
| and you have to deal with that, and so on.
|
| GP seems to be talking about instead some form of LLM-aided
| dynamic analysis, where you are probably using some kind of
| symbolic execution to generate new _seeds_ , not new
| _harnesses_.
|
| That's important work too, because I think in this case
| (disagreeing with the blog post author) the vulnerable
| function was actually reachable by existing harnesses, just
| not through the seed corpora (at least the public ones).
|
| One approach could be for the LLM to become a kind of a
| symbolic execution constraint solver, using the debugger as
| a form of path instrumentation and producing new seeds by
| predicting what a new input would look like when you invert
| each interesting constraint that the fuzzing coverage is
| blocked by, as the debugger hits the test for that
| constraint (which is difficult because it can require
| actual computation, not pattern matching, and because of
| path explosion).
|
| Or perhaps more plausibly, the LLM could generate Z3 or
| other SAT-solver code to define and solve for those
| constraints to generate new seeds, replacing what is
| currently extremely tedious and time-consuming work when
| done by humans.
| swatcoder wrote:
| Those demonstrate that they're _capable_ of generating
| _capable_ ones, which is really cool but also not
| surpising.
|
| What matters for engineering is how that technique compares
| to others in the context of specific requirements.
|
| A big part of "falling for hype" is in mistaking a new and
| capable tool for the being the right or oprimal tool.
| cjbprime wrote:
| It's fine to have LLM skepticism as a default, but here
| it's not justified. Google is showing here that the LLM-
| written harnesses improve massively on the harnesses in
| oss-fuzz that were written over many years by the
| combined sum of open source security researchers. Most
| dramatically, they improved tinyxml2 fuzzing coverage by
| 31% compared to the existing oss-fuzz harnesses, through
| an entirely automated flow for harness generation by
| LLMs.
|
| Whatever engineering technique you are imagining would be
| better is not one that humanity actually applied to the
| problem before the automated LLM harnesses were written.
| In general, writing and improving fuzzing harnesses is
| extremely tedious work that is not being done (or paid
| for) by nearly enough people to adequately protect
| critical open source software. The LLM approach is a
| legitimate breakthrough in the field of open source
| fuzzing.
| swatcoder wrote:
| Fair enough, interesting, and plausible! I looked at the
| first link and saw it as more of a capabilities demo, but
| didn't dig into the Google one. I'm mostly just
| encouraging thoughtful reflection on tool choice by
| raising questions, not making a case against. Very cool.
| arp242 wrote:
| (2022)
|
| Previous:
|
| _Why CVE-2022-3602 was not detected by fuzz testing_ -
| https://news.ycombinator.com/item?id=33693873 - Nov 2022 (171
| comments)
| dang wrote:
| Year added above. Thanks!
| cjbprime wrote:
| I'm not sure the post (from 2022) is/was correct. I've looked
| into it too, and I expect this _was_ reachable by the existing
| x509 fuzzer. There 's a fallacy in assuming that a fuzzer will
| solve for all reachable code paths in a reasonable time, and that
| if it doesn't then there must be a problem with the harness. The
| harness is a reasonable top-level x509 parsing harness, but
| starting all the way from the network input makes solving those
| deep constraints unlikely to happen by (feedback-driven) chance,
| which is what I think happened here.
|
| Of course, a harness that started from the punycode parsing --
| instead of the top-level X509 parsing -- finds this vulnerability
| immediately.
| hedora wrote:
| I've found that writing randomized unit tests for each small
| part of a system finds this sort of stuff immediately.
|
| In this case, a test that just generated 1,000,000 random
| strings and passed them to punycode would probably have found
| the problem (maybe not in the first run, but after a week or
| two in continuous integration).
|
| I try to structure the tests so they can run with dumb random
| input or with coverage-guided input from a fuzzer. The former
| usually finds >90% of the bugs the fuzzer would, and does so
| 100x faster, so it's better than fuzzing during development,
| but worse during nightly testing.
|
| One other advantage for dumb random input is that it works with
| distributed systems and things written in multiple languages
| (where coverage information isn't readily available).
| mattgreenrocks wrote:
| I really like this idea because it avoids the issue of
| fuzzers needing to burn tons of CPU just to get down to the
| actual domain logic, which can have really thorny bugs. Per
| usual, the idea of "unit" gets contentious quickly, but with
| appropriate tooling I could foresee adding annotations to
| code that leverage random input, property-based testing, and
| a user-defined dictionary of known weird inputs.
| paulddraper wrote:
| > maybe not in the first run, but after a week or two in
| continuous integration
|
| You'd use a different seed for each CI run??
|
| That sounds like a nightmare of non-determinism, and lost of
| trust in CI system in general.
| forty wrote:
| Not if you log the seed of the failing runs
| Arnavion wrote:
| aka property testing.
| cryptonector wrote:
| > Of course, a harness that started from the punycode parsing
| -- instead of the top-level X509 parsing -- finds this
| vulnerability immediately.
|
| Yes, it's nice to have a fuzzer that can start from a very
| high-level entry point, but it's even nicer to fuzz lots of
| entry points (one for every parser).
| capitainenemo wrote:
| If finding it required generating a valid signed certificate
| from random string generation that seems essentially equivalent
| to "unreachable" to me....
___________________________________________________________________
(page generated 2024-04-18 23:00 UTC)