[HN Gopher] How we applied fuzzing techniques to cURL
       ___________________________________________________________________
        
       How we applied fuzzing techniques to cURL
        
       Author : ingve
       Score  : 291 points
       Date   : 2024-03-01 15:09 UTC (1 days ago)
        
 (HTM) web link (blog.trailofbits.com)
 (TXT) w3m dump (blog.trailofbits.com)
        
       | SamuelAdams wrote:
       | I am curious how much effort goes into creating and maintaining
       | unit tests and fuzzing tests. Sometimes it takes longer / more
       | lines of code to write thorough tests than it does to implement
       | the core feature.
       | 
       | At that point, is it worth the time invested? Every new feature
       | can take 2-3 times longer to deliver due to adding tests.
        
         | epistasis wrote:
         | There's a tradeoff for sure, but to fully evaluate that
         | tradeoff you must also take into account the future time spent
         | on the code base, including the amount of time needed to debug
         | future features, adopt a new build environment, allow new team
         | members to muck around and see what does and does not work...
         | 
         | If it's a small project that won't be extended much, then
         | perhaps then re-runnable unit tests may not make the bar as a
         | net positive tradeoff. But some time must still be spent
         | testing the code, even if it's not tests that are written in
         | code.
        
         | avgcorrection wrote:
         | The best-case IMO is a test suite where writing the test is
         | almost as easy (or maybe easier?) than doing the equivalent
         | manual test. The test code is maybe 2-4x longer than the
         | change.
         | 
         | The worst case is... arbitrarily bad, to the point of being
         | impossible. Test setup is hell because there are too many
         | dependencies. You have to mock the "real world" since (again)
         | things depend on other things too much and you can't really do
         | a simple string-manipulating change without setting up a whole
         | environment. Also you introduce bugs in your tests that take
         | longer to debug than the real change you are making.
         | 
         | What I feel that we (current project) have fallen into the trap
         | of is that there wasn't a test suite from the start. Just
         | manual testing. Then when you get to a stage where you feel you
         | need one, the code base is not structured to make automated
         | testing simple.
        
         | bmcniel wrote:
         | Don't test protect against future unintended regressions?
         | 
         | If all code was write-only then testing is probably a waste of
         | time but code changes constantly.
        
         | yjftsjthsd-h wrote:
         | In the case of curl, the cost/benefit analysis is probably
         | skewed by it being deployed on a _massive_ scale, such that any
         | bugs in curl have an unusually large impact. If your company 's
         | in-house CRM has an exploitable bug, that's _bad_ but the
         | impact is just your company. If libcurl has an exploitable bug,
         | that 's millions of devices affected.
        
           | DylanSp wrote:
           | Probably _billions_ of devices at this point, honestly.
        
             | saagarjha wrote:
             | Curl claims 20 billion installations, FWIW.
        
         | Attummm wrote:
         | For libraries, tools, and frameworks, testing is crucial as it
         | ensures that the code relying on them can address the issue at
         | hand. Code can only be as reliable as what it's leaning on. So,
         | to answer your question, a lot of time.
         | 
         | In a business-oriented project(at most jobs), code may undergo
         | frequent changes due to requests from business, thus too much
         | focus on testing could potentially slowing down development
         | speed if extensive testing is implemented for each change.
         | However, regression tests can still provide valuable insights
         | and allow for faster development later in the life of the
         | project.
         | 
         | While many projects only focus on happy path testing, the use
         | of such tests might not be as high. Coupling them with Negative
         | Testing, and even better, implementing boundary testing,
         | compels developers to consider both valid and invalid inputs,
         | helping to identify and address potential edge cases before
         | they become bugs or security issues in production.
         | 
         | For instance, this [0] codebase has more tests than actual
         | code, including fuzzing tests.
         | 
         | [0]https://github.com/Attumm/redis-dict/blob/main/tests.py
        
         | hgs3 wrote:
         | Code that isn't tested, isn't done. Tests not only verify the
         | expectations, but also prevent future regression. Fuzzing is
         | essential for code that accepts external inputs. Heartbleed was
         | discoverable with a fuzzer.
        
           | jnwatson wrote:
           | In most situations, we don't get paid to make code "done'. We
           | get paid to get it close enough to work most of the time.
        
             | pdimitar wrote:
             | True but cURL is in a very different situation compared to
             | your average run-of-the-mill program. It's one of the most
             | massively deployed open-source software out there. When you
             | are not beholden to capricious shareholders and incompetent
             | managers you can actually focus on quality.
        
           | kgeist wrote:
           | >Code that isn't tested, isn't done. Tests not only verify
           | the expectations, but also prevent future regression
           | 
           | A few weeks ago someone at the company implemented a feature
           | without tests. It worked perfectly, everyone was happy. A
           | release of another new feature by a different team a few days
           | later broke the previous feature :)
        
         | 1over137 wrote:
         | >At that point, is it worth the time invested? Every new
         | feature can take 2-3 times longer to deliver due to adding
         | tests.
         | 
         | Depends if you are writing software to control a pacemaker, or
         | writing software for some silly smartphone game.
        
         | pests wrote:
         | Everyone always praises SQLite
         | 
         | > As of version 3.42.0 (2023-05-16), the SQLite library
         | consists of approximately 155.8 KSLOC of C code [...] the
         | project has 590 times as much test code and test scripts -
         | 92053.1 KSLOC.
         | 
         | https://www.sqlite.org/testing.html
        
           | guerrilla wrote:
           | What the fuck. If they're investing that much then why don't
           | they just go straight to formal verification. This is what
           | things like frama-c (or whatever's popular now) are for.
        
             | gavinhoward wrote:
             | The problem is that effort to do formal verification goes
             | exponential beyond a certain point.
             | 
             | seL4 is around 10-12 KLoC, and it took a decade of effort
             | from multiple people to make it happen.
             | 
             | At the size of SQLite, especially where they have to
             | operate on platforms with different behavior (as an OS,
             | seL4 _is_ the platform), formal verification is just too
             | much effort.
             | 
             | All that said, your reaction is _totally_ understandable.
        
               | tonyarkles wrote:
               | Link to how SQLite is tested, for anyone who's curious:
               | https://www.sqlite.org/testing.html
               | 
               | There's also an interesting thing where formal
               | verification requires a formal specification, which afaik
               | there isn't one for SQLite. One of the toughest problems
               | that someone would run into trying to put together a
               | formal specification for code as widely deployed as
               | SQLite boils down to Hyrum's Law[1]: on a long enough
               | time scale, all observable behaviours of your system
               | become interfaces that someone, somewhere depends on.
               | 
               | That massive suite of test cases isn't a formal
               | specification but given that it achieves 100% branch
               | coverage that implies to me that it:
               | 
               | - pretty tightly bounds the interface without formally
               | specifying it
               | 
               | - also pretty tightly constrains the implementation to
               | match the current implementation
               | 
               | Which, when you have as many users as you do with SQLite,
               | is probably a fair way of providing guarantees to your
               | users that upgrading from 3.44.0 to 3.45.1 isn't going to
               | break anything you're using unless you were relying on
               | explicitly-identified buggy behaviour (you'd be able to
               | see the delta in the test cases if you looked at the
               | Fossil diffs).
               | 
               | [1] https://www.hyrumslaw.com
        
               | er4hn wrote:
               | https://corecursive.com/066-sqlite-with-richard-hipp/ is
               | an interesting interview with the lead developer for
               | SQLite about the inspiration and work that went into that
               | testing as well.
        
               | SAI_Peregrinus wrote:
               | Also a formal specification can have bugs. Formal
               | verification checks that the code matches the spec, not
               | that the spec implements all desired behaviors and no
               | undesired behaviors.
        
               | IshKebab wrote:
               | You're right, though it's worth noting that you can also
               | use formal verification to verify properties about the
               | specification! For example you can verify that a security
               | system doesn't allow privilege escalation.
        
               | SAI_Peregrinus wrote:
               | Yep, though of course it's still on you to remember to
               | verify everything you care about. Actually writing out
               | all your requirements in a formal (or just programming)
               | language is the hard part of programming.
        
               | pests wrote:
               | That's the same link I posted in the original comment,
               | JSYK.
        
             | summerlight wrote:
             | Because formal verification doesn't scale at the moment.
             | There are a relatively few number of experts on formal
             | method around the globe and some of them need to work on
             | SQLite indefinitely unless you're okay with just one-off
             | verification. Frama-C has a different background because
             | it's from Inria, the institution with many formal method
             | experts.
        
             | datadeft wrote:
             | You can do many things other than testing:
             | 
             | - use a memory safe language
             | 
             | - formal verification (multiple implementations even)
             | 
             | - build a simulator like FoundationDB did
        
           | dkubb wrote:
           | I once found a confirmed bug in SQLite one time, and it's the
           | highlight of all my OSS bug reports.
           | 
           | Interestingly enough I found it while fuzz testing a query
           | builder and cross testing it against PostgreSQL, MySQL and an
           | in-memory filtering engine I wrote.
        
         | burnished wrote:
         | Generally speaking, yes. Its not like most code isn't changed
         | as a consequence of writing those tests, so the practice has
         | immediate benefit, but future changes can be made swiftly and
         | securely due to the confidence those tests should be giving
         | you.
         | 
         | But also your time estimate does sound wonky, 2-3x sounds
         | extreme. Maybe you need to improve your test writing process?
        
         | dogcomplex wrote:
         | I would generally double whatever your expectations are for the
         | initial feature development. Tests are essentially a second
         | implementation from a different angle running in parallel,
         | hoping the results match. Every feature change means changing 2
         | systems now. You save a bit of subsequent time with easier
         | debugging when other features break tests, but that's somewhat
         | eaten up by maintaining a system twice the size.
         | 
         | There are reasons many MVP developers and small teams whose
         | focus is more on rapid feature implementation than large team
         | coordination or code stability forego writing tests. It doesn't
         | make sense in all circumstances. Generally, more complex, less
         | grokable, more large-team-oriented or public library code is
         | when you need testing.
        
           | vlovich123 wrote:
           | Tests are only a second implementation if you use test
           | doubles incorrectly. Test doubles should only be used for I/O
           | outside of the program under test that you can't really run
           | locally / is a network dependency (eg mocking a SaaS service
           | or something) or for performance (mocking database responses
           | vs spinning up a test database instance). If you do it write,
           | most of your tests are just testing each layer and everything
           | below it.
           | 
           | I have yet to see a case where omitting tests actually helps
           | you move meaningfully faster - you're probably generating
           | more heat than light and that makes you feel like you're
           | moving faster.
        
         | summerlight wrote:
         | It's an economical decision. If you're developing a new indie
         | game with $10k revenue expectation in its lifetime, then
         | probably it's not worth your time. But if it's a core
         | infrastructure of multi-billion dollar business, then yes it's
         | worth your time since any non-trivial security incidents may
         | cost more than your annual salary.
        
         | tylerhou wrote:
         | It depends on how risk tolerant you are. If you are at a
         | startup, the growth of your startup is often largely dependent
         | on how quickly you can add features / onboard customers. In
         | that context, writing tests not only slows you down from adding
         | new features, it might make it harder to modify existing
         | features. In addition, early customers also tend to be
         | accepting of small bugs -- they themselves already have "taken
         | a risk" in trusting an early startup. So testing is not really
         | valuable -- you want to remain agile, and you won't lose much
         | money because of a bug.
         | 
         | On the other hand, if you are Google, you already have found a
         | money-printing firehose, and you /don't/ want to take on any
         | additional unnecessary risk. Any new code needs to /not/ break
         | existing functionality -- if it does, you might lose out of
         | millions of revenue. In addition, your product becomes so large
         | that it is impossible to manually test every feature. In this
         | case, tests actually help you move /faster/ because they help
         | you at scale automatically ensure that a change does not break
         | anything.
         | 
         | While cURL does not make any money, it is solidly on the
         | mature/Google end of the testing spectrum. It has found a
         | footing in the open source tooling "market" and people rely on
         | it to maintain its existing functionality. In addition, it has
         | accumulated a fairly large surface area, so manual testing is
         | not really feasible for every feature. So testing similarly
         | helps cURL developers move faster (in the long run), not
         | slower.
        
         | evil-olive wrote:
         | > Every new feature can take 2-3 times longer to deliver due to
         | adding tests.
         | 
         | alternate framing: you can have a feature done in half or a
         | third of the time, if you don't care whether it actually works
         | or not
         | 
         | ...or whether it will _continue_ working, as other new features
         | are added to the system and the codebase evolves and grows.
         | that sort of maintainability is one of the key things you get
         | from investing in writing tests - you can refactor without fear
         | of breaking some existing feature.
         | 
         | it also allows you to hire engineers and give them a "safety
         | net" where they can make changes and be confident they're not
         | breaking some crucial functionality of the unfamiliar codebase.
         | 
         | done correctly (an important caveat, because lots of people do
         | testing poorly and then conclude that all testing is bad) that
         | time spent writing tests is not wasted effort. cutting corners
         | by skipping tests is very much a false economy in the long
         | term.
        
         | rwmj wrote:
         | Fuzzing isn't really a "2-3 times longer" thing. You have to
         | set it up once (which for basic fuzzing is quite easy), you
         | throw compute power at it, and it turns up bugs for you. As you
         | make changes to the project the fuzzer will explore new paths
         | mostly automatically.
         | 
         | You may want to spend some time looking at code coverage and
         | doing some of the advanced things outlined in this article,
         | especially for very high risk / reward projects like Curl, but
         | even that is not a lot of work.
        
         | williamdclt wrote:
         | It's the 101 of talking about testing, it's been discussed to
         | the death. As usual, "it's a trade-off".
         | 
         | You're right that writing tests can make shipping a feature
         | slower. However, already having tests makes shipping a feature
         | faster (higher confidence it works, less manual testing of the
         | feature and the adjacent ones). It also lowers the amount of
         | bugs, which are going to slow down delivery of new features.
         | They also increase maintenance cost though (tests are code that
         | needs maintenance too).
         | 
         | Is it worth it? Depends, although it's quite rare that no test
         | at all is the right trade off
        
       | PoignardAzur wrote:
       | I don't get the part about custom mutators:
       | 
       | > _If the data can't be parsed into a valid TLV, instead of
       | throwing it away, return a syntactically correct dummy TLV. This
       | can be anything, as long as it can be successfully unpacked._
       | 
       | If you're creating a dummy value, how is that better than
       | failing? How does that give your fuzzer better coverage?
        
         | pstrateman wrote:
         | The file format they choose is difficult for a fuzzer to
         | produce valid examples by random chance.
         | 
         | The file format isn't what's being fuzzed, so trying to accept
         | as many things as possible as valid is useful.
         | 
         | It's a trick to make the fuzzer faster.
        
         | gavinhoward wrote:
         | Not an expert, but I am a power user of fuzzers.
         | 
         | The problem is that the space of invalid inputs is far larger
         | than the space of valid inputs. Sometimes orders of magnitude
         | larger, say billions or more invalid inputs to one valid input.
         | 
         | Naive fuzzing will hit so many error cases that it will hardly
         | produce a valid input. For the ratio that I mentioned, you
         | might run a fuzzer for a billion runs and only get one valid
         | input in the bunch.
         | 
         | Using a custom mutator and returning a dummy value will give
         | the fuzzer a starting point from a valid input and makes
         | generating other valid inputs more likely.
         | 
         | For my part, I prefer to use custom mutators to generate valid
         | test cases most of the time, but I want some invalid inputs
         | because error handling is where most bugs are.
        
           | vlovich123 wrote:
           | I personally would separate that into two separate fuzz cases
           | - one that generates only valid inputs and one that generates
           | only invalid inputs and spend more resources on verifying the
           | latter because validating invalid inputs is more important. I
           | didn't read the article but I like property testing for this
           | where your mutator takes random values and uses that to
           | generate a valid input somehow rather than stubbing in a
           | static value. Where I can use a static value is where I
           | wouldn't be fuzzing the validation of that value. Of course
           | certs are complicated beasts so I'm sure the cURL people did
           | what made sense to them.
        
           | PoignardAzur wrote:
           | Your post doesn't address my question (and neither do any of
           | the others replies). The article says:
           | 
           | > _If the data can't be parsed into a valid TLV, instead of
           | throwing it away, return a syntactically correct dummy TLV.
           | This can be anything, as long as it can be successfully
           | unpacked._
           | 
           | Is the syntactically correct dummy value the same each time?
           | If so, how does that lead to new coverage?
           | 
           | If it's a _different_ valid TLV value each time, then it 's
           | not really a dummy value, is it? In any case, why bother with
           | this "if invalid replace with dummy" step? Why not
           | generate/mutate a valid TLV value from the start?
        
             | MajesticHobo2 wrote:
             | > Is the syntactically correct dummy value the same each
             | time? If so, how does that lead to new coverage?
             | 
             | Per my understanding, the dummy value is constant but is
             | only used once (to create the first valid TLV). Everything
             | after that is a mutation of the original value that, due to
             | the custom mutator logic, is a valid TLV. The mutations are
             | where new coverage comes from.
             | 
             | > In any case, why bother with this "if invalid replace
             | with dummy" step? Why not generate/mutate a valid TLV value
             | from the start?
             | 
             | I'm guessing to handle both when there are seed files that
             | are already valid (which you'll want to use instead of a
             | dummy value), and when there aren't any valid seed files.
        
           | ackbar03 wrote:
           | You think generative ai/deep learning tools could help with
           | this?
        
         | stefan_ wrote:
         | Remember TLV is for "Type Length Value". It's better to take
         | your fuzzer output for value (or possibly type and value) but
         | generate the length and final TLV yourself than having tons of
         | fuzzer generated sequences already fail at the very basic
         | type/length check that is unlikely to be vulnerable in software
         | like cURL (but can be in many others..).
        
         | mkj wrote:
         | You want the failures to occur deeper in the programs so you
         | cover more code paths.
        
       | throwaway81523 wrote:
       | Lots of good info here about how they added more fuzz tests, but
       | no mention whether the fuzzing actually found bugs?
        
         | userbinator wrote:
         | If they did, I'm sure they would mention it.
        
       | costco wrote:
       | OSS-fuzz is great in many cases but as this post hints at it
       | seems like a lot of the time the fuzzers miss things because no
       | one actually reviews the coverage info. Not in this case, but
       | sometimes it's like someone just found the functions with
       | signatures like parse(unsigned char *data, size_t size) just to
       | get the money from Google for integrating fuzzing.
       | 
       | Then again some things are just difficult to fuzz properly. I
       | tried writing a libpurple IRC harness by doing it the "right way"
       | with the 3000 libpurple event callbacks you are supposed to set
       | up and structures you are supposed to allocate, which worked, but
       | it was very slow. I ended up being lazy and only calling
       | irc_parse_msg after modifying the code to remove error logging
       | that required proper libpurple setup.
        
       ___________________________________________________________________
       (page generated 2024-03-02 23:01 UTC)