[HN Gopher] How we applied fuzzing techniques to cURL
___________________________________________________________________
How we applied fuzzing techniques to cURL
Author : ingve
Score : 291 points
Date : 2024-03-01 15:09 UTC (1 days ago)
(HTM) web link (blog.trailofbits.com)
(TXT) w3m dump (blog.trailofbits.com)
| SamuelAdams wrote:
| I am curious how much effort goes into creating and maintaining
| unit tests and fuzzing tests. Sometimes it takes longer / more
| lines of code to write thorough tests than it does to implement
| the core feature.
|
| At that point, is it worth the time invested? Every new feature
| can take 2-3 times longer to deliver due to adding tests.
| epistasis wrote:
| There's a tradeoff for sure, but to fully evaluate that
| tradeoff you must also take into account the future time spent
| on the code base, including the amount of time needed to debug
| future features, adopt a new build environment, allow new team
| members to muck around and see what does and does not work...
|
| If it's a small project that won't be extended much, then
| perhaps then re-runnable unit tests may not make the bar as a
| net positive tradeoff. But some time must still be spent
| testing the code, even if it's not tests that are written in
| code.
| avgcorrection wrote:
| The best-case IMO is a test suite where writing the test is
| almost as easy (or maybe easier?) than doing the equivalent
| manual test. The test code is maybe 2-4x longer than the
| change.
|
| The worst case is... arbitrarily bad, to the point of being
| impossible. Test setup is hell because there are too many
| dependencies. You have to mock the "real world" since (again)
| things depend on other things too much and you can't really do
| a simple string-manipulating change without setting up a whole
| environment. Also you introduce bugs in your tests that take
| longer to debug than the real change you are making.
|
| What I feel that we (current project) have fallen into the trap
| of is that there wasn't a test suite from the start. Just
| manual testing. Then when you get to a stage where you feel you
| need one, the code base is not structured to make automated
| testing simple.
| bmcniel wrote:
| Don't test protect against future unintended regressions?
|
| If all code was write-only then testing is probably a waste of
| time but code changes constantly.
| yjftsjthsd-h wrote:
| In the case of curl, the cost/benefit analysis is probably
| skewed by it being deployed on a _massive_ scale, such that any
| bugs in curl have an unusually large impact. If your company 's
| in-house CRM has an exploitable bug, that's _bad_ but the
| impact is just your company. If libcurl has an exploitable bug,
| that 's millions of devices affected.
| DylanSp wrote:
| Probably _billions_ of devices at this point, honestly.
| saagarjha wrote:
| Curl claims 20 billion installations, FWIW.
| Attummm wrote:
| For libraries, tools, and frameworks, testing is crucial as it
| ensures that the code relying on them can address the issue at
| hand. Code can only be as reliable as what it's leaning on. So,
| to answer your question, a lot of time.
|
| In a business-oriented project(at most jobs), code may undergo
| frequent changes due to requests from business, thus too much
| focus on testing could potentially slowing down development
| speed if extensive testing is implemented for each change.
| However, regression tests can still provide valuable insights
| and allow for faster development later in the life of the
| project.
|
| While many projects only focus on happy path testing, the use
| of such tests might not be as high. Coupling them with Negative
| Testing, and even better, implementing boundary testing,
| compels developers to consider both valid and invalid inputs,
| helping to identify and address potential edge cases before
| they become bugs or security issues in production.
|
| For instance, this [0] codebase has more tests than actual
| code, including fuzzing tests.
|
| [0]https://github.com/Attumm/redis-dict/blob/main/tests.py
| hgs3 wrote:
| Code that isn't tested, isn't done. Tests not only verify the
| expectations, but also prevent future regression. Fuzzing is
| essential for code that accepts external inputs. Heartbleed was
| discoverable with a fuzzer.
| jnwatson wrote:
| In most situations, we don't get paid to make code "done'. We
| get paid to get it close enough to work most of the time.
| pdimitar wrote:
| True but cURL is in a very different situation compared to
| your average run-of-the-mill program. It's one of the most
| massively deployed open-source software out there. When you
| are not beholden to capricious shareholders and incompetent
| managers you can actually focus on quality.
| kgeist wrote:
| >Code that isn't tested, isn't done. Tests not only verify
| the expectations, but also prevent future regression
|
| A few weeks ago someone at the company implemented a feature
| without tests. It worked perfectly, everyone was happy. A
| release of another new feature by a different team a few days
| later broke the previous feature :)
| 1over137 wrote:
| >At that point, is it worth the time invested? Every new
| feature can take 2-3 times longer to deliver due to adding
| tests.
|
| Depends if you are writing software to control a pacemaker, or
| writing software for some silly smartphone game.
| pests wrote:
| Everyone always praises SQLite
|
| > As of version 3.42.0 (2023-05-16), the SQLite library
| consists of approximately 155.8 KSLOC of C code [...] the
| project has 590 times as much test code and test scripts -
| 92053.1 KSLOC.
|
| https://www.sqlite.org/testing.html
| guerrilla wrote:
| What the fuck. If they're investing that much then why don't
| they just go straight to formal verification. This is what
| things like frama-c (or whatever's popular now) are for.
| gavinhoward wrote:
| The problem is that effort to do formal verification goes
| exponential beyond a certain point.
|
| seL4 is around 10-12 KLoC, and it took a decade of effort
| from multiple people to make it happen.
|
| At the size of SQLite, especially where they have to
| operate on platforms with different behavior (as an OS,
| seL4 _is_ the platform), formal verification is just too
| much effort.
|
| All that said, your reaction is _totally_ understandable.
| tonyarkles wrote:
| Link to how SQLite is tested, for anyone who's curious:
| https://www.sqlite.org/testing.html
|
| There's also an interesting thing where formal
| verification requires a formal specification, which afaik
| there isn't one for SQLite. One of the toughest problems
| that someone would run into trying to put together a
| formal specification for code as widely deployed as
| SQLite boils down to Hyrum's Law[1]: on a long enough
| time scale, all observable behaviours of your system
| become interfaces that someone, somewhere depends on.
|
| That massive suite of test cases isn't a formal
| specification but given that it achieves 100% branch
| coverage that implies to me that it:
|
| - pretty tightly bounds the interface without formally
| specifying it
|
| - also pretty tightly constrains the implementation to
| match the current implementation
|
| Which, when you have as many users as you do with SQLite,
| is probably a fair way of providing guarantees to your
| users that upgrading from 3.44.0 to 3.45.1 isn't going to
| break anything you're using unless you were relying on
| explicitly-identified buggy behaviour (you'd be able to
| see the delta in the test cases if you looked at the
| Fossil diffs).
|
| [1] https://www.hyrumslaw.com
| er4hn wrote:
| https://corecursive.com/066-sqlite-with-richard-hipp/ is
| an interesting interview with the lead developer for
| SQLite about the inspiration and work that went into that
| testing as well.
| SAI_Peregrinus wrote:
| Also a formal specification can have bugs. Formal
| verification checks that the code matches the spec, not
| that the spec implements all desired behaviors and no
| undesired behaviors.
| IshKebab wrote:
| You're right, though it's worth noting that you can also
| use formal verification to verify properties about the
| specification! For example you can verify that a security
| system doesn't allow privilege escalation.
| SAI_Peregrinus wrote:
| Yep, though of course it's still on you to remember to
| verify everything you care about. Actually writing out
| all your requirements in a formal (or just programming)
| language is the hard part of programming.
| pests wrote:
| That's the same link I posted in the original comment,
| JSYK.
| summerlight wrote:
| Because formal verification doesn't scale at the moment.
| There are a relatively few number of experts on formal
| method around the globe and some of them need to work on
| SQLite indefinitely unless you're okay with just one-off
| verification. Frama-C has a different background because
| it's from Inria, the institution with many formal method
| experts.
| datadeft wrote:
| You can do many things other than testing:
|
| - use a memory safe language
|
| - formal verification (multiple implementations even)
|
| - build a simulator like FoundationDB did
| dkubb wrote:
| I once found a confirmed bug in SQLite one time, and it's the
| highlight of all my OSS bug reports.
|
| Interestingly enough I found it while fuzz testing a query
| builder and cross testing it against PostgreSQL, MySQL and an
| in-memory filtering engine I wrote.
| burnished wrote:
| Generally speaking, yes. Its not like most code isn't changed
| as a consequence of writing those tests, so the practice has
| immediate benefit, but future changes can be made swiftly and
| securely due to the confidence those tests should be giving
| you.
|
| But also your time estimate does sound wonky, 2-3x sounds
| extreme. Maybe you need to improve your test writing process?
| dogcomplex wrote:
| I would generally double whatever your expectations are for the
| initial feature development. Tests are essentially a second
| implementation from a different angle running in parallel,
| hoping the results match. Every feature change means changing 2
| systems now. You save a bit of subsequent time with easier
| debugging when other features break tests, but that's somewhat
| eaten up by maintaining a system twice the size.
|
| There are reasons many MVP developers and small teams whose
| focus is more on rapid feature implementation than large team
| coordination or code stability forego writing tests. It doesn't
| make sense in all circumstances. Generally, more complex, less
| grokable, more large-team-oriented or public library code is
| when you need testing.
| vlovich123 wrote:
| Tests are only a second implementation if you use test
| doubles incorrectly. Test doubles should only be used for I/O
| outside of the program under test that you can't really run
| locally / is a network dependency (eg mocking a SaaS service
| or something) or for performance (mocking database responses
| vs spinning up a test database instance). If you do it write,
| most of your tests are just testing each layer and everything
| below it.
|
| I have yet to see a case where omitting tests actually helps
| you move meaningfully faster - you're probably generating
| more heat than light and that makes you feel like you're
| moving faster.
| summerlight wrote:
| It's an economical decision. If you're developing a new indie
| game with $10k revenue expectation in its lifetime, then
| probably it's not worth your time. But if it's a core
| infrastructure of multi-billion dollar business, then yes it's
| worth your time since any non-trivial security incidents may
| cost more than your annual salary.
| tylerhou wrote:
| It depends on how risk tolerant you are. If you are at a
| startup, the growth of your startup is often largely dependent
| on how quickly you can add features / onboard customers. In
| that context, writing tests not only slows you down from adding
| new features, it might make it harder to modify existing
| features. In addition, early customers also tend to be
| accepting of small bugs -- they themselves already have "taken
| a risk" in trusting an early startup. So testing is not really
| valuable -- you want to remain agile, and you won't lose much
| money because of a bug.
|
| On the other hand, if you are Google, you already have found a
| money-printing firehose, and you /don't/ want to take on any
| additional unnecessary risk. Any new code needs to /not/ break
| existing functionality -- if it does, you might lose out of
| millions of revenue. In addition, your product becomes so large
| that it is impossible to manually test every feature. In this
| case, tests actually help you move /faster/ because they help
| you at scale automatically ensure that a change does not break
| anything.
|
| While cURL does not make any money, it is solidly on the
| mature/Google end of the testing spectrum. It has found a
| footing in the open source tooling "market" and people rely on
| it to maintain its existing functionality. In addition, it has
| accumulated a fairly large surface area, so manual testing is
| not really feasible for every feature. So testing similarly
| helps cURL developers move faster (in the long run), not
| slower.
| evil-olive wrote:
| > Every new feature can take 2-3 times longer to deliver due to
| adding tests.
|
| alternate framing: you can have a feature done in half or a
| third of the time, if you don't care whether it actually works
| or not
|
| ...or whether it will _continue_ working, as other new features
| are added to the system and the codebase evolves and grows.
| that sort of maintainability is one of the key things you get
| from investing in writing tests - you can refactor without fear
| of breaking some existing feature.
|
| it also allows you to hire engineers and give them a "safety
| net" where they can make changes and be confident they're not
| breaking some crucial functionality of the unfamiliar codebase.
|
| done correctly (an important caveat, because lots of people do
| testing poorly and then conclude that all testing is bad) that
| time spent writing tests is not wasted effort. cutting corners
| by skipping tests is very much a false economy in the long
| term.
| rwmj wrote:
| Fuzzing isn't really a "2-3 times longer" thing. You have to
| set it up once (which for basic fuzzing is quite easy), you
| throw compute power at it, and it turns up bugs for you. As you
| make changes to the project the fuzzer will explore new paths
| mostly automatically.
|
| You may want to spend some time looking at code coverage and
| doing some of the advanced things outlined in this article,
| especially for very high risk / reward projects like Curl, but
| even that is not a lot of work.
| williamdclt wrote:
| It's the 101 of talking about testing, it's been discussed to
| the death. As usual, "it's a trade-off".
|
| You're right that writing tests can make shipping a feature
| slower. However, already having tests makes shipping a feature
| faster (higher confidence it works, less manual testing of the
| feature and the adjacent ones). It also lowers the amount of
| bugs, which are going to slow down delivery of new features.
| They also increase maintenance cost though (tests are code that
| needs maintenance too).
|
| Is it worth it? Depends, although it's quite rare that no test
| at all is the right trade off
| PoignardAzur wrote:
| I don't get the part about custom mutators:
|
| > _If the data can't be parsed into a valid TLV, instead of
| throwing it away, return a syntactically correct dummy TLV. This
| can be anything, as long as it can be successfully unpacked._
|
| If you're creating a dummy value, how is that better than
| failing? How does that give your fuzzer better coverage?
| pstrateman wrote:
| The file format they choose is difficult for a fuzzer to
| produce valid examples by random chance.
|
| The file format isn't what's being fuzzed, so trying to accept
| as many things as possible as valid is useful.
|
| It's a trick to make the fuzzer faster.
| gavinhoward wrote:
| Not an expert, but I am a power user of fuzzers.
|
| The problem is that the space of invalid inputs is far larger
| than the space of valid inputs. Sometimes orders of magnitude
| larger, say billions or more invalid inputs to one valid input.
|
| Naive fuzzing will hit so many error cases that it will hardly
| produce a valid input. For the ratio that I mentioned, you
| might run a fuzzer for a billion runs and only get one valid
| input in the bunch.
|
| Using a custom mutator and returning a dummy value will give
| the fuzzer a starting point from a valid input and makes
| generating other valid inputs more likely.
|
| For my part, I prefer to use custom mutators to generate valid
| test cases most of the time, but I want some invalid inputs
| because error handling is where most bugs are.
| vlovich123 wrote:
| I personally would separate that into two separate fuzz cases
| - one that generates only valid inputs and one that generates
| only invalid inputs and spend more resources on verifying the
| latter because validating invalid inputs is more important. I
| didn't read the article but I like property testing for this
| where your mutator takes random values and uses that to
| generate a valid input somehow rather than stubbing in a
| static value. Where I can use a static value is where I
| wouldn't be fuzzing the validation of that value. Of course
| certs are complicated beasts so I'm sure the cURL people did
| what made sense to them.
| PoignardAzur wrote:
| Your post doesn't address my question (and neither do any of
| the others replies). The article says:
|
| > _If the data can't be parsed into a valid TLV, instead of
| throwing it away, return a syntactically correct dummy TLV.
| This can be anything, as long as it can be successfully
| unpacked._
|
| Is the syntactically correct dummy value the same each time?
| If so, how does that lead to new coverage?
|
| If it's a _different_ valid TLV value each time, then it 's
| not really a dummy value, is it? In any case, why bother with
| this "if invalid replace with dummy" step? Why not
| generate/mutate a valid TLV value from the start?
| MajesticHobo2 wrote:
| > Is the syntactically correct dummy value the same each
| time? If so, how does that lead to new coverage?
|
| Per my understanding, the dummy value is constant but is
| only used once (to create the first valid TLV). Everything
| after that is a mutation of the original value that, due to
| the custom mutator logic, is a valid TLV. The mutations are
| where new coverage comes from.
|
| > In any case, why bother with this "if invalid replace
| with dummy" step? Why not generate/mutate a valid TLV value
| from the start?
|
| I'm guessing to handle both when there are seed files that
| are already valid (which you'll want to use instead of a
| dummy value), and when there aren't any valid seed files.
| ackbar03 wrote:
| You think generative ai/deep learning tools could help with
| this?
| stefan_ wrote:
| Remember TLV is for "Type Length Value". It's better to take
| your fuzzer output for value (or possibly type and value) but
| generate the length and final TLV yourself than having tons of
| fuzzer generated sequences already fail at the very basic
| type/length check that is unlikely to be vulnerable in software
| like cURL (but can be in many others..).
| mkj wrote:
| You want the failures to occur deeper in the programs so you
| cover more code paths.
| throwaway81523 wrote:
| Lots of good info here about how they added more fuzz tests, but
| no mention whether the fuzzing actually found bugs?
| userbinator wrote:
| If they did, I'm sure they would mention it.
| costco wrote:
| OSS-fuzz is great in many cases but as this post hints at it
| seems like a lot of the time the fuzzers miss things because no
| one actually reviews the coverage info. Not in this case, but
| sometimes it's like someone just found the functions with
| signatures like parse(unsigned char *data, size_t size) just to
| get the money from Google for integrating fuzzing.
|
| Then again some things are just difficult to fuzz properly. I
| tried writing a libpurple IRC harness by doing it the "right way"
| with the 3000 libpurple event callbacks you are supposed to set
| up and structures you are supposed to allocate, which worked, but
| it was very slow. I ended up being lazy and only calling
| irc_parse_msg after modifying the code to remove error logging
| that required proper libpurple setup.
___________________________________________________________________
(page generated 2024-03-02 23:01 UTC)