[HN Gopher] Everything Is Broken: Shipping Rust-Minidump at Mozilla
___________________________________________________________________
Everything Is Broken: Shipping Rust-Minidump at Mozilla
Author : mthermidor
Score : 229 points
Date : 2022-06-14 15:19 UTC (7 hours ago)
(HTM) web link (hacks.mozilla.org)
(TXT) w3m dump (hacks.mozilla.org)
| js2 wrote:
| Thank you for this work!
|
| I've been involved with minidumps in one way or another since
| around 2010. Was at a startup at the time that had a browser
| based on Chromium and we needed crash reporting for our own app.
| So I wrote a pretty simply backend that received minidumps, ran
| them through the breakpad processor and shoved the output into
| Splunk. That was our crash-reporting system.
|
| Circa 2013 the company gets acquired by Yahoo which at the time
| was using Crittercism for its mobile apps but Yahoo wasn't happy
| with it. Somehow I was now the mobile app crash reporting expert
| at the company though so I built a whole new in-house crash
| reporting solution.
|
| For iOS I wrote an SDK around PLCrashReporter because unwinding
| stacks on the client works out way better on iOS than dealing
| with a minidump.
|
| For Android I had to deal with both JVM (er, Dalvik, er ART)
| stack traces, easy enough, but also native code crashes. For the
| latter I used breakpad's crash handler and minidumps. But it
| turns out that minidumps from Android devices are almost useless
| for two reasons:
|
| 1) If the crashes originate in managed code or calls into managed
| code you can't trace back through the managed code frames from a
| minidump. Especially if you don't have frame pointers.
|
| 2) You basically cannot get the symbols for all the different
| flavors of Android. Without symbols any stack trace that breakpad
| reconstructs is pretty useless.
|
| Eventually I abandoned minidumps on Android and instead unwinding
| on the phone using corkscrew, wait no, libbacktrace, wait no,
| libunwind. But that still doesn't give useful stack traces very
| often. In the end, I ended up capturing logcat output when
| restarting after a crash which actually tends to have the most
| useful stack traces.
|
| Which is all to say, both Apple and Google make it really hard
| for a mobile app to find out why it crashed. Both Android and iOS
| create a crash report for any app which crashes, but the app
| can't access those. So we're all shipping apps with third-party
| crash handlers built-in that try to capture a stack or minidump
| in-process and make sense of it later.
| avgcorrection wrote:
| Gankra is the most entertaining Rust author (Rust programmer who
| writes about Rust). Easily.
| robby_w_g wrote:
| There's an unreasonably grumpy commenter below that disagrees,
| but I personally agree with you and found this to be a fun
| read.
|
| I was interested in the topic before reading, but it could have
| easily been a slog of technical minutia. I'm glad that wasn't
| the case!
|
| Edit: the comment I referenced was deleted in the time I took
| to post this. It's probably for the best
| tialaramex wrote:
| Mmm. I think @m_ou_se is probably the most _entertaining_ at
| least if we consider that both Saturday Night Live and
| Nightmare On Elm Street is entertainment.
|
| For example, Rust deliberately doesn't have the tertiary
| operator, and random other types don't get silently coerced as
| booleans - so you can't write a = x ? 1 : -1; however you can
| write a = if x != 0 { 1 } else { -1 }; with the same effect.
| But Mara isn't satisfied with this verbose yet sensible answer,
| and proposes you could instead, for example:
|
| a = x.count_ones().count_ones().count_ones().count_ones() as
| i32 * 2 - 1;
|
| Hilarious? Or maybe terrifying? Entertaining certainly.
| https://twitter.com/m_ou_se/status/1404034056405368833?lang=...
|
| Aria is _more informative_ but I 'm not going to end up choking
| and spilling my beverage all over the desk.
| wly_cdgr wrote:
| singhrac wrote:
| > Rust is a really good language for writing parsers. C++ really
| isn't.
|
| One thing I appreciate about writing Rust is that ADT support
| implies writing parsers is simpler under the "parse don't
| validate" mindset (which was clarified for me I think in this [0]
| article).
|
| [0]: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-
| va...
| marcosdumay wrote:
| It's not ADT, it's strict types. The reason you can't do
| "parse, don't validate" in C++ is because you can't assume
| anything is valid at the point you use the data.
|
| What ADT does is give you enough flexibility so that a strict
| typing system doesn't suck.
| singhrac wrote:
| > you can't assume anything is valid at the point you use the
| data.
|
| Just to double check my understanding: are you talking about
| raw pointers (i.e. void*) being common in C++ and not in
| Rust? You're right that I was using ADT a bit loosely; to be
| honest the main value add for me has been the first class
| data-holding enums/sum types. C++ has std::variant, but the
| syntax support in Rust feels nicer.
| marcosdumay wrote:
| C++ has a series of issues.
|
| You can't trust pointers have a value, or that the value is
| valid, you can't trust that your enums have a value inside
| their interval, or in fact you can't trust that any value
| from any type is inside its interval at all.
|
| You also can't really trust that your values have the
| correct size.
|
| We choose some of those to ignore, otherwise we wouldn't be
| able to program at all, but C++ gives you no guarantees at
| all about anything. The point is that if you do a parsing
| run in C++ and encode your value, you will still get many
| of the above problems because of bugs in your code.
| swolchok wrote:
| > you can't trust that your enums have a value inside
| their interval
|
| If you don't set the underlying type, assigning a value
| that doesn't match an enumerator via `static_cast` is
| undefined behavior. See
| https://en.cppreference.com/w/cpp/language/enum . (Doing
| weird pointer casting things is also undefined behavior
| per the strict aliasing rule, though, come to think of
| it, I'm not sure whether memcpying an out-of-range value
| into an enum through the "reinterpret_cast to `char*`"
| loophole is undefined behavior.)
| marcosdumay wrote:
| Even if that covered all the problem space (instead of
| replacing it with a much larger one), if your code is
| flawless, parsing and validating are equivalent.
|
| Choosing one just makes a difference because code has
| problems.
| nemetroid wrote:
| I'm assuming you are referring to this part:
|
| > If the underlying type is not fixed and the source
| value is out of range, the behavior is undefined.
|
| Note the fine print about the meaning of "out of range":
|
| > (The source value, as converted to the enumeration's
| underlying type if floating-point, is in range if it
| would fit in the smallest bit field large enough to hold
| all enumerators of the target enumeration.)
|
| So this is _not_ undefined: enum E { A =
| 0, B = 1, C = 2 }; E valid = static_cast<E>(3);
| danShumway wrote:
| It's linked at the bottom of the article, but reminder that
| Gankra's blog (https://gankra.github.io/blah/) has a ton of other
| great writing like this.
|
| In particular, I always recommend "Text Rendering Hates You."
| nindalf wrote:
| I link that article every time I see someone on the internet
| say "that sounds easy, why don't you just"
|
| And the answer is always well, things are more complicated than
| they look. Even something as _trivial_ as rendering text on a
| screen.
| draw_down wrote:
| secondcoming wrote:
| Maybe I'm missing something, but they ported from C++ (because
| 'C++ is bad donchaknow') to Rust and still ran into problems
| parsing crash dumps?
|
| If the dump is corrupt then just stop trying to parse/make sense
| of it; it's garbage.
| Gankra wrote:
| No we removed many random crashes that the C++ code had. You
| cannot "simply" discard a crash report if something is slightly
| off because then you would discard most crash reports. And most
| debuginfo too.
|
| You can't expect "thing that runs when a process may have just
| experienced memory corruption" and "all builds of your
| application for all eternity" and "every toolchain you ever
| built your program with for all eternity" to be even vaguely
| reliable, because those things are in the past and we're trying
| to figure out how to fix the bugs people are experiencing in
| production today.
|
| It is a horribly miserable answer to tell your coworkers "yeah
| sorry I know users are getting thousands of crashes this
| morning but the crash-dumper didn't sign its name in cursive so
| I'm gonna refuse to let you read the letter it sent at all".
|
| And just an incoherent answer to say "yeah I know this is a
| stack overflow but it left the stack in a mildly corrupt state
| so I absolutely refuse to try to even look at the stack and
| figure anything out about it". Like, that is the entire purpose
| of a crashreporter, to investigate a program in an invalid
| state!
| rockdoe wrote:
| Reminds me of "Your program shouldn't have bugs in it isn't
| an acceptable position to take for a debugger", from the rr
| folks. Unfortunately I can't find the source of the quote any
| more, but it stuck in my mind.
| Gankra wrote:
| Yeah computing backtraces in a crashreporter is extremely
| similar to a debugger in that you need a lot of fudge-
| factor heuristics and fallback modes for known toolchain
| bugs or common corruptions.
| khuey wrote:
| You're probably remembering https://pernos.co/blog/tzcnt-
| portability/
| gpm wrote:
| Speaking of the rr folk, they also had the fascinating
| point that you can reliably generate a "stack trace" by
| figuring out which `call` instructions were executed with
| what values (also other jump instructions I suppose),
| instead of walking the stack. Thereby skipping the whole
| "parsing the stack is insanely difficult and unreliable"
| issue.
| glandium wrote:
| FWIW, that's from pernosco, not rr.
| gpm wrote:
| I think it's the same people?
| structural wrote:
| This is the excessively fun part of dealing with crash dumps in
| general. Many of them are going to be 1% corrupt, 99% fine, and
| somewhere in them likely has vital information about what
| caused the corruption.
|
| So the entire reason for being for things like rust-minidump
| are to make enough sense out of files that are known to be
| corrupt garbage to be able to find bugs.
| mrlonglong wrote:
| Do tell us more, don't leave us hanging ! Loved it.
| ComputerGuru wrote:
| A better/more technical article on the same tech, from Mozilla's
| collaborators on this project: https://jake-
| shadle.github.io/crash-reporting/
| Gankra wrote:
| That article is about the client-side (generating the minidump
| for a crashed process) to this article's server-side
| (processing/analyzing the minidump).
| nindalf wrote:
| This is a fantastic article, thank you for writing it. Looking
| forward to part 2!
| pierrebai wrote:
| If the follow-up post does not make it to HN front page, I'll
| have a hole in my life.
| Gankra wrote:
| Extra shoutouts to the folks at Sentry who also flipped rust-
| minidump on as their default backend and had to deal with way
| more exotic issues than I did (and fixed them!) because although
| Firefox sees some horrendous stuff and gets a bajillion crash
| reports, it's still one application with one basically stable
| minidump writing configuration.
|
| They have to deal with basically random apps doing whatever they
| want and it sounds like hell.
| yjftsjthsd-h wrote:
| > how we got absolutely owned by simple fuzzing
|
| > You are reading part 1, wherein we build up our hubris.
|
| Props to anyone willing to own their faults this readily:)
| j3s wrote:
| What a fun read! :3 I really like your writing style. Deploying
| stuff to production is always so nerve-wracking, I related to
| that very hard. I recently developed a golang alternative to an
| old erlang-ruby-hodgepodge, and when it worked in production I
| found myself constantly not believing that nothing went wrong.
| tclancy wrote:
| Ha, weeks and months of thinking, "Please just work" and then
| it does and it's always a shock.
___________________________________________________________________
(page generated 2022-06-14 23:00 UTC)