[HN Gopher] Flame Graphs: Making the opaque obvious (2017)
       ___________________________________________________________________
        
       Flame Graphs: Making the opaque obvious (2017)
        
       Author : davikr
       Score  : 22 points
       Date   : 2024-06-26 11:43 UTC (1 days ago)
        
 (HTM) web link (tech.popdata.org)
 (TXT) w3m dump (tech.popdata.org)
        
       | PreInternet01 wrote:
       | OK, shameful confession time here: I just _cannot_ grasp flame
       | charts, no matter how hard I try. And yes: that 's just me, I'm
       | dumb, etc. etc. (and I freely admit all of that, including the
       | et-ceteras!)
       | 
       | I tried to follow along with things that are relevant to my job,
       | like https://randomascii.wordpress.com/2016/09/05/etw-flame-
       | graph... ...And totally failed? I look at the reddest part of the
       | chart, I look at the peaks, and none of that matches the
       | conclusion reached in the blog post?
       | 
       | And then I tried to grok all the knowledge conveyed in
       | https://www.brendangregg.com/flamegraphs.html and... came away
       | similarly confused? Sure, I see patterns, but when asked to
       | identify which of those patterns indicate problems, I would still
       | score <50%?
       | 
       | And, I mean, I do memory/CPU traces like all day every day, and I
       | fix code all the time based on that, but that's all just based on
       | hierarchical bar charts, which are NOT the same as inverted flame
       | graphs, as far as I can tell?
       | 
       | Anyway, thanks for coming to my therapy session, and feel free to
       | leave your helpful-comment-that-will-finally-enlighten-me and/or
       | downvote at the door.
        
         | eterm wrote:
         | I think I'm the same as you, I much prefer the style where you
         | get an interactive profile and can hit "show me the hot path"
         | and it'll expand down the hot path.
         | 
         | The overall flame charts just don't provide the immediate
         | "Here's what's wrong" you need from a chart.
         | 
         | Charts should be simple enough to quickly get an idea of
         | whether the thing you're looking at is "Good" or "bad" and if
         | "Bad", quickly see where the problem is.
        
           | PreInternet01 wrote:
           | Well, thank you for pointing out that hierarchical bar charts
           | do a pretty good job at showing the hot path.
           | 
           | So my (and I guess your) workflow is then to fix the top-X
           | hot paths, and then re-profile, etc.
           | 
           | The attraction of flame graphs, as I understand it, is that
           | you can get the same information _without_ the need to
           | iterate. Which is obviously very valuable if you 're
           | operating on single-shot bug reports, but... _how_ exactly?
           | (Not a question directed at you, obviously, but more towards
           | the general discussion)
        
             | hobs wrote:
             | I recommend writing your own flame graph generator, the
             | syntax is super simple and it will help you understand the
             | span/subspan nesting.
             | 
             | In some cases it's not that helpful if you don't have the
             | right type of problem, eg I was troubleshooting some higher
             | level stuff (SQL) and looking at stored procedure
             | invocations and a two depth invocation doesn't tell you
             | very much.
             | 
             | For me a FG is more about sounding out the problem and
             | understanding what the scope of what is possible more than
             | fixing an immediate and obvious issue.
        
             | eterm wrote:
             | The jetbrains profiler lets you quickly "Scope to X", so
             | you can scope to the second or third highest ranked
             | functions and see their breakdown too, so you don't need to
             | fix the highest to get a full understanding.
        
         | throw0101b wrote:
         | > _And then I tried to grok all the knowledge conveyed
         | inhttps://www.brendangregg.com/flamegraphs.html_
         | 
         | Have you watched the LISA 2013 presentation video?
         | 
         | * https://www.youtube.com/watch?v=nZfNehCzGdw
         | 
         | * https://www.usenix.org/conference/lisa13/technical-
         | sessions/...
         | 
         | Some follow-on presentations from him, I found the first one
         | useful:
         | 
         | * https://www.youtube.com/watch?v=D53T1Ejig1Q&t=12m50s
         | 
         | * https://www.youtube.com/watch?v=VMpTU15rIZY&t=8m9s
        
         | rkallos wrote:
         | > I look at the reddest part of the chart, I look at the peaks
         | 
         | Neither of these are really the places I look at when examining
         | flame graphs. I tend to look at the bottom, and work my way up.
         | 
         | The key thing (imo) to look for are wide pieces that are not a
         | core part of what the code you're profiling is supposed to do.
         | 
         | In the first example of your first link, you have a flame graph
         | of code that seems to draw an image. There are a bunch of
         | functions like 'CopyPixels' and 'RasterizeEdges'. Those seem to
         | be a core part of drawing an image. In contrast, there's a
         | 'KiPageFault' function that's pretty wide that does not seem to
         | be important to the code's overall goal. I think a bit more
         | context is needed to arrive at the author's conclusion, though.
         | 
         | I think flame graphs, like all graphs, are more helpful when
         | the reader has a lot of context about what's supposed to
         | happen, or some intuition about how the chart is supposed to
         | look.
        
           | PreInternet01 wrote:
           | > I think flame graphs, like all graphs, are more helpful
           | when the reader has a lot of context about what's supposed to
           | happen, or some intuition about how the chart is supposed to
           | look.
           | 
           | Yeah, and then your comment... just ends? So, what I get here
           | is that, in a flame graph, the reddest part isn't the most
           | interesting, and neither is the widest part, nor the part
           | with the most peaks.
           | 
           | So, what, _exactly_ , am I looking for? "You know what this
           | graph should look like in case all was OK" is not exactly
           | helpful, right, because I _might_ just lack that baseline?
        
             | spencerflem wrote:
             | As I understand it, flame graphs add to normal x-axis-as-
             | time ones by merging repeated calls to the same function,
             | so if one function is called a lot it shows up as one wide
             | chunk and not many small chunks spread out.
             | 
             | So yes, height doesn't matter much, you read bottom to top
             | to get context, and width is what you're looking for.
             | 
             | But like any chart, it can't tell you what to optimise, or
             | what Can be optimized. It just tells you what parts of the
             | code are taking up the most time, in a hierarchical way.
             | Asking for a complete description of what to see here is
             | like asking for a complete description of what to look for
             | in a histogram. Like, ofc it matters what the histogram is
             | of!
        
             | SassyBird wrote:
             | _> So, what, exactly, am I looking for?_
             | 
             | This can't be answered in general. Flamegraphs are
             | measurements of what happened. But just like a ruler
             | doesn't tell you whether a given human is atypically short
             | or tall for its species, a flamegraph can't tell you which
             | portion of the program takes too long a time. You need to
             | have prior knowledge about data structures, algorithms,
             | memory bandwidth etc in order to confront your justified
             | expectations with the reality and be surprised with
             | something. And it will all depend on the particular program
             | you profile.
        
         | simscitizen wrote:
         | > I do memory/CPU traces like all day every day, and I fix code
         | all the time based on that
         | 
         | So I take it you understand one of the standard visualizations
         | produced by a CPU profiling tool, e.g. it takes a call stack
         | sample every millisecond. The x-axis is time (one sample per
         | ms, if you have only one CPU), and the y-axis is the call
         | stack.
         | 
         | Now for a flamegraph, you basically have the same
         | visualization, but you sort the samples across the x-axis so
         | that callstacks that start with the same prefixes are grouped
         | together.
         | 
         | Incidentally, the sorting of samples across the x-axis destroys
         | the time information which is often critical. I constantly see
         | engineers who don't understand flamegraphs looking at the
         | entire flamegraph of a say 3 second trace, and then trying to
         | use that whole flamegraph to optimize a 100 ms critical path
         | within the trace, which is totally nonsensical.
        
           | simscitizen wrote:
           | As an example, imagine you sampled a program and got 5 CPU
           | call stack samples.                 c               c       b
           | b               d       a       a       a       a       a
           | main    main    main    main    main
           | 
           | In a flamegraph, you would see:                 [c
           | ]       [b              ][d     ]       [a
           | ]       [main                                  ]
        
             | PreInternet01 wrote:
             | Yeah, I imagine it, and still don't see how the flame graph
             | would help?
             | 
             | Shown as a hierarchical bar chart, this would suggest 'b'
             | is problematic.
             | 
             | Where, color-wise (because peak-wise, 'c' would be the
             | culprit here) do I see this issue in a flame graph? Because
             | I fear that either 'main' or 'a' would have the most
             | dominant shade of red here?
        
               | simscitizen wrote:
               | Peaks don't matter, they just correspond to the depth of
               | the call stack.
               | 
               | Probably the simplest way to use the flame graph is work
               | from the bottom of the flamegraph and walk upwards until
               | you find something interesting you optimize. Ideally you
               | find something wide to optimize that makes sense. (The
               | widest thing here is "main" which is obviously probably
               | not the interesting thing to optimize, so you would work
               | upwards from there.) The basic idea is that things that
               | are wide in the flamegraph are expensive and potential
               | things to optimize.
               | 
               | Where I work, we have tools that can produce diffed
               | flamegraphs which can be really useful in figuring out
               | why one trace uses so much more/less CPU than another.
        
               | PreInternet01 wrote:
               | > Probably the simplest way to use the flame graph is
               | work from the bottom of the flamegraph and walk upwards
               | until you find something interesting you optimize
               | 
               | OK, so going by what is apparently the 'simple example'
               | in the linked article: https://tech.popdata.org/images/cp
               | s1970_before_fix_dwarf_gcc...
               | 
               | I work my way up. First thing that is really red is
               | Conversion::Process::Run, but that probably wraps a lot
               | of things, so I keep going up.
               | 
               | Next is Cps::Editor::relate_edits, or possibly
               | EditingAPI::Rules::countPeopleMatching, because it's a
               | darker red?
               | 
               | And then there is another red-ish function, followed(?)
               | by some yellow-colored (and thus unimportant?) stack
               | entries, and then the apparent culprit:
               | Record::hasVariable.
               | 
               | So, and I'm truly not trying to be difficult or
               | argumentative here: how was I supposed to pick out
               | 'Record::hasVariable' right away from 'https://tech.popda
               | ta.org/images/cps1970_before_fix_dwarf_gcc...'?
               | 
               | The first function that is red being called from yellow-
               | colored functions with about the same duration (width)?
               | And if so, why is Metadata::Cache::getVarsByName not a
               | more likely optimization target?
        
               | fwip wrote:
               | Other user here: Confession - I don't actually know what,
               | if anything, the colors mean in a flamegraph. They seem
               | random to me.
               | 
               | The way I'd personally hone in on Record::hasVariable is
               | that it's a relatively-simple sounding function (from the
               | name) that is taking a large portion of the X-axis.
               | Starting at the bottom, I'd go "main -> editInOrder ->
               | relateEdits -> countPeopleMatching -> getSourceDataAsLong
               | -> hasVariable." Then I'd be like "we really spend 47% of
               | our time in this simple-sounding function? What's it
               | doing?"
               | 
               | Basically, I look for the functions that have an outsized
               | complexity/time ratio. A function with a simple task is
               | usually easier to optimize, and a function that only runs
               | for 2% of your program isn't worth spending the time to
               | optimize.
        
               | eterm wrote:
               | Indeed, the flame chart can't tell you that.
               | 
               | The solution provided in the article seems to rip out
               | `Metadata::Cache::getVarsByName` entirely. If it were
               | easy to optimise `Metadata::Cache::getVarsByName`
               | instead, then that would also have been a suitable
               | optimisation.
               | 
               | I guess domain knowledge and experience let them know
               | which optimisation was more suitable here.
        
         | nemetroid wrote:
         | Did you read the linked article? It starts with a very simple
         | example and then builds up to a real-world case, unlike the
         | articles you linked, which both start off with an advanced
         | case.
        
           | PreInternet01 wrote:
           | > Did you read the linked article?
           | 
           | From https://news.ycombinator.com/newsguidelines.html:
           | "Please don't comment on whether someone read an article"
           | 
           | And yes, I did, and I truly don't grasp even the 'simple'
           | explanation. I already admitted I'm dumb.
           | 
           | Do you have _anything_ to add to the discussion?
        
             | nemetroid wrote:
             | > Do you have anything to add to the discussion?
             | 
             | Do _you_? Can you point out where it (edit: the simple
             | example) breaks down, instead of just repeating that you
             | 're dumb?
        
               | PreInternet01 wrote:
               | By the 'simple example', I assume you mean https://tech.p
               | opdata.org/images/cps1970_before_fix_dwarf_gcc...?
               | 
               | If so, the widest, reddest, and highest-magnitude
               | function I see there is __int__malloc. Right?
               | 
               | Yet, reading on, it seems that has nothing to do with
               | anything, and the actual culprit is Record::hasVariable,
               | which is somewhere in the middle of the graph, and not
               | any redder or less red, or wider or less wide, than
               | _many_ other functions.
               | 
               | So, looking at just the first graph, how am I supposed to
               | immediately spot the culprit?
        
               | vessenes wrote:
               | I haven't read the article. However, here is how I read
               | the chart:
               | 
               | Record::hasVariable takes a long time. You can see that
               | because it's wide and red. The immediate ones below it on
               | the chart don't do anything but call it; you can see that
               | because they are basically the same width. hasVariable
               | splits into two calls (the row above). I'm going from
               | memory here because I don't have the chart here, but I
               | think it should be clear from the trace:
               | 
               | A) hasVariable takes a lot of compute time
               | 
               | B) from the name, this seems surprising / non optimal.
               | 
               | C) Digging into the functions above it will yield chances
               | to optimize.
               | 
               | I do agree that the malloc is sort of a surprising amount
               | of the total percentage of hasVariable. Again, just from
               | the flame graph and function names, I'd bet that some
               | memory allocations are being done in a loop inside and
               | would be much faster to allocate outside the hasVariable
               | call, and get reused.
        
               | PreInternet01 wrote:
               | > Record::hasVariable takes a long time
               | 
               | As does Memdata::Cache::getVarsByName[blah] right above
               | it, and many functions below (unsurprisingly, but
               | still...), and they all have pretty much the same width
               | and color.
               | 
               | The point of flame graph proponents is that "you see
               | where the problem is right away."
               | 
               | My question is: "how, exactly?" And the answers so far
               | seems to be mostly... lacking, to the point that I now
               | officially declare flame graphs a cargo cult that is in
               | no way superior to my "hierarchical bar charts"
               | religion...
        
               | nemetroid wrote:
               | By simple example, I mean this:
               | https://tech.popdata.org/images/flamegraph-example.svg
               | 
               | But looking at the one you're referencing, slightly out
               | of order:
               | 
               | > highest-magnitude function I see there is
               | __int__malloc.
               | 
               | Height shows stack depth, not magnitude.
               | 
               | > the actual culprit is Record::hasVariable, which is
               | somewhere in the middle of the graph, and not any redder
               | or less red, or wider or less wide, than many other
               | functions.
               | 
               | > not any redder or less red
               | 
               | The colors are randomized to help with contrast
               | (Brendan's website mentions this practice), so they
               | aren't conveying any information.
               | 
               | > or wider or less wide
               | 
               | That's not really true. Hovering over Record::hasVariable
               | tells that this bar covers 45.6% of the runtime. The only
               | bars wider than that are the callers of
               | Record::hasVariable (edit: rather, the stack through
               | which Record::hasVariable is being called), i.e. the bars
               | on which Record::hasVariable is resting.
               | 
               | > somewhere in the middle of the graph
               | 
               | Sure - being in the middle height-wise means that it's
               | somewhere in the middle of the call stack. But there are
               | some clues to its relevance:
               | 
               | 1. It's close to the boundary between application code
               | and standard library code. It does call
               | MetaData::Cache::getVarsByName, which (going by the name)
               | also is part of the application, but everything deeper in
               | the stack (i.e. on top of those bars) is purely std::
               | stuff.
               | 
               | 2. Domain knowledge. The text alludes to this:
               | Record::hasVariable is a conceptually simple operation
               | that's not expected to be a major part of the runtime.
               | 
               | This does not mean that Record::hasVariable must be the
               | culprit. Maybe some function higher in the call stack
               | (e.g. EditingAPI::Rules::getSourceDataAsLong) is calling
               | Record::hasVariable way too many times? But it's a good
               | place to start looking.
        
               | PreInternet01 wrote:
               | Ah, OK, so there is a graph, where magnitude is
               | meaningless, colors are meaningless, runtime is relative,
               | yet, with "enough domain knowledge" you can "see where
               | the problem is right away"...
               | 
               | I'm pretty sure we're done here.
        
               | nemetroid wrote:
               | > I'm pretty sure we're done here.
               | 
               | Sorry to hear.
        
         | starkrights wrote:
         | I don't know if it helps, but this is how I read through the
         | graph from the first article after reading the authors
         | conclusion. (I'm an idiot)
         | 
         | - okay, we're analyzing DvrDrawImage - Let's analyze the
         | function on the left, GpBitmap::PipeLockBits - Up through
         | ::Decode, takes up 100% of the time from PipeLockBits, so we're
         | just racing down the call stack to get to that point, and
         | _then_ other stuff starts taking up time - Lots of
         | WindowsCodecs calls, I guess we kind of expect those to be
         | valid since image processing needs codecs and stuff and things
         | - Okay, now there's a bar taking up ~45-50% of the ::Decode
         | call, that's not a codec or drawing related thing- it's an os
         | level pagefault. That's fucky. - Since the ::Decode call is
         | ~55-65% of the ::DvrDrawImage call, that pagefault is (~40% of
         | ~60% =)~25% of this draw image call
         | 
         | So essentially I looked and saw a stack of bars altogether and
         | went 'okay, the call at the top of this is the only reason the
         | call 6 bars up the stack (ie, 6 bars down on the flamegraph) is
         | taking time, why?', and start applying the same process
         | recursively from there.
         | 
         | That is, until $why == 'due to codec calls and an ntkernel
         | call', and the question becomes 'is that valid?' Then the
         | question becomes "are these codec calls taking the amount of
         | time I would expect?", and "
         | 
         | iiuc peaks and colors don't matter. Tbf I think the
         | colorscheming on this chart is shit- afaik flamegraph coloring
         | is just a way of marking the same (or from the same library?
         | Implementation dependent?) function for easy spotting. Like why
         | not make the functions wildly different colors. Redness !=
         | hotness of a certain codepath.
         | 
         | The only thing peaks indicate is calls that have deeply nested
         | function calls, which I guess would be important if you're
         | having recursion issues.
        
       | zlurkerz wrote:
       | How do flame graphs handle the case where most of the time is
       | spent in some leaf function that is called from all over the
       | program? In this case, each individual stack would not take much
       | time but in aggregate, a lot of time is spent in the function at
       | the top of all of the call stacks. This should not be that
       | uncommon to have hotspots in things like copying routines,
       | compression, encryption etc that are not associated with any
       | particular stack.
       | 
       | pprof from https://github.com/google/pprof can produce a DAG view
       | of a profile where nodes are sized proportional to their
       | cumulative time, e.g.,
       | https://slcjordan.github.io/images/pprof/graphviz.png and such a
       | view would seem to cover the case above and subsume the usual use
       | cases for a flame graph, would it not?
       | 
       | Although I guess a flat text profile of functions sorted by time
       | would also highlight these kinds of hot spots. Still, if we want
       | a single graphical view as a go-to, it's not clear that flame
       | graphs are all that much better than pprof DAGs.
        
         | mandarax8 wrote:
         | My flame graph tool (KDAB hotspot) has a bottom-up flame graph
         | for that purpose.
        
           | zlurkerz wrote:
           | Right, but then you have to know to consult both top-down and
           | bottom-up. Seems like a DAG is a union of these views.
        
         | nemetroid wrote:
         | > How do flame graphs handle the case where most of the time is
         | spent in some leaf function that is called from all over the
         | program?
         | 
         | In my experience, they don't really. They're very good for
         | finding easy wins where a high/medium-level operation takes
         | much longer than you'd expect it to, not so much for finding
         | low-level hotspots in the code.
         | 
         | > such a view would [...] subsume the usual use cases for a
         | flame graph, would it not?
         | 
         | I don't think it does. A flame graph gives you a good idea of
         | the hierarchical structure (and quickly). Sometimes you can
         | work that out from a DAG, but not necessarily:
         | 
         | Let's say that A and B call M, and M calls X and Y. All four
         | edges have roughly the same value. A flame graph can show you
         | that 90% of the time spent in X started with A and that 90% of
         | the time spent in Y started with B, but the DAG can't.
        
         | tanelpoder wrote:
         | I had the same experience/question at some point when I noticed
         | (and investigated), why didn't a new Ubuntu Linux install break
         | out interrupt usage in vmstat anymore. [1]
         | 
         | Perf+flamegraphs were able to catch in-interrupt-handling
         | samples as I was on a bare metal machine with PMU counters
         | accessible. But the various NVMe I/O completion interrupts
         | happened all over the place, even when the CPUs were in random
         | userspace code sections. Neither the bottom-up & top-down
         | FlameGraph approach made it visually clear in this case.
         | 
         | But since interrupt time _did_ show up and since the FlameGraph
         | JS tool had a text search box, I searched for  "interrupt" or
         | "irq" in the search box - and the JS code highlighted all
         | matching sections with purple color (completely distinct from
         | the classic color range of flamegraphs). And seeing that purple
         | color all over the place made me smile and gave me a strong
         | visual "a-ha" moment. Probably with some extra JS tinkering,
         | these kinds of "events scattered all over the place" scenarios
         | could be made even clearer.
         | 
         | [1] https://news.ycombinator.com/item?id=26139611 (2021)
        
       | irogers wrote:
       | Just to advertise the perf tool has inbuilt flamegraph generation
       | code these days (well leaning on D3.js). So `perf script report
       | flamegraph` will convert a perf.data file into a flamegraph.html.
       | Similarly there is `perf script report gecko` to write out the
       | firefox profiler's json format.
        
       ___________________________________________________________________
       (page generated 2024-06-27 23:01 UTC)