[HN Gopher] Strobelight: A profiling service built on open sourc...
       ___________________________________________________________________
        
       Strobelight: A profiling service built on open source technology
        
       Author : birdculture
       Score  : 126 points
       Date   : 2025-03-07 14:43 UTC (8 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | saganus wrote:
       | Ah, this is performance profiling.
       | 
       | Seeing the title and the domain I thought this was user profiling
       | and I was wondering why would Meta be publishing this.
        
         | hunter2_ wrote:
         | > the domain
         | 
         | Perhaps a contributing factor is how HN shows only the final
         | non-eTLD [0] label of the domain. If it showed all labels,
         | you'd have seen "engineering.fb.com" which, while not a dead
         | giveaway, implies that the problem space is technical.
         | 
         | It would be nice if this aggressive truncation were applied
         | only above a certain threshold of length.
         | 
         | [0] https://en.wikipedia.org/wiki/Public_Suffix_List
        
       | doctorhandshake wrote:
       | I would assume the name is a reference to the use of strobes in
       | examining high speed periodic motion, like that in motors or on
       | production lines, eg:
       | https://www.checkline.com/inspection_stroboscope
        
       | arnath wrote:
       | This is really cool! I've always thought that one thing
       | preventing major competitors to AWS/Azure/GCP is the lack of
       | easy-to-use tooling for machine level monitoring like this. When
       | I was at Microsoft, we built a tool like this that used Windows
       | Firewall filters to track all the network traffic between our
       | services and it was incredibly useful for debugging.
       | 
       | That said, as with anything from Meta, I approach this with a
       | grain of salt and the fact that I can't tell what they stand to
       | gain from this makes me suspicious.
        
         | theptip wrote:
         | > the fact that I can't tell what they stand to gain from this
         | makes me suspicious.
         | 
         | Meta is one of the biggest contributors to FOSS in the world.
         | (React, PyTorch, Llama, ...). They stand to gain what every big
         | company does, a community contributing to their infra.
         | 
         | You'll note that nobody is open sourcing their ad recommender,
         | that is the one you should be skeptical about if you ever see.
         | You don't share your secret sauce.
        
           | schmorptron wrote:
           | ByteDance shared the TikTok content recommender, which I'd
           | argue is somewhat close to an ad recommender :)
        
             | ipsum2 wrote:
             | You mean the paper, not the source code?
        
           | paxys wrote:
           | Plus it helps them recruit engineers who are already familiar
           | with their tech stack.
        
           | ipsum2 wrote:
           | > You'll note that nobody is open sourcing their ad
           | recommender
           | 
           | Actually... (2019) https://ai.meta.com/blog/dlrm-an-advanced-
           | open-source-deep-l...
           | 
           | Source code:
           | 
           | https://github.com/facebookresearch/dlrm
           | 
           | Paper:
           | 
           | https://arxiv.org/abs/1906.00091
           | 
           | Updated 2023 blog post, but solely for content
           | recommendation, but ads recommendation is ~90% the same:
           | 
           | https://engineering.fb.com/2023/08/09/ml-
           | applications/scalin...
           | 
           | It's a little out of date, but the internal one is built with
           | the same concepts, just more advanced modeling techniques and
           | data.
        
         | mhlakhani wrote:
         | As a sibling commenter said, it helps brand and recruiting -
         | which meta cares about
        
           | bigtimesink wrote:
           | Maybe, but the gold chain, million dollar watch wearing CEO
           | talking about masculine energy doesn't help the brand.
        
             | jay-barronville wrote:
             | > Maybe, but the gold chain, million dollar watch wearing
             | CEO talking about masculine energy doesn't help the brand.
             | 
             | Why not exactly? Between Meta's great contributions to the
             | open-source ecosystem and Mark behaving more like a normal
             | man nowadays, right now is the only time in a long time
             | that I've considered applying to go work at Meta. I've
             | heard several of my colleagues and friends say the same
             | thing in recent months.
        
               | quesera wrote:
               | Imagining that there's _anything_ "normal" about that
               | knucklehead is why "masculinity" is such an easy target
               | for parody.
        
               | martinsnow wrote:
               | What's unattractive about how do you do fellow humans?
        
               | jay-barronville wrote:
               | > Imagining that there's anything "normal" about that
               | knucklehead is why "masculinity" is such an easy target
               | for parody.
               | 
               | You're certainly entitled to your opinions and ad
               | hominems. Many folks, including myself, disagree with
               | you, so there's that.
        
               | quesera wrote:
               | Yep, and you yours of course.
               | 
               | But man is that dude a bad example of how to be a human.
               | 
               | I'll cut him some slack for growing up in public with
               | stupid money and no one to regulate his impulses, but uff
               | da.
               | 
               | Wake me up when he's old enough for his lagging
               | prefrontal cortex to catch up with the rest of him.
        
       | Thaxll wrote:
       | I recommand https://grafana.com/oss/pyroscope/ for continous
       | profiling, I use it in Go and it works well.
       | 
       | They have support for many languages
       | https://grafana.com/docs/pyroscope/latest/configure-client/l... (
       | also based on eBPF ).
        
         | outerspace wrote:
         | Good to know there's an OSS alternative.
        
           | hassleblad23 wrote:
           | Strobelight is open source as well.
        
           | tdullien wrote:
           | The Otel profiling agent (formerly prodfiler, then Elastic
           | profiler) is the underlying OSS.
        
       | varunneal wrote:
       | Cool anecdote from inside article
       | 
       | > A seasoned performance engineer was looking through Strobelight
       | data and discovered that by filtering on a particular std::vector
       | function call (using the symbolized file and line number) he
       | could identify computationally expensive array copies that happen
       | unintentionally with the 'auto' keyword in C++.
       | 
       | > The engineer turned a few knobs, adjusted his Scuba query, and
       | happened to notice one of these copies in a particularly hot call
       | path in one of Meta's largest ads services. He then cracked open
       | his code editor to investigate whether this particular vector
       | copy was intentional... it wasn't.
       | 
       | > It was a simple mistake that any engineer working in C++ has
       | made a hundred times.
       | 
       | > So, the engineer typed an "&" after the auto keyword to
       | indicate we want a reference instead of a copy. It was a one-
       | character commit, which, after it was shipped to production,
       | equated to an estimated 15,000 servers in capacity savings per
       | year!
        
         | JoshTriplett wrote:
         | It's a cool anecdote. It's also a case study in heavyweight
         | copies being something that _shouldn 't_ happen by default, and
         | should require explicit annotation indicating that the engineer
         | _expects_ a heavyweight copy of the entire structure.
        
           | mhlakhani wrote:
           | I don't know if that would have helped here, if memory serves
           | me right:
           | 
           | 1. The copy was needed initially 2. This structure wasn't as
           | heavy back then
           | 
           | ... over time the code evolved so it became heavy and the
           | copy became unnecessary. That's harder to find without
           | profiling to guide things
        
           | ehsankia wrote:
           | If it's safety/correctness versus performance, I think the
           | default should be the former. Copying, while inefficient is
           | generally more correct and avoids hard-to-debug errors. It's
           | the whole discussion about premature optimization. I'd rather
           | make a copy than make sure the array is not mutated anywhere
           | ever.
        
             | ltbarcly3 wrote:
             | Yes, everyone agrees with you. The claim you responded to
             | was that you should have to be explicit, because it is very
             | easy to _unintentionally_ copy. For example, it is easy to
             | copy when there is never more than one live pointer to a
             | datastructure. It 's easy to copy when you allocate a
             | resource in a function and return it, which makes the
             | original an orphan which is then immediately freed. It's
             | extremely easy to make a mistake which prevents move from
             | working and you have to go back and carefully check if you
             | want to be sure. It should be trivial to just say "move
             | this" and if something isn't right it's an error at compile
             | time, rather than just falling back to silently being
             | wasteful.
        
             | umanwizard wrote:
             | This exact problem is basically why Rust exists.
        
             | JoshTriplett wrote:
             | I'm not saying it should silently alias any more than it
             | should silently copy. It should _give an error_ , and
             | require the developer to explicitly copy or explicitly
             | alias.
        
         | mhlakhani wrote:
         | That one diff blew my mind when I saw it. It's a prime example
         | of that story about "you paid me a lot of money to know _where_
         | to fix that pipe"
        
         | howlallday wrote:
         | >The best minds of my generation are thinking about how to make
         | people click ads. -- Jeff Hammerbacher
        
       | brancz wrote:
       | We're working hard to bring a lot of Strobelight to everyone
       | through Parca[0] as OSS and Polar Signals[1] as the commercial
       | version. Some parts already exists much to come this year! :)
       | 
       | [0] https://www.parca.dev/
       | 
       | [1] https://www.polarsignals.com/
       | 
       | (Disclaimer: founder of polar signals)
        
       | quelup wrote:
       | Strobelight is a lifesaver. Especially with high qps services -
       | makes it much easier to see where it's worth spending time trying
       | to optimize.
        
       | samstave wrote:
       | DOPE
       | 
       | Fractal compute expense modeelling is hard.
       | 
       | One may do well in applying fluid dynamics (such that we cannot
       | maintain in head)
       | 
       | into compute requirements, it will be funny once we realize that
       | everything i mico (pico) fluid dynamics in general
        
       | Starlord2048 wrote:
       | Between LLVM's optimization passes, static analysis, and modern
       | LLM-powered tools, couldn't we build systems that not only
       | identify but automatically fix these performance issues? GitHub
       | Copilot already suggests code - why not have "Copilot
       | Performance" that refactors inefficient patterns?
       | 
       | I'm curious if anyone is working on "self-healing" systems where
       | the optimization feedback loop is closed automatically rather
       | than requiring human engineers to parse complex profiling data.
        
       | maknee wrote:
       | "All of this is made possible with the inclusion of frame
       | pointers in all of Meta's user space binaries, otherwise we
       | couldn't walk the stack to get all these addresses (or we'd have
       | to do some other complicated/expensive thing which wouldn't be as
       | efficient)"
       | 
       | This makes things so, so, so much easier. Otherwise, a lot of
       | effort has to built into creating an unwinder in ebpf code,
       | essentially porting .eh_frame cfa/ra/bp calculations.
       | 
       | They claim to have event profilers for non-native languages (e.g.
       | python). Does this mean that they use something similar to
       | https://github.com/benfred/py-spy ? Otherwise, it's not obvious
       | to me how they can read python state.
       | 
       | Lastly, the github repo
       | https://github.com/facebookincubator/strobelight is pretty
       | barebones. Wonder when they'll update it
        
         | brancz wrote:
         | Already been done:
         | 
         | 1) native unwinding:
         | https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
         | 
         | 2) python:
         | https://www.polarsignals.com/blog/posts/2023/10/04/profiling...
         | 
         | Both available as part of the Parca open source project.
         | 
         | https://www.parca.dev/
         | 
         | (Disclaimer I work on Parca and am the founder of Polar
         | Signals)
        
           | maknee wrote:
           | Thanks! Those blogs are incredibly useful. Nice work on the
           | profiler. :)
           | 
           | I have multiple questions if you don't mind answering them:
           | 
           | Is there significant overhead to native unwinding and python
           | in ebpf? EBPF needs to constantly read & copy from user space
           | to read data structures.
           | 
           | I ask this because unwinding with frame pointers can be done
           | by reading without copying in userland.
           | 
           | Python can be ran with different engines (cpython, pypy, etc)
           | and versions (3.7, 3.8,...) and compilers can reorganize
           | offsets. Reading from offsets in seems me to be handwavy.
           | Does this work well in practice/when did it fail?
        
             | brancz wrote:
             | Thank you!
             | 
             | Overhead ultimately depends on the frequency, it defaults
             | to 19hz per core, at which it's less than 1%, which is
             | tried and tested with all sorts of super heavy python, JVM,
             | rust, etc. workloads. Since it's per core it tends to be
             | plenty of stacks to build statistical significance quickly.
             | The profiler is essentially a thread-per-core model, which
             | certainly helps for perf.
             | 
             | The offset approach has evolved a bit, it's mixed with some
             | disassembling today, with that combination it's rock solid.
             | It is dependent on the engine, and in the case of python
             | only support cpython today.
        
           | tdullien wrote:
           | Short note: Also available as the standard Otel profiling
           | agent ;)
        
       | flakiness wrote:
       | And open sourcing:
       | https://github.com/facebookincubator/strobelight
       | 
       | C++ from Meta/FB is much more pleasant to read than ones from ...
       | other older big techs. I appreciate that.
        
       | BigRedEye wrote:
       | At Yandex we have a similar profiler that supports native
       | languages seamlessly, with addition to Python/Java:
       | https://github.com/yandex/perforator. It's exciting to see new
       | profilers from big players!
        
       ___________________________________________________________________
       (page generated 2025-03-07 23:00 UTC)