[HN Gopher] Strobelight: A profiling service built on open sourc...
___________________________________________________________________
Strobelight: A profiling service built on open source technology
Author : birdculture
Score : 126 points
Date : 2025-03-07 14:43 UTC (8 hours ago)
(HTM) web link (engineering.fb.com)
(TXT) w3m dump (engineering.fb.com)
| saganus wrote:
| Ah, this is performance profiling.
|
| Seeing the title and the domain I thought this was user profiling
| and I was wondering why would Meta be publishing this.
| hunter2_ wrote:
| > the domain
|
| Perhaps a contributing factor is how HN shows only the final
| non-eTLD [0] label of the domain. If it showed all labels,
| you'd have seen "engineering.fb.com" which, while not a dead
| giveaway, implies that the problem space is technical.
|
| It would be nice if this aggressive truncation were applied
| only above a certain threshold of length.
|
| [0] https://en.wikipedia.org/wiki/Public_Suffix_List
| doctorhandshake wrote:
| I would assume the name is a reference to the use of strobes in
| examining high speed periodic motion, like that in motors or on
| production lines, eg:
| https://www.checkline.com/inspection_stroboscope
| arnath wrote:
| This is really cool! I've always thought that one thing
| preventing major competitors to AWS/Azure/GCP is the lack of
| easy-to-use tooling for machine level monitoring like this. When
| I was at Microsoft, we built a tool like this that used Windows
| Firewall filters to track all the network traffic between our
| services and it was incredibly useful for debugging.
|
| That said, as with anything from Meta, I approach this with a
| grain of salt and the fact that I can't tell what they stand to
| gain from this makes me suspicious.
| theptip wrote:
| > the fact that I can't tell what they stand to gain from this
| makes me suspicious.
|
| Meta is one of the biggest contributors to FOSS in the world.
| (React, PyTorch, Llama, ...). They stand to gain what every big
| company does, a community contributing to their infra.
|
| You'll note that nobody is open sourcing their ad recommender,
| that is the one you should be skeptical about if you ever see.
| You don't share your secret sauce.
| schmorptron wrote:
| ByteDance shared the TikTok content recommender, which I'd
| argue is somewhat close to an ad recommender :)
| ipsum2 wrote:
| You mean the paper, not the source code?
| paxys wrote:
| Plus it helps them recruit engineers who are already familiar
| with their tech stack.
| ipsum2 wrote:
| > You'll note that nobody is open sourcing their ad
| recommender
|
| Actually... (2019) https://ai.meta.com/blog/dlrm-an-advanced-
| open-source-deep-l...
|
| Source code:
|
| https://github.com/facebookresearch/dlrm
|
| Paper:
|
| https://arxiv.org/abs/1906.00091
|
| Updated 2023 blog post, but solely for content
| recommendation, but ads recommendation is ~90% the same:
|
| https://engineering.fb.com/2023/08/09/ml-
| applications/scalin...
|
| It's a little out of date, but the internal one is built with
| the same concepts, just more advanced modeling techniques and
| data.
| mhlakhani wrote:
| As a sibling commenter said, it helps brand and recruiting -
| which meta cares about
| bigtimesink wrote:
| Maybe, but the gold chain, million dollar watch wearing CEO
| talking about masculine energy doesn't help the brand.
| jay-barronville wrote:
| > Maybe, but the gold chain, million dollar watch wearing
| CEO talking about masculine energy doesn't help the brand.
|
| Why not exactly? Between Meta's great contributions to the
| open-source ecosystem and Mark behaving more like a normal
| man nowadays, right now is the only time in a long time
| that I've considered applying to go work at Meta. I've
| heard several of my colleagues and friends say the same
| thing in recent months.
| quesera wrote:
| Imagining that there's _anything_ "normal" about that
| knucklehead is why "masculinity" is such an easy target
| for parody.
| martinsnow wrote:
| What's unattractive about how do you do fellow humans?
| jay-barronville wrote:
| > Imagining that there's anything "normal" about that
| knucklehead is why "masculinity" is such an easy target
| for parody.
|
| You're certainly entitled to your opinions and ad
| hominems. Many folks, including myself, disagree with
| you, so there's that.
| quesera wrote:
| Yep, and you yours of course.
|
| But man is that dude a bad example of how to be a human.
|
| I'll cut him some slack for growing up in public with
| stupid money and no one to regulate his impulses, but uff
| da.
|
| Wake me up when he's old enough for his lagging
| prefrontal cortex to catch up with the rest of him.
| Thaxll wrote:
| I recommand https://grafana.com/oss/pyroscope/ for continous
| profiling, I use it in Go and it works well.
|
| They have support for many languages
| https://grafana.com/docs/pyroscope/latest/configure-client/l... (
| also based on eBPF ).
| outerspace wrote:
| Good to know there's an OSS alternative.
| hassleblad23 wrote:
| Strobelight is open source as well.
| tdullien wrote:
| The Otel profiling agent (formerly prodfiler, then Elastic
| profiler) is the underlying OSS.
| varunneal wrote:
| Cool anecdote from inside article
|
| > A seasoned performance engineer was looking through Strobelight
| data and discovered that by filtering on a particular std::vector
| function call (using the symbolized file and line number) he
| could identify computationally expensive array copies that happen
| unintentionally with the 'auto' keyword in C++.
|
| > The engineer turned a few knobs, adjusted his Scuba query, and
| happened to notice one of these copies in a particularly hot call
| path in one of Meta's largest ads services. He then cracked open
| his code editor to investigate whether this particular vector
| copy was intentional... it wasn't.
|
| > It was a simple mistake that any engineer working in C++ has
| made a hundred times.
|
| > So, the engineer typed an "&" after the auto keyword to
| indicate we want a reference instead of a copy. It was a one-
| character commit, which, after it was shipped to production,
| equated to an estimated 15,000 servers in capacity savings per
| year!
| JoshTriplett wrote:
| It's a cool anecdote. It's also a case study in heavyweight
| copies being something that _shouldn 't_ happen by default, and
| should require explicit annotation indicating that the engineer
| _expects_ a heavyweight copy of the entire structure.
| mhlakhani wrote:
| I don't know if that would have helped here, if memory serves
| me right:
|
| 1. The copy was needed initially 2. This structure wasn't as
| heavy back then
|
| ... over time the code evolved so it became heavy and the
| copy became unnecessary. That's harder to find without
| profiling to guide things
| ehsankia wrote:
| If it's safety/correctness versus performance, I think the
| default should be the former. Copying, while inefficient is
| generally more correct and avoids hard-to-debug errors. It's
| the whole discussion about premature optimization. I'd rather
| make a copy than make sure the array is not mutated anywhere
| ever.
| ltbarcly3 wrote:
| Yes, everyone agrees with you. The claim you responded to
| was that you should have to be explicit, because it is very
| easy to _unintentionally_ copy. For example, it is easy to
| copy when there is never more than one live pointer to a
| datastructure. It 's easy to copy when you allocate a
| resource in a function and return it, which makes the
| original an orphan which is then immediately freed. It's
| extremely easy to make a mistake which prevents move from
| working and you have to go back and carefully check if you
| want to be sure. It should be trivial to just say "move
| this" and if something isn't right it's an error at compile
| time, rather than just falling back to silently being
| wasteful.
| umanwizard wrote:
| This exact problem is basically why Rust exists.
| JoshTriplett wrote:
| I'm not saying it should silently alias any more than it
| should silently copy. It should _give an error_ , and
| require the developer to explicitly copy or explicitly
| alias.
| mhlakhani wrote:
| That one diff blew my mind when I saw it. It's a prime example
| of that story about "you paid me a lot of money to know _where_
| to fix that pipe"
| howlallday wrote:
| >The best minds of my generation are thinking about how to make
| people click ads. -- Jeff Hammerbacher
| brancz wrote:
| We're working hard to bring a lot of Strobelight to everyone
| through Parca[0] as OSS and Polar Signals[1] as the commercial
| version. Some parts already exists much to come this year! :)
|
| [0] https://www.parca.dev/
|
| [1] https://www.polarsignals.com/
|
| (Disclaimer: founder of polar signals)
| quelup wrote:
| Strobelight is a lifesaver. Especially with high qps services -
| makes it much easier to see where it's worth spending time trying
| to optimize.
| samstave wrote:
| DOPE
|
| Fractal compute expense modeelling is hard.
|
| One may do well in applying fluid dynamics (such that we cannot
| maintain in head)
|
| into compute requirements, it will be funny once we realize that
| everything i mico (pico) fluid dynamics in general
| Starlord2048 wrote:
| Between LLVM's optimization passes, static analysis, and modern
| LLM-powered tools, couldn't we build systems that not only
| identify but automatically fix these performance issues? GitHub
| Copilot already suggests code - why not have "Copilot
| Performance" that refactors inefficient patterns?
|
| I'm curious if anyone is working on "self-healing" systems where
| the optimization feedback loop is closed automatically rather
| than requiring human engineers to parse complex profiling data.
| maknee wrote:
| "All of this is made possible with the inclusion of frame
| pointers in all of Meta's user space binaries, otherwise we
| couldn't walk the stack to get all these addresses (or we'd have
| to do some other complicated/expensive thing which wouldn't be as
| efficient)"
|
| This makes things so, so, so much easier. Otherwise, a lot of
| effort has to built into creating an unwinder in ebpf code,
| essentially porting .eh_frame cfa/ra/bp calculations.
|
| They claim to have event profilers for non-native languages (e.g.
| python). Does this mean that they use something similar to
| https://github.com/benfred/py-spy ? Otherwise, it's not obvious
| to me how they can read python state.
|
| Lastly, the github repo
| https://github.com/facebookincubator/strobelight is pretty
| barebones. Wonder when they'll update it
| brancz wrote:
| Already been done:
|
| 1) native unwinding:
| https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
|
| 2) python:
| https://www.polarsignals.com/blog/posts/2023/10/04/profiling...
|
| Both available as part of the Parca open source project.
|
| https://www.parca.dev/
|
| (Disclaimer I work on Parca and am the founder of Polar
| Signals)
| maknee wrote:
| Thanks! Those blogs are incredibly useful. Nice work on the
| profiler. :)
|
| I have multiple questions if you don't mind answering them:
|
| Is there significant overhead to native unwinding and python
| in ebpf? EBPF needs to constantly read & copy from user space
| to read data structures.
|
| I ask this because unwinding with frame pointers can be done
| by reading without copying in userland.
|
| Python can be ran with different engines (cpython, pypy, etc)
| and versions (3.7, 3.8,...) and compilers can reorganize
| offsets. Reading from offsets in seems me to be handwavy.
| Does this work well in practice/when did it fail?
| brancz wrote:
| Thank you!
|
| Overhead ultimately depends on the frequency, it defaults
| to 19hz per core, at which it's less than 1%, which is
| tried and tested with all sorts of super heavy python, JVM,
| rust, etc. workloads. Since it's per core it tends to be
| plenty of stacks to build statistical significance quickly.
| The profiler is essentially a thread-per-core model, which
| certainly helps for perf.
|
| The offset approach has evolved a bit, it's mixed with some
| disassembling today, with that combination it's rock solid.
| It is dependent on the engine, and in the case of python
| only support cpython today.
| tdullien wrote:
| Short note: Also available as the standard Otel profiling
| agent ;)
| flakiness wrote:
| And open sourcing:
| https://github.com/facebookincubator/strobelight
|
| C++ from Meta/FB is much more pleasant to read than ones from ...
| other older big techs. I appreciate that.
| BigRedEye wrote:
| At Yandex we have a similar profiler that supports native
| languages seamlessly, with addition to Python/Java:
| https://github.com/yandex/perforator. It's exciting to see new
| profilers from big players!
___________________________________________________________________
(page generated 2025-03-07 23:00 UTC)