[HN Gopher] Tracy: A real time, nanosecond resolution frame prof...
       ___________________________________________________________________
        
       Tracy: A real time, nanosecond resolution frame profiler
        
       Author : Flex247A
       Score  : 176 points
       Date   : 2024-09-24 02:57 UTC (20 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Flex247A wrote:
       | I am a beginner in graphics programming, and I came across this
       | amazing frame profiler.
       | 
       | Web demo of Tracy: https://tracy.nereid.pl/
       | 
       | This blows my mind. It's so fast and responsive I never expected
       | a WebAssembly application to be!
        
       | vblanco wrote:
       | A truly incredible profiler for the great price of free. There is
       | nothing coming at this level of features and performance even on
       | paid software. Tracy could cost thousands of dollars a year and
       | would still be the best profiler.
       | 
       | Tracy requires you to add macros to your codebase to log
       | functions/scopes, so its not an automatic sampling profiler like
       | superluminal, verysleepy, VS profiler, or others. Each of those
       | macros has around 50 nanoseconds of overhead, so you can
       | liberally use them in the millions. On the UI, it has a stats
       | window that will record average, deviation, min/max of those
       | profiler zones, which can be used to profile functions at the
       | level of single nanoseconds.
       | 
       | Its the main thing i use for all my profiling and optimization
       | work. I combine it with superluminal (sampling profiler) to get a
       | high level overview of the program, then i put tracy zones on the
       | important places to get the detailed information.
        
         | Flex247A wrote:
         | Hello! Going through your tutorial and it's been a great ride!
         | 
         | Thanks for the good work.
        
         | eagle2com wrote:
         | Doesn't Tracy have the capability to do sampling as well? I
         | remember using it at some point, even if it was finicky to
         | setup because windows.
        
           | vblanco wrote:
           | it does, but i dont use it much due to it being too slow and
           | heavy on memory on my ryzen 5950x (32 threads) on windows. a
           | couple seconds of tracing goes into tens of gigabytes of ram.
        
           | forrestthewoods wrote:
           | Yeah I had issues with the Tracy sampler. It didn't "just
           | work" the way Superluminal did.
           | 
           | My only issue with Superluminal is I can't get proper
           | callstacks for interpreted languages like Python. It treats
           | all the CPP callstacks as the same. Not sure if Tracy can
           | handle that nicely or not...
        
         | forrestthewoods wrote:
         | Tracy and Superluminal are the way. Both are so good.
        
       | cwbaker400 wrote:
       | Tracy is brilliant. @wolfpld I hope you're enjoying reading this
       | and all of the other great comments in this thread. Great work
       | and thank you very very much!
        
       | drpossum wrote:
       | Can someone explain how this achieves nanosecond resolution?
       | That's an extremely difficult target to reach on computing
       | hardware due to inherent clock resolutions and interrupt timing.
        
         | vardump wrote:
         | On x86/AMD64 it uses CPU's TSC clock.
         | 
         | https://github.com/wolfpld/tracy/blob/master/public/client/T...
        
         | simonask wrote:
         | There are several sources of timing information, and I think in
         | this context "nanosecond precision" just means that Tracy is
         | able to accurately represent and handle input in nanoseconds.
         | 
         | The resolution of the actual measurements depends on the kind
         | of measurement:
         | 
         | 1. If the measurement is based on high resolution timers on the
         | CPU, the resolution depends on the hardware and the OS. On
         | Windows, `QueryPerformanceFrequency()` returns the resolution,
         | and I believe it is often in the order of 10s or 100s of
         | nanoseconds.
         | 
         | 2. If the measurement is based on GPU-side performance
         | counters, it depends on the driver and the hardware. Graphics
         | APIs allow you to query the "time-per-tick" value to translate
         | from performance counters to nanoseconds. Performance counters
         | can be down to "number of instructions executed", and since a
         | single instruction can be on the order of 1-2 nanoseconds in
         | some cases, translating a performance counter value to a time
         | period requires nanosecond precision.
         | 
         | 3. Modern GPUs also include their own high-precision timers for
         | profiling things that are not necessarily easy to capture with
         | performance counters (like barriers, contention, and
         | complicated cache interactions).
        
           | drpossum wrote:
           | Yes, that's my understanding and why I asked. I disagree
           | about "in this context", though, which is a pitch. If I was
           | going to buy hardware that claimed ns resolution for
           | something I was building I would expect 1ns resolution, not
           | "something around a few ns" and not qualified "only on
           | particular hardware". If such a product were presenting
           | itself in a straightforward way to be compared to similar
           | products and respecting the potential user it would say
           | "resolutions down to a few ns" or something more specific but
           | accurate.
           | 
           | There was even a discussion on this not long ago on how to
           | market to technical folks and things to not do (this is one
           | of the things not to do)
           | 
           | https://www.bly.com/Pages/documents/STIKFS.html
           | 
           | https://news.ycombinator.com/item?id=41368583
        
         | Galanwe wrote:
         | It does reach nanosecond only in the sense that its sampling
         | profiler can report nanosecond resolution. I've tried the event
         | profiler for microsecond sensitive projets though, and it blows
         | up the timings and latency even at low event frequency.
        
           | vardump wrote:
           | I think it's mostly due to Tracy's poor timing calibration
           | code. TSC is good accuracy and latency wise.
        
             | Galanwe wrote:
             | The problem is not the timestamping, it's the queue used to
             | push profiling events which is not fast enough
        
               | vardump wrote:
               | That queue is about as fast as it gets, <10 ns. Timestamp
               | is taken before queueing.
               | 
               | Again, due to bad calibration code the measured
               | timestamps have quite a bit jitter.
               | 
               | Edit: TSC might not be synchronized in multi-socket
               | systems. (Multiple physical CPU sockets). That can
               | generate a large error.
        
       | boywitharupee wrote:
       | can someone explain how is profiling tools like this written for
       | GPU applications? wouldn't you need access to internal runtime
       | api?
       | 
       | for ex. Apple wraps Metal buffers as "Debug" buffers to record
       | allocations/deallocations.
        
         | ossobuco wrote:
         | I don't know about Tracy, but I've seen a couple WebGPU JS
         | debugging tools simply intercepting calls to the various WebGPU
         | functions like writeBuffer, draw, etc, by modifying the
         | prototypes of Device, Queue and so on[0].
         | 
         | - 0: https://github.com/brendan-
         | duncan/webgpu_inspector/blob/main...
        
         | MindSpunk wrote:
         | Some graphics APIs support commands that tell the GPU to record
         | a timestamp when it gets to processing the command. This is
         | oversimplified, but is essentially what you ask the GPU to do.
         | There's lots of gotchas in hardware that makes this more
         | difficult in practice as a GPU won't always execute and
         | complete work exactly as you specify at the API level if it's
         | safe to. And the timestamp domain isn't always the same as the
         | CPU.
         | 
         | But in principle it's not that different to how you just grab
         | timestamps on the CPU. On Vulkan the API used is called
         | "timestamp queries"
         | 
         | It's quite tricky on tiled renderers like Arm/Qualcomm/Apple as
         | they can't provide meaningful timestamps at much tighter
         | granularity than a whole renderpass. I believe Metal only
         | allows you to query timestamps at the encoder level, which
         | roughly maps to a render pass in Vulkan (at the hardware level
         | anyway)
        
       | gcr wrote:
       | My favorite FOSS video game, Dr. Robotnik's Ring Racers
       | (http://kartkrew.org), has Tracy support! It's not compiled into
       | the default build. I've learned a lot reading the code.
        
       | mastax wrote:
       | I just started using this yesterday, it looks really good.
       | Haven't really dug into it.
       | 
       | Is the latest windows build broken for anyone else? It doesn't
       | start. In WinDbg it looks like it dereferences a null pointer. I
       | built it myself and it works fine.
        
         | mastax wrote:
         | Of course I come into work the next day and now I can't run my
         | custom built one either...
        
           | mastax wrote:
           | https://github.com/wolfpld/tracy/issues/887
           | 
           | MSVC changed the mutex constructor to constexpr, breaking
           | binary backward compatibility. They say WONTFIX, you must use
           | the latest MSVCRT with the latest MSVC. But I have the latest
           | MSVCRT installed? Whatever - a workaround was pushed to
           | master yesterday.
        
       | mastax wrote:
       | This article is a good quick introduction to Tracy:
       | https://luxeengine.com/integrating-tracy-profiler-in-cpp/
        
       | Green-Man wrote:
       | Does anybody have an opinion or comparison with respect to
       | easy_profiler?
       | 
       | https://github.com/yse/easy_profiler
       | 
       | Especially interesting if based on real practical experience.
        
       | throwawaymaths wrote:
       | Can anyone with experience suggest pointers on arguing Tracy vs
       | perfetto? My team uses perfetto and I highly suspect we are
       | running into artefacts due to that.
        
       ___________________________________________________________________
       (page generated 2024-09-24 23:01 UTC)