Post B55hvOW0gsCP99MPdg by azonenberg@ioc.exchange
(DIR) More posts by azonenberg@ioc.exchange
(DIR) Post #B54o2dYLaHkqNZKXz6 by azonenberg@ioc.exchange
2026-04-08T05:18:43Z
0 likes, 0 repeats
The 100baseT1 decode benchmark is dropping below real time when I have a lot of traffic on the link. Do not want.Sooo now to figure out how to either double the speed of the decode or steal some more time from something else along the critical path like the edge detector or CDR filters that I've spent a ton of time optimizing.It looks like the workload scheduler might be a weak point here: the greedy algorithm makes the eye pattern run as soon as data is available, which makes the I/Q demux (on the critical path for the decode) compete with the eye pattern for GPU time. Hypothetically if the eye pattern were later in the filter graph, it might run during the CPU-bound part of the baseT1 decode, which would start earlier and thus the overall filter graph would finish sooner. But fixing this is... decidedly nontrivial.
(DIR) Post #B54o2eJqjfkgktYU1g by azonenberg@ioc.exchange
2026-04-08T05:26:22Z
0 likes, 0 repeats
As a test, to verify the hypothesis that the eye is stealing GPU time from the demux, I deleted the eye pattern from the filter graph.Refresh rate went from about 9.8 WFM/s to 11.5, not *quite* real time (that would be 12.5) but pretty close and certainly a big improvement. But I don't want to delete it, and I have no easy way to delay it or de-prioritize it: while Vulkan queues have priority associated with them, the assignment of worker threads (and thus queues) to filter blocks is essentially random (whichever one catches the condition variable first) and thus not usable as a precedence mechanism.Messing with the scheduler also seems like a bad idea here, because any changes to tune for this particular workload unless very well tested might cause problems for other filter graphs. So I don't want any hacky special casing.
(DIR) Post #B54o2fI77bxRlnkbxY by ignaloidas@not.acu.lt
2026-04-08T07:13:02.046Z
0 likes, 0 repeats
@azonenberg@ioc.exchange wrt scheduling - considering that it's running the same workload very often, could you just have it try a couple different options and then choose the best? Would have a "warmup period", but should be able to find close to optimal schedule for each graph
(DIR) Post #B54o2lSyAACqvCsMiG by azonenberg@ioc.exchange
2026-04-08T05:37:27Z
0 likes, 0 repeats
Also I have more margin than I thought... I left validation layers on during the previous test oops. It's still sub realtime but by a lot less: averaging around 11.5 WFM/s with the eye pattern active, and pretty comfortably real time with it removed.That explains some of the weirdness I was seeing in the NSight trace with long gaps on submits and such.So the actual overall graph run time is on the order of 76 ms, much closer to where we want to be, but I still do want to speed it up just a bit.
(DIR) Post #B55hvOW0gsCP99MPdg by azonenberg@ioc.exchange
2026-04-08T13:54:48Z
0 likes, 0 repeats
@ignaloidas That is a possibility, although "the same" workload may not be strictly true due to noise bursts, different amounts of packet activity, etc. Also consider GPU and CPU power management if you are doing single shot trigger vs continuous (single shot trigger often makes the filter graph run slower than continuous, because the GPU isn't saturated for long enough to spin up to a higher p-state).The other question is just how to even come up with concepts for different schedules. Like, the goal is to have the batch finish as quickly as possible, normally you would not expect starting a filter late to be a net benefit for this. Having a CPU-heavy filter at the end of a long linear path with GPU-heavy stuff before it is kind of a special case and I'm wondering if I might be better off just making it more GPU-heavy instead lol
(DIR) Post #B55hvPcQaUvWZLN3HU by ignaloidas@not.acu.lt
2026-04-08T17:39:14.343Z
0 likes, 0 repeats
@azonenberg@ioc.exchange FWIW my first idea was to just go through all possible permutations - there is a partial ordering that is implied by the work graph and it's dependencies, so you just try every permutation that satisfies said partial ordering, launching tasks in that order, waiting for all of the dependencies to finish before moving to the next one.Though as you correctly point out, this might not work that well when the workload isn't "stable"