Post ASAGgwqfKFix6orWGe by mislav@hachyderm.io
(DIR) More posts by mislav@hachyderm.io
(DIR) Post #AS8vRImdiB20JJiWno by simon@fedi.simonwillison.net
2023-01-29T18:01:13Z
0 likes, 0 repeats
I've always been cautious about running automated performance and load tests in a CI environment such as GitHub Actions because of the chance that my tests will end up running on a shared instance that's busy doing something else, so the results will be inconsistentAm I over-thinking this?Context is that I"m considering adding performance tests to Datasette that get run in CI and used to plot performance trends over time as the code changes
(DIR) Post #AS8wU1RSk9XvLjfmRE by miek@mastodon.cloud
2023-01-29T18:07:06Z
0 likes, 0 repeats
@simon load tests might work, but performance tests on these shared environments won't. I contemplated setting up an runner on hardware i own and then just do the tests there. Export Prometheus metrics to get fancy graphs. But never implemented it. Maybe something that does this exists by now
(DIR) Post #AS8wfbP9NKJs4U74W8 by linusgroh@chaos.social
2023-01-29T18:07:27Z
0 likes, 0 repeats
@simon You can add a self-hosted CI runner that uses dedicated hardware: https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runnersWe did this for SerenityOS's JS performance tests due to similar concerns about getting reliable measurements over time :)https://github.com/SerenityOS/serenity/blob/d11baf48ae3a9998a0527836e73e9b1cc597715f/.github/workflows/libjs-test262.yml#L10
(DIR) Post #AS8wwFqSmnMc82pSMq by drewbrew@mastodon.cloud
2023-01-29T18:09:29Z
0 likes, 0 repeats
@simon I would think that as long as you don't immediately mark it as a regression after one slowdown, it would do more good than harm
(DIR) Post #AS8xKxa7pIaAj0y2b2 by brian@graphics.social
2023-01-29T18:10:23Z
0 likes, 0 repeats
@simon I don’t think you are overthinking it. In the past, I have run those class of tests on dedicated agents to try to mitigate that problem. The alternative is to run the tests for the same changelist of the source on multiple agents and then capture the statistics like average and standard deviation. But of course, compute-wise, that can get expensive.
(DIR) Post #AS8xW1IoKiLFRy8tqi by zzzeek@fosstodon.org
2023-01-29T18:10:37Z
0 likes, 0 repeats
@simon our perf tests run python cprofile and just look at callcounts, not time. Measuring / assuming some value of "speed" in terms of actual time on CI is completely impossible.
(DIR) Post #AS8xhmVRAgbJ2EemEy by xek@hachyderm.io
2023-01-29T18:11:22Z
0 likes, 0 repeats
@simon I'd worry more about running on different platforms over the longer span of time. I'd definitely include architecture info in the output, and would focus on the ratio between tests within a run, versus a single test over time.Even then, I'd add a few control measurements to suss out per-arch wonkiness. A simple busy loop for CPU performance and maybe a few different memory sweeps to get a rough idea of cache size.
(DIR) Post #AS8xtIjdgWCuB2nCVM by jcristharif@hachyderm.io
2023-01-29T18:11:38Z
0 likes, 1 repeats
@simon I've been researching this a bunch this last week, as I'm trying to do the same for msgspec.It seems like the common solution is to do relative benchmarking, where you compare the main branch with the proposed change, running the benchmarks on both within the same CI run (and then doing some stats to determine if a regression is likely).A few resources I've found useful:- https://labs.quansight.org/blog/2021/08/github-actions-benchmarks- https://asv.readthedocs.io/en/stable/commands.html#asv-continuous
(DIR) Post #AS8y5ye9gUqddeGv56 by simon@fedi.simonwillison.net
2023-01-29T18:12:43Z
0 likes, 0 repeats
@jcristharif Oh I love that! Comparing the main branch and the new code in the same run would seem to mostly solve for that problem. There's a chance someone else on the shared instance fires up an expensive process that coincides with switching from one branch to the other, but I think that's a risk that's worth accepting.
(DIR) Post #AS8ySMIsPSYJNazWcK by webology@mastodon.social
2023-01-29T18:14:30Z
0 likes, 0 repeats
@simon I don't think you are overthinking it. It does feel like you might really be tracking GitHub Actions runner's performance and load testing over time, but that might still be useful vs. not tracking it at all.
(DIR) Post #AS8ydteCmHSBynT9d2 by ezarowny@file-explorers.club
2023-01-29T18:15:00Z
0 likes, 0 repeats
@simon you could run multiple tests to account for variance.
(DIR) Post #AS8ynRGb9SKIr0yf1U by simon@fedi.simonwillison.net
2023-01-29T18:15:46Z
0 likes, 0 repeats
OK, I think I'm convinced that the key trick to this is to run comparison tests in the same CI run - for example, for each PR you run the performance tests against both the main branch and the PR branch and report any deltas between the two
(DIR) Post #AS8yyfC3iEFLEUCrZo by migurski@mastodon.social
2023-01-29T18:15:59Z
0 likes, 0 repeats
@simon No, this seems legit. We have an office Mac Mini configured as a GH Actions runner so we can keep accurate-ish client performance numbers over time.
(DIR) Post #AS8zFpz6qPgzcjL3ke by tie@infosec.exchange
2023-01-29T18:18:32Z
0 likes, 0 repeats
@simon, if you are serious about tracking small performance deltas (e.g. 5-10%), you're not overthinking it. Wasting time chasing the reasons behind a fake performance regression is bad. Giving up on finding the root cause of a real regression, because you chalked it off to environment noise is even worse.Get a dedicated CPU and network or your load generator and another for your app in single-threaded tests. Get several if you are hoping to test a saturation of something approximating a real server (but the tuning there can be devilish). Keep in mind that there are dragons along this way, even if you do everything right. New kernel, new Python, new OpenSSL, CPU microcode or a new type of CPU - all will affect the results. Being able to reproduce is a serious undertaking of its own.OTOH, if you are only interested in catching *gross* performance differences (are we twice as fast? are we twice as slow?), then yeah, you're overthinking it.
(DIR) Post #AS8zSVwSfqbNE892kS by hywan@fosstodon.org
2023-01-29T18:25:16Z
0 likes, 0 repeats
@simon performance results on GitHub Actions are pretty unstable and inconsistent. I would advice not using this platform for such computations :-).
(DIR) Post #AS8ziB5DzfYxAUiFiS by waldoj@mastodon.social
2023-01-29T18:26:11Z
0 likes, 0 repeats
@simon No, you're absolutely right. It's important to pick carefully which tests get run on a shared instances and which require something "dedicated" (your definition of which may vary, depending on whether you're concerned about e.g. a noisy neighbor on EC2).
(DIR) Post #AS8zvjsBxPqSpc5Df6 by mathias@social.fussenegger.pro
2023-01-29T18:27:46Z
0 likes, 0 repeats
@simon Shared instances are definitely a problem for performance regression testing. On VMs you control you could do some tweaks to reduce variability (see https://www.mongodb.com/blog/post/reducing-variability-performance-tests-ec2-setup-key-results). But the best option is tuned bare metal
(DIR) Post #AS90Hem9adVdOQi5QW by simon@fedi.simonwillison.net
2023-01-29T18:29:44Z
0 likes, 0 repeats
Consensus from the replies is that I'm definitely not over-thinking this, and running any kind of performance tests on shared instances does indeed need to be done with caution
(DIR) Post #AS90kv3E9p0uGgeKsy by masukomi@connectified.com
2023-01-29T18:43:14Z
0 likes, 0 repeats
@simon My thinking re performance tests is not "did this get slower" but "have I exceeded some lower bound that's 'definitely too slow'" because frankly i don't even trust "did it get slower" on _my_ machine. Too much unpredictability around load and stuff. Some background task may kick in between tests and suddenly suck up cpu or ram without my awareness.
(DIR) Post #AS91GJH09MJKMw3oWm by jedbrown@hachyderm.io
2023-01-29T18:57:36Z
0 likes, 0 repeats
@simon This may reduce variability, but is still prone to interference if your workloads are memory bandwidth or IO intensive, since those are shared resources on a shared node and other jobs on the node can have a lot of temporal variability. This is a tough problem because dedicated nodes are expensive and CI infrastructure is rarely set up to do all the correctness testing and then spend 1 minute on a dedicated node running only performance regression tests.
(DIR) Post #AS9239tfncQfTjRSCG by mikemol@pony.social
2023-01-29T19:08:54Z
0 likes, 0 repeats
@simon @jcristharif Perhaps the problem is the nature of performance? If possible, perhaps you can instrument something like "number of operations", rather than "time to completion". Then you can have directly-comparable results regardless of the execution environment.
(DIR) Post #AS92IHvCFKzUi5MunY by coderanger@cloudisland.nz
2023-01-29T19:12:33Z
0 likes, 0 repeats
@simon if you ask nicely there is a chance you can use speed.python.org?
(DIR) Post #AS92ZGOfAVdou7FImW by grantgochnauer@mas.to
2023-01-29T19:17:16Z
0 likes, 0 repeats
@simon one option is to setup a remote GH actions build agent in your own hosting environment. We haven’t done it yet but I believe it’s possible.
(DIR) Post #AS92tjrbJzo45LXpZI by frank@frankwiles.social
2023-01-29T19:21:04Z
0 likes, 0 repeats
@simon I think you would need to set up a private runner in order to even come close to consistent results. Luckily setting one up isn’t too hard.
(DIR) Post #AS95H1oUwxOoKCKIZU by percederberg@mastodonsweden.se
2023-01-29T19:51:05Z
0 likes, 0 repeats
@simon One way to get more consistent results is to schedule such testing to once daily, preferably outside US work hours. Perhaps even before western Europe wakes up (depending cloud of course). Build times vary greatly in my estimation.
(DIR) Post #AS966RpnxohLnRSMDI by programmylife@fosstodon.org
2023-01-29T20:00:27Z
0 likes, 0 repeats
@simon https://tomaugspurger.github.io/posts/performance-regressions/ your post reminded me of this post about pandas performance benchmarks (they use a dedicated machine with nightly jobs). We ran into issues at my last job trying to detect performance regressions on CI due to inconsistent performance so we were considering doing similar but didn't make it that far.
(DIR) Post #AS97IhBaoU9Gpy3FjM by rcrowley@mastodon.rcrowley.org
2023-01-29T20:13:31Z
0 likes, 0 repeats
@simon I don’t think you’re overthinking this at all. Even within the same run, the degree of sharedness and oversubscribedness can fluctuate. A couple of alternative techniques to consider:You could teach your CI to launch and terminate an EC2 instance which has at least a better chance of giving you repeatable results, especially on larger instance sizes.Or you could count CPU cycles or instructions using perf-record(1) and perf-report(1) to decouple from wall-clock time.
(DIR) Post #AS99bW4gY76pmnvGqW by zroot@hachyderm.io
2023-01-29T20:39:29Z
0 likes, 0 repeats
@simon that's interesting, since you're using a moving baseline. Keep in mind to run the two cases inside the same job or there isn't the assurance they will be executed on the same runner
(DIR) Post #AS9BgOqp298GjrtXAO by bromanko@hachyderm.io
2023-01-29T21:02:04Z
0 likes, 0 repeats
@simon you aren’t over thinking it. One option to consider is measuring proxy metrics rather than wall time. Something deterministic like instruction counts, library loads, etc.
(DIR) Post #AS9Ea2Ovt8pPEkVX2O by dirk@nona.social
2023-01-29T21:33:11Z
0 likes, 0 repeats
@simon could you add some kind of reference job, so you can measure your actual times based on this reference time instead of seconds or minutes?
(DIR) Post #AS9ywhbTyWQ5Ho7gUC by graynorton@mastodon.social
2023-01-30T06:14:35Z
0 likes, 0 repeats
@simon Check out Tachometer: https://github.com/google/tachometer
(DIR) Post #ASAGgwqfKFix6orWGe by mislav@hachyderm.io
2023-01-30T09:33:47Z
0 likes, 0 repeats
@simon Person from GitHub here 👋 As much as I love Actions for CI and builds, I also wouldn’t use our runners for comparison of performance results over time. Not just because of shared computing instances, but also because the runner image itself keeps getting upgraded on a weekly basis. Custom runners of your own would be a way to go.