I talked about this a little bit in the meetup talk I linked and I intend to write a bit more about this, but I'll try to summarize.
There are kind of three prongs here:
First, using criterion.rs does a ton for giving us more stable metrics. It handles things like warmups, accounting for obvious statistical outliers in the sample runs, postprocessing the raw data to provide more meaningful statistics, etc. I'm currently using a fork of the library which additionally does this recording and processing of a variety of metrics we get from `perf_event_open` on Linux but which I assume you could get through ETW or Intel/AMD's userspace PMC libraries.
Second, I try to provide a stable environment so that results over long time deltas are comparable and we can store the data for offline analysis rather than having to checkout recent prior commits and compare the current PR/nightly/etc against them. Prior to the current deployment I was using cgroups to move potentially competing processes off of the benchmark cores which produced some nice results. However I had some issues with the version of the cpuset utility I installed on the debian machines and I haven't sorted it out yet.
Third, we do a few things with the time-series-esque data we get from measuring multiple toolchains to try and only surface relevant results. Those are mostly in src/analysis.rs if you want to poke around. It basically boils down to calculating the Kernel Density Estimate of the current toolchain's value being from the same population (I hope these terms are halfway correct) as all prior toolchains' value.
I hope that with a few extensions to the above we can get close to being reliable enough to include in early PR feedback, but I think the likely best case scenario is a manually invoked bot on PRs followed by me and a few other people triaging the regressions surfaced by the tool after something merges.
Here are a few issues that I think will help improve this too:
There are kind of three prongs here:
First, using criterion.rs does a ton for giving us more stable metrics. It handles things like warmups, accounting for obvious statistical outliers in the sample runs, postprocessing the raw data to provide more meaningful statistics, etc. I'm currently using a fork of the library which additionally does this recording and processing of a variety of metrics we get from `perf_event_open` on Linux but which I assume you could get through ETW or Intel/AMD's userspace PMC libraries.
Second, I try to provide a stable environment so that results over long time deltas are comparable and we can store the data for offline analysis rather than having to checkout recent prior commits and compare the current PR/nightly/etc against them. Prior to the current deployment I was using cgroups to move potentially competing processes off of the benchmark cores which produced some nice results. However I had some issues with the version of the cpuset utility I installed on the debian machines and I haven't sorted it out yet.
Third, we do a few things with the time-series-esque data we get from measuring multiple toolchains to try and only surface relevant results. Those are mostly in src/analysis.rs if you want to poke around. It basically boils down to calculating the Kernel Density Estimate of the current toolchain's value being from the same population (I hope these terms are halfway correct) as all prior toolchains' value.
I hope that with a few extensions to the above we can get close to being reliable enough to include in early PR feedback, but I think the likely best case scenario is a manually invoked bot on PRs followed by me and a few other people triaging the regressions surfaced by the tool after something merges.
Here are a few issues that I think will help improve this too:
https://github.com/anp/lolbench/issues/20
https://github.com/anp/lolbench/issues/17
https://github.com/anp/lolbench/issues/14