[HN Gopher] Bubble sort slower with -O3 than -O2 with GCC
___________________________________________________________________
Bubble sort slower with -O3 than -O2 with GCC
Author : asicsp
Score : 110 points
Date : 2021-10-17 12:13 UTC (10 hours ago)
(HTM) web link (stackoverflow.com)
(TXT) w3m dump (stackoverflow.com)
| codr7 wrote:
| I've written a lot of interpreters in C/++ lately, and spent a
| lot of time profiling and optimizing them.
|
| I can't remember -O3 ever being faster than -O2, it's usually
| slightly slower for me.
| DannyBee wrote:
| For interpreters, that's highly likely to be true, especially
| if they are mainly spending time in switch tables or executing
| basic opcodes.
| gnufx wrote:
| Python, specifically, is supposed to get most benefit from
| -fno-semantic-interposition
| infberg wrote:
| Care to explain? -O3 generates larger code than -O2?
| pclmulqdq wrote:
| Yes, -O3 tends to include a lot of features that increase
| code size, like aggressive loop unrolling. If you are
| jumping around a large amount of code, -O3 generally
| performs more poorly than -O2, but if you are running a
| tight loop (like HPC code), -O3 is better.
|
| In the past, at a time when I worked on a very performance
| sensitive codebase that was also limited in scope, we
| compiled with -Osize and did all the loop optimizations we
| wanted manually (and with pragmas). That produced faster
| code than -O2 or -O3.
| gnufx wrote:
| Regarding unrolling, -O3 contains -funroll-and-jam but
| not -funroll-loops. You may want one or the other, maybe
| both, depending on circumstances. I don't see much
| benefit from the available pragmas on HPC-type code
| unless for OpenMP, and "omp simd" isn't necessary to get
| vectorization in the places I've seen people say it is.
| Mileage always varies somewhat, of course. (Before
| second-guessing anything, use -fopt-info.)
| jleahy wrote:
| It's a shame that Osize can sometimes produce truly awful
| code. There are a few optimisations in there that trade a
| byte for a massive slowdown.
| userbinator wrote:
| You asked for minimum size, and that's what you got. I'd
| say that's working as it should.
|
| A more granular control over optimisation would be good,
| however.
| jhgb wrote:
| Surely a profile-guided build should be able to only
| apply -Os to those functions where it doesn't cause a lot
| of problems.
| nly wrote:
| What about when you factor in PGO? (-fprofile-generate &
| -fprofile-use)
| mhh__ wrote:
| This is the rub. the compiler is flying completely blind
| without PGO.
|
| I turned on PGO for the D compiler recently and it compiled
| it's tests 15% faster. Boom.
| codr7 wrote:
| Haven't really been down that rabbit hole yet, thanks for
| reminding me.
| mgraczyk wrote:
| To me the more interesting meta-point here is that some
| algorithms are less microarchitecture friendly than others.
| Algorithms with larger distances between data dependencies are
| less likely to stall.
|
| For example I compared an insertion sort and see the same
| performance between O2 and O3: void
| insertionsort(int *buf) { for (int i = 1; i <
| n; ++i) { const int x = buf[i]; int j = i
| - 1; for (; j >= 0 && buf[j] > x; --j) buf[j + 1] =
| buf[j]; buf[j + 1] = x; } }
|
| bubble O2: 1.208s
|
| bubble O3: 1.469s
|
| insertion O2: 0.126s
|
| insertion O3: 0.126s
| gnufx wrote:
| I see something rather different with that example, using GCC 10
| on SKX with that code and program argument, bound to a core, +/-
| ~2%:
|
| -O2 -g: 1.07s
|
| -O3 -g: 1.17s
|
| -O3 -g -fprofile-generate/use: 0.84s
|
| (maqao cqa from https://maqao.org would give comprehensive info
| on the generated code.)
|
| -- /* Tom, don't let me ever catch you using
| bubblesort again! -- Don */
|
| -- Knuth in dvips/afm2tfm.c
| jeffbee wrote:
| Lots of things are slower at higher -O levels, many things get
| slower if you compile them for your native instruction
| extensions, and there are other surprises lurking all over the
| compiler. The only way to be sure is to sweep out the parameter
| space of your compiler, using benchmark results as your objective
| function.
| glandium wrote:
| Relevant talk by Emery Berger: Performance matters
| https://www.youtube.com/watch?v=r-TLSBdHe1A I'd recommend
| everyone interested in the subject to watch it.
| rkagerer wrote:
| _this is pretty clearly an anti-optimization you should report
| onhttps://gcc.gnu.org/bugzilla/_
|
| Was it ever reported?
|
| I did a quick search on https://gcc.gnu.org/bugzilla for any
| issues, of any status, tagged with _missed-optimization_ and
| containing the term "store-forwarding" or "bubble sort" and
| didn't see it (maybe I'm not searching right?)
| mhh__ wrote:
| How to keep the O3 goodness without pessimizations: Use
| likely/unlikely liberally, unreachable(), and PGO.
|
| If the code slows down, it's usually because the compiler has
| generated a bunch of code because it doesn't know what hot path
| to schedule for.
|
| They will generate hundreds to thousands of instructions for
| factorial if you let them because it turns it into a loop, then
| vectorizes the loop.
|
| Undefined Behaviour pub quiz question in there ^^
| unwind wrote:
| Just to clarify, in this context PGO = Profile-Guided
| Optimization [1].
|
| [1]: https://en.m.wikipedia.org/wiki/Profile-
| guided_optimization
| jhgb wrote:
| ...as opposed to? Off the top of my head I can't seem to
| remember a single acronym I could possibly confuse it with,
| and now you got me wondering which common acronym I
| completely failed to learn.
| toxik wrote:
| Pose graph optimization
| mhh__ wrote:
| That seems like a bit of a reach given that the thread is
| obviously about GCC.
| MauranKilom wrote:
| Not sure if that would have an impact here. GCC is just unaware
| of the latency implications of store forwarding. I mean, it's
| definitely worth a shot, but you'd just be more or less hoping
| that your mentioned techniques disable the right optimization
| pass.
| gnufx wrote:
| Yes to -fprofile-use. That brought gfortran to essentially the
| same geometric mean as ifort with reasonable "good" flags for
| me, running a benchmark set which I think is supposed to show
| off proprietary compilers. ifort got little benefit from its
| equivalent.
| [deleted]
| userbinator wrote:
| My experience with MSVC and ICC is similar - O3 is often worse
| than O2, and a lot of the time Os is actually the fastest (and
| smallest).
| StephanTLavavej wrote:
| MSVC has /O1 and /O2 (and /Os), but not /O3. See
| https://docs.microsoft.com/en-us/cpp/build/reference/compile...
| . (I work on MSVC's STL.)
| flatiron wrote:
| Dude spent more time on the explanation than I spend on my "good"
| projects at work!
| mcraiha wrote:
| Mandatory xkcd https://xkcd.com/664/
| MauranKilom wrote:
| If you've spent any amount of time on StackOverflow around
| C/C++/assembly, you could probably tell within the first three
| paragraphs of the answer who authored it. :)
| digitallyfree wrote:
| From my experience, -O2 includes optimizations that improve
| performance in the majority of cases. Whereas -O3 contains
| additional optimizations that may improve performance but the
| results are less clear-cut - hence why they're included in -O3 as
| opposed to -O2. I've seen cases where -O3 is faster than -O2, and
| vice versa.
| amelius wrote:
| Can you turn on optimizations on a per-function basis?
| gnufx wrote:
| Yes, and that's sometimes a good idea. GCC has a pragma and
| function attribute (unfortunately not for Fortran, at least).
| phkahler wrote:
| That's why its infuriating that vectorization was not enabled
| at -O2. Since x86-64 supports SSE2 as a baseline, every program
| has been under performing at -O2 for over 15 years. Without
| knowing this it turns out the recent introduction of
| -march=x86-64-v2 and v3 dont benefit either when both have AVX2
| which goes largely unused. See pathetic test results here:
| https://github.com/solvespace/solvespace/issues/972
|
| Notice that v3 is no better and may even be a hair slower.
| gnufx wrote:
| See https://gcc.gnu.org/gcc-12/changes.html It's still not
| necessarily a guaranteed win, I think.
| jleahy wrote:
| Vectorisation is not a clear win, and so doesn't belong in
| O2. Certainly it can deliver massive improvements in some
| cases (as loop unrolling could in Pentium III days), yet can
| cause significant binary size increases (think icache costs)
| and makes loops with small iteration counts potentially much
| slower.
|
| If for some reason you want O2 but with vectorisation you can
| enable it, and if you want O3 without inlining or loop
| unrolling (the same thing), you're welcome to specify that.
| userbinator wrote:
| Did you mean the P4? An extremely long pipeline meant loop
| unrolling helps it far more than it does x86 processors
| before or after. PIII is more similar to Core/2/i in
| performance characteristics.
|
| ...and then there's Atom, which is also an odd one - I
| believe it's more like a classic Pentium (P54).
| jleahy wrote:
| Well I did want to say the P4, but I never actually had
| one so didn't want to say something that turned out to be
| wrong.
|
| Certainly on earlier microarchitectures it did help a lot
| more than it does now, and there were some nice wins on
| the PIII for it
| phkahler wrote:
| The fact that they are changing to include some
| vectorization options by default for -O2 suggests
| otherwise. In particular SLP vectorization should have been
| there all along. Every graphics library 2d and 3d defines
| vector classes that will easily go faster with the simplest
| vectorization.
|
| Arguments around code size have been largely irrelevant for
| quite some time, so long as performance increases.
|
| If I put on a tinfoil hat, I'd suspect Intel was at least
| an influence. They're the ones adding AVX, AVX2, 512 and
| providing their own compiler to take advantage of them,
| while crippling AMD parts. That is until they switched to
| LLVM.
| kzrdude wrote:
| Maybe it could use a different heuristic in O2? Depending
| on the function it's vital to have autovectorization for
| performance. Neither -O3 nor -O2 are perfect as a blanket
| option for all the functions in a big program.
| torginus wrote:
| I think bubble sort is kind of an optimization nightmare - a loop
| where subsequent iterations depend on each other - and the
| dependency is through memory. I'm pretty sure the compiler tries
| to extract some level of parallelism from the loop and fails
| miserably.
| kiryin wrote:
| I've always been under the impression that -03, -0fast and
| related "performance" tweaks are not recommended, and shouldn't
| be used unless the software in question is specifically written
| with them in mind.
___________________________________________________________________
(page generated 2021-10-17 23:00 UTC)