[HN Gopher] Converting integers to decimal strings faster with A...
___________________________________________________________________
Converting integers to decimal strings faster with AVX-512
Author : ibobev
Score : 70 points
Date : 2022-03-29 16:58 UTC (6 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| Aardwolf wrote:
| And meanwhile after 9 years of existing intel is still not
| reliably adding avx 512 to flagship desktop/gaming cpu's, first
| because of 5+ years of using the same old architecture and
| process, now because I don't know why they disabled it when it
| was there. Why was it not a problem for Intel to add mmx, several
| sse iterations, avx, avx 2, but now 9 years of nothing, even
| though parallel computing in annoying to program for and depend
| on gpu's is only gaining in importance?
| Findecanor wrote:
| AVX-512 still isn't _one_ instruction set either. It is several
| that are overlapping, with different feature sets.
|
| I have seen several accounts of them being inefficient (drawing
| much power, large chip area, dragging down the clock of the
| entire chip). They have also been criticised for making context
| switches longer.
|
| There are some good ideas in there though. What I think Intel
| should do is to redesign it as a single modern instruction set
| for 128 and 256-bit vectors (or for scalable vector lengths
| such as where ARM and RISC-V are going). Then express older
| SSE-AVX2 instructions in the new instructions' microcode, with
| the goal of replacing SSE-AVX2 in the long term. Otherwise, if
| Intel doesn't have a instruction set that is modern and
| feature-complete, I think Intel would be falling behind.
| aseipp wrote:
| You can already use the AVX-512 instructions with vectors
| smaller than 512 bits; that, in fact, is one of the
| extensions to the base microarchitecture, and is present in
| every desktop/server CPU that has it available (but not the
| original Xeon Phis, IIRC.) This is called the "Vector Length"
| extension.
|
| Edit: I noted you said "redesign" which is an important
| point. I don't disagree in principle, I just point this out
| to be clear you can use the new ISA with smaller vectors
| already, to anyone who doesn't know this. No need to throw
| out baby with bathwater. Maybe their "redesign" could just be
| ratifying this as a separate ISA feature...
|
| > drawing much power, large chip area, dragging down the
| clock of the entire chip
|
| The chip area thing has mostly been overblown IMO, the vector
| unit itself is rather large but would be anyway (you can't
| get away from this if you want fat vectors), and the register
| file is shared anyway so it's "only" 2x bigger relative to a
| 256b one, but still pales in comparison to the L2 or L3. The
| power considerations are significantly better at least as of
| Ice Lake/Tiger Lake, having pretty minimal impact on clock
| speeds/power licensing, but it's worth nothing these have
| only 1 FMA unit in consumer SKUs and are traditionally low
| core count designs. Seeing AVX-512 performance/power profiles
| on Sapphire Rapids or Ice Lake Xeon (dual FMAs, for one)
| would be more interesting but again it wouldn't match desktop
| SKUs in any case.
|
| It's also worth noting AVX2 has similar problems e.g. on
| Haswell machines where it first appeared, and would similarly
| cause downclocking due to thermal constraints; but at the
| time the blogosphere wasn't as advanced and efficient at
| regurgitating wives' tales with zero context or wider
| understanding, so you don't see it talked about as much.
|
| In any case I suspect Intel is probably thinking about moving
| in the direction of variable-width vector sizes. But they
| probably think about a lot of things. It's unclear how this
| will all pan out, though I do admit AVX-512 overall counts as
| a bit of a "fumbled bag" for them, so to speak. I just want
| the instructions! No bigger vectors needed.
| moonchild wrote:
| > "only" 2x bigger relative to a 256b one
|
| 4x bigger; avx512 has twice as many registers.
| gpderetta wrote:
| The number of microarchitectural registers is only
| loosely related to the number of architectural registers.
| needusername wrote:
| > now because I don't know why they disabled it when it was
| there
|
| E-cores (Gracemont) don't support AVX-512, only P-cores (Golden
| Cove) support AVX-512
| Aardwolf wrote:
| Sure, but they deliberately released a BIOS version that
| prevented people from enabling it when disabling E-cores
| matja wrote:
| _which_ AVX-512? :) -
| https://twitter.com/InstLatX64/status/1114141314441011200/ph...
| PragmaticPulp wrote:
| > Why was it not a problem for Intel to add mmx, several sse
| iterations, avx, avx 2, but now 9 years of nothing, even though
| parallel computing in annoying to program for and depend on
| gpu's is only gaining in importance?
|
| AVX512 is significantly more intensive than previous generation
| SIMD instruction sets.
|
| It takes up a lot of die space and it uses a massive amount of
| power.
|
| It can use so much power and is so complex that the clock speed
| of the CPU is reduced slightly when AVX512 instructions are
| running. This led to an exaggerated outrage from
| enthusiast/gaming users who didn't want their clock speed
| reduced temporarily just because some program somewhere on
| their machine executed some AVX512 instructions. To this day
| you can still find articles about how to disable AVX512 for
| this reason.
|
| AVX512 is also incompatible with having a heterogeneous
| architecture with smaller efficiency cores. The AVX512 part of
| the die is massive and power-hungry, so it's not really an
| option for making efficiency cores.
|
| I think Intel is making the right choice to move away from
| AVX512 for consumer use cases. Very few, if any, consumer-
| target applications would even benefit from AVX512
| optimizations. Very few companies would actually want to do it,
| given that the optimizations wouldn't run on AMD CPUs or many
| Intel CPUs.
|
| It's best left as a specific optimization technique for very
| specific use cases where you know you're controlling the server
| hardware.
| Aardwolf wrote:
| MMX was 64-bit wide, SSE 128-bit wide and AVX 256-bit wide,
| and they succeeded each other rapidly (when compared to the
| situation now). AVX512 is another doubling, so what's the
| hold up with this one compared to the previous two doublings?
|
| Maybe it's time for AMD to do something about this, like when
| they took the initiative to create x86-64
| arcticbull wrote:
| As the 4th doubling, that's a 16X increase over the base.
|
| 512-bit registers, ALUs, data paths, etc, are all really
| really physically big.
| monocasa wrote:
| My sense is that it's because it happened about when
| Intel's litho progress began stalling out. It was designed
| for a world the uarch folks were expecting with smaller
| transistors than they ended up getting.
| mcronce wrote:
| Zen 4 is reported to have AVX-512 support. I'm not sure if
| that'll be included on Ryzen or not, though; only Epyc is
| "confirmed" as far as I know.
| adrian_b wrote:
| The size of the die space and the amount of the power have
| very little to do with AVX-512.
|
| AVX-512 is a better instruction set and many tasks can be
| done with fewer instructions than when using AVX or SSE,
| resulting in a lower energy consumption, even at the same
| data width.
|
| The increase in size and power consumption is due almost
| entirely to the fact that AVX-512 has both twice the number
| of registers and double-width registers in comparison with
| AVX. Moreover the current implementations have a
| corresponding widening of the execution units and datapaths.
|
| If SSE or AVX would have been widened for higher performance,
| they would have had the same increases in size and power, but
| they would have remained less efficient instruction sets.
|
| Even in the worst AVX-512 implementation, in Skylake Server,
| doing any computation in AVX-512 mode reduces a lot the
| energy consumption.
|
| The problem with AVX-512 in Skylake Server and derived CPUs,
| e.g. Cascade Lake, is that those Intel CPUs have worse
| methods of limiting the power consumption than the
| contemporaneous AMD Zen. Whatever method was used by Intel,
| it reacted too slow during consumption peaks. Because of
| that, the Intel CPUs had to reduce the clock frequency in
| advance whenever they feared that a too large power
| consumption could happen in the future, e.g. when they see a
| sequence of AVX-512 instructions and they fear that more will
| follow.
|
| While this does not matter for programs that do long
| computations with AVX-512, when the clock frequency really
| needs to go down, it handicaps the programs that execute only
| a few AVX-512 instructions, but enough to trigger the
| decrease in clock frequency, which slows down the non-AVX-512
| instructions that follow.
|
| This was a serious problem for all Intel CPUs derived from
| Skylake, where you must take care to not use AVX-512
| instructions unless you intend to use many of them.
|
| However it was not really a problem of AVX-512 but of Intel's
| methods for power and die temperature control. Those can be
| improved and Intel did improve them in later CPUs.
|
| AVX-512 is not the only one that caused such undesirable
| behaviors. Even in much older Intel CPUs, the same kind of
| problems appear when you are interested to have maximum
| single-thread performance, but some random background process
| starts on another previously idle core. Even if that
| background process consumes a negligible power, the CPU is
| afraid that it might start to consume a lot and it reduces
| drastically the maximum turbo frequency compared with the
| case when a single core was active, causing the program that
| interests you to slow down.
|
| This is exactly the same kind of problem, and it is visible
| especially on Windows, which has a huge quantity of enabled
| system services that may start to execute unexpectedly, even
| when you believe that the computer should be idle.
| Nevertheless, the people got used to this behavior,
| especially because it was little that they could do about it,
| so it was much less discussed than the AVX-512 slowdown.
| zdw wrote:
| The power/clockspeed hit for AVX512 is less on the most
| recent Intel CPUs, per previous discussion:
| https://news.ycombinator.com/item?id=24215022
| cl0ckt0wer wrote:
| It looks like they're shifting to efficiency as a target. In
| their latest 12th gen desktop CPUs, you could re-enable avx512
| if you disabled the "efficiency" cores. Then they released a
| BIOS update with that feature removed. Has there been an
| instruction set that failed? Itanium maybe?
| ceeplusplus wrote:
| It costs a lot of area for something that's doesn't move the
| needle for the majority of desktop applications (browsers,
| games, office apps). The AVX512 units on Skylake-SP for example
| are a substantial portion of the area of each core. At some
| point you have to consider how many cores you could fit in the
| area used for vector extensions and make a trade-off.
| aseipp wrote:
| This argument IMO is rather overblown. A quick die shot of
| Skylake-X indicates to me that even if you completely nerfed
| every AVX-512 unit appropriately (including the register
| files) on say, an 8-core or 10-core processor, you're going
| to get what? 1 or 2 more cores at maximum?[1] People make it
| sound like 70% of every Skylake-X die just sits there and you
| could have 50x more cores, when it's not that simple. You can
| delete tons of features from a modern microprocessor to save
| space but that isn't always the best way forward.
|
| In any case, I think this whole argument really misses
| another important thing, which is that the ISA and the vector
| width are separate here different. If Intel would just give
| us AVX-512 with 256b or 128b vectors, it would lead to a very
| big increase in use IMO, without having to have massive
| register files and vector data paths complicate things. The
| instruction set improvements are good enough to justify this
| IMO. (Alder Lake makes it a more complex case since they'd
| need to do a lot of work to unify Gracemont/Golden Cove
| before they could even think about that, but I still think it
| would be great.) They could even just half the width of the
| vector load/store units relative to the vector size like AMD
| did with Zen 2 (e.g. 256b vecs are split into 2x128b ops.)
| I'll still take it
|
| [1] https://twitter.com/GPUsAreMagic/status/12568664655773941
| 81/...
| gpderetta wrote:
| Exactly. Even just one AVX512 ALU instead of the typical
| two would be an option (which in fact Intel has used on
| some *lake variants).
| Aardwolf wrote:
| > It costs a lot of area for something that's doesn't move
| the needle for the majority of desktop applications
|
| But that could be a bit of a chicken and egg situation,
| right?
| jeffbee wrote:
| I don't know where you came up with 9 years. The very first CPU
| that had these features came out in May 2018. "Tiger Lake" has
| these features and it is/was Intel's mainstream laptop CPU for
| the past year or so. Adler Lake, their current generation,
| lacks these features but I think it's understandable because
| they had to add AVX, AVX2, BMI, BMI2, FMA, VAES, and a bunch of
| other junk to the Tremont design in order to make a big/little
| CPU that works seamlessly. Whether you think they should
| instead have made a heterogeneous design that is harder to use
| is another question.
| gnabgib wrote:
| Intel proposed AVX512 in 2013[0], with first appearance on
| Xeon Phi x200 announced in 2013, launched in 2016[1], and
| then on Skylake-X released in 2017[2]
|
| [0]: https://en.wikipedia.org/wiki/AVX-512 [1]:
| https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing [2]: h
| ttps://en.wikipedia.org/wiki/Skylake_(microarchitecture)#Hi..
| .
| adrian_b wrote:
| Actually AVX-512 is considerably older than AVX.
|
| Initially AVX-512 was known as "Larrabee New Instructions".
|
| This instruction set, which included essential features,
| which have been missing in both earlier and later Intel
| ISAs, e.g. mask registers and scatter-gather instructions,
| was developed a few years before 2009, by a team in which
| many people had been brought from outside Intel.
|
| The "Larrabee New Instructions" have been disclosed
| publicly in 2009, then the first hardware implementation
| available outside Intel, "Knights Ferry" was released in
| May 2010. Due to poor performance against GPUs, it was
| available only in development systems.
|
| A year later, in 2011, Sandy Bridge was launched, the first
| Intel product with AVX. Even if AVX had significant
| improvements over SSE, it was seriously crippled in
| comparison with the older AVX-512 a.k.a. "Larrabee New
| Instructions".
|
| It would have been much better for the Intel customers if
| Sandy Bridge would have implemented a 256-bit version of
| AVX-512 instead of implementing AVX. However Intel has
| always attempted to implement as few improvements as
| possible in each CPU generation, in order to minimize their
| production costs and maximize their profits. This worked
| very well for them as long as they did not have serious
| competition.
|
| The next implementation of AVX-512 (using the name "Intel
| Many Integrated Cores Instructions"), was in Knights
| Corner, the first Xeon Phi, launched in Q4 2012. This
| version made some changes in the encoding of the
| instructions and it also removed some instructions intended
| for GPU applications.
|
| The next implementation of AVX-512, which changed again the
| encoding of the instructions to the one used today, and
| which changed its name to AVX-512, was in Knights Landing,
| which was launched in Q2 2016.
|
| With the launch of Skylake Server, in Q3 2017, AVX-512
| appeared for the first time in mainstream Intel CPUs, but
| after removing some sets of instructions previously
| available on Xeon Phi.
|
| AVX-512 is a much more pleasant ISA than AVX, e.g. by using
| the mask registers it is much easier to program loops when
| the length and alignment of data is arbitrary.
| Unfortunately the support for it is unpredictable, so it is
| usually not worthwhile to optimize for it.
|
| Hopefully the rumors that Zen 4 supports AVX-512 are true,
| so its launch might be the real start of widespread use of
| AVX-512.
| wtallis wrote:
| Also consumer Skylake started shipping in 2015 with some
| non-functional silicon area reserved for AVX-512 register
| files.
| jeffbee wrote:
| Skylake did not have IFMA or VBMI. The first
| microarchitecture with both of those was Cannon Lake, Q2
| 2018, which practically did not exist in the market, and
| the first mainstream CPU with both of these was Ice Lake,
| Q3 2019.
| thebiss wrote:
| If you're curious what gets generated, Godbolt maps the C code to
| ASM, and provides explanations of the instructions on hover. Link
| below processes the source using CLANG, providing a cleaner
| result than GCC.
|
| [0] https://godbolt.org/z/78KodqxaP
| nitwit005 wrote:
| > The code is a bit technical, but remarkably, it does not
| require a table.
|
| Those constants being used look a lot like a table.
| stncls wrote:
| Especially given the fact that the table in the baseline code
| is only 200 bytes. That's less than four AVX-512 registers!
| adrian_b wrote:
| I assume that he referred to the fact that the code does not
| need an array in memory, with constant values that must be
| accessed with indexed loads. Using such a table in memory can
| be relatively slow, as proven by the benchmarks for this
| method.
|
| There are constants, but they are used directly as arguments
| for the AVX-512 intrinsics. They must still be loaded into the
| AVX-512 registers from memory, but they are loaded from
| locations already known at compile-time and the loads can be
| scheduled optimally by the compiler, because they are not
| dependent on any other instructions.
|
| For a table stored in an array in memory, the loads can be done
| only after the indices are computed and the loads may need to
| be done in a random order, not in the sequential order of the
| memory locations. When the latencies of the loads cannot be
| hidden, they can slow a lot any algorithm.
| nitwit005 wrote:
| The code itself sits in memory, which means the "table" is
| still in memory.
| orlp wrote:
| If your program is sufficiently large and diverse the
| constants embedded in the instructions are no better than a
| table load/lookup. The jump to the conversion routine is
| unpredictable, and the routine will not be in the instruction
| cache, causing the CPU to stall while waiting for the
| instructions to load from RAM.
|
| This will never show up in a microbenchmark where the
| function's instructions are always hot. In fact, a lot of
| microbenchmarking software "warms up" the code by calling it
| a bunch of times before starting to time it, to maximize the
| chances of them being able to ignore this reality.
| dzaima wrote:
| The AVX-512 constants aren't embedded in the instructions,
| but the address to load being static means the CPU can
| start to cache them when the decoder gets to them, possibly
| even before it knows that the input to the function is.
|
| In contrast, for the scalar code, the CPU must complete the
| divisions (which'll become multiplications, but even those
| still have relatively high latency) before it can even
| begin to look for what to cache.
| dzaima wrote:
| The list of arguments to _mm512_setr_epi64 are just reciprocals
| of powers of 10, multiplied by 2^52. The scalar code uses
| division by powers of 10, which'd compile down to similar
| multiplication constants; you just don't see them because the
| compiler does that for the scalar code, but you have to do so
| manually for AVX-512.
|
| And permb_const is a list of indices for the result characters
| within the vector register - the algorithm works on 64-bit
| integers, but the result must be a list of bytes, so it picks
| out every 8th one.
| jeffbee wrote:
| I'd be curious about use cases where this does or does not make
| sense. On the one hand, you saved a few nanoseconds on the
| integer-to-string encoder. But on the other hand you're committed
| to storing and transferring up to 15 leading zeros, and there
| must be some cost on the decoder to consuming the leading zeros.
| So this clearly makes sense on a write-once-read-never
| application but there must also be a point at which the full
| lifecycle cost crosses over and this approach is worse.
| aqrit wrote:
| related to IEEE 754 double-precision floating-point round-
| trips?
| dzaima wrote:
| The article compares padded length 15 output for both cases;
| removing leading zeroes would have cost for both AVX-512 and
| regular code.
| jeffbee wrote:
| Yeah, that's the "this approach" to which I refer though.
| This is a micro-optimization of an approach that I'm not sure
| has many beneficial applications.
| dzaima wrote:
| Ah. The article, as I see it, is primarily about AVX-512 vs
| scalar code, not the 15 digits though. The fixed-length
| restriction is purely to simplify the challenge.
|
| To remove leading zeroes, you'd need to use one of
| bsf/lzcnt/pcmpistri, and a masked store, which has some
| cost, but still stays branchless and will probably easily
| be compensated by the smaller cache/storage/network usage &
| shorter decoding.
| bumblebritches5 wrote:
___________________________________________________________________
(page generated 2022-03-29 23:00 UTC)