[HN Gopher] Popcount CPU instruction (2019)
___________________________________________________________________
Popcount CPU instruction (2019)
Author : softwaredoug
Score : 181 points
Date : 2023-08-13 14:39 UTC (8 hours ago)
(HTM) web link (vaibhavsagar.com)
(TXT) w3m dump (vaibhavsagar.com)
| angiosperm wrote:
| I still don't know of _any_ RISC-V implementation that supports
| the POPC (and related) instructions. Would like to be pointed at
| any.
|
| It is generally a bad idea to field an instruction-set
| architecture that turns common, brilliant optimizations into
| gross pessimizations.
| camel-cdr wrote:
| The bit manipulation [0] extension has been ratified for a
| while now and is part of the RVA22 application extension
| profile [1].
|
| You can already buy SOCs that support it, e.g. vision five 2
| and star64.
|
| Interestingly the risc-v vector extension [2] has it's own
| popcount instructions for vector registers/register masks. This
| is needed, because the scalable architecture doesn't guarantee
| that a vector mask can fit into a 64 bit register, so vector
| masks are stored in a single LMUL=1 register. This works really
| well, because with LMUL=8 and SEW=8 you get 100% utilization of
| the single LMUL=1 vector register.
|
| Another interesting thing is that the vector crypto extension
| will likely introduce a element wise popcount instruction [3].
|
| [0] https://github.com/riscv/riscv-bitmanip
|
| [1] https://github.com/riscv/riscv-profiles
|
| [2] https://github.com/riscv/riscv-v-spec
|
| [3] https://github.com/riscv/riscv-
| crypto/blob/master/doc/vector...
| SushiHippie wrote:
| Reading the title I thought it was about debians popularity
| contest "popcon" https://popcon.debian.org/
| adrian_b wrote:
| > it has in fact been present in CPU architectures since at least
| 1961
|
| Actually it was present for the first time in an electronic
| computer already in February 1951 in Ferranti Mark I, which was
| the first commercially available electronic computer, slightly
| before Univac I.
|
| "Population count" is a name that was first used in Cray-1
| (1976).
|
| In Ferranti Mark I, this instruction was named "sideways add" and
| this was one of a few instructions that had been included in its
| instruction set at the request of Alan Turing, so he is the
| original author of this "weird" instruction.
|
| In CDC 6600 (August 1963), the name of this instruction was "sum
| of 1's".
| svat wrote:
| Indeed, Knuth writes in TAOCP Vol 4A (in section "7.1.3.
| Bitwise Tricks and Techniques", previously pre-fascicle 1a:
| https://cs.stanford.edu/~knuth/fasc1a.ps.gz):
|
| > Early machine designers provided fullword bitwise operations
| in their computers primarily because such instructions could be
| included in a machine's repertoire almost for free. Binary
| logic seemed to be potentially useful, although only a few
| applications were originally foreseen. [...] The Manchester
| Mark I computer, built at about the same time, included not
| only bitwise AND, but also OR and XOR. When Alan Turing wrote
| the first programming manual for the Mark I in 1950, he
| remarked that bitwise NOT can be obtained by using XOR (denoted
| '[?]') in combination with a row of 1s. R. A. Brooker, who
| extended Turing's manual in 1952 when the Mark II computer was
| being designed, remarked further that OR could be used "to
| round off a number by forcing 1 into its least significant
| digit position." By this time the Mark II, which was to become
| the prototype of the Ferranti Mercury, had also acquired new
| instructions for sideways addition and for the position of the
| most significant 1.
|
| The "sideways add" operation (aka population count, sum of
| binary digits, number of 1s, ...) finds mention in all five (so
| far) volumes of TAOCP, and he even uses a special notation nx
| for the sum of the bits of x. It has always been of interest
| (from later in the same volume):
|
| > The first textbook on programming, _The Preparation of
| Programs for an Electronic Digital Computer_ by Wilkes,
| Wheeler, and Gill, second edition (Reading, Mass.: Addison-
| Wesley, 1957), 155, 191-193, presented an interesting
| subroutine for sideways addition due to D. B. Gillies and J. C.
| P. Miller.
| jon_richards wrote:
| And in 1955 you had Arithmetic Operations in Digital
| Computers. https://archive.org/details/arithmetic_operations_
| in_digital...
| dalke wrote:
| In case you are curious, here is the algorithm from that 1957
| text book (2nd edition): https://archive.org/details/programs
| forelect00wilk/page/190/...
|
| I never did sit down and figure out the notation.
|
| It isn't in the 1951 (1st edition), at
| https://archive.org/details/preparationofpro0000maur
| vanderZwan wrote:
| > _at the request of Alan Turing_
|
| Well, I can imagine there might actually have been a connection
| to cryptography then. Not NSA related though.
| Sharlin wrote:
| > this instruction was named "sideways add"
|
| An early instance of what's now more commonly called a
| "horizontal" operation in the context of SIMD.
| stabbles wrote:
| which are slow.
|
| If you wanna count the number of occurrences of a certain
| character in a string, you ultimately have to use popcount on
| an ordinary register, there's no corresponding SIMD
| instruction (at least in <= AVX2). Something along the lines
| of: __builtin_popcount(_mm256_movemask_epi8
| (_mm256_cmpeq_epi8(haystack, needle)));
|
| So, if you wanna avoid horizontal reductions, you better
| _sum_ the result of _mm256_cmpeq_epi8 and effectively keep
| track of 32 counters of 1 byte integers.
|
| Meaning that you have an outer loop over the string, and a
| fixed-length inner loop of 255 iterations in which you
| accumulate counts, and after the inner loop a horizontal sum.
|
| Then you go from 1 horizontal operation per iteration, to 1
| horizontal operations per 255 iterations.
| ack_complete wrote:
| There is actually such an instruction, psadbw
| (_mm_sad_epu8). It's not any noticeably faster for the
| scenario you describe since it only affects the outer loop,
| but it does avoid needing the popcnt instruction since
| psadbw only requires SSE2.
| 10000truths wrote:
| With AVX512, you can use masking and the vpopcntb
| instruction to avoid moving values between GPRs and vector
| registers until the very end: uint64_t
| count_occurrences(const __m512i* vec_string, size_t
| num_vecs, char target) { __m512i count_vec =
| _mm512_set1_epi64(0); __m512i target_vec =
| _mm512_set1_epi8(target); __m512i increment_vec =
| _mm512_set1_epi8(1); for(size_t i = 0; i <
| num_vecs; ++i) { __mmask64 cmp_mask =
| _mm512_cmpeq_epu8_mask(target_vec, vec_string[i]);
| count_vec += _mm512_maskz_popcnt_epi8(cmp_mask,
| increment_vec); } return
| _mm512_reduce_add_epi64(count_vec); }
| Aardwolf wrote:
| Why was there no standard operator for this in C since the 1970s?
| You got various bit shifts and bitwise logic operators, so why
| not this? std::popcount exists in C++ now since C++20, but that
| took a while if this instruction was in CPU's that long.
| Someone wrote:
| Because almost nobody had a computer that supported it when C
| was created. That article mentions 3 cpu architectures that
| have the instruction in the 1970s: 1961: IBM
| Stretch 1964: CDC 6000 1975: Cray-1
|
| After that, it mentions further support in 2005, with Sparc.
|
| Reading https://en.wikipedia.org/wiki/IBM_7030_Stretch, a grand
| total of 9 IBM Stretch's were sold.
|
| In comparison, the CDC 600 was a wild success, with "100+"
| sales (https://en.wikipedia.org/wiki/CDC_6600)
|
| The Cray-1 scores similarly with "over 100"
| (https://en.wikipedia.org/wiki/Cray-1)
|
| C is from 1972 and started life for the PDP-7 and was soon
| ported to the PDP-11, of which "around 600,000" were sold
| (https://en.wikipedia.org/wiki/PDP-11; I guess most of these
| will have been sold late in the 1970s and early 1980s, but I
| think it's a safe bet they were almost immediately more common
| than those 3 machines)
| matthews2 wrote:
| At least GCC and Clang can recognise most sane implementations
| of popcount and replace it with an instruction.
| pantalaimon wrote:
| We do have `stdc_popcount()` now with C23
| sigsev_251 wrote:
| Isnt't it stdc_count_ones?
| dataflow wrote:
| Maybe because it hasn't been in x86 CPUs that long. Like Intel
| added it in 2009 or so. And there were intrinsics for it so I'm
| guessing it wasn't a high priority.
| ralferoo wrote:
| On the PS3 SPU cores, the best "does everything" instruction was
| the single "rotate and mask" series of instructions which was the
| underlying implementation of all rotates, shifts, and bitfield
| extraction instructions. That architecture is still my favourite
| years on - every register was vectorised by default, and the
| instruction was the minimalistic set to achieve a lot of unusual
| operations that were incredibly useful for graphics or signal
| processing work. Often you couldn't find an exact instruction to
| do what you wanted, but if you studied the documentation long
| enough, you'd find one that generalised it and actually did do
| what you wanted, as well as gracefully handling loads of other
| edge cases you hadn't thought of - for instance the various
| instructions to create shuffle masks so that lots of
| byte/word/long re-ordering operations could be done with the
| single shuffle instruction, which could be used to emulate
| unaligned memory access using aligned-only memory accesses.
|
| Not exactly weird, but on the PowerPC side "eieio", is my most
| favourite name for an instruction on any architecture - without
| fail, every time I come across it, I start humming Old McDonald!
| Findecanor wrote:
| Continuing the tradition of having _ridiculously_ -named
| instruction, when PowerPC went 64-bit, it got a rotate-and-mask
| instructions named `rldicl` - Rotate Left Doubleword Immediate
| then Clear Left
| [deleted]
| Everdred2dx wrote:
| Not a whole lot of asm experience here- what does it mean for a
| register to be vectorized?
| nwallin wrote:
| SIMD. Instead of having one value in one register, you have
| several values per register. (usually a small power of 2. SSE
| is 4 floats per register. AVX512 is 16 floats per register.)
| corysama wrote:
| And, when you perform an operation on a register, like an
| add, the operation works on all of the values in that
| register simultaneously in a single instruction.
| icelusxl wrote:
| > "rotate and mask" series of instructions
|
| Sounds a lot like the PowerPC's rlwinm instruction.
|
| The PowerPC 600 series, part 5: Rotates and shifts
|
| https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99...
| Vogtinator wrote:
| There are similar instructions on PowerPC (rlwinm) and bfi/ubfx
| instructions on ARMv8.
| ktm5j wrote:
| There's also a gnu error code called EIEIO:
| https://www.gnu.org/software/libc/manual/html_node/Error-Cod...
| cosmojg wrote:
| Macro: int ED "?." The experienced user
| will know what is wrong.
|
| Macro: int EGREGIOUS "You really blew it
| this time." You did what?
|
| Macro: int EIEIO "Computer bought the
| farm." Go home and have a glass of warm, dairy-fresh milk.
|
| Macro: int EGRATUITOUS "Gratuitous error."
| This error code has no purpose.
| moritzwarhier wrote:
| And apparently also
|
| > EGREGIOUS
|
| and
|
| > EGRATITUOUS
|
| and... lots of stuff ^^
| adamnemecek wrote:
| Do you happen to have a link to said docs?
| jjtheblunt wrote:
| I'd start here
|
| https://www.psdevwiki.com/ps3/Cell_Programming_IBM
| ralferoo wrote:
| It looks like my memory is playing tricks and the combined
| instruction that does everything was just on the PPU (the
| PowerPC part) and not the SPU.
|
| But the SPU ISA docs are available in many places, just
| googled and found this mirror:
| http://www.ece.stonybrook.edu/~midor/ESE545/SPU_ISA_v12.pdf -
| the various shifts start on page 117, but they're simpler
| than I remember - shift left in various forms, rotate (right)
| and mask (i.e. shift right) and rotate (right).
|
| But, even looking at the instructions, it's not immediately
| obvious that every instruction is operating on a 128-bit
| vector - e.g. shl operates on 4 32-bit fields at once,and
| shlh operates on 8 16-bit fields at once, and so e.g. a
| single instruction can even shift by different amounts.
| Almost all instructions operate this way, only a few only use
| the first field in a vector, e.g. conditional branches (brz,
| brnz, brhz, brhnz). The other rotate instructions are
| interesting - you can do bitwise rotate across the entire
| 128-bit vector, rotate by bytes (very useful for memory
| accesses) and the wonder instruction shufb (shuffle bytes,
| see page 116) which allows each byte of the result to be
| pulled from a specific byte in of 2 vectors, or a constant
| (0,0x80,0xFF) which allows you do do all manner of things for
| example byte order swapping or different values together.
|
| So, you have this really powerful instruction set that's
| vectorised on almost every instruction, and then you have a
| massive 128 registers per core coupled with 32KB local cache
| memory which is single cycle access. It's an absolute joy to
| program for on anything which will fit into the 32KB local
| cache (which includes your program). The only downside is
| that anything that doesn't fit into 32KB requires you to
| start a DMA request to copy a chunk to or from main memory,
| and wait for it to complete and/or continue doing other work
| in the background. With suitable workloads operating on 10KB
| or less, many people used this in a true double-buffered
| fashion, so the next set of data was transferring while you
| were processing the previous set, but this made lots of tasks
| quite complicated and many developers found it too hard to
| make good use of the SPU, and so it was often sadly under-
| utilised. One interesting effect though, is that DMA'ing your
| buffers took a similar time to a cache miss on the PowerPC
| cores, so actually even a naive approach that always
| triggered a small DMA from main memory in parallel with other
| operations often wouldn't see much latency and might still
| even be quicker than running the same task on the PowerPC
| cores where there was less opportunity for doing other work
| during a cache miss.
|
| Another interesting feature of this CPU is that you can do a
| pre-emptive branch prediction so that the instruction cache
| is already full when a branch is takenm meaning that a branch
| can execute in a single cycle. Coupled with indirect branch
| (i.e. using a register value), you can give a CPU a branch
| hint in advance of the end of a loop as to whether it'll
| terminate or not and also avoid stalls. In some cases this
| could outperform the x86 branch prediction that was common at
| the time, which was just maintaining a cache of the last
| target of recent branches.
|
| Overall, SPU programming was tough to master, but once you
| got something working the performance was phenomenal. I think
| that's why I liked it so much - because the sense of
| accomplishment was so tangible once you'd battled through it
| and emerged the other side with some amazing code. But
| nowadays, the power it provided, that was unheard of at the
| time (7 cores at 3GHz), isn't anything special over a mid-
| range i9.
| jb1991 wrote:
| Title checks out. I don't believe a word of it.
| fear91 wrote:
| It's a very useful instruction. Go uses it extensively in its
| core runtime and the compiler itself. I assume other languages do
| too if they care about performance.
|
| Examples:
|
| 1. Growing a slice.
|
| 2. Runtime's heap bitmap.
|
| 3. Page allocation.
|
| 4. malloc.
|
| 5. Register allocation.
| cpach wrote:
| If you want an example where popcount is very useful, check out
| this challenge from Cryptopals:
| https://cryptopals.com/sets/1/challenges/6
| fegu wrote:
| Haskell added a function popCount in
| https://hackage.haskell.org/package/base-4.18.0.0/docs/Data-...
| some years ago to allow easy use of this operation where present.
| inetknght wrote:
| > You might be wondering, like I was, if there's more to this
| instruction, but that's all it does! This doesn't seem very
| useful, right?
|
| It seems _very_ useful. I 've used it for several algorithms
| including DNA analysis and vectorized math.
| nwoli wrote:
| It's of course useful, the question is it useful enough to
| warrant its own instruction
| chaboud wrote:
| Whether it's shift, mask, multiply, Kernighan's method, or
| something else, it's going to be multiple instructions and
| tens or hundreds of cycles to do this in software.
|
| pop count instructions take a handful of cycles to run (~10
| on ARM, ~3 on Intel?). It's one of those things that silicon
| can do very well.
| fear91 wrote:
| New Intel/AMD CPU's do a register based popcount in a
| single clock.
| angiosperm wrote:
| Used to be three cycles.
|
| Unfortunately, the original AMD64 back in 200x lacked a
| popcount, so most software built for PCs even today lacks
| any instances of the instruction. Means to get the
| instruction generated are finicky, non-portable, and
| often result unexpectedly in a function call, instead.
| E.g., without a "-m" option, Gcc and Clang will turn
| "__builtin_popcount()" into a function call. Likewise,
| "std::popcount()" and "std::bitset<>::count()". Always
| use at least "-mavx".
| RetroTechie wrote:
| Question is whether _performance critical_ code that would
| benefit from a pop count instruction, is common enough to
| warrant inclusion.
|
| As long as it isn't a bottleneck in common software, a few
| shifts/masks/add/integer multiply or whatever, are _very_
| quick on modern cpus. Often 1-cycle. If not >1 such
| instructions in parallel per clock.
| ilyt wrote:
| I'd wager also yes.
|
| > As long as it isn't a bottleneck in common software, a
| few shifts/masks/add/integer multiply or whatever, are
| very quick on modern cpus. Often 1-cycle. If not >1 such
| instructions in parallel per clock.
|
| one instruction takes less cache space than multiple
| instructions. At worst it might mean fitting vs not
| fitting hot loop in cache.
| dalke wrote:
| I work in cheminformatics, and wrote one of the documents
| cited by Sagar.
|
| The answer is "yes and no".
|
| My area of focus is a part of molecular similarity. The
| overall idea is that molecules which are similar tend to
| have similar functionality. There are many approximate
| ways to estaimte molecular similarity. The one I focus on
| maps chemical substructures ("features" or "descriptors")
| to 1 or several bits on a bitstring of length ~1024 bits,
| called a fingerprint.
|
| We use Jaccard similarity (called Tanimoto similarity in
| my field) of two fingerprints as a proxy for molecular
| similarity computed as popcount(A & B) / popcount(A | B).
| Since popcount(A) and popcount(B) can be pre-computed,
| this ends up being popcount(A & B) / (popcount(A) +
| popcount(B) - popcount(A & B). If the fingerprints are
| grouped by popcount then this boils down to computing
| popcount(A & B), plus some tiny amount of integer math.
|
| This can be used to find the k-nearest matches to a
| single query, to cluster the fingerprints, to identify
| diverse fingerprints, and so on.
|
| These methods bounded by two things: 1. the time needed
| to compute popcount of (A & B), and 2. the memory
| bandwidth.
|
| The CPU bottleneck really is the popcount calculation. At
| https://jcheminf.biomedcentral.com/articles/10.1186/s1332
| 1-0... I compared different popcount implementations and
| found POPCNT about 2-3x faster than the fastest pure-C
| popcount.
|
| However, POPCNT on Intel processors isn't all that fast.
| Rather, when I was really focused on this 5+ years ago,
| Intel processors only had one execution port than could
| handle POPCNT, so I could only get one 8 bytes per cycle.
| (Some AMD processors have several such ports, but I never
| tried one of those CPUs.)
|
| Instead, Wojciech Mula, Nathan Kurz and Daniel Lemire
| showed that AVX2 instructions were even faster than
| sequential POPCNT instructions because AVX2 could handle
| more things in parallel. See my table for numbers.
|
| For small bitstrings (I think 512 bits was the boundary)
| POPCNT is still the fastest.
|
| With AVX2 it's fast enough that memory bandwidth becomes
| the limiting issue, and I need to start worrying about
| cache locality.
| inetknght wrote:
| > _It's of course useful, the question is it useful enough to
| warrant its own instruction_
|
| I argue: yes absolutely.
| [deleted]
| spacechild1 wrote:
| What's weird about popcount?
| Agingcoder wrote:
| That was my initial reaction as well. Counting bits arises
| quite naturally in many places.
| pwdisswordfishc wrote:
| Nothing. It's clickbait.
| tialaramex wrote:
| I'd guess to the "Surely my computer is a 1:1 mapping of the C
| programming language" people the expectation is that their CPU
| will have two increment instructions, two decrement
| instructions, and nothing like popcount. After all, C doesn't
| have a popcount operator.
|
| I have this terrible feeling that somewhere down the line we're
| going to get a generation of people who assumed the computer is
| a 1:1 mapping of Python -- a friend who teaches CS for a major
| university explained that they're going to teach Python as
| first language for Computer Science undergraduates starting
| 2024.
| SeanLuke wrote:
| > a friend who teaches CS for a major university explained
| that they're going to teach Python as first language for
| Computer Science undergraduates starting 2024.
|
| This is extremely widespread. My institution does it, as do
| many others I know of. I would not be surprised to find that
| Python is the most common initial programming language right
| now.
| sgerenser wrote:
| When I was in college (2001-2006) it was Java. I'm pretty
| sure the school already switched to Python for intro CS
| within the past few years.
| RetroTechie wrote:
| Well there have been Java bytecode & Forth optimized (if not
| directly executing) cpus. So you never know. ;-)
| ilyt wrote:
| > I have this terrible feeling that somewhere down the line
| we're going to get a generation of people who assumed the
| computer is a 1:1 mapping of Python -- a friend who teaches
| CS for a major university explained that they're going to
| teach Python as first language for Computer Science
| undergraduates starting 2024.
|
| Probably better than starting with C with all its footguns
| and "optimistic" approach to guessing what programmer
| meant...
| automatic6131 wrote:
| Having python as a first programming language isn't a big
| disadvantage. The core elements of program optimization are
| still there: making sure you don't accidentally write O(n2)
| code etc. even if the memory management is opaque, you end up
| with an idea of values vs references, stacks and heaps just
| from some surprising edge cases in the language.
| mhh__ wrote:
| I'm worried about python being assumed to be the same as
| every language than every CPU.
| sharikous wrote:
| What would be so terrible?
|
| Some people will be naturally curious, as always, and they
| will go deeper.
|
| And even languages like Javascript influenced the design of
| CPUs and even specific instructions. Nothing is "pure" in the
| computing field
| jpc0 wrote:
| > a friend who teaches CS for a major university explained
| that they're going to teach Python as first language for
| Computer Science undergraduates starting 2024
|
| I think this is significantly more common than you think it
| is.
|
| Also python has an extremely rich type system, can be
| completely duck typed as well, and does effectively
| everything you would need to know about in a CS degree, at
| least in the first semester or two, other than manual memory
| management. Which is significantly easier when you have a
| firm grasp of other aspects of programming.
|
| There are multiple approaches to teaching, starting with
| python and then moving lower down later isn't necessary a bad
| move. Who says starting with assembly/C and moving up is a
| good move?
| ponderings wrote:
| We could make a slow human friendly machine code, show how
| it abstracts electronic components and then get into
| physics and chemistry.
|
| It would be as fascinating as unproductive.
| tialaramex wrote:
| Oxford and Cambridge both begin with an ML, the same way I
| was taught last century (although of course it's a
| different ML today, I believe one of them uses Ocaml).
|
| Unavoidably I am biased, but I nevertheless feel confident
| that this is the correct way to teach programming to CS
| students in particular. One reason to use an ML, and
| especially to use a less common ML, is to create a level
| playing field. Teaching Python, Java (as they do today at
| his institution), or C (yes that happens) means some of the
| fresh students have months, maybe years, even occasionally
| decades of prior practical experience in the "first
| language" before day one. These students will find most, or
| even all, of the exercises trivial and so they won't
| actually learn what you were teaching, which may surprise
| them too, they're also likely to be disruptive, after all
| they're bored because this is easy.
|
| Now, for just a "level playing field" you could choose
| almost anything unpopular these days. Pascal? Scheme?
| Fortran? But that's not the only reason. MLs have a nice
| type system, which is great because probably simultaneously
| on the theory side of their course your students are
| learning about types, and Hindley-Milner gets you from a
| theory to a real world, from why this is a good idea, to
| huh, the machine did exactly what I meant, that's neat. If
| your colleagues are teaching them a sane approach to types,
| but you're teaching a language which says YOLO, everything
| is just a bag of bits don't worry about it, that's not a
| cohesive message. Likewise for function composition. It'd
| be nice if the first language you're teaching has actual
| bona fide mandatory TCO, so that when the nice elegant
| explanation in the theory class shows tail calls, the
| result in your practical class isn't a room full of stack
| overflow errors (or worse).
|
| This is reasonable for a First Language because you're
| expecting to teach these students several languages. The
| same doesn't apply to say, an optional module for
| Geographers that gets them enough Python to use a GIS
| competently. But we are (hopefully) not relying on
| Geographers to invent the next Python.
| Philpax wrote:
| > I think this is significantly more common than you think
| it is.
|
| For context: it was the first language taught in my CS
| degree in 2014. They switched to Python from Java due to
| its pedagogical advantages - namely, students being able to
| focus more on the high-level algorithmic semantics than on
| language design minutiae (Java) or memory management / poor
| diagnostics (C).
|
| I wasn't a fan of the decision at the time, but I certainly
| see why they did it in hindsight!
| larodi wrote:
| Heh you underestimate (or perhaps understate) the
| predictability of stupidity... somewhere down the road you'll
| have kids running for-loops on top of GB-large LLM models :D
|
| now, seriously - it gets me thinking a lot that certain
| generation is going to be the last to be interested in
| assembly, but actually there are always people fascinated by
| the beauty of assembly language.
| praptak wrote:
| > 1:1 mapping of the C programming language
|
| I was pretty surprised the first time I discovered the C
| compiler could "reverse engineer" a "logical or of two
| shifts" to generate a single rotate instruction.
| tialaramex wrote:
| Idiom recognition. Some compilers had idiom recognition
| before C existed, because in some of the early languages
| the literal interpretation of the common idioms is
| incredibly expensive to perform, but a compiler can
| recognise that actually it can take massive shortcuts.
|
| For example, sort this list of numbers into ascending
| order, then, take that sorted list, get the last number
| from it. With idiom recognition the compiler can say oh,
| I'll walk the list once, tracking the maximum as I go,
| that's much faster. In C this seems like, well, if I meant
| that I'd write that but I think APL is an example where the
| list expression is idiomatic and so that's what programmers
| will write but they _want_ the fast thing.
| eska wrote:
| Maybe you'll feel better knowing that popcount has been
| standardized in C23
| tialaramex wrote:
| Thanks, I didn't know about that and so I just read a
| summary, it's kind of sad that even the C standard library
| is now obliged put prefixes on names for namespace reasons,
| but yeah, it is good.
| eska wrote:
| To be fair it's not different from C++ although it
| features namespaces. One still has to write "std::foo()"
| in headers in order to not pollute the global namespace,
| and therefore in template code, anyway.
|
| As long as they stop adding locale and global/thread-
| state dependent functions I'll be quiet :)
| sigsev_251 wrote:
| Regarding locale, let's hope that the text conversion API
| makes it in the next version so we can all work in
| unicode and be happy.
| adrian_b wrote:
| The only thing weird about it is that even if it was already
| implemented in the first commercial computer made with vacuum
| tubes, because its implementation in hardware is cheap and it
| can be much faster than a software implementation, it was
| omitted in most later computers, with the exception of the most
| expensive computers, which were typically used mainly by
| government agencies, until AMD decided to implement it in AMD
| Barcelona (2007).
|
| Once AMD had it, Intel followed soon in Nehalem (2009), and now
| most CPUs have it.
|
| The article is somewhat misleading about population count being
| available in ARM NEON since 2005 (i.e. since Cortex-A8).
|
| The 64-bit Armv8-A ISA, which was introduced in 2012, has the
| CNT instruction, which is indeed a population count
| instruction.
|
| The 32-bit Armv7-A ISA has the VCNT instruction, which is a
| variant of population count, but which is much less useful than
| the 64-bit POPCNT provided by Cray 1, AMD Barcelona, Intel
| Nehalem or Armv8-A, because it works only on 8-bit words, so
| executing the equivalent of a 64-bit POPCNT requires a sequence
| of several instructions.
| stassats wrote:
| CNT on arm64 counts 8 or 16-bit elements too.
| zzo38computer wrote:
| Bit population count is useful for many things. MMIX has SADD
| ("sideways add", which is also a bit population count), and also
| has MOR ("multiple OR") and MXOR ("multiple XOR"), which are also
| useful (for many purposes, including endianness conversion (even
| PDP-endian), reversing the bits in a byte, hashing, and others)
| but I had not seen on other instruction sets.
|
| In GNU C, you can use the function __builtin_popcount to make bit
| population count.
| [deleted]
| [deleted]
| jagrsw wrote:
| Hamming distance (and popcnt) is also used in gradient descent
| scenarios.
|
| Eg. here's part of fuzzing engine code, where it tries to produce
| an input which matches some form of variable check inside the
| fuzzed code:
|
| https://github.com/google/honggfuzz/blob/1f09e3bfb0785d9b311...
| register uint8_t v = ((sizeof(Arg1) * 8) -
| __builtin_popcount(Arg1 ^ Arg2)); uint8_t prev =
| ATOMIC_GET(globalCovFeedback->bbMapCmp[pos]); if (prev <
| v) { ...
| dbcurtis wrote:
| Ah yes, pop count. Back in my days as a CPU logic designer on
| refrigerated, walk-in, scientific mainframes, I was working on a
| vector processor that was intended to be a competitor to Cray-1,
| but sank due to poor marketing and a collapse in oil
| prospecting... but I digress. A few of us on the team had come
| from previous CPUs at other companies, and argued with the
| architecture committee that we should have a vector pop count
| instruction in order to get some interest from No Such Agency.
| Shot down, of course.
|
| When serial No. 1 was in initial benchmarking, two guys named
| Frank and George with No Last Names, who lived somewhere near
| Fort Mead, said they would buy at least one if we added vector
| pop count.
| angiosperm wrote:
| Popcount eventually gets added to every architecture that lacks
| it, always at great expense. You would think people would have
| learned by now.
|
| And, no, not typically just because the spooks want it.
| NotYourLawyer wrote:
| Why, then?
| bruce343434 wrote:
| I've read this story before but can't remember where
| dbcurtis wrote:
| I suspect it played out multiple times at different
| companies.
| [deleted]
| agubelu wrote:
| To add to some of the uses of popcount, it is also used in chess
| engines. Since a chess board has 64 squares, many engines use 64
| bit unsigned integers to represent the whole board, mapping each
| square to a bit, which are called bitboards. Then, each
| combination of piece type and color is given its own bitboard,
| for example, a bitboard to represent the positions of white
| rooks. The squares attacked by a piece can similarly be
| represented by a bitboard.
|
| This in turn allows for the use of something called magic
| bitboards, which are essentially compressed lookup tables for the
| squares to which a rook or a bishop can move for every possible
| disposition of the board. Having all these possibilities
| precomputed speeds up move generation significantly.
|
| Popcount is used, for example, to know the amount of pieces of a
| certain type on the board, the number of squares to which a piece
| can move, or the amount of pieces attacking a given square.
| WaffleIronMaker wrote:
| If you want to learn more about Hamming Codes from a mathematical
| perspective, I recommend 3Blue1Brown's YouTube series:
|
| > How to send a self-correcting message (Hamming codes) -
| https://youtu.be/X8jsijhllIA
|
| > Hamming codes part 2, the elegance of it all -
| https://youtu.be/b3NxrZOu_CE
|
| Associated with Ben Eater's video, if you want a perspective
| regarding low level computer hardware:
|
| > What is error correction? Hamming codes in hardware -
| https://youtu.be/h0jloehRKas
| ks2048 wrote:
| I recently used popcount for an audio-fingerprinting project [0]
| to compute the difference in binary "audio fingerprints". With
| AVX2 and 4 cores, my desktop managed 448 billion bit-diff calcs
| per second! [0] https://kenschutte.com/phingerprint/
___________________________________________________________________
(page generated 2023-08-13 23:00 UTC)