hngopher.com

       [HN Gopher] Popcount CPU instruction (2019)
       ___________________________________________________________________
        
       Popcount CPU instruction (2019)
        
       Author : softwaredoug
       Score  : 181 points
       Date   : 2023-08-13 14:39 UTC (8 hours ago)
        
 (HTM) web link (vaibhavsagar.com)
 (TXT) w3m dump (vaibhavsagar.com)
        
       | angiosperm wrote:
       | I still don't know of _any_ RISC-V implementation that supports
       | the POPC (and related) instructions. Would like to be pointed at
       | any.
       | 
       | It is generally a bad idea to field an instruction-set
       | architecture that turns common, brilliant optimizations into
       | gross pessimizations.
        
         | camel-cdr wrote:
         | The bit manipulation [0] extension has been ratified for a
         | while now and is part of the RVA22 application extension
         | profile [1].
         | 
         | You can already buy SOCs that support it, e.g. vision five 2
         | and star64.
         | 
         | Interestingly the risc-v vector extension [2] has it's own
         | popcount instructions for vector registers/register masks. This
         | is needed, because the scalable architecture doesn't guarantee
         | that a vector mask can fit into a 64 bit register, so vector
         | masks are stored in a single LMUL=1 register. This works really
         | well, because with LMUL=8 and SEW=8 you get 100% utilization of
         | the single LMUL=1 vector register.
         | 
         | Another interesting thing is that the vector crypto extension
         | will likely introduce a element wise popcount instruction [3].
         | 
         | [0] https://github.com/riscv/riscv-bitmanip
         | 
         | [1] https://github.com/riscv/riscv-profiles
         | 
         | [2] https://github.com/riscv/riscv-v-spec
         | 
         | [3] https://github.com/riscv/riscv-
         | crypto/blob/master/doc/vector...
        
       | SushiHippie wrote:
       | Reading the title I thought it was about debians popularity
       | contest "popcon" https://popcon.debian.org/
        
       | adrian_b wrote:
       | > it has in fact been present in CPU architectures since at least
       | 1961
       | 
       | Actually it was present for the first time in an electronic
       | computer already in February 1951 in Ferranti Mark I, which was
       | the first commercially available electronic computer, slightly
       | before Univac I.
       | 
       | "Population count" is a name that was first used in Cray-1
       | (1976).
       | 
       | In Ferranti Mark I, this instruction was named "sideways add" and
       | this was one of a few instructions that had been included in its
       | instruction set at the request of Alan Turing, so he is the
       | original author of this "weird" instruction.
       | 
       | In CDC 6600 (August 1963), the name of this instruction was "sum
       | of 1's".
        
         | svat wrote:
         | Indeed, Knuth writes in TAOCP Vol 4A (in section "7.1.3.
         | Bitwise Tricks and Techniques", previously pre-fascicle 1a:
         | https://cs.stanford.edu/~knuth/fasc1a.ps.gz):
         | 
         | > Early machine designers provided fullword bitwise operations
         | in their computers primarily because such instructions could be
         | included in a machine's repertoire almost for free. Binary
         | logic seemed to be potentially useful, although only a few
         | applications were originally foreseen. [...] The Manchester
         | Mark I computer, built at about the same time, included not
         | only bitwise AND, but also OR and XOR. When Alan Turing wrote
         | the first programming manual for the Mark I in 1950, he
         | remarked that bitwise NOT can be obtained by using XOR (denoted
         | '[?]') in combination with a row of 1s. R. A. Brooker, who
         | extended Turing's manual in 1952 when the Mark II computer was
         | being designed, remarked further that OR could be used "to
         | round off a number by forcing 1 into its least significant
         | digit position." By this time the Mark II, which was to become
         | the prototype of the Ferranti Mercury, had also acquired new
         | instructions for sideways addition and for the position of the
         | most significant 1.
         | 
         | The "sideways add" operation (aka population count, sum of
         | binary digits, number of 1s, ...) finds mention in all five (so
         | far) volumes of TAOCP, and he even uses a special notation nx
         | for the sum of the bits of x. It has always been of interest
         | (from later in the same volume):
         | 
         | > The first textbook on programming, _The Preparation of
         | Programs for an Electronic Digital Computer_ by Wilkes,
         | Wheeler, and Gill, second edition (Reading, Mass.: Addison-
         | Wesley, 1957), 155, 191-193, presented an interesting
         | subroutine for sideways addition due to D. B. Gillies and J. C.
         | P. Miller.
        
           | jon_richards wrote:
           | And in 1955 you had Arithmetic Operations in Digital
           | Computers. https://archive.org/details/arithmetic_operations_
           | in_digital...
        
           | dalke wrote:
           | In case you are curious, here is the algorithm from that 1957
           | text book (2nd edition): https://archive.org/details/programs
           | forelect00wilk/page/190/...
           | 
           | I never did sit down and figure out the notation.
           | 
           | It isn't in the 1951 (1st edition), at
           | https://archive.org/details/preparationofpro0000maur
        
         | vanderZwan wrote:
         | > _at the request of Alan Turing_
         | 
         | Well, I can imagine there might actually have been a connection
         | to cryptography then. Not NSA related though.
        
         | Sharlin wrote:
         | > this instruction was named "sideways add"
         | 
         | An early instance of what's now more commonly called a
         | "horizontal" operation in the context of SIMD.
        
           | stabbles wrote:
           | which are slow.
           | 
           | If you wanna count the number of occurrences of a certain
           | character in a string, you ultimately have to use popcount on
           | an ordinary register, there's no corresponding SIMD
           | instruction (at least in <= AVX2). Something along the lines
           | of:                   __builtin_popcount(_mm256_movemask_epi8
           | (_mm256_cmpeq_epi8(haystack, needle)));
           | 
           | So, if you wanna avoid horizontal reductions, you better
           | _sum_ the result of _mm256_cmpeq_epi8 and effectively keep
           | track of 32 counters of 1 byte integers.
           | 
           | Meaning that you have an outer loop over the string, and a
           | fixed-length inner loop of 255 iterations in which you
           | accumulate counts, and after the inner loop a horizontal sum.
           | 
           | Then you go from 1 horizontal operation per iteration, to 1
           | horizontal operations per 255 iterations.
        
             | ack_complete wrote:
             | There is actually such an instruction, psadbw
             | (_mm_sad_epu8). It's not any noticeably faster for the
             | scenario you describe since it only affects the outer loop,
             | but it does avoid needing the popcnt instruction since
             | psadbw only requires SSE2.
        
             | 10000truths wrote:
             | With AVX512, you can use masking and the vpopcntb
             | instruction to avoid moving values between GPRs and vector
             | registers until the very end:                 uint64_t
             | count_occurrences(const __m512i* vec_string, size_t
             | num_vecs, char target) {           __m512i count_vec =
             | _mm512_set1_epi64(0);           __m512i target_vec =
             | _mm512_set1_epi8(target);           __m512i increment_vec =
             | _mm512_set1_epi8(1);           for(size_t i = 0; i <
             | num_vecs; ++i) {               __mmask64 cmp_mask =
             | _mm512_cmpeq_epu8_mask(target_vec, vec_string[i]);
             | count_vec += _mm512_maskz_popcnt_epi8(cmp_mask,
             | increment_vec);           }           return
             | _mm512_reduce_add_epi64(count_vec);       }
        
       | Aardwolf wrote:
       | Why was there no standard operator for this in C since the 1970s?
       | You got various bit shifts and bitwise logic operators, so why
       | not this? std::popcount exists in C++ now since C++20, but that
       | took a while if this instruction was in CPU's that long.
        
         | Someone wrote:
         | Because almost nobody had a computer that supported it when C
         | was created. That article mentions 3 cpu architectures that
         | have the instruction in the 1970s:                 1961: IBM
         | Stretch       1964: CDC 6000       1975: Cray-1
         | 
         | After that, it mentions further support in 2005, with Sparc.
         | 
         | Reading https://en.wikipedia.org/wiki/IBM_7030_Stretch, a grand
         | total of 9 IBM Stretch's were sold.
         | 
         | In comparison, the CDC 600 was a wild success, with "100+"
         | sales (https://en.wikipedia.org/wiki/CDC_6600)
         | 
         | The Cray-1 scores similarly with "over 100"
         | (https://en.wikipedia.org/wiki/Cray-1)
         | 
         | C is from 1972 and started life for the PDP-7 and was soon
         | ported to the PDP-11, of which "around 600,000" were sold
         | (https://en.wikipedia.org/wiki/PDP-11; I guess most of these
         | will have been sold late in the 1970s and early 1980s, but I
         | think it's a safe bet they were almost immediately more common
         | than those 3 machines)
        
         | matthews2 wrote:
         | At least GCC and Clang can recognise most sane implementations
         | of popcount and replace it with an instruction.
        
         | pantalaimon wrote:
         | We do have `stdc_popcount()` now with C23
        
           | sigsev_251 wrote:
           | Isnt't it stdc_count_ones?
        
         | dataflow wrote:
         | Maybe because it hasn't been in x86 CPUs that long. Like Intel
         | added it in 2009 or so. And there were intrinsics for it so I'm
         | guessing it wasn't a high priority.
        
       | ralferoo wrote:
       | On the PS3 SPU cores, the best "does everything" instruction was
       | the single "rotate and mask" series of instructions which was the
       | underlying implementation of all rotates, shifts, and bitfield
       | extraction instructions. That architecture is still my favourite
       | years on - every register was vectorised by default, and the
       | instruction was the minimalistic set to achieve a lot of unusual
       | operations that were incredibly useful for graphics or signal
       | processing work. Often you couldn't find an exact instruction to
       | do what you wanted, but if you studied the documentation long
       | enough, you'd find one that generalised it and actually did do
       | what you wanted, as well as gracefully handling loads of other
       | edge cases you hadn't thought of - for instance the various
       | instructions to create shuffle masks so that lots of
       | byte/word/long re-ordering operations could be done with the
       | single shuffle instruction, which could be used to emulate
       | unaligned memory access using aligned-only memory accesses.
       | 
       | Not exactly weird, but on the PowerPC side "eieio", is my most
       | favourite name for an instruction on any architecture - without
       | fail, every time I come across it, I start humming Old McDonald!
        
         | Findecanor wrote:
         | Continuing the tradition of having _ridiculously_ -named
         | instruction, when PowerPC went 64-bit, it got a rotate-and-mask
         | instructions named `rldicl` - Rotate Left Doubleword Immediate
         | then Clear Left
        
           | [deleted]
        
         | Everdred2dx wrote:
         | Not a whole lot of asm experience here- what does it mean for a
         | register to be vectorized?
        
           | nwallin wrote:
           | SIMD. Instead of having one value in one register, you have
           | several values per register. (usually a small power of 2. SSE
           | is 4 floats per register. AVX512 is 16 floats per register.)
        
             | corysama wrote:
             | And, when you perform an operation on a register, like an
             | add, the operation works on all of the values in that
             | register simultaneously in a single instruction.
        
         | icelusxl wrote:
         | > "rotate and mask" series of instructions
         | 
         | Sounds a lot like the PowerPC's rlwinm instruction.
         | 
         | The PowerPC 600 series, part 5: Rotates and shifts
         | 
         | https://devblogs.microsoft.com/oldnewthing/20180810-00/?p=99...
        
         | Vogtinator wrote:
         | There are similar instructions on PowerPC (rlwinm) and bfi/ubfx
         | instructions on ARMv8.
        
         | ktm5j wrote:
         | There's also a gnu error code called EIEIO:
         | https://www.gnu.org/software/libc/manual/html_node/Error-Cod...
        
           | cosmojg wrote:
           | Macro: int ED                   "?." The experienced user
           | will know what is wrong.
           | 
           | Macro: int EGREGIOUS                   "You really blew it
           | this time." You did what?
           | 
           | Macro: int EIEIO                   "Computer bought the
           | farm." Go home and have a glass of warm, dairy-fresh milk.
           | 
           | Macro: int EGRATUITOUS                   "Gratuitous error."
           | This error code has no purpose.
        
           | moritzwarhier wrote:
           | And apparently also
           | 
           | > EGREGIOUS
           | 
           | and
           | 
           | > EGRATITUOUS
           | 
           | and... lots of stuff ^^
        
         | adamnemecek wrote:
         | Do you happen to have a link to said docs?
        
           | jjtheblunt wrote:
           | I'd start here
           | 
           | https://www.psdevwiki.com/ps3/Cell_Programming_IBM
        
           | ralferoo wrote:
           | It looks like my memory is playing tricks and the combined
           | instruction that does everything was just on the PPU (the
           | PowerPC part) and not the SPU.
           | 
           | But the SPU ISA docs are available in many places, just
           | googled and found this mirror:
           | http://www.ece.stonybrook.edu/~midor/ESE545/SPU_ISA_v12.pdf -
           | the various shifts start on page 117, but they're simpler
           | than I remember - shift left in various forms, rotate (right)
           | and mask (i.e. shift right) and rotate (right).
           | 
           | But, even looking at the instructions, it's not immediately
           | obvious that every instruction is operating on a 128-bit
           | vector - e.g. shl operates on 4 32-bit fields at once,and
           | shlh operates on 8 16-bit fields at once, and so e.g. a
           | single instruction can even shift by different amounts.
           | Almost all instructions operate this way, only a few only use
           | the first field in a vector, e.g. conditional branches (brz,
           | brnz, brhz, brhnz). The other rotate instructions are
           | interesting - you can do bitwise rotate across the entire
           | 128-bit vector, rotate by bytes (very useful for memory
           | accesses) and the wonder instruction shufb (shuffle bytes,
           | see page 116) which allows each byte of the result to be
           | pulled from a specific byte in of 2 vectors, or a constant
           | (0,0x80,0xFF) which allows you do do all manner of things for
           | example byte order swapping or different values together.
           | 
           | So, you have this really powerful instruction set that's
           | vectorised on almost every instruction, and then you have a
           | massive 128 registers per core coupled with 32KB local cache
           | memory which is single cycle access. It's an absolute joy to
           | program for on anything which will fit into the 32KB local
           | cache (which includes your program). The only downside is
           | that anything that doesn't fit into 32KB requires you to
           | start a DMA request to copy a chunk to or from main memory,
           | and wait for it to complete and/or continue doing other work
           | in the background. With suitable workloads operating on 10KB
           | or less, many people used this in a true double-buffered
           | fashion, so the next set of data was transferring while you
           | were processing the previous set, but this made lots of tasks
           | quite complicated and many developers found it too hard to
           | make good use of the SPU, and so it was often sadly under-
           | utilised. One interesting effect though, is that DMA'ing your
           | buffers took a similar time to a cache miss on the PowerPC
           | cores, so actually even a naive approach that always
           | triggered a small DMA from main memory in parallel with other
           | operations often wouldn't see much latency and might still
           | even be quicker than running the same task on the PowerPC
           | cores where there was less opportunity for doing other work
           | during a cache miss.
           | 
           | Another interesting feature of this CPU is that you can do a
           | pre-emptive branch prediction so that the instruction cache
           | is already full when a branch is takenm meaning that a branch
           | can execute in a single cycle. Coupled with indirect branch
           | (i.e. using a register value), you can give a CPU a branch
           | hint in advance of the end of a loop as to whether it'll
           | terminate or not and also avoid stalls. In some cases this
           | could outperform the x86 branch prediction that was common at
           | the time, which was just maintaining a cache of the last
           | target of recent branches.
           | 
           | Overall, SPU programming was tough to master, but once you
           | got something working the performance was phenomenal. I think
           | that's why I liked it so much - because the sense of
           | accomplishment was so tangible once you'd battled through it
           | and emerged the other side with some amazing code. But
           | nowadays, the power it provided, that was unheard of at the
           | time (7 cores at 3GHz), isn't anything special over a mid-
           | range i9.
        
       | jb1991 wrote:
       | Title checks out. I don't believe a word of it.
        
       | fear91 wrote:
       | It's a very useful instruction. Go uses it extensively in its
       | core runtime and the compiler itself. I assume other languages do
       | too if they care about performance.
       | 
       | Examples:
       | 
       | 1. Growing a slice.
       | 
       | 2. Runtime's heap bitmap.
       | 
       | 3. Page allocation.
       | 
       | 4. malloc.
       | 
       | 5. Register allocation.
        
       | cpach wrote:
       | If you want an example where popcount is very useful, check out
       | this challenge from Cryptopals:
       | https://cryptopals.com/sets/1/challenges/6
        
       | fegu wrote:
       | Haskell added a function popCount in
       | https://hackage.haskell.org/package/base-4.18.0.0/docs/Data-...
       | some years ago to allow easy use of this operation where present.
        
       | inetknght wrote:
       | > You might be wondering, like I was, if there's more to this
       | instruction, but that's all it does! This doesn't seem very
       | useful, right?
       | 
       | It seems _very_ useful. I 've used it for several algorithms
       | including DNA analysis and vectorized math.
        
         | nwoli wrote:
         | It's of course useful, the question is it useful enough to
         | warrant its own instruction
        
           | chaboud wrote:
           | Whether it's shift, mask, multiply, Kernighan's method, or
           | something else, it's going to be multiple instructions and
           | tens or hundreds of cycles to do this in software.
           | 
           | pop count instructions take a handful of cycles to run (~10
           | on ARM, ~3 on Intel?). It's one of those things that silicon
           | can do very well.
        
             | fear91 wrote:
             | New Intel/AMD CPU's do a register based popcount in a
             | single clock.
        
               | angiosperm wrote:
               | Used to be three cycles.
               | 
               | Unfortunately, the original AMD64 back in 200x lacked a
               | popcount, so most software built for PCs even today lacks
               | any instances of the instruction. Means to get the
               | instruction generated are finicky, non-portable, and
               | often result unexpectedly in a function call, instead.
               | E.g., without a "-m" option, Gcc and Clang will turn
               | "__builtin_popcount()" into a function call. Likewise,
               | "std::popcount()" and "std::bitset<>::count()". Always
               | use at least "-mavx".
        
             | RetroTechie wrote:
             | Question is whether _performance critical_ code that would
             | benefit from a pop count instruction, is common enough to
             | warrant inclusion.
             | 
             | As long as it isn't a bottleneck in common software, a few
             | shifts/masks/add/integer multiply or whatever, are _very_
             | quick on modern cpus. Often 1-cycle. If not  >1 such
             | instructions in parallel per clock.
        
               | ilyt wrote:
               | I'd wager also yes.
               | 
               | > As long as it isn't a bottleneck in common software, a
               | few shifts/masks/add/integer multiply or whatever, are
               | very quick on modern cpus. Often 1-cycle. If not >1 such
               | instructions in parallel per clock.
               | 
               | one instruction takes less cache space than multiple
               | instructions. At worst it might mean fitting vs not
               | fitting hot loop in cache.
        
               | dalke wrote:
               | I work in cheminformatics, and wrote one of the documents
               | cited by Sagar.
               | 
               | The answer is "yes and no".
               | 
               | My area of focus is a part of molecular similarity. The
               | overall idea is that molecules which are similar tend to
               | have similar functionality. There are many approximate
               | ways to estaimte molecular similarity. The one I focus on
               | maps chemical substructures ("features" or "descriptors")
               | to 1 or several bits on a bitstring of length ~1024 bits,
               | called a fingerprint.
               | 
               | We use Jaccard similarity (called Tanimoto similarity in
               | my field) of two fingerprints as a proxy for molecular
               | similarity computed as popcount(A & B) / popcount(A | B).
               | Since popcount(A) and popcount(B) can be pre-computed,
               | this ends up being popcount(A & B) / (popcount(A) +
               | popcount(B) - popcount(A & B). If the fingerprints are
               | grouped by popcount then this boils down to computing
               | popcount(A & B), plus some tiny amount of integer math.
               | 
               | This can be used to find the k-nearest matches to a
               | single query, to cluster the fingerprints, to identify
               | diverse fingerprints, and so on.
               | 
               | These methods bounded by two things: 1. the time needed
               | to compute popcount of (A & B), and 2. the memory
               | bandwidth.
               | 
               | The CPU bottleneck really is the popcount calculation. At
               | https://jcheminf.biomedcentral.com/articles/10.1186/s1332
               | 1-0... I compared different popcount implementations and
               | found POPCNT about 2-3x faster than the fastest pure-C
               | popcount.
               | 
               | However, POPCNT on Intel processors isn't all that fast.
               | Rather, when I was really focused on this 5+ years ago,
               | Intel processors only had one execution port than could
               | handle POPCNT, so I could only get one 8 bytes per cycle.
               | (Some AMD processors have several such ports, but I never
               | tried one of those CPUs.)
               | 
               | Instead, Wojciech Mula, Nathan Kurz and Daniel Lemire
               | showed that AVX2 instructions were even faster than
               | sequential POPCNT instructions because AVX2 could handle
               | more things in parallel. See my table for numbers.
               | 
               | For small bitstrings (I think 512 bits was the boundary)
               | POPCNT is still the fastest.
               | 
               | With AVX2 it's fast enough that memory bandwidth becomes
               | the limiting issue, and I need to start worrying about
               | cache locality.
        
           | inetknght wrote:
           | > _It's of course useful, the question is it useful enough to
           | warrant its own instruction_
           | 
           | I argue: yes absolutely.
        
           | [deleted]
        
       | spacechild1 wrote:
       | What's weird about popcount?
        
         | Agingcoder wrote:
         | That was my initial reaction as well. Counting bits arises
         | quite naturally in many places.
        
         | pwdisswordfishc wrote:
         | Nothing. It's clickbait.
        
         | tialaramex wrote:
         | I'd guess to the "Surely my computer is a 1:1 mapping of the C
         | programming language" people the expectation is that their CPU
         | will have two increment instructions, two decrement
         | instructions, and nothing like popcount. After all, C doesn't
         | have a popcount operator.
         | 
         | I have this terrible feeling that somewhere down the line we're
         | going to get a generation of people who assumed the computer is
         | a 1:1 mapping of Python -- a friend who teaches CS for a major
         | university explained that they're going to teach Python as
         | first language for Computer Science undergraduates starting
         | 2024.
        
           | SeanLuke wrote:
           | > a friend who teaches CS for a major university explained
           | that they're going to teach Python as first language for
           | Computer Science undergraduates starting 2024.
           | 
           | This is extremely widespread. My institution does it, as do
           | many others I know of. I would not be surprised to find that
           | Python is the most common initial programming language right
           | now.
        
             | sgerenser wrote:
             | When I was in college (2001-2006) it was Java. I'm pretty
             | sure the school already switched to Python for intro CS
             | within the past few years.
        
           | RetroTechie wrote:
           | Well there have been Java bytecode & Forth optimized (if not
           | directly executing) cpus. So you never know. ;-)
        
           | ilyt wrote:
           | > I have this terrible feeling that somewhere down the line
           | we're going to get a generation of people who assumed the
           | computer is a 1:1 mapping of Python -- a friend who teaches
           | CS for a major university explained that they're going to
           | teach Python as first language for Computer Science
           | undergraduates starting 2024.
           | 
           | Probably better than starting with C with all its footguns
           | and "optimistic" approach to guessing what programmer
           | meant...
        
           | automatic6131 wrote:
           | Having python as a first programming language isn't a big
           | disadvantage. The core elements of program optimization are
           | still there: making sure you don't accidentally write O(n2)
           | code etc. even if the memory management is opaque, you end up
           | with an idea of values vs references, stacks and heaps just
           | from some surprising edge cases in the language.
        
           | mhh__ wrote:
           | I'm worried about python being assumed to be the same as
           | every language than every CPU.
        
           | sharikous wrote:
           | What would be so terrible?
           | 
           | Some people will be naturally curious, as always, and they
           | will go deeper.
           | 
           | And even languages like Javascript influenced the design of
           | CPUs and even specific instructions. Nothing is "pure" in the
           | computing field
        
           | jpc0 wrote:
           | > a friend who teaches CS for a major university explained
           | that they're going to teach Python as first language for
           | Computer Science undergraduates starting 2024
           | 
           | I think this is significantly more common than you think it
           | is.
           | 
           | Also python has an extremely rich type system, can be
           | completely duck typed as well, and does effectively
           | everything you would need to know about in a CS degree, at
           | least in the first semester or two, other than manual memory
           | management. Which is significantly easier when you have a
           | firm grasp of other aspects of programming.
           | 
           | There are multiple approaches to teaching, starting with
           | python and then moving lower down later isn't necessary a bad
           | move. Who says starting with assembly/C and moving up is a
           | good move?
        
             | ponderings wrote:
             | We could make a slow human friendly machine code, show how
             | it abstracts electronic components and then get into
             | physics and chemistry.
             | 
             | It would be as fascinating as unproductive.
        
             | tialaramex wrote:
             | Oxford and Cambridge both begin with an ML, the same way I
             | was taught last century (although of course it's a
             | different ML today, I believe one of them uses Ocaml).
             | 
             | Unavoidably I am biased, but I nevertheless feel confident
             | that this is the correct way to teach programming to CS
             | students in particular. One reason to use an ML, and
             | especially to use a less common ML, is to create a level
             | playing field. Teaching Python, Java (as they do today at
             | his institution), or C (yes that happens) means some of the
             | fresh students have months, maybe years, even occasionally
             | decades of prior practical experience in the "first
             | language" before day one. These students will find most, or
             | even all, of the exercises trivial and so they won't
             | actually learn what you were teaching, which may surprise
             | them too, they're also likely to be disruptive, after all
             | they're bored because this is easy.
             | 
             | Now, for just a "level playing field" you could choose
             | almost anything unpopular these days. Pascal? Scheme?
             | Fortran? But that's not the only reason. MLs have a nice
             | type system, which is great because probably simultaneously
             | on the theory side of their course your students are
             | learning about types, and Hindley-Milner gets you from a
             | theory to a real world, from why this is a good idea, to
             | huh, the machine did exactly what I meant, that's neat. If
             | your colleagues are teaching them a sane approach to types,
             | but you're teaching a language which says YOLO, everything
             | is just a bag of bits don't worry about it, that's not a
             | cohesive message. Likewise for function composition. It'd
             | be nice if the first language you're teaching has actual
             | bona fide mandatory TCO, so that when the nice elegant
             | explanation in the theory class shows tail calls, the
             | result in your practical class isn't a room full of stack
             | overflow errors (or worse).
             | 
             | This is reasonable for a First Language because you're
             | expecting to teach these students several languages. The
             | same doesn't apply to say, an optional module for
             | Geographers that gets them enough Python to use a GIS
             | competently. But we are (hopefully) not relying on
             | Geographers to invent the next Python.
        
             | Philpax wrote:
             | > I think this is significantly more common than you think
             | it is.
             | 
             | For context: it was the first language taught in my CS
             | degree in 2014. They switched to Python from Java due to
             | its pedagogical advantages - namely, students being able to
             | focus more on the high-level algorithmic semantics than on
             | language design minutiae (Java) or memory management / poor
             | diagnostics (C).
             | 
             | I wasn't a fan of the decision at the time, but I certainly
             | see why they did it in hindsight!
        
           | larodi wrote:
           | Heh you underestimate (or perhaps understate) the
           | predictability of stupidity... somewhere down the road you'll
           | have kids running for-loops on top of GB-large LLM models :D
           | 
           | now, seriously - it gets me thinking a lot that certain
           | generation is going to be the last to be interested in
           | assembly, but actually there are always people fascinated by
           | the beauty of assembly language.
        
           | praptak wrote:
           | > 1:1 mapping of the C programming language
           | 
           | I was pretty surprised the first time I discovered the C
           | compiler could "reverse engineer" a "logical or of two
           | shifts" to generate a single rotate instruction.
        
             | tialaramex wrote:
             | Idiom recognition. Some compilers had idiom recognition
             | before C existed, because in some of the early languages
             | the literal interpretation of the common idioms is
             | incredibly expensive to perform, but a compiler can
             | recognise that actually it can take massive shortcuts.
             | 
             | For example, sort this list of numbers into ascending
             | order, then, take that sorted list, get the last number
             | from it. With idiom recognition the compiler can say oh,
             | I'll walk the list once, tracking the maximum as I go,
             | that's much faster. In C this seems like, well, if I meant
             | that I'd write that but I think APL is an example where the
             | list expression is idiomatic and so that's what programmers
             | will write but they _want_ the fast thing.
        
           | eska wrote:
           | Maybe you'll feel better knowing that popcount has been
           | standardized in C23
        
             | tialaramex wrote:
             | Thanks, I didn't know about that and so I just read a
             | summary, it's kind of sad that even the C standard library
             | is now obliged put prefixes on names for namespace reasons,
             | but yeah, it is good.
        
               | eska wrote:
               | To be fair it's not different from C++ although it
               | features namespaces. One still has to write "std::foo()"
               | in headers in order to not pollute the global namespace,
               | and therefore in template code, anyway.
               | 
               | As long as they stop adding locale and global/thread-
               | state dependent functions I'll be quiet :)
        
               | sigsev_251 wrote:
               | Regarding locale, let's hope that the text conversion API
               | makes it in the next version so we can all work in
               | unicode and be happy.
        
         | adrian_b wrote:
         | The only thing weird about it is that even if it was already
         | implemented in the first commercial computer made with vacuum
         | tubes, because its implementation in hardware is cheap and it
         | can be much faster than a software implementation, it was
         | omitted in most later computers, with the exception of the most
         | expensive computers, which were typically used mainly by
         | government agencies, until AMD decided to implement it in AMD
         | Barcelona (2007).
         | 
         | Once AMD had it, Intel followed soon in Nehalem (2009), and now
         | most CPUs have it.
         | 
         | The article is somewhat misleading about population count being
         | available in ARM NEON since 2005 (i.e. since Cortex-A8).
         | 
         | The 64-bit Armv8-A ISA, which was introduced in 2012, has the
         | CNT instruction, which is indeed a population count
         | instruction.
         | 
         | The 32-bit Armv7-A ISA has the VCNT instruction, which is a
         | variant of population count, but which is much less useful than
         | the 64-bit POPCNT provided by Cray 1, AMD Barcelona, Intel
         | Nehalem or Armv8-A, because it works only on 8-bit words, so
         | executing the equivalent of a 64-bit POPCNT requires a sequence
         | of several instructions.
        
           | stassats wrote:
           | CNT on arm64 counts 8 or 16-bit elements too.
        
       | zzo38computer wrote:
       | Bit population count is useful for many things. MMIX has SADD
       | ("sideways add", which is also a bit population count), and also
       | has MOR ("multiple OR") and MXOR ("multiple XOR"), which are also
       | useful (for many purposes, including endianness conversion (even
       | PDP-endian), reversing the bits in a byte, hashing, and others)
       | but I had not seen on other instruction sets.
       | 
       | In GNU C, you can use the function __builtin_popcount to make bit
       | population count.
        
       | [deleted]
        
       | [deleted]
        
       | jagrsw wrote:
       | Hamming distance (and popcnt) is also used in gradient descent
       | scenarios.
       | 
       | Eg. here's part of fuzzing engine code, where it tries to produce
       | an input which matches some form of variable check inside the
       | fuzzed code:
       | 
       | https://github.com/google/honggfuzz/blob/1f09e3bfb0785d9b311...
       | register uint8_t v = ((sizeof(Arg1) * 8) -
       | __builtin_popcount(Arg1 ^ Arg2));       uint8_t prev =
       | ATOMIC_GET(globalCovFeedback->bbMapCmp[pos]);         if (prev <
       | v) {         ...
        
       | dbcurtis wrote:
       | Ah yes, pop count. Back in my days as a CPU logic designer on
       | refrigerated, walk-in, scientific mainframes, I was working on a
       | vector processor that was intended to be a competitor to Cray-1,
       | but sank due to poor marketing and a collapse in oil
       | prospecting... but I digress. A few of us on the team had come
       | from previous CPUs at other companies, and argued with the
       | architecture committee that we should have a vector pop count
       | instruction in order to get some interest from No Such Agency.
       | Shot down, of course.
       | 
       | When serial No. 1 was in initial benchmarking, two guys named
       | Frank and George with No Last Names, who lived somewhere near
       | Fort Mead, said they would buy at least one if we added vector
       | pop count.
        
         | angiosperm wrote:
         | Popcount eventually gets added to every architecture that lacks
         | it, always at great expense. You would think people would have
         | learned by now.
         | 
         | And, no, not typically just because the spooks want it.
        
           | NotYourLawyer wrote:
           | Why, then?
        
         | bruce343434 wrote:
         | I've read this story before but can't remember where
        
           | dbcurtis wrote:
           | I suspect it played out multiple times at different
           | companies.
        
       | [deleted]
        
       | agubelu wrote:
       | To add to some of the uses of popcount, it is also used in chess
       | engines. Since a chess board has 64 squares, many engines use 64
       | bit unsigned integers to represent the whole board, mapping each
       | square to a bit, which are called bitboards. Then, each
       | combination of piece type and color is given its own bitboard,
       | for example, a bitboard to represent the positions of white
       | rooks. The squares attacked by a piece can similarly be
       | represented by a bitboard.
       | 
       | This in turn allows for the use of something called magic
       | bitboards, which are essentially compressed lookup tables for the
       | squares to which a rook or a bishop can move for every possible
       | disposition of the board. Having all these possibilities
       | precomputed speeds up move generation significantly.
       | 
       | Popcount is used, for example, to know the amount of pieces of a
       | certain type on the board, the number of squares to which a piece
       | can move, or the amount of pieces attacking a given square.
        
       | WaffleIronMaker wrote:
       | If you want to learn more about Hamming Codes from a mathematical
       | perspective, I recommend 3Blue1Brown's YouTube series:
       | 
       | > How to send a self-correcting message (Hamming codes) -
       | https://youtu.be/X8jsijhllIA
       | 
       | > Hamming codes part 2, the elegance of it all -
       | https://youtu.be/b3NxrZOu_CE
       | 
       | Associated with Ben Eater's video, if you want a perspective
       | regarding low level computer hardware:
       | 
       | > What is error correction? Hamming codes in hardware -
       | https://youtu.be/h0jloehRKas
        
       | ks2048 wrote:
       | I recently used popcount for an audio-fingerprinting project [0]
       | to compute the difference in binary "audio fingerprints". With
       | AVX2 and 4 cores, my desktop managed 448 billion bit-diff calcs
       | per second! [0] https://kenschutte.com/phingerprint/
        
       ___________________________________________________________________
       (page generated 2023-08-13 23:00 UTC)