[HN Gopher] The NSA Instruction (2019)
       ___________________________________________________________________
        
       The NSA Instruction (2019)
        
       Author : cjg
       Score  : 230 points
       Date   : 2021-06-11 08:06 UTC (14 hours ago)
        
 (HTM) web link (vaibhavsagar.com)
 (TXT) w3m dump (vaibhavsagar.com)
        
       | st_goliath wrote:
       | Some years back, I got myself a copy of Andrew Hodges "Alan
       | Turing - The Enigma", a biography and IMO generally a good read,
       | but also with some gems regarding very early computing history in
       | it.
       | 
       | Specifically, after WWII, Turing worked on the ACE 1 (later
       | reduced to Pilot ACE) project to build an electronic computer,
       | which didn't really progress due to management and bureaucracy
       | overhead. He eventually went to Manchester, once they got their
       | Manchester Mark 1 off the ground, which they tried to
       | commercialize as "Ferranti Mark 1"
       | (https://en.wikipedia.org/wiki/Ferranti_Mark_1).
       | 
       | While employed for the University, Turing IIRC continued to work
       | as an external consultant for whatever became of G.C. & C.S. on
       | the side. According to the book, he convinced them to buy such a
       | machine (presumably for crypt-analysis?) and, on the Manchester
       | side of things, insisted on some modifications to be made,
       | including a "horizontal adder", so it could count the number of
       | bits set in a word with a single instruction, i.e. a popcount
       | instruction. This would pre-date the IBM Stretch mentioned in the
       | article.
        
       | dwheeler wrote:
       | Obviously using a dedicated instruction is fastest in normal
       | cases.
       | 
       | But if you need to implement popcount or many other bit
       | manipulation algorithms in software, a good book to look at is
       | "Hacker's Delight" by Henry S. Warren, Jr, 2003.
       | 
       | "Hacker's Delight' page 65+ discuss "Counting 1-bits" (population
       | counts). There are a lot of software algorithms to do this.
       | 
       | One approach is to set each 2-bit field to the count of 2 1-bit
       | fields, then each 4-bit field to the count of 2 2-bit fields,
       | etc., like this:                   x = (x & 0x55555555) + ((x >>
       | 1) & 0x55555555);         x = (x & 0x33333333) + ((x >> 2) &
       | 0x33333333);         x = (x & 0x0f0f0f0f) + ((x >> 4) &
       | 0x0f0f0f0f);         x = (x & 0x00ff00ff) + ((x >> 8) &
       | 0x00ff00ff);         x = (x & 0x0000ffff) + (x >> 16);
       | 
       | assuming x is 32 bits.
       | 
       | I think this approach is a classic divide-and-conquer solution.
        
         | dragontamer wrote:
         | > But if you need to implement popcount or many other bit
         | manipulation algorithms in software
         | 
         | Power9, ARM, x86 BMI, Nvidia PTX, AMD GCN, and AMD RDNA all
         | have a popcount instruction.
         | 
         | Yeah, all mainstream CPUs and GPUs made in the past decade...
         | 
         | Unfortunately, there's no system I can think of where you'd
         | need the software solution anymore... Maybe if you wanted
         | popcount on an Arduino??
        
           | ncmncm wrote:
           | Yet, practically all software running on 64-bit x86 machines
           | is compiled without, because the original amd64 released in
           | 2003 lacked it, and distributions still target that.
           | Likewise, MSVC. There would be good reasons for Apple XCode
           | not to, but that doesn't mean they don't.
           | 
           | If you tell MSVC to issue a popcount instruction with
           | "__popcnt64()" (etc.), it will. If you ask Gcc to issue a
           | popcount instruction with "__builtin_popcount()", it will
           | only do it if you have also told it to target an ISA that has
           | one; otherwise it emulates.
           | 
           | The only _portable_ way to get a popcount instruction, thus
           | far, is to use C++ 's std::bitset::count() in circumstances
           | where the compiler believes the instruction would work.
           | Pleasingly, Gcc and Clang are both happy to hold an integer
           | type and its std::bitset representation in the same register
           | at the same time, so there is no runtime penalty for a round-
           | trip through std::bitset.
           | 
           | MSVC's standard library implementation of std::bitset does
           | not use the popcount instruction.
        
         | dlemire wrote:
         | You can also go faster using SIMD instructions if you need to
         | compute wider population counts (beyond 64 bits):
         | 
         | Faster Population Counts Using AVX2 Instructions, Computer
         | Journal, Volume 61, Issue 1, 2018
         | https://arxiv.org/abs/1611.07612
        
         | amalcon wrote:
         | A neat one for large, sparse integers is:
         | for(i=0; !x; ++i) {         x = (x-1)&x       }       return i;
         | 
         | Which runs only a number of iterations equal to the number of 1
         | bits. This works because (x-1) actually just flips the
         | rightmost 1 bit and all zeroes to its right, then the & zeroes
         | all of those.
         | 
         | It's not that fast unless your integer is really sparse (since
         | it has a branch), but I've always liked the bit hack.
        
           | pbsd wrote:
           | When the integer is expected to be dense, you have the
           | corresponding trick                   size_t count =
           | sizeof(x) * 8;         while(x != -1) {             x |= x+1;
           | --count;         }         return count;
        
             | throw5away wrote:
             | This is essentially equivalent to feeding the input through
             | bitwise-NOT first. Unfortunately, there are far more
             | integers that are neither sparse nor dense than integers
             | that are sparse or dense.
        
       | graderjs wrote:
       | This is a great piece of writing. I wish more blogs tied together
       | so many technical perspectives like this. Bravo auteur!
        
       | torgoguys wrote:
       | >You might be wondering, like I was, if there's more to this
       | instruction, but that's all it does! This doesn't seem very
       | useful, right?
       | 
       | I have a hard time understanding how anybody who has done most
       | any non-trivial amount of bit maniplation couldn't think of
       | plenty of uses.
        
         | mrobot wrote:
         | Why did you write this comment?
        
           | fuzzer37 wrote:
           | Why did you write this one?
        
           | torgoguys wrote:
           | Simply because I found what I quoted from the article to be a
           | weird comment. What I mean is that there probably isn't very
           | much overlap in the two-circle Venn diagram of 1) people who
           | already are low-level enough to care about native CPU
           | instructions and 2) those who can't think of good uses for a
           | population count.
           | 
           | I'm wasn't trying to be a jerk to the author. I just found it
           | strange and would be happy to find out that I'm an idiot and
           | forgot about ____ programmers who fall into that category. My
           | comment was inviting such comments. (After all, if the author
           | meant what they wrote there, apparently they fall into that
           | camp...)
        
           | l33t2328 wrote:
           | It's immediately relevant to the very beginning of the
           | article.
        
             | Alekhine wrote:
             | But taken in the context of the whole article, it adds
             | nothing of any value. The author literally spends the rest
             | of the post describing how it _is_ useful. Saying  'doesn't
             | seem useful, does it?' at the beginning is a rhetorical
             | device. The author is assuming that the reader probably
             | doesn't have experience using bit manipulations for complex
             | problems.
             | 
             | I am so tired of HN pedantry.
        
               | pc86 wrote:
               | How else are we supposed to feel intellectually superior
               | to literally everyone if we're not pedantic assholes to
               | each other?
        
               | tbrake wrote:
               | Put a little more blunt than I would have but there's
               | nothing wrong there. Rarely a submission goes by where
               | the first comments aren't people racing in to argue with
               | the author.
               | 
               | There's an adversarial air that exists around dissecting,
               | criticizing, nit-picking etc ideas presented, both here
               | and in the tech world at large. As if one's most valuable
               | contribution to a conversation is assuming the role of
               | smug contrarian.
               | 
               | It's frankly tiring and obnoxious. I used to think non-
               | geeks were just bad at communicating with us; that the
               | fault was on their side somehow. But now I think we're
               | just dicks.
        
               | na85 wrote:
               | Careful, pointing out smug asshole commenters is against
               | the contribution guidelines.
        
               | lurquer wrote:
               | That's not the way 'literally' should be used.
        
       | tptacek wrote:
       | The consensus on the 1992 thread (including a really great
       | comment from 'Animats) seems to be that `popcount` was generally
       | not added to architectures at NSA's request --- that people
       | familiar with those archs knew the actual reason `popcount` wound
       | up in the ISA, and it preceded NSA purchases.
       | 
       | https://groups.google.com/g/comp.arch/c/UXEi7G6WHuU/m/Z2z7fC...
        
         | Animats wrote:
         | The striking thing is that the IBM System/360 didn't have it.
         | Nor does the System/370. Those were the standard mainframes for
         | a generation.
         | 
         | IBM Z-series machines do have population count, finally.
        
       | ludamad wrote:
       | My first thought "How else do you quickly count pieces on a
       | bitboard?". Definitely chess programming caused me to never
       | second guess the usefulness of `popcount`
        
       | drichel wrote:
       | Counting bits was the bottleneck in the genomic scan I co-
       | authored (Kanoungi et al. 2020). popcnt resulted in insane
       | perfomance gains comared to all other methods.
       | 
       | However, we re-discovered the fact that some Intel CPUs,
       | including the Nehalem mentioned in the article, have a bug that
       | severly affects popcnt's performance, see for example here:
       | https://github.com/komrad36/LATCH/issues/3#issuecomment-2671...
        
       | 4gotunameagain wrote:
       | Another interesting application of popcount is in computer
       | vision, namely in matching keypoints that use binary descriptors
       | for 3D reconstruction in SLAM/TRN etc
        
         | jonatron wrote:
         | Yep, I've used __builtin_popcountll for ORB from OpenCV (256
         | bit binary descriptors).
        
           | 4gotunameagain wrote:
           | Looks like we've done similar things :)
           | 
           | Horror story: I was once developing a TRN system for a
           | spacecraft instrument which uses an ancient x86 processor
           | that does not have popcnt, ended up using a look-up table
           | instead...
        
             | solarexplorer wrote:
             | Did you know about HAKMEM 169? I guess it was/is not widely
             | known since many people mention lookup tables as the only
             | fast alternative to the popcnt instruction.
             | 
             | http://www.hakmem.org/#item169
        
       | rcgorton wrote:
       | It is also incredibly useful for doing string scanning - look at
       | strlen/strchr in various libc imp lementations
        
         | chrchang523 wrote:
         | Nitpick: it's the related "count trailing zeros" operation that
         | is useful (in combination with movemask) there, not popcount
         | itself.
        
       | bsmith0 wrote:
       | Here's a dumb question. If someone asked me to do it I'd probably
       | write code like:
       | 
       | while(x != 0) { c += x&1; x >>= 1; }
       | 
       | Is this something that should be added to LLVM?
       | 
       | Edit: flip the order
        
         | secondcoming wrote:
         | Both clang and gcc have __builtin_popcnt variants.
        
         | sumtechguy wrote:
         | I came across this long ago. But it shows some very nice ways
         | to fiddle bits. It has a few different ways to do it. Which
         | would be handy on systems that do not have a popcount.
         | 
         | https://graphics.stanford.edu/~seander/bithacks.html
        
         | [deleted]
        
         | nickysielicki wrote:
         | Popcount is easily recognized by llvm (and it's actually
         | mentioned in the article...)
         | 
         | In the case of the code you've posted, you're shifting out the
         | LSB before you check the bit, so it's not quite right, but (in
         | general) popcount is recognized and used when possible.
        
           | bsmith0 wrote:
           | Yep my bad! I think flipping the order should work still
           | though.
           | 
           | The two links in the article:
           | 
           | https://lemire.me/blog/2016/05/23/the-surprising-
           | cleverness-...
           | 
           | And the LLVM source indicate to me it only picks up on
           | x&(x-1) pattern, which would miss the popcount optimization
           | on code like mine.
        
             | nickysielicki wrote:
             | Flipping the order works, except if the LSB on x is set.
             | 
             | https://godbolt.org/z/qdWhxMPsf
             | 
             | Note the run output under clang.
             | 
             | edit:
             | 
             | > And the LLVM source indicate to me it only picks up on
             | x&(x-1) pattern, which would miss the popcount optimization
             | on code like mine.
             | 
             | Thanks for teaching me something this morning. That's
             | annoying.
             | 
             | I think the portable solution is std::popcount in C++ (or
             | equivalent in Rust).
        
               | vardump wrote:
               | While it seems to be true gcc and clang don't recognize
               | this pattern even when implemented correctly, your
               | program becomes an infinite loop if the highest bit is
               | set (negative), because 'i' will never become 0.
               | 
               | Example with int8_t:                 int8_t i = -127; //
               | 0b10000001       i >>= 1; // 0b11000000       i >>= 1; //
               | 0b11100000       i >>= 1; // 0b11110000       i >>= 1; //
               | 0b11111000       i >>= 1; // 0b11111100       i >>= 1; //
               | 0b11111110       i >>= 1; // 0b11111111       i >>= 1; //
               | 0b11111111 ad infinitum
               | 
               | One needs to be careful when using >> (shift right) with
               | signed integers.
               | 
               | So your program is not equivalent to popcount.
        
               | tialaramex wrote:
               | > or equivalent in Rust
               | 
               | https://doc.rust-lang.org/std/?search=count_ones
               | 
               | Internally Rust actually just staples LLVM's
               | implementation into your code, via an intrinsic - but if
               | that were ever to change the standard library
               | count_ones() methods will do whatever happens instead so
               | you should use that.
        
       | notacoward wrote:
       | Back in the mid-2000s I worked at a company that made their own
       | (MIPS-based) chips. NSA was one of our customers - supposedly the
       | "defense" who could be considered the good side of NSA compared
       | to the 10x larger "offense" but still. As we were planning for
       | our second generation, they offered quite a bit of money if we'd
       | implement a "sheep and goats" instruction. It would take two
       | operands: an input and a mask. The masked-in bits of the input
       | (the "sheep") would be packed toward the MSB of the output, while
       | the masked-out bits (the "goats") would be packed toward the LSB.
       | We had a lot of people on staff with serious chops in all sorts
       | of math including cryptography, but none of them could identify
       | an algorithm that would benefit from having such an instruction
       | (as distinct from more conventional range-based bitfield
       | instructions). Since the company went under shortly afterward, it
       | remained a mystery. I still wonder about it.
        
         | coolspot wrote:
         | Is it possible that by some mistake the NSA your company was
         | working with was National Sheepfarmers Association?
         | 
         | Did representatives of the "NSA" have a New Zealand accent?
        
         | dboreham wrote:
         | I have also been around in a company that made CPUs that
         | initially had no bit count instruction. Then at some point the
         | instruction was added. At the time I heard that "men in black
         | with mirrored sunglasses" had shown up and demanded that the
         | instruction be added. Whether or not this was an accurate
         | description of events, you can see the note on page 74 in this
         | document (section 8.2) :
         | http://www.transputer.net/iset/pdf/tis-acwg.pdf recording the
         | instruction having been added.
         | 
         | Edit, I see Roger Shepherd (one of the people in the know at
         | above mentioned company) commented in the comp.arch thread
         | (which I vaguely remember reading at the time) but no mention
         | of MIB...
        
         | pbsd wrote:
         | That sounds like a decent primitive to accelerate arbitrary bit
         | permutations in software. It's known as GRP in, e.g., [1].
         | 
         | [1] http://palms.ee.princeton.edu/PALMSopen/shi00bit.pdf
        
           | notacoward wrote:
           | Very interesting. This was published in 2000, and the people
           | we worked with were near Princeton, so this result - the
           | specific utility of such an instruction, if not the semantics
           | themselves - might very well be something that was known to
           | some relevant people but not yet widely enough for any of our
           | people to recognize it. Thanks!
        
         | devit wrote:
         | That's equivalent to !0 << popcount(!(x & mask)) (where left
         | shift must saturate and not truncate the shift count, otherwise
         | you need to special case x = 0) and seems much less useful than
         | popcount.
        
           | notacoward wrote:
           | I think you're misunderstanding what the instruction (or
           | similar ones that others have mentioned) would do. It's a
           | specialized permutation function; every bit in the input is
           | preserved, just in a different position. Your version doesn't
           | have that property at all, and would indeed not be very
           | useful.
        
         | less_less wrote:
         | Half of this instruction is present in AMD64's BMI2 extension
         | as PEXT, and the reverse operation as PDEP. Unlike "sheep and
         | goats", PEXT just extracts the sheep into the LSB and ignores
         | the goats.
         | 
         | If I recall the Knuth lecture correctly, given a "sheep and
         | goats" instruction where one of the sets is packed in reverse
         | order, you can implement any n-bit permutation in something
         | like log2(n) instructions. I don't remember if this is true if
         | they're both packed in forward order. But it would be nice for
         | some hardware crypto designs, like DES or more recently GIFT.
         | 
         | PEXT has at least two additional use cases I know of:
         | manipulating bit indices for databases, and binary (GF2) matrix
         | manipulation. I've used it in a (non-crypto) project to select
         | a subset of columns from a binary matrix, to convert it to
         | systematic form. This subroutine also used popcount.
         | 
         | What I really wanted in that project was another "NSA
         | instruction": bit matrix multiply. Cray supercomputers can
         | multiply two 64x64 binary matrices in one instruction, though I
         | have no idea how many cycles it takes. With AVX2, the best I
         | could do is 6 instructions plus precomputation for 8x8 x 8x32,
         | which is 1/128'th the work.
        
           | Enginerrrd wrote:
           | It's out of my depth, but my guess is on sething DES related.
           | Here's a link to some possibly relevant discussion about it.
           | 
           | http://www.icodeguru.com/Embedded/Hacker's-Delight/050.htm
        
           | someguydave wrote:
           | indeed, Cray famously said "If you were plowing a field,
           | which would you rather use: two strong oxen or 1024
           | chickens?"
           | 
           | Unfortunately we only have 1024 chickens in modern computers.
        
             | lowbloodsugar wrote:
             | If you had to digest a million grains, which would you
             | rather use?
        
             | CalChris wrote:
             | Yes, but those chickens now are as powerful as Cray's oxen
             | were then.
        
               | flavius29663 wrote:
               | so, do you want 2 modern oxen or 1024 modern chickens?
        
               | CalChris wrote:
               | Gimme dem modern wide supercalar OOO cached chickens,
               | please. Cray was right back then but he is no longer
               | right now. If he were, the market would say so.
        
               | lowbloodsugar wrote:
               | These days, all the oxen are made up of chickens. The
               | biggest one is 7,630,848 chickens.
               | 
               | https://www.top500.org/lists/top500/2020/11/
        
           | thomasmg wrote:
           | Succinct (space-saving) data structures often need "rank" and
           | "select" operations. Rank(n) is the number of 1 bits up to
           | position n. Select(n) is the reverse: at which position is
           | the n-th 1 bit.
           | 
           | For "rank", the "popcount" instruction can be used.
           | Interestingly, for "select", the "PDEP" instruction can be
           | used: you can put the data array in the PDEP mask, and 1 << n
           | in the value; basically flip the operands. I found this quite
           | fascinating. For details, there is a short paper on this: "A
           | Fast x86 Implementation of Select".
           | 
           | I wonder if those succinct data structures are in any way
           | related to what NSA is doing. I think not, but who knows.
        
             | Bayart wrote:
             | The paper, if anyone wants to save some clicks :
             | https://arxiv.org/abs/1706.00990
        
           | Sniffnoy wrote:
           | Heh, so it's an instruction for INTERCAL's "select" (~)
           | operator...
        
         | mjevans wrote:
         | The 'and goats' part leads me to conceptualizing the
         | instruction more like:
         | 
         | Bit Scrambler / Chutes - shuffle bits around in a way that
         | divides a stream.
         | 
         | This might also be useful in pre-filters for compression
         | (entropy reduction) if you knew the content of the message.
         | E.G. for ASCII text the upper 2-3 bits of each letter could be
         | ranked to the side for better compression and a reduction of
         | message size.
         | 
         | As others have pointed out, modern CPUs ended up with 'half'
         | that instruction, so I wonder if there were any other reasons
         | for the full instruction.
        
         | jhallenworld wrote:
         | Compress and expand (from Hackers Delight) are like this, but
         | only the selected bits are kept. These are quite useful
         | instructions. One use is the hash function for perfect hash
         | tables. The hash table includes a mask which picks the bits of
         | the keys which actually change (between all keys) and
         | compresses them all to the right as the hash index.
         | 
         | Disclaimer: I contributed the "expand" algorithm shown in
         | Hacker's Delight.
        
       | dragontamer wrote:
       | GPU-programmers use popcount-based programming all the time these
       | days, but the abstractions are built on top and are hardware
       | accelerated.
       | 
       | CUDA's __activemask(); returns the 32-bit value of your current
       | 32-wide EXEC mask. That is to say, if your current warp is:
       | int foo = 0;         if(threadIdx.x %= 2){           foo =
       | __activemask();          }
       | 
       | foo will be "0b01010101...." or 0x55555555. This __activemask()
       | has a number of useful properties should you use __popc with it.
       | 
       | popcount(__activemask()); returns the number of threads
       | executing.
       | 
       | lanemask_lt() returns "0b0000000000000001" for the 0th lane.
       | 0b0000000000000011 for the 1st lane. 0b0000000000000111... for
       | the 2nd lane... and 111111111...111 for the last 31st lane.
       | 
       | popcount(__activemask() & lanemask_lt()); returns the "active
       | lane count". All together now, we can make a parallel SIMD-stack
       | that can push/pop together in parallel.                   int
       | head = 0;         char buffer[0x1000];
       | while(fooBar()){ // Dynamic! We don't know who is, or is not
       | active anymore             int localPrefix =
       | __popc(__activemask() & __lanemask_lt());             int
       | totalWarpActive = __popc(__activemask());
       | buffer[head + localPrefix] = generateValueThisThread();
       | if(localPrefix == 0){                 head += totalWarpActive; //
       | Move the head forward, much like a "push" operation in single-
       | thread land                 // Only one thread should move the
       | head             }              __syncthreads(); // Thread
       | barrier, make sure everyone is waiting on activeThread#0 before
       | continuing.         }
       | 
       | ------------
       | 
       | As such, you can dynamically load-balance between GPU threads
       | (!!!) from a shared stack with minimal overheads.
       | 
       | If you want to extend this larger than one 32-wide CUDA-warp,
       | you'll need to use __shared __ memory to share the prefix with
       | the rest of the block.
       | 
       | It is a bad idea (too much overhead) to extend this much larger
       | than a block, as there's no quick way to communicate outside of
       | your block. Still though, having chunks of up to 1024 threads
       | synchronized through a shared data-structure that only has
       | nanoseconds of overhead is a nifty trick.
       | 
       | -----------
       | 
       | EDIT: Oh right, and this concept is now replicated very, very
       | quickly in the dedicated __ballot_sync(...) function (which
       | compiles down to just a few assembly instructions).
       | 
       | Playing with the "Exec-mask" is a hugely efficient way to
       | synchronously, and dynamically gather information across your
       | warp. So lots of little tricks have been built around this.
        
       | ncmncm wrote:
       | It is appalling that, after _every_ other general-computing
       | architecture in common use either started out with a popcount
       | instruction, or had one added later at substantial expense,
       | RISC-V came out without one.
       | 
       | It still doesn't have any. The proposed B, "bitmanip" extension
       | has it (along with a raft of trivial variations: count leading
       | zeroes, count trailing ones, yada yada) but that is not ratified
       | and not implemented in any chip I know of. Since B is a huge
       | extension, we can expect it will be routinely omitted even after
       | it's ratified, and compilers will need special prodding to
       | produce any such instructions.
       | 
       | It should have been in the base instruction set. We probably can
       | blame its lack on the academic origins of the design. CS
       | professors probably think of it as a thing not needed to
       | implement Lisp, therefore not worth class time.
       | 
       | (Some people say, "Oh, but you can trap and emulate it", which
       | adds insult to injury. Trapping and emulating eliminates all the
       | value the instruction offers.)
        
       | carapace wrote:
       | (I just want to add that this is the best thread on HN i've read
       | in a while. Y'all bringing a little nerdy tear to my eye. <3 )
        
       | FridayoLeary wrote:
       | A bit off- topic but i want to know; is binary code (01etc) still
       | used today in programming/coding? And for what applications?
        
         | fortyrod wrote:
         | Maybe not for general-purpose computing. I've used it for on-
         | the-fly code generation (hacking display rotation into the
         | Windows 3x BitBlt engine) and programming special-purpose media
         | accelerators. In both cases you end up creating a bunch of
         | convenience #defines or macros that generate the bits, which
         | immediately takes you back into tiny language territory rather
         | than pure machine code. The relative ease of creating new
         | programmable hardware in FPGAs is another place this might
         | occur.
        
       | adrian_b wrote:
       | It is possible that the "population count" instruction has been
       | included in the instruction sets of most American supercomputers
       | at the request of NSA, which was an important customer for them.
       | 
       | Nevertheless, the first computer having this instruction was a
       | British computer, the Ferranti Mark I (February 1951).
       | 
       | The name used by Ferranti Mark I for this instruction was
       | "sideways add".
       | 
       | Also notable was that Ferranti Mark I had the equivalent of LZCNT
       | (count leading zeroes) too.
       | 
       | Both instructions are very useful and they are standard now for
       | modern instruction sets, but they were omitted in most computers
       | after Ferranti Mark I, except in expensive supercomputers.
        
         | adrian_b wrote:
         | Moreover, Ferranti Mark I included a hardware random number
         | generator, another feature useful for cryptography, which was
         | reintroduced only recently in modern CPUs.
        
           | iib wrote:
           | Hardware random number generators do have some security
           | issues though. Linux devs were opposed to solely relying on
           | them, because they can be compromised by the vendor [1]. So
           | they are at best used in algorithms that they can not
           | compromise (still in [1], but lower, in the comments).
           | 
           | [1] https://web.archive.org/web/20180611180213/https://plus.g
           | oog...
        
             | adrian_b wrote:
             | The security issues are not with hardware random number
             | generators in general, but with those that are included
             | inside complex devices like monolithic CPUs or TPMs, so
             | that the owners of those devices cannot verify that the
             | RNG's really do what they are claimed to do.
             | 
             | Discrete hardware RNG's, like that of the Ferranti Mark I,
             | are perfectly secure.
             | 
             | For a modern device, the best way to implement a hardware
             | RNG is to just include an ADC (analog-digital converter)
             | input. Then you may connect externally on the PCB some
             | noisy analog amplifier, e.g. one which has a noisy resistor
             | or diode at its input. Digitizing the noise with the ADC
             | will provide the random numbers and the ADC input can be
             | isolated and tested separately at any time, so the user can
             | verify that there is no hidden functionality.
             | 
             | Most microcontrollers have ADC inputs, so it is easy to add
             | a secure hardware RNG for them. The same could be done for
             | a personal computer by making a noisy amplifier that can be
             | plugged in the microphone input, or by making a USB device
             | with a microcontroller.
        
               | iib wrote:
               | I remembered something, and I want to say as an aside,
               | for anybody reading that at one point has to design a toy
               | RNG from an ADC, as I had to some years ago, you should
               | not take the last bits as they are--as was my first
               | thought--, you should pass them through something like
               | the von Neumann corrector [1].
               | 
               | [1] https://everything2.com/title/von+Neumann+corrector
        
             | ncmncm wrote:
             | Indeed, AMD has more than once shipped CPUs in which the
             | random-number instruction would always yield the same
             | value, that had to be monkey-patched to yield apparently
             | random numbers. A valuable hint.
        
         | st_goliath wrote:
         | I commented on that earlier, including that it probably also
         | has a cryptanalysis background:
         | https://news.ycombinator.com/item?id=27472900
         | 
         | But yes, it definitely pre-dates the 1961 IBM machine in the
         | article.
        
       | pklausler wrote:
       | Surely the Cray BMM (bit matrix multiplication) instructions have
       | a better claim to that nickname.
        
       | [deleted]
        
       | oefrha wrote:
       | Discussed at the time:
       | https://news.ycombinator.com/item?id=20914479
        
         | implements wrote:
         | In this discussion someone offers:
         | 
         | "I remember in one interview I was asked to write a function
         | that returns true iff x is a power of two, so I wrote return 1
         | == __builtin_popcount(x). They liked that."
         | 
         | I'm no longer a programmer, but I wondered why it wasn't
         | "return (__builtin_popcount(x) == 1)" - just out of interest.
        
           | throw5away wrote:
           | If you accidentally write "return x = 1" when x is a
           | variable, you always return true. If you return "1 = x", you
           | cause a syntax error. So some people have gotten into the
           | habit of writing constants on the left, even if the return
           | value of __builtin_popcount is not assignable.
        
             | implements wrote:
             | Thanks for explaining - that makes sense.
        
       | NobodyReport wrote:
       | I have always been a lurker and I am trying to add my account to
       | my Keybase but it requires 2 karma. Could a few kind sirs give me
       | the juice?
        
         | NobodyReport wrote:
         | Not gonna spam. Just this one comment to do and hope it works
         | out. Edit- Thank you to the two who did but sadly someone down
         | voted so my total is 1.
         | 
         | I respect the community. I don't wish to spam just trying to
         | take control of all online identities.
        
       ___________________________________________________________________
       (page generated 2021-06-11 23:01 UTC)