[HN Gopher] Counting the number of matching characters in two AS...
       ___________________________________________________________________
        
       Counting the number of matching characters in two ASCII strings
        
       Author : ingve
       Score  : 72 points
       Date   : 2021-05-21 14:54 UTC (8 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | sm4sh wrote:
       | Well okay, but guess which one I'd rather have to maintain in a
       | big codebase? Clearly written by a theoretical engineer.
        
         | JKCalhoun wrote:
         | I've worked with engineers that must read these blogs and try
         | to out-clever everyone.
         | 
         | They succeed, and that's the problem.
        
         | humblepie wrote:
         | For a large codebase, this would be encapsulated and heavily
         | commented, etc. It's something that you pull out when you need
         | to make something "go to eleven" kind of thing.
        
           | raldi wrote:
           | Exactly. You could have some exhaustive unit tests as well,
           | and after that there would be nothing to maintain.
        
         | Forge36 wrote:
         | It's about use cases.
         | 
         | There was an old lady 6 about unoptimized behavior in VB6 and
         | why is wasn't necessary to do so. I had to write a wrapper for
         | it which was optimized in my code. Was it harder to read? Yes.
         | Did it produce a significant performance improvement? Also yes.
         | 
         | Software development is about tradeoffs. Demoing one way to
         | improve a simple function helps others trunk how they write
         | code, and more importantly why they wrote specific code
        
         | rocqua wrote:
         | So if your big codebase were profiled, and the 99th percentile
         | case happened to spend 80% of its time in this function, you
         | would still stick to the old function unwaveringly? Even if
         | your system was latency sensitive, and your 99 percentile
         | spikes were far from acceptable?
         | 
         | Sure, there are many cases where this is a premature
         | optimization. This blog never said otherwise. But that doesn't
         | mean there is never a time for optimization. And if profiling
         | points at this, then this speedup can be valuable.
        
         | teachingassist wrote:
         | Plenty of library functions are already going to be optimised
         | in this type of way - just because they're not visible in your
         | codebase doesn't mean they are not there.
        
         | [deleted]
        
         | commandlinefan wrote:
         | > rather have to maintain
         | 
         | Well, as long as it works, what difference does it make? As
         | long as it's prefixed with a
         | 
         | // This compares the matching number of bytes
         | 
         | comment, you should never have to mess with it.
        
           | rocqua wrote:
           | With a compiler or architecture change, I guess this could
           | create awkward to handle performance regressions.
        
       | draw_down wrote:
       | Hmm, I kinda liked the original code.
        
       | kens wrote:
       | If you like this sort of weird bit operations, you should check
       | out HAKMEM, a huge list of hacks created at MIT in 1972.
       | 
       | "Here is some little known data which may be of interest to
       | computer hackers. The items and examples are so sketchy that to
       | decipher them may require more sincerity and curiosity than a
       | non-hacker can muster. Doubtless, little of this is new, but
       | nowadays it's hard to tell."
       | 
       | https://w3.pppl.gov/~hammett/work/2009/AIM-239-ocr.pdf - see e.g.
       | page 79
        
         | codezero wrote:
         | I love the curiosity/hacking of the early days of computing,
         | hooking up your registers to an audio amplifier? sure!
         | 
         | """
         | 
         |  _This is a display hack (that is, it makes pretty patterns)
         | with the low 9 bits = Y and the 9 next higher a X; also, it
         | makes interesting, related noises with a stereo amplifier
         | hooked to the X and Y signals._
         | 
         | """
        
         | P-ala-din wrote:
         | I would also recommend Bit Twiddling Hacks By Sean Eron
         | Anderson [1] and Hacker's Delight.
         | 
         | [1] - https://graphics.stanford.edu/~seander/bithacks.html
        
       | superjan wrote:
       | I did not know the trick of using memcpy to get the compiler to
       | load a register from an arbitrary char* pointer. Given that the
       | architecture supports it, can you assume it will be optimized
       | like that? I assume this is done to stay away from undefined
       | behavior.
        
         | bombcar wrote:
         | Poking around godbot seems to indicate ARM and x86 does it, but
         | web assembly doesn't: https://www.godbolt.org/z/r86f9nr1q which
         | I guess makes sense (but web assembly has to become actual code
         | at some point, so maybe the optimization is lost entirely?).
        
           | superjan wrote:
           | Thanks, good idea to share this link in the comments.
        
       | johnghanks wrote:
       | Original Title: How to use weird hacks to make your code
       | unreadable.
        
       | ComputerGuru wrote:
       | I love Daniel's optimization posts, but this one seems a little
       | too esoteric: it's for when autovectorization isn't fast enough
       | (edit: or isn't available) but you can't or don't want to use
       | SIMD; it gives you a solution solidly better than the former but
       | worse than the latter, and more opaque than either.
       | 
       | In all cases, obviously the real magic is in the 64-bit word
       | comparison routine (which doesn't get more than a passing
       | mention); converting the former problem into the latter is
       | extremely straightforward and obvious - if you know that
       | optimized solutions for the latter exist (I personally didn't
       | know of this bit twiddling hack (aside from doing it via SIMD,
       | obviously) and am grateful to learn about the approach in Mula's
       | code).
        
         | f00zz wrote:
         | > I personally didn't know of this bit twiddling hack and am
         | grateful to learn about the approach in Mula's code
         | 
         | Yeah the last line had me scratching my head for a bit. Cool
         | stuff.
        
         | tyingq wrote:
         | Are there many cases where plain ASCII is okay anymore, anyway?
         | Wouldn't proper UTF-whatever comparisons mandate doing it the
         | slow way, if you wanted counts versus just "same/not the same"?
        
           | masklinn wrote:
           | Proper comparison would have to be even slower as you'd need
           | to deal with normalisation & equivalence.
           | 
           | The << slow way >> of the post would also count matching code
           | units which is utterly useless, really.
        
           | rocqua wrote:
           | For UTF-8 I believe the question is ill-posed. The "position"
           | of a "character" in unicode is very hard to determine and
           | depends largely on what you take to be a character.
        
             | bmn__ wrote:
             | Then UTF-8 should not be operated on directly. When reading
             | from a file etc., first decode the input into an internal
             | string format that has straight-forward linear access to
             | each gc boundary.
        
               | eesmith wrote:
               | Even then,
               | https://en.wikipedia.org/wiki/Unicode_equivalence . What
               | does "matching character" mean given combining
               | characters?
        
               | rocqua wrote:
               | Probably pick a canonical encoding and compare there.
        
           | joawowiij wrote:
           | hex, base64 encoded strings, etc
        
             | tyingq wrote:
             | That makes sense for "same/not-same" comparisons to me. I
             | don't know why you would want counts of different
             | characters though.
        
               | bombcar wrote:
               | "Password must be different by at least 6 characters"
               | perhaps, though that's the inverse, really.
               | 
               | Fuzzy matching is another aspect that comes to mind.
        
           | nonameiguess wrote:
           | The actual killer application of character count matching is
           | bioinformatics. You only need four distinct characters to
           | encode DNA, which of course, means in reality you only need
           | two bits, but I don't know if anyone has bothered to develop
           | a two-bit character encoding system and instructions that can
           | operate on them. They have come up with extremely optimized
           | implementations that compare byte streams, though, utilizing
           | ASCII GCTA.
        
             | jakobnissen wrote:
             | Bioinformatician here. Yes, people have come up with many
             | encoding systems: 2-bit encoding systems, 3-bit encoding
             | systems (for N's), various encoding schemes that allows
             | long runs of N's to be stored with little memory costs, and
             | even reduced amino acid alphabets for e.g. 4-bit amino acid
             | sequences.
             | 
             | In my experience, the performance bottlenecks in
             | bioinformatics is usually IO, parsing, de/compression and
             | needlessly slow algorithms. You rarely get meaningful speed
             | improvements by internally encoding the sequences one way
             | or another (storing small fixed-size sequences in machine
             | integers is an exception). There can be other advantages,
             | though, like reduced memory consumption.
             | 
             | Partly for that reason, most DNA data is just stored as
             | gzipped text files.
        
             | TchoBeer wrote:
             | except amino acids are encoded for with three bases,
             | meaning that the actual character encoding system is 6
             | bits.
        
           | bombcar wrote:
           | That would be more interesting, something that could handle a
           | UTF-8 string in a similar situation (I guess you'd compare
           | the bytes and then do a walk to see if it was part of a
           | multi-byte character? Maybe the high-bit would help?).
        
           | Thaxll wrote:
           | HTTP headers.
        
           | sfink wrote:
           | Yes! Plain ASCII is great, even when you need to properly
           | handle non-ASCII. When >90% of your strings are ASCII in
           | practice, then having an "is ASCII" bit in your string
           | headers can save a ton of memory, and gain more performance
           | than you lose in flag checks and extra code.
           | 
           | Heck, if your usual encoding is UTF-8, the flag bit can be
           | purely advisory and ignored when it doesn't matter.
        
         | bombcar wrote:
         | If the optimizing compiler can figure out to use SIMD shouldn't
         | it also be able to degrade to this when SIMD isn't available?
         | 
         | And if we're already down the "write it understandable and
         | trust the compiler" path isn't that the better way to go?
         | 
         | It's interesting to me that C is now effectively a "higher
         | level" language than it may appear on the surface because of
         | optimization by compilers.
        
       | MontagFTB wrote:
       | I'd still take the traditional solution
       | (https://godbolt.org/z/K5hWqezMP) over the latter one
       | (https://godbolt.org/z/jaWMM491M). The latter solution requires a
       | traditional check at the tail end of it, so the compiler appears
       | to include the traditional implementation in the latter's
       | solution anyways.
       | 
       | Unless this is a significant bottleneck for your application, the
       | latter does not seem practical, and a lot of profiling would have
       | to go into justifying the latter code (which would only be
       | worthwhile if, as I said, this were a significant bottleneck.)
        
       | [deleted]
        
       | KETpXDDzR wrote:
       | Where are the benchmarks?
        
       | codezero wrote:
       | Can you do this with an xor and popcount? I suppose that's what
       | the author is implying when they say the compiler takes care of
       | this.
        
         | danbruc wrote:
         | You can, the return statement is essentially a popcount.
         | c1 = 0x8080808080808080       c2 = 0x7F7F7F7F7F7F7F7F       c3
         | = 0x0101010101010101            e = ~(x ^ y)
         | return popcount((e & c1) & ((e & c2) + c3))
         | 
         | For those wondering how this works. If two bytes are identical,
         | then the XOR will yield 0x00 and the negation will yield 0xFF.
         | Masking out the seven least significant bits with 0x80 will
         | just yield 0x80 again. Masking out the most significant bit
         | with 0x7F will just yield 0x7F again and adding 0x01 will yield
         | 0x80. Finally combing 0x80 and 0x80 with AND just yields 0x80,
         | i.e. exactly the most significant bit set.
         | 
         | On the other hand if two bytes are not identical, then at least
         | one of the two following things happens. The two bytes differ
         | in the most significant bit and therefore the most significant
         | bit of the XNOR is zero. Then masking out the seven least
         | significant bits with 0x80 yields 0x00 and therefore the final
         | AND yields 0x00.
         | 
         | Or the two bytes differ in at least one of the seven least
         | significant bits and therefore at least one of the seven least
         | significant bits of the XNOR is zero. Then masking out the most
         | significant bit with 0x7F yields something smaller than 0x7F
         | and adding 0x01 yields something smaller than 0x80, i.e. the
         | most significant bit is zero, and therefore the final AND again
         | yields 0x00 because the first operand is either 0x00 or 0x80.
         | 
         | The remaining work is to count the number of set bit, either
         | with popcount or some other construction like the one in the
         | article.
        
           | sfink wrote:
           | Except this is suboptimal, as is the code in the original
           | post: it specifies ASCII, so (e & c1) is guaranteed to be
           | zero. In your code, that means you could just do
           | return popcount((e & c2) + c3);
           | 
           | (Thanks for writing this out, as otherwise I wouldn't have
           | spotted it.)
        
             | danbruc wrote:
             | The author certainly meant extended ASCII and therefore you
             | can not make this change. Sure, it may be technical
             | incorrect but ASCII is so commonly used to refer to
             | extended ASCII that this should probably be the default
             | interpretation unless it is clear from the context, say a
             | discussion of the history of character encodings, that it
             | is meant in its original sense.
        
       | b20000 wrote:
       | but the real question is, can you write the most optimal version
       | on the spot in under a minute, while your interviewer is babbling
       | and distracting you, and use a binary search?
        
       | f00zz wrote:
       | Cool trick, but if I'm reading this correctly the test fails for
       | characters with codes larger than 127
        
         | maxnoe wrote:
         | Which is probably why the title says "ASCII"
        
           | f00zz wrote:
           | I wasn't reading it correctly, it does work for characters >
           | 127. For pure ASCII you wouldn't even need t1
        
       ___________________________________________________________________
       (page generated 2021-05-21 23:01 UTC)