[HN Gopher] Counting the number of matching characters in two AS...
___________________________________________________________________
Counting the number of matching characters in two ASCII strings
Author : ingve
Score : 72 points
Date : 2021-05-21 14:54 UTC (8 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| sm4sh wrote:
| Well okay, but guess which one I'd rather have to maintain in a
| big codebase? Clearly written by a theoretical engineer.
| JKCalhoun wrote:
| I've worked with engineers that must read these blogs and try
| to out-clever everyone.
|
| They succeed, and that's the problem.
| humblepie wrote:
| For a large codebase, this would be encapsulated and heavily
| commented, etc. It's something that you pull out when you need
| to make something "go to eleven" kind of thing.
| raldi wrote:
| Exactly. You could have some exhaustive unit tests as well,
| and after that there would be nothing to maintain.
| Forge36 wrote:
| It's about use cases.
|
| There was an old lady 6 about unoptimized behavior in VB6 and
| why is wasn't necessary to do so. I had to write a wrapper for
| it which was optimized in my code. Was it harder to read? Yes.
| Did it produce a significant performance improvement? Also yes.
|
| Software development is about tradeoffs. Demoing one way to
| improve a simple function helps others trunk how they write
| code, and more importantly why they wrote specific code
| rocqua wrote:
| So if your big codebase were profiled, and the 99th percentile
| case happened to spend 80% of its time in this function, you
| would still stick to the old function unwaveringly? Even if
| your system was latency sensitive, and your 99 percentile
| spikes were far from acceptable?
|
| Sure, there are many cases where this is a premature
| optimization. This blog never said otherwise. But that doesn't
| mean there is never a time for optimization. And if profiling
| points at this, then this speedup can be valuable.
| teachingassist wrote:
| Plenty of library functions are already going to be optimised
| in this type of way - just because they're not visible in your
| codebase doesn't mean they are not there.
| [deleted]
| commandlinefan wrote:
| > rather have to maintain
|
| Well, as long as it works, what difference does it make? As
| long as it's prefixed with a
|
| // This compares the matching number of bytes
|
| comment, you should never have to mess with it.
| rocqua wrote:
| With a compiler or architecture change, I guess this could
| create awkward to handle performance regressions.
| draw_down wrote:
| Hmm, I kinda liked the original code.
| kens wrote:
| If you like this sort of weird bit operations, you should check
| out HAKMEM, a huge list of hacks created at MIT in 1972.
|
| "Here is some little known data which may be of interest to
| computer hackers. The items and examples are so sketchy that to
| decipher them may require more sincerity and curiosity than a
| non-hacker can muster. Doubtless, little of this is new, but
| nowadays it's hard to tell."
|
| https://w3.pppl.gov/~hammett/work/2009/AIM-239-ocr.pdf - see e.g.
| page 79
| codezero wrote:
| I love the curiosity/hacking of the early days of computing,
| hooking up your registers to an audio amplifier? sure!
|
| """
|
| _This is a display hack (that is, it makes pretty patterns)
| with the low 9 bits = Y and the 9 next higher a X; also, it
| makes interesting, related noises with a stereo amplifier
| hooked to the X and Y signals._
|
| """
| P-ala-din wrote:
| I would also recommend Bit Twiddling Hacks By Sean Eron
| Anderson [1] and Hacker's Delight.
|
| [1] - https://graphics.stanford.edu/~seander/bithacks.html
| superjan wrote:
| I did not know the trick of using memcpy to get the compiler to
| load a register from an arbitrary char* pointer. Given that the
| architecture supports it, can you assume it will be optimized
| like that? I assume this is done to stay away from undefined
| behavior.
| bombcar wrote:
| Poking around godbot seems to indicate ARM and x86 does it, but
| web assembly doesn't: https://www.godbolt.org/z/r86f9nr1q which
| I guess makes sense (but web assembly has to become actual code
| at some point, so maybe the optimization is lost entirely?).
| superjan wrote:
| Thanks, good idea to share this link in the comments.
| johnghanks wrote:
| Original Title: How to use weird hacks to make your code
| unreadable.
| ComputerGuru wrote:
| I love Daniel's optimization posts, but this one seems a little
| too esoteric: it's for when autovectorization isn't fast enough
| (edit: or isn't available) but you can't or don't want to use
| SIMD; it gives you a solution solidly better than the former but
| worse than the latter, and more opaque than either.
|
| In all cases, obviously the real magic is in the 64-bit word
| comparison routine (which doesn't get more than a passing
| mention); converting the former problem into the latter is
| extremely straightforward and obvious - if you know that
| optimized solutions for the latter exist (I personally didn't
| know of this bit twiddling hack (aside from doing it via SIMD,
| obviously) and am grateful to learn about the approach in Mula's
| code).
| f00zz wrote:
| > I personally didn't know of this bit twiddling hack and am
| grateful to learn about the approach in Mula's code
|
| Yeah the last line had me scratching my head for a bit. Cool
| stuff.
| tyingq wrote:
| Are there many cases where plain ASCII is okay anymore, anyway?
| Wouldn't proper UTF-whatever comparisons mandate doing it the
| slow way, if you wanted counts versus just "same/not the same"?
| masklinn wrote:
| Proper comparison would have to be even slower as you'd need
| to deal with normalisation & equivalence.
|
| The << slow way >> of the post would also count matching code
| units which is utterly useless, really.
| rocqua wrote:
| For UTF-8 I believe the question is ill-posed. The "position"
| of a "character" in unicode is very hard to determine and
| depends largely on what you take to be a character.
| bmn__ wrote:
| Then UTF-8 should not be operated on directly. When reading
| from a file etc., first decode the input into an internal
| string format that has straight-forward linear access to
| each gc boundary.
| eesmith wrote:
| Even then,
| https://en.wikipedia.org/wiki/Unicode_equivalence . What
| does "matching character" mean given combining
| characters?
| rocqua wrote:
| Probably pick a canonical encoding and compare there.
| joawowiij wrote:
| hex, base64 encoded strings, etc
| tyingq wrote:
| That makes sense for "same/not-same" comparisons to me. I
| don't know why you would want counts of different
| characters though.
| bombcar wrote:
| "Password must be different by at least 6 characters"
| perhaps, though that's the inverse, really.
|
| Fuzzy matching is another aspect that comes to mind.
| nonameiguess wrote:
| The actual killer application of character count matching is
| bioinformatics. You only need four distinct characters to
| encode DNA, which of course, means in reality you only need
| two bits, but I don't know if anyone has bothered to develop
| a two-bit character encoding system and instructions that can
| operate on them. They have come up with extremely optimized
| implementations that compare byte streams, though, utilizing
| ASCII GCTA.
| jakobnissen wrote:
| Bioinformatician here. Yes, people have come up with many
| encoding systems: 2-bit encoding systems, 3-bit encoding
| systems (for N's), various encoding schemes that allows
| long runs of N's to be stored with little memory costs, and
| even reduced amino acid alphabets for e.g. 4-bit amino acid
| sequences.
|
| In my experience, the performance bottlenecks in
| bioinformatics is usually IO, parsing, de/compression and
| needlessly slow algorithms. You rarely get meaningful speed
| improvements by internally encoding the sequences one way
| or another (storing small fixed-size sequences in machine
| integers is an exception). There can be other advantages,
| though, like reduced memory consumption.
|
| Partly for that reason, most DNA data is just stored as
| gzipped text files.
| TchoBeer wrote:
| except amino acids are encoded for with three bases,
| meaning that the actual character encoding system is 6
| bits.
| bombcar wrote:
| That would be more interesting, something that could handle a
| UTF-8 string in a similar situation (I guess you'd compare
| the bytes and then do a walk to see if it was part of a
| multi-byte character? Maybe the high-bit would help?).
| Thaxll wrote:
| HTTP headers.
| sfink wrote:
| Yes! Plain ASCII is great, even when you need to properly
| handle non-ASCII. When >90% of your strings are ASCII in
| practice, then having an "is ASCII" bit in your string
| headers can save a ton of memory, and gain more performance
| than you lose in flag checks and extra code.
|
| Heck, if your usual encoding is UTF-8, the flag bit can be
| purely advisory and ignored when it doesn't matter.
| bombcar wrote:
| If the optimizing compiler can figure out to use SIMD shouldn't
| it also be able to degrade to this when SIMD isn't available?
|
| And if we're already down the "write it understandable and
| trust the compiler" path isn't that the better way to go?
|
| It's interesting to me that C is now effectively a "higher
| level" language than it may appear on the surface because of
| optimization by compilers.
| MontagFTB wrote:
| I'd still take the traditional solution
| (https://godbolt.org/z/K5hWqezMP) over the latter one
| (https://godbolt.org/z/jaWMM491M). The latter solution requires a
| traditional check at the tail end of it, so the compiler appears
| to include the traditional implementation in the latter's
| solution anyways.
|
| Unless this is a significant bottleneck for your application, the
| latter does not seem practical, and a lot of profiling would have
| to go into justifying the latter code (which would only be
| worthwhile if, as I said, this were a significant bottleneck.)
| [deleted]
| KETpXDDzR wrote:
| Where are the benchmarks?
| codezero wrote:
| Can you do this with an xor and popcount? I suppose that's what
| the author is implying when they say the compiler takes care of
| this.
| danbruc wrote:
| You can, the return statement is essentially a popcount.
| c1 = 0x8080808080808080 c2 = 0x7F7F7F7F7F7F7F7F c3
| = 0x0101010101010101 e = ~(x ^ y)
| return popcount((e & c1) & ((e & c2) + c3))
|
| For those wondering how this works. If two bytes are identical,
| then the XOR will yield 0x00 and the negation will yield 0xFF.
| Masking out the seven least significant bits with 0x80 will
| just yield 0x80 again. Masking out the most significant bit
| with 0x7F will just yield 0x7F again and adding 0x01 will yield
| 0x80. Finally combing 0x80 and 0x80 with AND just yields 0x80,
| i.e. exactly the most significant bit set.
|
| On the other hand if two bytes are not identical, then at least
| one of the two following things happens. The two bytes differ
| in the most significant bit and therefore the most significant
| bit of the XNOR is zero. Then masking out the seven least
| significant bits with 0x80 yields 0x00 and therefore the final
| AND yields 0x00.
|
| Or the two bytes differ in at least one of the seven least
| significant bits and therefore at least one of the seven least
| significant bits of the XNOR is zero. Then masking out the most
| significant bit with 0x7F yields something smaller than 0x7F
| and adding 0x01 yields something smaller than 0x80, i.e. the
| most significant bit is zero, and therefore the final AND again
| yields 0x00 because the first operand is either 0x00 or 0x80.
|
| The remaining work is to count the number of set bit, either
| with popcount or some other construction like the one in the
| article.
| sfink wrote:
| Except this is suboptimal, as is the code in the original
| post: it specifies ASCII, so (e & c1) is guaranteed to be
| zero. In your code, that means you could just do
| return popcount((e & c2) + c3);
|
| (Thanks for writing this out, as otherwise I wouldn't have
| spotted it.)
| danbruc wrote:
| The author certainly meant extended ASCII and therefore you
| can not make this change. Sure, it may be technical
| incorrect but ASCII is so commonly used to refer to
| extended ASCII that this should probably be the default
| interpretation unless it is clear from the context, say a
| discussion of the history of character encodings, that it
| is meant in its original sense.
| b20000 wrote:
| but the real question is, can you write the most optimal version
| on the spot in under a minute, while your interviewer is babbling
| and distracting you, and use a binary search?
| f00zz wrote:
| Cool trick, but if I'm reading this correctly the test fails for
| characters with codes larger than 127
| maxnoe wrote:
| Which is probably why the title says "ASCII"
| f00zz wrote:
| I wasn't reading it correctly, it does work for characters >
| 127. For pure ASCII you wouldn't even need t1
___________________________________________________________________
(page generated 2021-05-21 23:01 UTC)