[HN Gopher] A neat XOR trick
___________________________________________________________________
A neat XOR trick
Author : jmillikin
Score : 312 points
Date : 2022-12-11 22:03 UTC (2 days ago)
(HTM) web link (www.mattkeeter.com)
(TXT) w3m dump (www.mattkeeter.com)
| random_coder wrote:
| A neat trick indeed, using bitsets in a sliding window. Contest
| programming makes heavy use of some such tricks.
| jvanderbot wrote:
| This is also essentially the trick applied in Kadane's Algorithm
| for the "maximum contiguous sum" search, albeit using masks to
| "undo" the set-of-characters check whereas Kadane's uses
| subtraction to "undo" the addition for maximum sum check.
|
| https://www.geeksforgeeks.org/largest-sum-contiguous-subarra...
| ribit wrote:
| One can improve this even further I think. First, using an
| iterator instead of indexing into an UTF-8 string. Second, using
| the codepoint value directly as a bitmask (why is author
| realigning the bit mask to "a" in the first place?)
| dahfizz wrote:
| The point is that each character represents a single bit in the
| mask, so it's easy to count the whole number of characters
| using popcount. If you XOR in the characters directly you can
| no longer easily count which characters are in the set.
| grogers wrote:
| The author passed the input as &[char] not &str, so indexing is
| no problem. That's weird in its own right, but obviously not
| the core of the post.
| mytailorisrich wrote:
| Looking at ASCII decimal values, a-z correspond to 97-122 so if
| you want to use a 32 bit bitmap you remap 97-122 to 0-25 by
| subtracting 97 ('a').
|
| If you can use a 128 bit bitmap for the same cost then you
| could indeed directly index by ASCII values.
|
| You can also get rid of the 'if' on window size in the main
| loop by partially unrolling it and taking those cases (start
| and end of string) outside.
| opk wrote:
| I assume the author is realigning to get only one bit for each
| letter. This relies on having an alphabet of 26 characters
| where 26 is less than the 32 available in a u32. In most real-
| world problems, you need to allow for the full ASCII or Unicode
| range and this wouldn't be possible.
|
| Using codepoints directly, there is overlap. 'f' ^ 'd' will
| give you the same bit pattern as 'b'. You could keep an xor
| value for each window size smaller than the full window but
| that effectively brings back the inner loop that using xor is
| avoiding and you could just use equality. With codepoints,
| there may be a solution similar to a bloom filter so
| efficiently determine whether a duplicate is possible but I've
| not thought through that fully.
| SaddledBounding wrote:
| 'e' ^ 'f' ^ 'g' == 0x40. Uh-oh.
|
| Edit: I misread the post; my bad. The post effectively uses a bit
| vector to store the last N chars in a window, and the bit vector
| happens to fit in a single machine word. Also, XOR happens to be
| a good way to update the bit vector, because it turns out it's
| sufficient to store how many times each character appears in the
| window mod 2.
|
| So to be clear, my "demonstration" above only works because the
| ASCII representations of e, f, and g are not linearly independent
| with respect to xor. However in the article, the representations
| of all of the characters are chosen to be a linearly independent
| set.
| jandrese wrote:
| The example code in the article left out the lookup table to
| convert each character into a single bit representation. All
| the tricky xor and popcount stuff could have been a 26 byte
| array just as easily and been O(n).
| robinhouston wrote:
| There's no lookup table. The example code in the article does
| the conversion using this expression: 1 <<
| (s[i + j] as u32 - 'a' as u32)
| WorldMaker wrote:
| Yes, for some additional explanation for those that might
| use it, that's "ASCII math": it assumes all lowercase ASCII
| Latin letters (which most Unicode encodings including UTF-8
| and UTF-32 inherit). ASCII was intentionally designed so
| that the lowercase letters a-z are in English alphabet
| order and so subtracting 'a' from the letter gives you an
| alphabet position number from 0-25. (The same applies in
| ASCII for the upper case range A-Z and the numerical range
| 0-9, though doing math _between_ ranges is less fun.)
|
| Then you've just got a standard single 1 left shifted 0-25
| positions. (So 'a' is 1 and 'z' is 1*2^25.)
| quietbritishjim wrote:
| Ah OK, the article is actually suggesting making a list of
| Booleans whose length equals the number of possible characters.
| It just happens that it's assuming 26 allowed characters which
| fit in the bits of a 64 bit number.
|
| The running time is does not depend on the window length, but
| _does_ depend on the number of possible characters. If it 's
| all of Unicode for example you'd be stuffed: you could fix the
| vector of values in memory (currently there are approx. 150,000
| unicode code points), even if you used a byte per value, but
| counting number of true values will require iterating over the
| whole vector.
|
| Even just going from 26 Latin letters to 256 byte values makes
| this trick quite messy unless your language has a really nice
| bit vector type (admittedly many do).
|
| This comment was helpful for me to understand what was going
| on, even though it's a mistake followed by a correction.
| xnorswap wrote:
| I had considered this approach but the nature of AoC means it
| rewards real world time to solution rather than and kind of
| computation time, and a naive solution still runs fast enough
| that it more than makes up in time spent programming it.
|
| There's actually whole classes of problems which would be more
| interesting if the naive solution wasn't fast enough, for example
| even on day 7 (or was it 8?) naively exploring to every edge from
| every square would be O(N^3) but still execute just fine.
|
| I'm a couple of days behind but so far this year hasn't yet had
| lanternfish style problems where the naive solution blows up
| completely, but hopefully they will come as they are a lot more
| interesting.
| ladberg wrote:
| If you aren't already familiar with it, you might find
| projecteuler.net fun!
| kadoban wrote:
| codeforces.com or one of those is slightly more likely to be
| fun for programmers.
|
| projecteuler.net is great, but it _very_ quickly becomes deep
| math.
| n4r9 wrote:
| Arguably day 11 part 2 is such a "lanternfish" problem,
| although it essentially tells you to watch out for it.
| WJW wrote:
| I definitely ran into that. I read "impractically large
| number" and thought "well the language I'm using has
| arbitrary size integers so I should be fine". So, I changed
| the code from 20 iterations to 10000 iterations and suddenly
| all 16 GB of my RAM was filled with Santa's worry counters.
| nraynaud wrote:
| I love the Sunday quaterbacking of AoC, more than the
| competition itself (I'm not even doing this year). And I agree
| with you that a quick reminder that some problems are "naively
| intractable" is a good wake up call once in a while.
| hairtuq wrote:
| On a related note, almost any task around shuffling or shifting
| bits can be improved with xor. For example the straightforward
| way to remove bit i is mask = -1 << i;
| return (x & ~mask) | ((x >> 1) & mask);
|
| but with xor we can do it in one less instruction:
| mask = -1 << i; return ((x ^ (x >> 1)) & mask) ^ x;
| [deleted]
| DannyBee wrote:
| You don't need to count ones at all, because you can stop looking
| at a window as soon as you discover it's not unique. As a result,
| you only have to detect if the bitmask is changed from adding a
| new character, not how many one bits there are in it. (This is
| true in the hash-set version as well - if insert returned false
| you can move on). old bit mask = current bit mask
| current bit mask = current bit mask OR new character if old
| bit mask == current bit mask window is not unique, move
| to the next window (otherwise window is so far unique)
| meindnoch wrote:
| And how do you remove the character leaving the window on the
| left? I don't think this solution works the way you think it
| does...
| DannyBee wrote:
| Yes, I wrote it too quickly you would have to xor characters
| in and out instead. I doubt it is worth it vs popcnt at that
| point
|
| It doesn't change the fact that problem is incremental, and
| most importantly you can early exit windows as soon as you
| discover a single non-unique character. You don't have to
| wait till the end of the window. They don't implement this
| for hash set, for example.
|
| Since most (n>1) windows are not unique (well, more
| accurately the probability of the window being unique
| decreases as the size of the window approaches the number of
| possible characters), early exit from windows will likely
| beat bit munging except for small windows, until you get into
| SIMD solutions.
| rep_lodsb wrote:
| Bit operations are fast, branches are slow.
| DannyBee wrote:
| Well, yes, but most of these branches can be if-converted
| into conditional moves anyway if you want.
|
| So they are data dependencies and not control ones. There
| are still some control ones.
|
| Beyond that, let me be super clear: Imagine the following
| six versions:
|
| 1. One window at a time processing. No early exit, no bit
| munging.
|
| 2. One window at a time processing. No early exit, bit
| munging.
|
| 3. One window at a time processing. You don't use direct
| bit munging, but you early exit the window when you hit
| the non-unique character, and skip the window forward to
| the first instance of that non-unique character.
|
| IE given "hmma<something>", n=4, you early exit at the
| second m, and skip processing windows forward to right
| after the first m (since no window prior to that can be
| unique, as they will all hit the double m)
|
| 4. One window at a time processing, bit munging, same
| otherwise as #3
|
| 5. Sliding window processing, no bit munging
|
| 6. Sliding window processing, bit munging.
|
| The speedup between the 1-2 vs 3-6 is much greater than 5
| vs 6 and 3 vs 4.
|
| You could probably make 3 pretty darn competitive with 6,
| and 4 could probably beat 6 with SIMD (IE processing
| multiple windows in parallel really fast, and maybe
| wasting work, vs guaranteeing you only do the minimum
| work to find a unique window)
| dahfizz wrote:
| I don't think you are correct. Certainly not with a word
| size of 4. For very large word sizes, the early return
| will be a big win. But you are really under-rating how
| expensive the hashmap will be vs "bit munging".
|
| You could roll your own perfect hash, but you would end
| up with a solution that looks almost identical to TFA. If
| you use a stdlib hash, you are going to be chasing
| pointers all over memory to do a single insert / lookup.
| De-referencing a single pointer not in cache costs 50 -
| 100 clock cycles on a modern system. By the time you do
| one insert, TFA will have XOR'd at least 32 chars into
| its bit mask. And your cache won't be as nice as in TFA,
| slowing you down even more.
|
| Given the two scenarios: 1) bit munging, no early return
| 2) hash lookup, early return
|
| I would guess scenario 1 wins until word size gets above
| ~256. And obviously, we can add an early return to
| scenario 1 to make it unquestionably the fastest.
| rep_lodsb wrote:
| OR wouldn't work with a sliding window since it can't be
| inverted?
|
| This should work: init bit mask and count of
| bits to 0 for each new char: old = bit mask
| bit mask = bit mask XOR old char if bitmask > old then
| count++ else count-- old = bit mask bit mask =
| bit mask XOR new char if bitmask > old then count++
| else count-- if count == window length: return match
|
| The idea is that each XOR will always set or clear exactly one
| bit, so we can maintain a count of set bits.
|
| But if there is a POPCNT instruction it would probably be
| faster to use it.
| DannyBee wrote:
| My guess is, it's still faster to just check whether
| hashset.insert returns false. Since most windows are not
| unique early exit will likely beat faster processing.
| gpderetta wrote:
| For small windows it is possible that an early exit might
| be counterproductive as it exercises the branch predictor.
|
| One would have to test ti see where the cutoff is, but
| often doing extra work is faster.
|
| Edit: but this is discussed elsethread.
|
| A fixed size run also helps with vectorization.
| dahfizz wrote:
| I doubt the hashmap would beat the XOR method from the
| article. Hashmaps means allocations, and it means hashing.
| Hashing is going to be at least as much work as XORing a
| couple values in the mask. Hashmaps also means chasing
| pointers all over memory and ruining your cache.
|
| You can add an early return to the OPs XOR method by just
| checking if the char is already in the bit mask. This will
| be faster if the word size is large.
| DannyBee wrote:
| the domain/range is fixed, so you don't have to allocate,
| actually, because you can have a fixed sized table and
| perfect hash :)
|
| That said, I agree you can make the bit munging faster in
| the end, i'm just saying i don't think the speed up is
| anywhere near the improvement from early-exiting.
| rep_lodsb wrote:
| Even if the hashmap library knew that the only valid keys
| were 'a' .. 'z', it wouldn't be magically faster. The
| best it could do is use basically the same code as a
| hand-rolled implementation.
|
| Bit operations and shifts take a single clock cycle, and
| the mask can be stored in a register throughout the
| entire loop.
|
| If "early exit" brings any improvement, I don't see why
| the best wouldn't be to combine the two solutions instead
| of choosing one or the other.
| dahfizz wrote:
| > Even if the hashmap library knew that the only valid
| keys were 'a' .. 'z', it wouldn't be magically faster.
| The best it could do is use basically the same code as a
| hand-rolled implementation.
|
| This is known as a perfect hash[1]. knowing that you will
| never have collisions does allow for a faster
| implementation. The hash map can be backed by an array
| which will never need to be resized, and you don't have
| to fiddle with linked lists to chain collisions.
|
| You're correct though, that this is something you will
| have to implement yourself. Library hashmaps are going to
| trade performance for general usefulness.
|
| [1] https://en.wikipedia.org/wiki/Perfect_hash_function
| meindnoch wrote:
| You can combine the two XORs: new_mask =
| old_mask XOR old_char XOR new_char count +=
| signum(new_mask - old_mask)
|
| Where signum(x) = (x > 0) - (x < 0)
| rep_lodsb wrote:
| If the old and new char are different, two bits will have
| changed state and comparing the old and new mask will have
| unreliable results.
|
| e.g.
|
| old mask = 011100 new mask = 001111, numerically less even
| though it has one more bit set
| meindnoch wrote:
| Dangit! I concede...
| meindnoch wrote:
| This is basically the prefix sum technique in disguise: the sum
| over an interval is the difference between the prefix sums to the
| endpoints. The summation can be substituted with any group
| operation, in this case modulo-2 addition of 26-long vectors.
|
| Many leetcode-type problems are amenable to O(n) speedups using
| this technique.
| hanemile wrote:
| gamozolabs benchmarked this kind of stuff in a recent stream:
| https://www.youtube.com/watch?v=ru97E9xA0s4
| twic wrote:
| One small thing: // Turn on bits as they
| enter the window set ^= 1 << (s[i] as u32 - 'a' as
| u32); // Turn off bits as they leave the window
| if i >= window_size { set ^= 1 << (s[i -
| window_size] as u32 - 'a' as u32); }
|
| "turn on" and "turn off" actually mean "flip", right? Each bit is
| not a present/not present flag, it's the parity of the count of
| number of occurrences. Which still works, because of the
| "counting how many characters appear an odd number of times in a
| window" thing.
| piersj225 wrote:
| You can also swap two variables without using a temp variable
| thanks to XOR
|
| https://stackoverflow.com/a/19618810
| rep_lodsb wrote:
| Unless they are the same memory location!
|
| http://underhanded-c.org/_page_id_16.html
|
| x86 CPUs have a dedicated instruction to swap two registers, or
| a register with memory.
| eru wrote:
| A note of caution: using that trick on modern hardware and/or
| languages is likely to be slower than just using the temporary
| variable. Because compilers and processors can see through the
| latter better than the XOR.
| rytis wrote:
| I wonder how long before this question starts popping up in
| interviews with immediate reject if one does not come up with
| magic-xor solution in 5 minutes on the whiteboard. EDIT: neat
| solution nonetheless!!
| mafuy wrote:
| This was basically part of a public set of Google interview
| questions around 2009 already ;)
| senderista wrote:
| Note that you could also solve this with AND/OR, with just a
| little more complexity. To "turn on" the bit for the new leading
| character as you slide the window forward, you of course OR it
| with the bitmask for the window, and to "turn off" the bit for
| the old trailing character, you AND its complemented bit-value
| with the bitmask. Not as elegant as XOR, but still works, and
| shows that XOR properties aren't essential to the solution.
| rep_lodsb wrote:
| If the same character occurs twice in the window, the second OR
| will have no effect but the AND will remove it.
|
| The trick is that XOR will leave the bit set if there is an odd
| number of occurrences, and the only way to have the bit count
| equal the window size is if every char is unique.
| DonHopkins wrote:
| Of all the combinational logic functions, XOR is the coolest.
| Even its name is cool! It sounds like an evil super hero, or an
| ancient god.
|
| ChatGPT did offer a useful suggestion I'd never heard of when I
| asked it to describe The Mighty XOR:
|
| >As I mentioned in my previous responses, XOR is a logical
| operation and does not have any physical form or abilities, so it
| cannot be a superhero. Therefore, it is not possible for a
| character named "The Mighty XOR" to be part of the Marvel
| universe or any other fictional universe, as they would not exist
| in reality. If you are interested in characters from the Marvel
| universe with powers related to digital logic or computing, you
| might consider the character of The Calculator from DC Comics,
| who has the ability to use his super-genius intellect to perform
| complex calculations and hack into computer systems. However,
| this character is not part of the Marvel universe and does not
| have the name "The Mighty XOR".
|
| https://en.wikipedia.org/wiki/Calculator_(character)
|
| >Calculator (Noah Kuttler) is a supervillain appearing in
| American comic books published by DC Comics. Originally
| introduced as one of many villains in Batman's rogues' gallery,
| the character was later redeveloped in the 2000s as a master
| information broker, hacker, and tactical supervisor to other
| supervillains, and foil to Batman's partner Oracle.
|
| >[...] Calculator suffers from severe obsessive-compulsive
| disorder, unbeknownst to his peers (even though this was hinted
| at when he was in charge of monitoring Supergirl), and initially
| controlled this with medication.
| nerder92 wrote:
| I ran the unoptimized version of the code into ChatGPT and he
| was able to optimize it similarly to OPs solutions.
|
| I was quite impressed, here is the ChatGPT version:
|
| fn run(s: &[char], window_size: usize) -> usize { let mut
| unique_chars = 0; for i in 0..window_size { unique_chars |= 1
| << (s[i] as u32 - 'a' as u32); } if unique_chars.count_ones()
| as usize == window_size { return window_size; }
| for i in 1..s.len() - window_size { let prev = s[i
| - 1] as u32 - 'a' as u32; let next = s[i +
| window_size - 1] as u32 - 'a' as u32; unique_chars
| ^= 1 << prev; unique_chars |= 1 << next;
| if unique_chars.count_ones() as usize == window_size {
| return i + window_size; } }
| panic!("No unique window found");
|
| }
|
| //NOTE: my prompt was the make the code O(N)
| ajnin wrote:
| It's also overlooking that there is an actual hero named X-OR,
| which is the French name of Space Sheriff Gavan, a somewhat
| popular (at least in France) Japanese show of the prolific
| "heroes in shiny suits and rubbery monsters from space" genre.
|
| https://fr.m.wikipedia.org/wiki/X-Or
| NathanaelRea wrote:
| Isn't the memory just O(Unique) since the window size is bounded
| by the max number of unique characters (in this case 26)? In the
| general case, the hash would need to store one bit for each
| unique element.
|
| If you use a bit mask, you allocate all the memory up front. In a
| set you allocate through runtime, but could just use set.add(char
| - 'a') for a similar memory bound. But both need to be able to
| store every unique element. They are both O(Unique), it just
| happens that 26 <= num_bits(u32).
| qsort wrote:
| Yes. It's a neat trick, but from an algorithmic point of view
| it's exactly the same as the obvious solution, replacing a set
| with a hash-set. I did this with a counter in python:
| def first_diff(s, n): cnt = Counter()
| for i, c in enumerate(s, 1): cnt[c] += 1
| if i > n: top = s[i - n - 1]
| cnt[top] -= 1 if cnt[top] == 0:
| del cnt[top] if len(cnt) == n:
| return i
| brilee wrote:
| This trick is called zobrist hashing in chess/go programming. It
| allows for incremental calculation of the board hash, which is
| particularly useful because all of the game variations you're
| exploring are all nearly identical to one another, saving a lot
| of compute.
| FartyMcFarter wrote:
| It's reminiscent of Zobrist hashing in that it's an incremental
| hash calculated with xor. But Zobrist hashing has the crucial
| difference of using a random k-bit number to be xor'ed into the
| hash (k is a constant, typically 64 these days) for each member
| of the set that is present, instead of an N-bit string where N
| is the number of possible elements in the set.
|
| For example, in a chess engine: if the hash value for "white
| knight on f3 square" is 0x8BADF00D and the hash value for
| "white king on e4 square" is 0x1BADB002, then a chess board
| containing only "white knight on f3 and white king on e4" would
| be hashed as (0x8BADF00D xor 0x1BADB002) = 0x9000400F.
|
| The upside of this trick is that if your set can contain N
| different objects (e.g. 768 combinations of 12 different chess
| pieces on each of 64 squares), you don't need to use an N-bit
| hash value, which would be pretty big. In the particular
| problem described in this blog, this isn't an advantage as it's
| using a set containing only up to 26 letters, which neatly fits
| in a u32 even if you dedicate a bit to each letter.
|
| However there are downsides to Zobrist hashing - it's very hard
| to guarantee that you don't get hash collisions in this manner,
| as AFAIK you'd have to try all combinations of valid sets,
| which is prohibitively expensive. So every algorithm relying on
| this hash value either has to be robust to hash collisions, or
| it accepts a small probability of failure.
|
| Most importantly in this case, Zobrist hashing doesn't let you
| test whether a particular element is present in the set, nor
| does it let you count the number of elements in a set. So it
| wouldn't work as a solution to the problem in this blog, which
| requires counting how many unique letters are present in the
| hash value.
| thomasahle wrote:
| Zobrist hashing is different. With Zobrist you have a table pd
| random numbers. You look each key up in the table and xor the
| values.
| retrac wrote:
| It's pretty well-known but my favourite XOR trick is XOR linked
| lists. [1]
|
| It's possible to store a doubly-linked list using only a single
| pointer per node. For entry B, &B = &A ^ &C. You can walk the
| list in either direction as long as you have pointers to two
| neighbouring nodes in the list. Half as much memory overhead
| compared to an implementation where each node stores two
| pointers.
|
| [1] https://en.wikipedia.org/wiki/XOR_linked_list
| someweirdperson wrote:
| count_ones() is a challenge on its own.
|
| https://graphics.stanford.edu/~seander/bithacks.html#CountBi...
| averne_ wrote:
| You can just use __builtin_popcount or equivalent, which maps
| to a single instruction on most platforms.
| jhoechtl wrote:
| > There's no moral to the story, other than "xor is cool"
|
| The moral is knowledge is king and no matter how much iron you
| throw at a problem a well-designed algorithm will beat it.
| deathanatos wrote:
| ...a simple histogram (the second proposed solution in TFA)
| runs "instantly", for both parts.
|
| Part of the meta-game to AoC is knowing that you can limit your
| answer to _only_ the requirements in the combination of the
| question and input given. If the naive solution runs fast
| enough, and is vastly quicker to implement, that 's the one you
| want.
| mangamadaiyan wrote:
| Cache misses, TLB flushes, and pipeline stalls would beg to
| differ with you.
| toxik wrote:
| The true insight here is not using XOR, it's sliding the window
| efficiently. Here is pseudocode not using XOR.
| members = [0]*len(alphabet) for i in 0...input
| length: members[input[i-window_size]] = 0 if
| bounds ok members[input[i]] = 1
| if sum(members) == window_size: return i
|
| If `members` is a bit vector and `sum()` is a popcount, this code
| is equivalent.
| [deleted]
| ErikCorry wrote:
| I had a very similar solution for my AoC in Toit, but I used a
| Deque instead of just having a trailing index into the input
| string. https://github.com/erikcorry/advent-of-
| code-2022/blob/main/d...
|
| Then a friend solved it with regular expressions, but I found a
| sicker regex. After all, that's what regexes are about.
|
| https://mobile.twitter.com/erikcorry/status/1600524753596456...
| peeters wrote:
| Note that for enumerable domains of less than 64, using a long
| plus bit ops has been a standard Set implementation for quite a
| while. For example, Java's EnumSet uses a long if the enum has
| less than 64 values:
|
| https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/c...
|
| Where add() uses `|= bitmask`, remove() uses `&= ~bitmask`, and
| size() uses a count of the 1's in the long.
|
| Adding XOR as an efficient toggle would be interesting, but
| unnecessary to keep this O(n), if I understand correctly. It's
| just toggling the value, so (albeit with an extra branch), you
| could implement it as: if (!set.contains(val))
| { set.add(val); } else {
| set.remove(val); }
| dentalperson wrote:
| I've seen this use of constants inside big O notation in leetcode
| and 'informal' discussions. Is it pedantic to say that O(NM) ==
| O(N) if M is a constant (in this case since it's bound by 26)? Or
| is this the current and expected usage?
| tkzed49 wrote:
| It's correct to omit constants from big O, but in the article W
| is a distinct input variable so the usage is valid. The size of
| the window W is bound by N, not a constant 26.
| ErikCorry wrote:
| But it's impossible for W to be bigger than 26 since you
| can't have 27 letters in a row that are all different.
|
| Normally for a search algorithm the size of the alphabet is
| assumed to be constant.
| bsza wrote:
| W cannot be larger than the size of the character set, so even
| the first algorithm runs in O(N) time.
|
| It's a cool trick, but it didn't make the algorithm
| asymptotically more efficient as the OP suggests.
| [deleted]
| TreeRingCounter wrote:
| thomasahle wrote:
| You could imagine a much larger character set though. For
| example you might be interested in a sliding window of unique
| words instead.
| bsza wrote:
| Yes, but then the last algorithm runs in the same time as the
| one before it. Using xor here really only gives you a
| constant factor improvement (which is still nice!).
|
| Reminds me of the problem "find the only missing number in a
| scrambled list of 1..n". You can get an "O(n)" solution using
| xor, but once bit size becomes a concern, there are less
| hacky solutions that are just as fast.
| thomasahle wrote:
| "Bit tricks" are worth a factor `log n`. Let me explain: If
| you have an array of `n` integers in {0,1}, and you want to
| count the number of 1s, you need `n` time. But if you have
| an array of `n` bits, with bit level operations you can do
| `n/W` time, where `W` is your word size. We often have
| W=32, 64 or 256 on some architectures, but you may still
| argue it's just a constant.
|
| The thing is that we usually saw you have random access to
| all of your input data in constant time. But if your input
| size is `n`, your pointer size must be `log n`. So for
| standard algorithm analysis to work, you need to allow a
| word size of at least `log n`.
| bsza wrote:
| You're using n as a parameter of the machine's specs,
| which is a big no-no. You can't just come up with a
| bigger machine for each consecutive value of n. If that
| were allowed, all algorithms would trivially be O(1).
|
| We usually assume the pointer size to be infinite (e.g.
| RAM machine), to avoid having to deal with access time.
| But if we don't, then it must be a constant size (and we
| have to adjust the big O formulas accordingly). In
| neither of those cases do you get free arbitrary length
| xor operations. Just like you don't get e.g. free
| arbitrary length multiplications either.
|
| One nice property of xor, however, is that you can
| parallelize the heck out of it. So you can get arbitrary
| constant factor improvements by adding a bunch of CPUs.
| moloch-hai wrote:
| The moral is that POPCNT is critical to zillions of massive
| optimizations, and omitting it from base instruction sets, as has
| been done over and over and then later corrected, each time at
| great expense, is extremely foolish. The latest offender was
| RISC-V, but the overwhelming majority of x86 code is still
| compiled to a target version lacking it.
| [deleted]
| [deleted]
| adrian_b wrote:
| It is good to remember that the instruction POPCNT was
| introduced for the first time in the instruction set of a
| computer by Alan Turing, already in 1949, in Ferranti Mark I
| (as "sideways add").
|
| Unfortunately few other computers have included it before Cray
| 1, which made it well known, under the current name.
| js2 wrote:
| Sideways add makes a bit of sense for the name of this
| operation. But I can't figure out why Cray named it
| "population count." It just seems odd to refer to the set
| bits as part of a population that you're counting.
| pklausler wrote:
| We had it on the CDC 6600 in 1963 as 'Bi NXj' if I remember
| rightly.
| DannyBee wrote:
| Well, actually, in this case, no, POPCNT is unnecessary here,
| since you don't actually have to count anything. you only need
| to know if the bitmask changed when you added a new character
| ;)
|
| But otherwise, yes.
|
| I spent quite a while optimizing GCC's bitmap operations years
| ago, including implementing some new sparse bitmap types, and
| the sparse bitmap types i implemented ended up dependent on the
| speed of popcnt and friends, so yes.
| nebulous1 wrote:
| it's not sufficient that the newest character in the window
| is unique? the older characters need to be unique too.
| DannyBee wrote:
| I posted how to do it in another comment on the thread.
| nebulous1 wrote:
| If you're using OR (unlike the OP) how are you removing
| the left side of the window? And still, even if the
| newest character is unique, previous characters might not
| be
| rep_lodsb wrote:
| Input: "abccdef"
|
| Your algorithm will report a match at "bccd" since the
| "a" wasn't removed from the bitmask.
| DannyBee wrote:
| Yes, you are correct you would have to xor them back out
| actionfromafar wrote:
| I swear I have seen this conversation before. A glitch?
| DannyBee wrote:
| No - he responded on both parts of the thread where i
| commented. I started to respond here as well, then
| updated it to redirect to the other comment so we don't
| end up with two long threads.
| nerpderp82 wrote:
| If you XOR them together, you get back to the original
| conversation.
| syrrim wrote:
| In practice if you want the popcount of anything larger than
| a couple registers you'll want to use vector instructions
| anyways though. There are a lot of operations that will speed
| up certain applications if made into their own instruction,
| not all of them need to be.
| gpderetta wrote:
| legend say that popcnt was the "NSA" instruction, was heavily
| used for cryptographic analysis and that was kept out of common
| instruction sets for a long time to give NSA an advantage.
|
| It is probably just a legend though.
| kjs3 wrote:
| The usefulness of popcnt, et al, in cryptography was known at
| least as far back as Alan Turing. It wasn't 'kept out of'
| ISAs, if for no other reason than not all computer
| manufacturers were (are) American, so the NSA wouldn't have
| had much leverage to keep, say, Ferranti or Hitachi from
| including it in their computers.
|
| The legend you're _probably_ misremembering is the one where
| the NSA approached Seymor Cray at CDC while he was designing
| the 6600 super and 'suggested' that if he included a popcnt
| instruction in the ISA, the NSA would certainly look
| favorably on purchasing some. He did and they did (quite a
| few). This story is also possibly apocryphal.
| _a_a_a_ wrote:
| > The usefulness of popcnt, et al, in cryptography was
| known at least as far back as Alan Turing
|
| Ok. Why??????????????
|
| @gpderetta is correct at least in quoting hacker's delight
| where it was also said to be rumoured the NSA wanted
| popcount but it was unclear to HD's author why they wanted
| it.
| rep_lodsb wrote:
| IIRC the Colossus machine built during WW2 was basically
| counting the number of bits resulting from doing various
| boolean operations on the input. It was used to crack the
| Lorenz cipher, which XORed the plaintext with a pseudo-
| random keystream (generated using mechanical rotors!).
|
| Cryptography has advanced since then - and I'm not an
| expert - but there may still be statistical weaknesses
| that could be found by counting bits?
| moloch-hai wrote:
| I think it is generally taken as useful in, specifically,
| cryptanalysis, presumably for estimating Hamming
| distance.
| snvzz wrote:
| >RISC-V
|
| Doesn't B have a POPCNT?
|
| B is mandatory in RVA22.
| moloch-hai wrote:
| Can you buy any RISC-V embodiment that implements such an
| instruction? It seems to be extremely optional.
| Vt71fcAqt7 wrote:
| B is pretty recent, but Sifive p650 will have it
|
| https://www.sifive.com/cores/performance-p650
|
| Heres another:
|
| https://www.andestech.com/en/2022/12/08/andes-technology-
| unv...
|
| I expect all future Linux cores to have it
| cbm-vic-20 wrote:
| These are the cpop and cpopw instructions.
|
| https://github.com/riscv/riscv-
| bitmanip/blob/main/bitmanip/i...
| iamevn wrote:
| If ascii characters aren't guaranteed, would this solution still
| be able to scale? I wonder how the overhead from scanning the
| input to find the domain to use for the bitset compares to
| overhead of allocating a bunch of sets.
|
| Edit: More generally, I feel like these sorts of tricks are
| getting less relevant over time. With the prevalence of Unicode
| text how many clever interview question optimizations that assume
| characters are just another way of saying u8 are no longer
| relevant?
| [deleted]
| polynomial wrote:
| I am wondering why the bit masks are Big (not Little) Endian? I'm
| sure there's a reason the example is done that way, I just don't
| know what it is.
| mkeeter wrote:
| I wanted to put the 1 visually close to the equal sign, so you
| didn't have to scan all the way to the end of
| 0000000000000000000000000000000 to find it. Perhaps this is
| misleading, because the code places the set bit at the LSB!
| ivanstojic wrote:
| Isn't the real performance gained by eliminating the triple
| nested for loops and memoizing over a sliding window, rather than
| the use of XOR? You would get the same order-of-magnitude
| improvement if you used a map combined with iterate-only-once
| approach.
|
| Ie, it's undeniable that the final program works much faster, but
| the article should really be called "a smarter memoization
| trick."
| boredumb wrote:
| Something about seeing fun bit tricks solving higher level
| problems is exciting. As someone who writes higher level software
| in general using golang/rust/python to solve "boring business"
| problems, seeing someone using XORs to accomplish something feels
| like leaving my house and going camping for the weekend.
| foverzar wrote:
| Right? Things like fast inverse square root feel like total
| black magic.
| boredumb wrote:
| Very curious what you're up to on camping trips that involves
| black magic.
| timerol wrote:
| One fun repository of logic operations like this is
| https://graphics.stanford.edu/~seander/bithacks.html
| chasil wrote:
| This is vaguely reminiscent of the xor swap procedure, although
| it seems more practical.
|
| https://en.wikipedia.org/wiki/XOR_swap_algorithm
| [deleted]
| fifilura wrote:
| I think my intuition tells me that this is analogous to the
| convolutions you do in neural network image processing. Just that
| this is one dimensional.
| timvisee wrote:
| I did something similar and counted from the end, to make much
| bigger jumps!
|
| https://github.com/timvisee/advent-of-code-2022/blob/master/...
|
| Maybe a bit less neat in the binary sense, but definitely very
| fast.
| scrapheap wrote:
| That's a nice low level solution the problem - I have to confess
| that I just wrote a regular expression for it :)
| commandlinefan wrote:
| > I just wrote a regular expression
|
| Running time of O(MG)
| notorandit wrote:
| Very cool! ^
| dukoid wrote:
| FTR: This can be done with "only" two loops (opposed to the
| "naive" 3 in the article) -- without any extra data structures.
| It's sufficient to keep track of how many characters have been
| unique so far. For each "new" character, check whether it's
| different from all of them. If so, increase count. Otherwise,
| reset count to the distance to the match. def
| main(): count = 1 for i in range(1, len(SIGNAL)):
| for j in range(0, count): if SIGNAL[i - j - 1] ==
| SIGNAL[i]: count = j break
| count = count + 1 if count == 4:
| print(i + 1) break
|
| If we consider the word length a constant this should be O(n).
| flatline wrote:
| That was my conclusion too, this is a super easy problem with a
| fixed window length. The bit hashing is neat but I was too
| confused by what seemed to be an over-complication of the
| problem to really appreciate it. Did you and I both miss
| something here?
| dukoid wrote:
| I don't think so. As much as I love using bit operations: in
| this case I'd actually prefer a table of character counts for
| a "true" O(n) solution, as bit counting isn't guaranteed to
| be a "native" operation.
| segmondy wrote:
| The article started with 3 loops, took it down to 2 then one.
| Why are you talking about 2 loops?
| dukoid wrote:
| The article took it down to 2 by adding an extra hashtable.
| It's possible to go down to 2 without using an extra data
| structure.
| gigatexal wrote:
| Can someone explain the 1 << stuff? I'm very new to bit fiddling
| but I did follow the blog.
|
| " fn run(s: &[char], window_size: usize) -> usize { let mut set =
| 0u32; for i in 0..s.len() { // Turn on bits as they enter the
| window set ^= 1 << (s[i] as u32 - 'a' as u32);
| // Turn off bits as they leave the window if i >=
| window_size { set ^= 1 << (s[i - window_size] as
| u32 - 'a' as u32); } // Check the
| current window and see if we're done if
| set.count_ones() as usize == window_size { return
| i + 1; } } panic!("No unique window
| found");
|
| } "
| dahfizz wrote:
| The "stuff" assigns an integer value to the char. 'b' - 'a' =
| 1, for example. This assigns each char a unique integer value.
| By shifting 1 up that many times, you assign a unique bit
| position to each char.
| alanning wrote:
| For those new to bitshifting, here are the results of the
| "stuff" operation that @dahfizz describes:
| 'a' - 'a' = 0, so take 1 and shift it left 0 times:
| 00000000000000000000000000000001 = 'a' 'b' - 'a' =
| 1, so take 1 and shift it left 1 time:
| 00000000000000000000000000000010 = 'b' 'c' - 'a' =
| 2, so take 1 and shift it left 2 times:
| 00000000000000000000000000000100 = 'c'
|
| (I'm not sure why the post's author chose to use
| 10000000000000000000000000000000 as their example for 'a'
| rather than the above which IIUC is how the code actually
| works.)
| OJFord wrote:
| '1 << stuff' = 'bit shift left 1 by stuff many positions'. Say
| you have a byte, 8 bits, '1' is 00000001 - if we shift it left
| 3 bits we have 00001000 (8).
|
| 'Stuff' there is just any expression (to understand separately)
| that evaluates to the number of positions to shift; `<<` is the
| operator for bit-shifting (left, cf. `>>`).
|
| In brief `(s[i - window_size] as u32 - 'a' as u32)` is finding
| the character that just left the window on the left,
| represented as a number, starting from 'a' as 0 so that all 26
| fit inside 32 bit positions.
| gigatexal wrote:
| Thank you all who kindly took to answering my question. It
| makes sense now!
| jsd1982 wrote:
| 1 << n sets a 1 bit at the nth bit position. << is the left
| shift operator.
| twic wrote:
| This seems a bit overcomplicated to me.
|
| Using a bitmap to track letters you've seen is a great idea. But
| the parity and counting gives me a headache. How about you start
| with a zero-size window at the start of the string, then
| iteratively try to grow it by moving the end forward until it's
| long enough, and if the character added would be a duplicate,
| moving the start forward until it's no longer a duplicate? Sort
| of inchworm-style movement.
|
| I am not smart enough to write Rust, so here it is in Java:
| private static int run(String input, int windowSize) {
| int windowStart = 0; // start at the start int
| windowEnd = 0; // the window is initially zero-size
| int lettersSeen = 0; // a bitmap of letters we have seen
| while (windowEnd - windowStart < windowSize) { if
| (windowEnd >= input.length()) throw new
| IllegalArgumentException("No unique window found");
| int letterToAdd = 1 << input.charAt(windowEnd) - 'a';
| while ((lettersSeen & letterToAdd) != 0) { //
| duplicate, slide the start until we drop the original
| int letterToRemove = 1 << input.charAt(windowStart) - 'a';
| assert (lettersSeen & letterToRemove) == letterToRemove; //
| sanity check that we have seen this letter
| lettersSeen ^= letterToRemove; ++windowStart;
| } // add the new letter
| assert (lettersSeen & letterToAdd) == 0; // sanity check that we
| have seen this letter lettersSeen ^= letterToAdd;
| ++windowEnd; } return windowStart;
| }
|
| This is not quite optimal in terms of bit operations - i mask the
| bitmap in the inner loop condition, but you could just test
| letterToRemove against letterToAdd and break if they're the same.
| I think that's a bit less readable though.
|
| Note that although this has two nested loops, they don't both
| independently range over the whole string, so this is not O(n^2).
| The outer loop moves windowEnd over the whole string, and the
| inner one moves windowStart over the whole string bit by bit.
| Each index visits each position in the string at most once.
| nerpderp82 wrote:
| I asked ChatGPT for a Rust port, https://play.rust-
| lang.org/?version=stable&mode=debug&editio...
|
| I had to make zero changes, it appears to work.
| feoren wrote:
| This is clever: you let the window size vary such that you
| never allow two copies of the same character in your window,
| and just stop when your window size reaches the target size.
| Seems like it should work well. But:
|
| > i mask the bitmap in the inner loop condition, but you could
| just test letterToRemove against letterToAdd and break if
| they're the same
|
| You still need to check ((lettersSeen & letterToAdd) != 0) once
| to know whether you need to enter that loop at all. Worst case
| (a long string of the same letter) you do this twice per letter
| in the input anyway.
| n4r9 wrote:
| I think the claim at the end is supposed to say `O(D)` running
| time rather than `O(N)` (or I can't see where `N` is defined).
|
| Does this technique catch letters duplicated more than once?
|
| In a Python 3 shell: >> 2 ^ 4 6 >> 2
| ^ 4 ^ 2 4 >> 2 ^ 4 ^ 2 ^ 2 6
|
| Yes I guess it does, because inputs are used up in order to turn
| bits on and off. Nice!
| Robin_Message wrote:
| The insight is that the only way to get W bits set is for the
| last W letters to all have been unique. Duplicate letters
| toggle bits on and off, but never get you any extra 1 bits set.
|
| If you wanted to detect V unique characters from a window of W
| characters where V < W, this trick wouldn't work. But you could
| still have a rolling window, it would just be a map counting
| "how many of this character in the window", which you can
| increment the relevant character for as it enters the window,
| and decrement as it leaves.
| mkeeter wrote:
| > I think the claim at the end is supposed to say `O(D)`
| running time rather than `O(N)` (or I can't see where `N` is
| defined).
|
| Good catch; this should be fixed (using N for input length and
| W for window size consistently in the writeup).
| [deleted]
| thomasahle wrote:
| One issue with this technique is when the set of characters grow
| above 64, so you can no longer give each one a unique bit.
|
| This could happen if we are instead interested in a window of
| unique words in a long document.
|
| Amazingly there is still a solution based on XOR!
|
| It is a variant of Bloom Filters, which I know everyone loves,
| but it uses XOR instead of OR to combine things.
|
| Basically each word is hashed to three locations and you flip
| those bits. Then you have to do some probabilistic analysis.
|
| https://arxiv.org/abs/2211.03683
|
| Edit: Actually the technique in this paper would take linear time
| to determine uniqueness. I wonder if there's a way to do it in
| constant time...
|
| I guess you could just implement linear probing or cuckoo hashing
| in bits. But then update time would be amortized constant only.
| nynx wrote:
| This article's technique can definitely expand into dynamically
| allocated bitsets.
| FartyMcFarter wrote:
| In that case the complexity becomes O(N * A) where A is the
| size of the alphabet, since at every step you have to go
| through the whole bitset in order to count how many bits are
| set.
|
| Edit - actually it should be possible to update the count
| incrementally, so it should still be O(N) I think.
| thomasahle wrote:
| Say we don't want to allocate memory dynamically though. We
| just want to allocate O(W) bits at the start of the program.
| Can you still do it?
| at_compile_time wrote:
| >One issue with this technique is when the set of characters
| grow above 64, so you can no longer give each one a unique bit.
|
| You can xor arrays of integers.
| thomasahle wrote:
| Right, but if the universe of possible characters grow much
| larger than the window size (as it does in many
| applications), we want a method that only pays in terms of
| window size, not universe size.
| twic wrote:
| So you use a sparse bitmap, ie a hash set. The size of the
| hash set is bounded to the size of the window, so this
| should be pretty efficient.
|
| If you have a huge string, a huge alphabet, and also a huge
| window, then probabilistic techniques start to make sense.
| But let's not run while we can successfully walk!
| thomasahle wrote:
| Sure. We can always use a hash set, but then you need to
| use chaining or some other collision resolution policy.
| The whole of this whole exercise is to avoid that and get
| something really fast and truly constant time :-)
| mjburgess wrote:
| I may be missing something, but isnt the obvious answer just to
| scan-and-compare 4 chars/time ? (Noting that != is XOR).
| int main(void) { char *c = 0, *data =
| "nznrnfrfntjfmvfwmzdfjlvtqnbhcprsg"; for (
| c = data; !((c[1] != c[0]) &&
| ((c[2] != c[1]) && (c[2] != c[0])) && ((c[3] !=
| c[2]) && (c[3] != c[1]) && (c[3] != c[0]))); c++
| ); printf("%ld", c-data+4); }
| leipert wrote:
| 2nd part is a window of length 14.
| aarestad wrote:
| Sure, but now do 11 at a time with a similar algorithm (which
| was part 2 of the problem).
| mjburgess wrote:
| I've just submitted in order to see the second part, which
| just says "now look for 14 unique".
|
| I dont see the problem here, if you want a notation to
| express it, a macro for something like, 0
| != 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 1 != 0,
| 2, 3, 4, 5, 6, 7, ...
|
| is straightforward, and could be parameterised on the window
| size.
| jstanley wrote:
| That's fine, but it's O(n) in the window size whereas the
| xor trick is O(1).
| mjburgess wrote:
| The window size is always constant in the problem, so
| it's O(1)
|
| There shouldnt be a loop over the window size in
| solutions to this (particular) problem
| deathanatos wrote:
| TFA's implementation is using a loop over the window size
| to compare each character to all the other characters, to
| see if they're unique. So, the article's O(N * W2) is
| correct, but I agree, the window size is fixed and small;
| it's fair to boil it down to just O(N).
|
| (I never even considered doing a for loop like that to
| determine if the window is unique; I just histogram the
| window, and then check if all the values in the histogram
| are 1. If yes, unique. If I had been cleverer, I wouldn't
| rebuild the histogram as the window shifts, but there was
| no need for that. This is essentially the next solution
| TFA presents.)
| deathanatos wrote:
| Part 1 and part 2 are the same, only the window size
| varies. But within a part, the window size is constant.
| For most, it should have just meant that, if you hard-
| coded the window size during part 1, that you need to
| make it a parameter.
|
| But for both parts, the naive solution should be O(N *
| W), where N is the input length, and W is the window
| size. Like a sibling says, since within a part the window
| size is fixed, it is acceptable to call it O(N).
|
| Edit: ah, I see what his solution is doing. It is
| correct. I need to adjust my definition of naive, I
| guess. His is O(N * W2). You can still reasonably
| consider W2 constant, though, I think. And he gets to
| what I would call the "naive" solution.
|
| (Previously.) ~The "naive solution" in the OP does not
| appear to be correct. Or at least, if it does work,~ it
| appears to do far more computation that it needs to.
|
| The "xor trick" is O(N), too. (There cannot be a more
| efficient solution, as the entire input must be
| considered in the worst case. TFA is correct that its
| final form omits W, though.)
| deathanatos wrote:
| If you consider W the window size, and N the input size, this
| is still O(N * W2), as TFA states. (TFA's implementation is
| generic over W, yours is constant and essentially -funroll'd in
| place. But it's the same, for big-O; as you mention in your
| reply later, you could even make a macro, generic over W.)
| mihaic wrote:
| Am I the only one bothered that he didn't first optimize the
| HashSet solution to O(N)? When sliding the window, you increase
| the counter for the new element, decrease the counter for leaving
| element, and update for each element how many have their counter
| set to 1.
|
| That makes a bit weaker the case for using popcount, since that
| actually is O(W/Wordsize), which happens to be O(1) in this case.
| DannyBee wrote:
| You don't even have to do this.
|
| If HashSet.insert ever returns false, the character was already
| in the set, and the window is not unique, so you can early exit
| and move on. If you get through the window without insert
| returning false, it is unique. You don't need to do anything
| else (like check the length of the hash set at the end)
| TYMorningCoffee wrote:
| I think he meant to use a hash map plus a numOfOnes counter
| instead of hash set. With a hash map, you don't need to build
| the entire window every loop. It's only a constant amount of
| work per iteration, achieving O(N)
| DannyBee wrote:
| Yes. You can slide a hash map and a counter, as we went
| through below. You can also replace the hash map with a
| bitmask to store the set.
|
| You actually don't need any extra anything at all if you
| want, as you can just keep a single count because you don't
| need to look back at old windows, and can keep moving
| forward to the last non-unique point, making it O(N).
|
| I'm sure this being HN, someone will come along and post
| how to do this.
|
| I already got caught out once by trying to code it too fast
| so not gonna do it ;)
| kevinwang wrote:
| Thank you. I was wondering the same thing and I think the
| article is overfocusing on "cool bit tricks" while this simple
| approach works just fine.
| SquareWheel wrote:
| Without the cool bit trick, would it be worth publishing the
| article at all? I found it a novel approach to explore.
___________________________________________________________________
(page generated 2022-12-13 23:01 UTC)