* * * * *
                                        
                        Yet even more stupid benchmarks
                                        
Yet another silly optimization problem [1]. This time, from a silly coding
challenge [2] to find the number of integers expressible with unique digits
(that is, no single digit repeats) in a base-10 representation up to the
value 10,000,000,000 (there are 8,877,690 such numbers, by the way).

The neatest and fastest solution was final program on this page [3], written
in C#. It generates only such numbers; it doesn't try to test each number.
Since I don't use C#, I decided to translate the code in C to play around
with it. Wasn't all that hard:

> #include <stdio.h>
> #include <stdlib.h>
> 
> int total = 0;
> const int pre[(1 << 10) + 1] /* = { ... } */ ;
> 
> void generate2(
>         int maxlen,
>         int currentlen,
>         int availabledigits,
>         int currentvalue
> )
> {
>   int last = (currentlen == maxlen - 1);
>   int x    = availabledigits;
>  
>   while(x != 0)
>   {
>     int digit = pre[x ^ (x & (x - 1))];
>     x &= (x - 1);
>    
>     if (digit == 0 && currentvalue == 0)
>       continue;
>    
>     if (last)
>       ++total;
>     else
>       generate2(
>         maxlen,
>         currentlen + 1,
>         availabledigits & ~(1 << digit),
>         (currentvalue * 10) + digit
>       );
>   }
> }
> 
> int main(int argc,char *argv[])
> {
>   int len;
>  
>   for (len = 1 ; len <= 10 ; len++)
>     generate2(len,0,0xFFF >> 2,0);
> 
>   printf("total: %d\n",total);
>   return EXIT_SUCCESS;
> }
> 

I pregenerated the pre[] array since I wanted this to run as fast as
possible. The code used to generate the array:

> for (i = 0 ; i <= 10 ; i++)
>   pre[1 << i] = i;
> 

Anyway, once written and compiled (gcc -O4 -fomit-frame-pointer f.c) it ran
in about 0.2 seconds (average run) on a 2.6GHz (gigaHertz) machine. Fast, but
I could go faster by running it across the two CPU (Central Processing Unit)s
in the box. I was expecting about half the runtime, since this is easily
parallelizable.

It ran in about 0.16 seconds, a rather disappointing ¾ time. I commented out
the code in generate2() just to test the overhead of threading and
syncronization and that isn't a factor (program ran in 0.001 seconds).

Undaunted, I decided to try one of the quad-core boxes at The Office.
Reworked the code a bit to split the load between four CPUs as evenly as
possible, and ran some tests.

0.13 seconds on average. Still not quite half the speed.

Hmmm … 

[1] gopher://gopher.conman.org/0Phlog:2008/08/07.2
[2] http://beust.com/weblog/archives/000491.html
[3] http://www.indiangeek.net/2008/08/29/a-case-study-in-micro-optimization/

Email Sean Conner at sean@conman.org

.