Subj : Re: any function to handle this kind of counting?
To   : comp.programming
From : Jon Harrop
Date : Sun Jul 31 2005 04:21 pm

Randy wrote:
> Jon Harrop wrote:
>> Randy wrote:
>>>...
>>>But just because there's no pretty solution doesn't mean there's no ugly
>>>solution.
>> 
>> I posted an elegant 10-line solution...
> 
> You know, I'll bet you could get that down to one line if you wrote it
> in APL.  Of course, the OP would find APL equally as opaque as OCaml.
> 
> If the OP's goal is a solution, an implementation in a familiar PL
> would be good.  If his goal is a generic function, then pseudocode
> would be good.

I have no idea what PL the OP would find "familiar". Most of the people that
I know can understand OCaml or SML code (they were taught it as undergrads)
better than C.

>>>Either way, rebuilding the dynamic data structure as you go is gonna be
>>>slow.
>> 
>> No, that is completely wrong. What you are calling "dynamic data
>> structures" can be asymptotically faster than the array-based approach.
>> Performance of my list-based implementation could be improved upon by
>> using sets or hash tables instead of association lists.
> 
> Slow is a relative term.  When running on a multi-GHz CPU with a GB or
> RAM using two arrays of only 2000 elements, slow is still going to be
> blazingly fast.  I too am speaking asymptotically.  In fact, if the
> OP's volume of data were much larger, it'd be painful to build a
> dynamic data structure that has to grow each time a new data exemplar
> is encountered.  If that's what the language's runtime or its
> supporting library requires, given enough data, it will eat you alive.

I don't understand what you mean but your original statement is wrong (see
below).

> That's why dynamic data structures are not popular among folks who
> program for high performance.

No. My background is in high-performance scientific computing and I've moved
much of my work to OCaml specifically because it can be so much faster than
C++ for non-trivial problems.

> There is no way that a generic dynamic data structure is going to
> outperform a specific static data structure, unless the latter uses
> too much memory.  Likewise, there's no way the former is going to
> outperform a *specific* dynamic data structure, like the kind that can
> easily be written in C, or probably better yet, C++.

or OCaml or many other languages. It really doesn't make any difference.

> BUT.  I'll grant that it may be preferable to build solutions to
> simple problems like these using higher level languages than C.  And
> the resulting program probably will be more elegant.  And it might
> even be fast enough to solve the user's problem within an acceptable
> amount of time.

Typically, OCaml code is often much faster than C code because it is easier
to use a more appropriate algorithm.

> Perhaps that language should be OCaml.  But that's another discussion.

Perhaps. I'm interested in learning about all languages.

> That's interesting.  I just wrote an array-based version in C and ran
> it on my 2.2 GHz P4, and my timings were:
> 
> % time ./a.out
> 0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> 
> For fun, I enlarged the arrays 10 fold (to 20,000), and got:
> 
> 0.004u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w
> 
> Recompiling with -03, I enlarged the data 4 fold more, and got:
> 
> % time ./a.out
> 0.009u 0.001s 0:00.01 0.0%      0+0k 0+0io 0pf+0w
> 
> So a brute force array-based implementation computes 40 times more
> data in the same amount of time as OCaml (assuming equivalent
> hardware). Or perversely, OCaml ran 40X slower.

I've done a correct performance comparison below (after correcting your C
code).

> Clearly, crude array-based implementations are fast.  And legible (see
> below). 

No. As I have already shown, the array-based implementation can be slow for
this problem.

>> With a larger range [0..10^5), giving sparse arrays, the array-based
>> method requires a lot more memory (150Mb vs 5Mb) and is 500x slower than
>> the tree-based method:
> ...
> 
> Yes.  We agree.  Doing this would be nutty.

But that is exactly what your program does?

> My problems with your reply are the following:
> 
> 1) You solved a problem that's probably much more general than the
> user needs.  It's easy to use a simple array based approach that can
> be resized, recompiled, and rerun until it *does* solve the user's
> problem.  Or size the array using two passes.  It's not elegant.  But
> it's simple and fast and it gets the job done, given that this is
> probably a one time task anyway (such as  tallying homework grades
> using student IDs).

Are you saying that you don't like general solutions that are fast and
robust?

> 2) You solved the problem using a tool that is not useful to almost
> anyone.  Virtually nobody uses OCaml -- surely less than 1% of the
> professional software world.  (Perhaps that's because it supports
> arrays so poorly.  ;-))  Regardless, the OP is not going to purchase,
> download, and learn a new programming language in order to solve so
> simple a problem.  In this case, OCaml's superior elegance is
> overwhelmed by its inferior popularity (and its unusual semantics).

We obviously don't walk in the same circles...

> Had your solution been written in a common PL like C, C++, Java, or
> even Python, Ruby or Perl, which the user might possibly know or might
> get assistance with from coworkers or friends, the code would have
> been of more than academic interest.

I have no idea what the OPs background is. For me, I know the following
numbers of people who are at least somewhat familiar with the following
languages:

10 C++
10 C
7 OCaml
3 Java
2 Perl
1 Python
1 Ruby

Had I written my solution in C it would have been incomprehensible. With
some luck, it could have been correct though...

> 3) The OP asked for a function.  I doubt he could readily discern the
> algorithm beneath your implementation.  I can't.

Ok. I'm not sure what I can do about that.

> 4) You overstate the cost of using C, especially for small tasks,

It took me 15 minutes just to find out why your C code was segfaulting.

> Here's the C program that I timed, BTW:
> 
> "
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> 
> #define VECLEN 2000
> #define NUMNUM 10

If I change this magic number to 1000 then your C code segfaults for a
non-trivial and undocumented reason. Note that the OP didn't specify the
range of NUMNUM.

Had the code not happened to segfault, it would silently produce corrupted
results.

> #define NUMCHARS 95
> 
> void main()

That should be "int main()".

> {
> char vec1[VECLEN], vec2[VECLEN];

You have used limited-precision types here. We don't know the range held in
vec1. Strictly, we don't even know if vec2 should be a string array.

The use of "signed char" for vec1 is the cause of the segfault for
NUMNUM=1000. It causes later code to read off the end of an array.

> int counts[NUMNUM][NUMCHARS];

Like mine, this array-based implementation is bad because it unnecessarily
requires O(NUMNUM) space and we don't know how big NUMNUM is.

> int c, d;
> 
>    // fill the integer vector with 10 random numerals (0 - 9)
>    for (c = 0; c < VECLEN; c++)
>      vec1[c] = (int) (((double) rand() / (double) RAND_MAX) * (double)
>        NUMNUM);

This could be:

  vec1[c] = rand() % NUMNUM;

>    // fill the char vector with 95 random letters (' ' - Z)
>    for (c = 0; c < VECLEN; c++)
>      vec2[c] = (int) ' ' + (int) (((double) rand() / (double) RAND_MAX)
>        * (double) NUMCHARS);

Similarly.

>    // zero out the count array
>    for (c = 0; c < NUMNUM; c++)
>      for (d = 0; d < NUMCHARS; d++)
>        counts[c][d] = 0;

If the arrays were allocated dynamically then you could have used calloc as
a shorter and faster replacement for this code.

>    // increment the entry in the 2D count array that corresponds
>    //   to pair of values from vec1 and vec2 at offset 1:VECLEN
>    for (c = 0; c < VECLEN; c++)
>      counts[vec1[c]][vec2[c] - (int) ' ']++;
> 
>    /* printout elided, to allow for comparable timing
>    for (c = 0; c < NUMNUM; c++)
>      for (d = 0; d < NUMCHARS; d++)
>        if (counts[vec1[c]][vec2[c]])
>          printf( "%d - %d\n", c, d + (int) ' ', counts[c][c]);
>    */
> }
> "

Replacing the "char" arrays with "int" arrays, I get:

$ gcc -O3 collate.c -o collate_c
$ time ./collate_c 10 2000

real    0m0.003s
user    0m0.000s
sys     0m0.000s
$ time ./collate_c 100000 2000

real    0m0.462s
user    0m0.130s
sys     0m0.050s

So your C code is 4x longer and 15x slower than the map or hash table
implementations in OCaml when NUMNUM=10^5.

Incidentally, if I were to use a more mainstream language then I'd use C++
and the STL. The code would be much longer than the OCaml but equally
legible/illegible.

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com

.