Subj : Re: any function to handle this kind of counting?
To   : comp.programming
From : Randy
Date : Sun Jul 31 2005 06:57 am

Jon Harrop wrote:
> Randy wrote:
>>...
>>But just because there's no pretty solution doesn't mean there's no ugly
>>solution.
> 
> I posted an elegant 10-line solution...

You know, I'll bet you could get that down to one line if you wrote it 
in APL.  Of course, the OP would find APL equally as opaque as OCaml.

If the OP's goal is a solution, an implementation in a familiar PL 
would be good.  If his goal is a generic function, then pseudocode 
would be good.

I was trying to cover either contingency.  However, IMHO, an 
implementation in OCaml is unlikely to be what the OP had in mind. 
Pretty though it is...

....
>>Either way, rebuilding the dynamic data structure as you go is gonna be
>>slow.
> 
> No, that is completely wrong. What you are calling "dynamic data structures"
> can be asymptotically faster than the array-based approach. Performance of
> my list-based implementation could be improved upon by using sets or hash
> tables instead of association lists.

Slow is a relative term.  When running on a multi-GHz CPU with a GB or 
RAM using two arrays of only 2000 elements, slow is still going to be 
blazingly fast.  I too am speaking asymptotically.  In fact, if the 
OP's volume of data were much larger, it'd be painful to build a 
dynamic data structure that has to grow each time a new data exemplar 
is encountered.  If that's what the language's runtime or its 
supporting library requires, given enough data, it will eat you alive.

That's why dynamic data structures are not popular among folks who 
program for high performance.  Perhaps that's also why folks seldom 
use OCaml?  Probably not, since numerous PLs with noncompetitive 
performance have become increasingly popular in the past decade (e.g. 
Python, Ruby, Perl, Java, etc).

There is no way that a generic dynamic data structure is going to 
outperform a specific static data structure, unless the latter uses 
too much memory.  Likewise, there's no way the former is going to 
outperform a *specific* dynamic data structure, like the kind that can 
easily be written in C, or probably better yet, C++.

BUT.  I'll grant that it may be preferable to build solutions to 
simple problems like these using higher level languages than C.  And 
the resulting program probably will be more elegant.  And it might 
even be fast enough to solve the user's problem within an acceptable 
amount of time.  Perhaps that language should be OCaml.  But that's 
another discussion.

Regardless, I get the strong impression that you and I are embarking 
on yet another religious war, in which A advocates a superior way to 
see the world and B advocates a simpler less elegant way that works 
very nicely thank you and potentially runs a lot faster.  If so, I 
resign right now.

....
> 
> Here is an array-based implementation:
> 
> I get the following times with first elements in the range [0..10) and
> second elements in the range [32..127):
> 
> $ time ./collate 10 2000    List based
> 
> real    0m0.018s
> user    0m0.010s
> sys     0m0.000s
> $ time ./collate2 10 2000   Tree based
> 
> real    0m0.011s
> user    0m0.010s
> sys     0m0.000s
> $ time ./collate3 10 2000   Hash table based
> 
> real    0m0.008s
> user    0m0.000s
> sys     0m0.010s
> $ time ./collate4 10 2000   Array based
> 
> real    0m0.007s
> user    0m0.010s
> sys     0m0.000s
> 

That's interesting.  I just wrote an array-based version in C and ran 
it on my 2.2 GHz P4, and my timings were:

% time ./a.out
0.000u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w

For fun, I enlarged the arrays 10 fold (to 20,000), and got:

0.004u 0.000s 0:00.00 0.0%      0+0k 0+0io 0pf+0w

Recompiling with -03, I enlarged the data 4 fold more, and got:

% time ./a.out
0.009u 0.001s 0:00.01 0.0%      0+0k 0+0io 0pf+0w

So a brute force array-based implementation computes 40 times more 
data in the same amount of time as OCaml (assuming equivalent 
hardware).  Or perversely, OCaml ran 40X slower.  Clearly, crude 
array-based implementations are fast.  And legible (see below).

> With a larger range [0..10^5), giving sparse arrays, the array-based method
> requires a lot more memory (150Mb vs 5Mb) and is 500x slower than the
> tree-based method:
....

Yes.  We agree.  Doing this would be nutty.

> 
> Well-written C equivalents using the non-trivial data structures would
> require hundreds of lines of code. C is a very poor choice of language for
> this kind of work...

Just because I'm enjoying this exchange so much, I'll continue.

My problems with your reply are the following:

1) You solved a problem that's probably much more general than the 
user needs.  It's easy to use a simple array based approach that can 
be resized, recompiled, and rerun until it *does* solve the user's 
problem.  Or size the array using two passes.  It's not elegant.  But 
it's simple and fast and it gets the job done, given that this is 
probably a one time task anyway (such as  tallying homework grades 
using student IDs).

2) You solved the problem using a tool that is not useful to almost 
anyone.  Virtually nobody uses OCaml -- surely less than 1% of the 
professional software world.  (Perhaps that's because it supports 
arrays so poorly.  ;-))  Regardless, the OP is not going to purchase, 
download, and learn a new programming language in order to solve so 
simple a problem.  In this case, OCaml's superior elegance is 
overwhelmed by its inferior popularity (and its unusual semantics).

Had your solution been written in a common PL like C, C++, Java, or 
even Python, Ruby or Perl, which the user might possibly know or might 
get assistance with from coworkers or friends, the code would have 
been of more than academic interest.

3) The OP asked for a function.  I doubt he could readily discern the 
algorithm beneath your implementation.  I can't.

4) You overstate the cost of using C, especially for small tasks,

     Randy


Here's the C program that I timed, BTW:

"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define VECLEN 2000
#define NUMNUM 10
#define NUMCHARS 95

void main()
{
char vec1[VECLEN], vec2[VECLEN];
int counts[NUMNUM][NUMCHARS];
int c, d;

   // fill the integer vector with 10 random numerals (0 - 9)
   for (c = 0; c < VECLEN; c++)
     vec1[c] = (int) (((double) rand() / (double) RAND_MAX) * (double)
       NUMNUM);

   // fill the char vector with 95 random letters (' ' - Z)
   for (c = 0; c < VECLEN; c++)
     vec2[c] = (int) ' ' + (int) (((double) rand() / (double) RAND_MAX)
       * (double) NUMCHARS);

   // zero out the count array
   for (c = 0; c < NUMNUM; c++)
     for (d = 0; d < NUMCHARS; d++)
       counts[c][d] = 0;

   // increment the entry in the 2D count array that corresponds
   //   to pair of values from vec1 and vec2 at offset 1:VECLEN
   for (c = 0; c < VECLEN; c++)
     counts[vec1[c]][vec2[c] - (int) ' ']++;

   /* printout elided, to allow for comparable timing
   for (c = 0; c < NUMNUM; c++)
     for (d = 0; d < NUMCHARS; d++)
       if (counts[vec1[c]][vec2[c]])
         printf( "%d - %d\n", c, d + (int) ' ', counts[c][c]);
   */
}
"

-- 
Randy Crawford   http://www.ruf.rice.edu/~rand   rand AT rice DOT edu

.