Subj : Re: any function to handle this kind of counting? To : comp.programming From : Randy Date : Sun Jul 31 2005 06:57 am Jon Harrop wrote: > Randy wrote: >>... >>But just because there's no pretty solution doesn't mean there's no ugly >>solution. > > I posted an elegant 10-line solution... You know, I'll bet you could get that down to one line if you wrote it in APL. Of course, the OP would find APL equally as opaque as OCaml. If the OP's goal is a solution, an implementation in a familiar PL would be good. If his goal is a generic function, then pseudocode would be good. I was trying to cover either contingency. However, IMHO, an implementation in OCaml is unlikely to be what the OP had in mind. Pretty though it is... .... >>Either way, rebuilding the dynamic data structure as you go is gonna be >>slow. > > No, that is completely wrong. What you are calling "dynamic data structures" > can be asymptotically faster than the array-based approach. Performance of > my list-based implementation could be improved upon by using sets or hash > tables instead of association lists. Slow is a relative term. When running on a multi-GHz CPU with a GB or RAM using two arrays of only 2000 elements, slow is still going to be blazingly fast. I too am speaking asymptotically. In fact, if the OP's volume of data were much larger, it'd be painful to build a dynamic data structure that has to grow each time a new data exemplar is encountered. If that's what the language's runtime or its supporting library requires, given enough data, it will eat you alive. That's why dynamic data structures are not popular among folks who program for high performance. Perhaps that's also why folks seldom use OCaml? Probably not, since numerous PLs with noncompetitive performance have become increasingly popular in the past decade (e.g. Python, Ruby, Perl, Java, etc). There is no way that a generic dynamic data structure is going to outperform a specific static data structure, unless the latter uses too much memory. Likewise, there's no way the former is going to outperform a *specific* dynamic data structure, like the kind that can easily be written in C, or probably better yet, C++. BUT. I'll grant that it may be preferable to build solutions to simple problems like these using higher level languages than C. And the resulting program probably will be more elegant. And it might even be fast enough to solve the user's problem within an acceptable amount of time. Perhaps that language should be OCaml. But that's another discussion. Regardless, I get the strong impression that you and I are embarking on yet another religious war, in which A advocates a superior way to see the world and B advocates a simpler less elegant way that works very nicely thank you and potentially runs a lot faster. If so, I resign right now. .... > > Here is an array-based implementation: > > I get the following times with first elements in the range [0..10) and > second elements in the range [32..127): > > $ time ./collate 10 2000 List based > > real 0m0.018s > user 0m0.010s > sys 0m0.000s > $ time ./collate2 10 2000 Tree based > > real 0m0.011s > user 0m0.010s > sys 0m0.000s > $ time ./collate3 10 2000 Hash table based > > real 0m0.008s > user 0m0.000s > sys 0m0.010s > $ time ./collate4 10 2000 Array based > > real 0m0.007s > user 0m0.010s > sys 0m0.000s > That's interesting. I just wrote an array-based version in C and ran it on my 2.2 GHz P4, and my timings were: % time ./a.out 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w For fun, I enlarged the arrays 10 fold (to 20,000), and got: 0.004u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w Recompiling with -03, I enlarged the data 4 fold more, and got: % time ./a.out 0.009u 0.001s 0:00.01 0.0% 0+0k 0+0io 0pf+0w So a brute force array-based implementation computes 40 times more data in the same amount of time as OCaml (assuming equivalent hardware). Or perversely, OCaml ran 40X slower. Clearly, crude array-based implementations are fast. And legible (see below). > With a larger range [0..10^5), giving sparse arrays, the array-based method > requires a lot more memory (150Mb vs 5Mb) and is 500x slower than the > tree-based method: .... Yes. We agree. Doing this would be nutty. > > Well-written C equivalents using the non-trivial data structures would > require hundreds of lines of code. C is a very poor choice of language for > this kind of work... Just because I'm enjoying this exchange so much, I'll continue. My problems with your reply are the following: 1) You solved a problem that's probably much more general than the user needs. It's easy to use a simple array based approach that can be resized, recompiled, and rerun until it *does* solve the user's problem. Or size the array using two passes. It's not elegant. But it's simple and fast and it gets the job done, given that this is probably a one time task anyway (such as tallying homework grades using student IDs). 2) You solved the problem using a tool that is not useful to almost anyone. Virtually nobody uses OCaml -- surely less than 1% of the professional software world. (Perhaps that's because it supports arrays so poorly. ;-)) Regardless, the OP is not going to purchase, download, and learn a new programming language in order to solve so simple a problem. In this case, OCaml's superior elegance is overwhelmed by its inferior popularity (and its unusual semantics). Had your solution been written in a common PL like C, C++, Java, or even Python, Ruby or Perl, which the user might possibly know or might get assistance with from coworkers or friends, the code would have been of more than academic interest. 3) The OP asked for a function. I doubt he could readily discern the algorithm beneath your implementation. I can't. 4) You overstate the cost of using C, especially for small tasks, Randy Here's the C program that I timed, BTW: " #include #include #include #define VECLEN 2000 #define NUMNUM 10 #define NUMCHARS 95 void main() { char vec1[VECLEN], vec2[VECLEN]; int counts[NUMNUM][NUMCHARS]; int c, d; // fill the integer vector with 10 random numerals (0 - 9) for (c = 0; c < VECLEN; c++) vec1[c] = (int) (((double) rand() / (double) RAND_MAX) * (double) NUMNUM); // fill the char vector with 95 random letters (' ' - Z) for (c = 0; c < VECLEN; c++) vec2[c] = (int) ' ' + (int) (((double) rand() / (double) RAND_MAX) * (double) NUMCHARS); // zero out the count array for (c = 0; c < NUMNUM; c++) for (d = 0; d < NUMCHARS; d++) counts[c][d] = 0; // increment the entry in the 2D count array that corresponds // to pair of values from vec1 and vec2 at offset 1:VECLEN for (c = 0; c < VECLEN; c++) counts[vec1[c]][vec2[c] - (int) ' ']++; /* printout elided, to allow for comparable timing for (c = 0; c < NUMNUM; c++) for (d = 0; d < NUMCHARS; d++) if (counts[vec1[c]][vec2[c]]) printf( "%d - %d\n", c, d + (int) ' ', counts[c][c]); */ } " -- Randy Crawford http://www.ruf.rice.edu/~rand rand AT rice DOT edu .