Subj : Re: any function to handle this kind of counting? To : comp.programming From : Jon Harrop Date : Sun Jul 31 2005 04:21 pm Randy wrote: > Jon Harrop wrote: >> Randy wrote: >>>... >>>But just because there's no pretty solution doesn't mean there's no ugly >>>solution. >> >> I posted an elegant 10-line solution... > > You know, I'll bet you could get that down to one line if you wrote it > in APL. Of course, the OP would find APL equally as opaque as OCaml. > > If the OP's goal is a solution, an implementation in a familiar PL > would be good. If his goal is a generic function, then pseudocode > would be good. I have no idea what PL the OP would find "familiar". Most of the people that I know can understand OCaml or SML code (they were taught it as undergrads) better than C. >>>Either way, rebuilding the dynamic data structure as you go is gonna be >>>slow. >> >> No, that is completely wrong. What you are calling "dynamic data >> structures" can be asymptotically faster than the array-based approach. >> Performance of my list-based implementation could be improved upon by >> using sets or hash tables instead of association lists. > > Slow is a relative term. When running on a multi-GHz CPU with a GB or > RAM using two arrays of only 2000 elements, slow is still going to be > blazingly fast. I too am speaking asymptotically. In fact, if the > OP's volume of data were much larger, it'd be painful to build a > dynamic data structure that has to grow each time a new data exemplar > is encountered. If that's what the language's runtime or its > supporting library requires, given enough data, it will eat you alive. I don't understand what you mean but your original statement is wrong (see below). > That's why dynamic data structures are not popular among folks who > program for high performance. No. My background is in high-performance scientific computing and I've moved much of my work to OCaml specifically because it can be so much faster than C++ for non-trivial problems. > There is no way that a generic dynamic data structure is going to > outperform a specific static data structure, unless the latter uses > too much memory. Likewise, there's no way the former is going to > outperform a *specific* dynamic data structure, like the kind that can > easily be written in C, or probably better yet, C++. or OCaml or many other languages. It really doesn't make any difference. > BUT. I'll grant that it may be preferable to build solutions to > simple problems like these using higher level languages than C. And > the resulting program probably will be more elegant. And it might > even be fast enough to solve the user's problem within an acceptable > amount of time. Typically, OCaml code is often much faster than C code because it is easier to use a more appropriate algorithm. > Perhaps that language should be OCaml. But that's another discussion. Perhaps. I'm interested in learning about all languages. > That's interesting. I just wrote an array-based version in C and ran > it on my 2.2 GHz P4, and my timings were: > > % time ./a.out > 0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w > > For fun, I enlarged the arrays 10 fold (to 20,000), and got: > > 0.004u 0.000s 0:00.00 0.0% 0+0k 0+0io 0pf+0w > > Recompiling with -03, I enlarged the data 4 fold more, and got: > > % time ./a.out > 0.009u 0.001s 0:00.01 0.0% 0+0k 0+0io 0pf+0w > > So a brute force array-based implementation computes 40 times more > data in the same amount of time as OCaml (assuming equivalent > hardware). Or perversely, OCaml ran 40X slower. I've done a correct performance comparison below (after correcting your C code). > Clearly, crude array-based implementations are fast. And legible (see > below). No. As I have already shown, the array-based implementation can be slow for this problem. >> With a larger range [0..10^5), giving sparse arrays, the array-based >> method requires a lot more memory (150Mb vs 5Mb) and is 500x slower than >> the tree-based method: > ... > > Yes. We agree. Doing this would be nutty. But that is exactly what your program does? > My problems with your reply are the following: > > 1) You solved a problem that's probably much more general than the > user needs. It's easy to use a simple array based approach that can > be resized, recompiled, and rerun until it *does* solve the user's > problem. Or size the array using two passes. It's not elegant. But > it's simple and fast and it gets the job done, given that this is > probably a one time task anyway (such as tallying homework grades > using student IDs). Are you saying that you don't like general solutions that are fast and robust? > 2) You solved the problem using a tool that is not useful to almost > anyone. Virtually nobody uses OCaml -- surely less than 1% of the > professional software world. (Perhaps that's because it supports > arrays so poorly. ;-)) Regardless, the OP is not going to purchase, > download, and learn a new programming language in order to solve so > simple a problem. In this case, OCaml's superior elegance is > overwhelmed by its inferior popularity (and its unusual semantics). We obviously don't walk in the same circles... > Had your solution been written in a common PL like C, C++, Java, or > even Python, Ruby or Perl, which the user might possibly know or might > get assistance with from coworkers or friends, the code would have > been of more than academic interest. I have no idea what the OPs background is. For me, I know the following numbers of people who are at least somewhat familiar with the following languages: 10 C++ 10 C 7 OCaml 3 Java 2 Perl 1 Python 1 Ruby Had I written my solution in C it would have been incomprehensible. With some luck, it could have been correct though... > 3) The OP asked for a function. I doubt he could readily discern the > algorithm beneath your implementation. I can't. Ok. I'm not sure what I can do about that. > 4) You overstate the cost of using C, especially for small tasks, It took me 15 minutes just to find out why your C code was segfaulting. > Here's the C program that I timed, BTW: > > " > #include > #include > #include > > #define VECLEN 2000 > #define NUMNUM 10 If I change this magic number to 1000 then your C code segfaults for a non-trivial and undocumented reason. Note that the OP didn't specify the range of NUMNUM. Had the code not happened to segfault, it would silently produce corrupted results. > #define NUMCHARS 95 > > void main() That should be "int main()". > { > char vec1[VECLEN], vec2[VECLEN]; You have used limited-precision types here. We don't know the range held in vec1. Strictly, we don't even know if vec2 should be a string array. The use of "signed char" for vec1 is the cause of the segfault for NUMNUM=1000. It causes later code to read off the end of an array. > int counts[NUMNUM][NUMCHARS]; Like mine, this array-based implementation is bad because it unnecessarily requires O(NUMNUM) space and we don't know how big NUMNUM is. > int c, d; > > // fill the integer vector with 10 random numerals (0 - 9) > for (c = 0; c < VECLEN; c++) > vec1[c] = (int) (((double) rand() / (double) RAND_MAX) * (double) > NUMNUM); This could be: vec1[c] = rand() % NUMNUM; > // fill the char vector with 95 random letters (' ' - Z) > for (c = 0; c < VECLEN; c++) > vec2[c] = (int) ' ' + (int) (((double) rand() / (double) RAND_MAX) > * (double) NUMCHARS); Similarly. > // zero out the count array > for (c = 0; c < NUMNUM; c++) > for (d = 0; d < NUMCHARS; d++) > counts[c][d] = 0; If the arrays were allocated dynamically then you could have used calloc as a shorter and faster replacement for this code. > // increment the entry in the 2D count array that corresponds > // to pair of values from vec1 and vec2 at offset 1:VECLEN > for (c = 0; c < VECLEN; c++) > counts[vec1[c]][vec2[c] - (int) ' ']++; > > /* printout elided, to allow for comparable timing > for (c = 0; c < NUMNUM; c++) > for (d = 0; d < NUMCHARS; d++) > if (counts[vec1[c]][vec2[c]]) > printf( "%d - %d\n", c, d + (int) ' ', counts[c][c]); > */ > } > " Replacing the "char" arrays with "int" arrays, I get: $ gcc -O3 collate.c -o collate_c $ time ./collate_c 10 2000 real 0m0.003s user 0m0.000s sys 0m0.000s $ time ./collate_c 100000 2000 real 0m0.462s user 0m0.130s sys 0m0.050s So your C code is 4x longer and 15x slower than the map or hash table implementations in OCaml when NUMNUM=10^5. Incidentally, if I were to use a more mainstream language then I'd use C++ and the STL. The code would be much longer than the OCaml but equally legible/illegible. -- Dr Jon D Harrop, Flying Frog Consultancy http://www.ffconsultancy.com .