Subj : Re: any function to handle this kind of counting?
To   : comp.programming
From : Jon Harrop
Date : Mon Aug 01 2005 09:26 pm

Peter Ammon wrote:
>>>A data structure impedance mismatch is easier to deal with than a
>>>language mismatch.
>> 
>> You're probably going to end up copying the data structures in both
>> cases.
> 
> That's so.  But it's still substantially more difficult to modify your
> OCaml code to operate with ObscureCLibrary than it would be if you were
> all C.  Going from C->OCaml is probably even harder, but less likely to
> come up.

The easiest way to call C code from OCaml is to write an IDL interface and
compile it into bindings using camlidl. The main difference in complexity
is that you would have to create a formal definition of the C library's
interface. However, you would want to know the API in order to use the C
library from C code anyway, so it really isn't that different. For example,
you'd have to supplement the C declarations with "in", "out" or "in/out".

There are pathological cases (like the GLU tesselator) where the C interface
is not easily represented by OCaml's type system. However, it took me much
longer to get working C++ code that it did to write OCaml bindings to the
GLU tesselator.

>>>>3. Efficiency (a "home grown" set implementation in C is unlikely to
>>>>match the performance of OCaml's set implementation).
>>>
>>>Doubtful.  OCaml is very fast, but it still targets the same instruction
>>>set as C, and there are very fast C libraries as well.
>> 
>> No, my statement is true. Instruction sets are basically irrelevant.
> 
> The point was that there's no optimizations that are possible in OCaml
> that are not possible in C, at least in principle.

If you mean that you can write a C equivalent of any OCaml code then I think
that is incorrect. I believe there are parts of FPL compilers (probably
ocamlopt) that must be written in assembler. For example, some GCs.

> Or if there are, I 
> don't know about them :>  My C compiler does optimizations typically
> associated with functional languages, like transforming recursive
> functions to tail-recursive form and then applying tail-recursion
> optimization.

No, there are many optimisations that are performed by FPL compilers but not
by C compilers. Typically, either the optimisations don't make sense in C
(e.g. unboxing) or C code doesn't facilitate the optimisation by making the
necessary information available (e.g. referential transparency).

Languages like OCaml are at a huge disadvantage in terms of performance. It
is precisely these optimisations that allow the implementation to regain
much of the lost performance.

Out of curiosity, can your C compiler in-line a function when it is passed
by pointer to another function?

>> Specifically in the context of sets, the OCaml stdlib implementation will
>> return the union of two non-overlapping sets in O(1) time. That is
>> substantially faster than the O(n) of the STL set_union and all other
>> imperative set implementations that I have heard of.
> 
> That's really cool, if true! Can you point me at more information about 
> it?  All the references I can find say that OCaml's module Set uses
> red-black trees, and I don't see how you can get O(1) unions with those.

I think OCaml's set module uses AVT not RB trees. OCaml's implementation is
fast not because of the balancing, but because it is a purely functional
data structure and, therefore, it can exploit referential transparency.

Essentially, the STL can't reuse any of the previous sets (i.e. referencing
them) when creating the union/intersection/difference because you might
mutate the data in an old set and that would affect the new set. Instead,
the STL must explicitly copy all of the elements that it uses.

In contrast, OCaml "knows" that your set is immutable (a functional data
structure) so it is free to refer back to any old bits of tree that it can
find when constructing the result. Consequently, OCaml's implementation is
much faster whenever it can reuse old bits of tree, such as when computing
the union of two equal-sized non-overlapping trees.

I wrote a program in C++ and OCaml to compute the set of "n"th-nearest
neighbours of a given atom in a conventional condensed matter physics
simulation of an amorphous material (it also applies to proteins, etc.).
The OCaml code is less than half the length and 100x faster than the C++
implementation. Both implementations can be optimised, of course. To
optimise the C++ you must compute set unions without the STL's set_union
function. The optimisations are all rather complicated.

Anyway, the point is that OCaml is much closer to the sweet spot on the
writing-time vs running-time for a wide variety of programs.

>>>>4. Capability (OCaml's built-in hash function can traverse any data
>>>>structure).
>>>
>>>I don't understand what you mean by this.
>> 
>> The GC must be able to traverse any data structure using only information
>> available at run-time (OCaml strips out as much run-time type information
>> as possible in order to evade run-time type checking). OCaml also comes
>> with several functions (e.g. polymorphic comparison operators) that use
>> this capability to traverse any data structure in order to compare, hash
>> and so forth.
> 
> Interesting.  But wouldn't that make hashing a large data structure
> expensive?

Yes. OCaml's built-in hash function stops after 100 data, IIRC. You can
supply your own hash function, of course.

Ideally, you don't want to repeatedly hash big data structures in any
language. So you'd normally compute the hash once and store it in the data
structure ready to be looked up.

>>>For the record, here's how I'd do it in Cocoa/Objective-C, using a hash
>>>table:
>>>
>>>NSMutableDictionary* dict = [NSMutableDictionary dictionary];
>>>NSEnumerator* enumer = [input objectEnumerator];
>>>NSString* string;
>>>while ((string = [enumer nextObject]))
>>>[dict setObject:[NSNumber numberWithInt:1 + [[dict objectForKey:string]
>>>intValue]] forKey:string];
>>>
>>>Certainly nothing approaching "hundreds of lines!"  If anything, it's
>>>shorter than the OCaml approach.
>> 
>> That's also very nice. I can't understand it and I don't know if it is
>> correct (e.g. does it compute the same result as my OCaml?).
> 
> If I understand your code correctly, I think so.  It computes a
> dictionary of string->number of appearances of that string.  While
> Objective-C shares OCaml's nasty syntax and lacks OCaml's elegance, it
> makes up for that in sheer practicality:
> 
> 1) Dictionaries can work with any type that can hash itself and compare
> itself to another object, so I don't have to write the "setup"
> type-declaring code.

This is the same in OCaml.

> 2) Messages sent to nil objects return 0, and a missing key in a
> dictionary gives you nil, so I don't have to special-case looking up an
> object that is not in the dictionary.

OCaml doesn't have a "nil" value. You usually have a "find" function that
returns the result of the map or raises an exception to say that the key
wasn't found.

> Just wait 'til the Perl guy gets here, though!

Yes, any language with built-in dictionaries will be great for that task. C
is an awful choice...

> (Seriously, tell more about these sets.)

:-)

Check out the freebies on our company website if you want more OCaml
examples. Also, check out the first chapter of my book and the free code
from it. Feel free to buy a few copies too. ;-)

-- 
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com

.