https://www.unison-lang.org/articles/distributed-datasets/core-idea/

Unison Logo Unison Docs Community Share Blog

Spark-like datasets in Unison

How to make any immutable data structure distributed

by

  * Rebecca Mark
  * Paul Chiusano

In this article

Spark-like distributed datasets in under 100 lines of Unison How to
make any immutable data structure distributed Distributed and
parallel reductions Incremental evaluation via memoization Additional
topics and FAQs

This article is part of a series of compositional distributed systems

Vote on or propose topics

In this post we'll build our own distributed data structure from
first principles. The version developed here is a little simpler than
theSeqused in the "real" version of the library. We'll get to the
realSeqin Part 5 of this article.

First, let's look at a simple in-memory tree data structure and
understand how it can be modified to represent data spread across
multiple nodes. Here's the in-memory version:

structural type SimpleTree a

structural type SimpleTree a
  = Empty
  | Two (SimpleTree a) (SimpleTree a)
  | One a

ASimpleTreeis eitherSimpleTree.Empty,a leaf:SimpleTree.One,or a
branch:SimpleTree.Two.All the leaves and subtrees in aSimpleTreeare
references to values which are in local memory on the same machine.
If instead we want it to contain references to values that may exist
at another location, we can write the following instead:

structural type Tree a

structural type Tree a
  = Two (distributed.Value (Tree a)) (distributed.Value (Tree a))
  | Empty
  | One (distributed.Value a)

All we've done here is wrap each branch and leaf of the tree in a
data type,distributed.Value,which represents alazyvalue at a possibly
remote location.



Making a distributed data type in Unison can be as simple as wrapping
the fields indistributed.Value.

There are some choices you can make here. For instance you can put
distributed.Valuearound all the fields in the data constructor or
only some, and you can also have laziness "start at the root" or only
when looking at subtrees. We'll discuss these subtleties in the last
part of the series.

With this simple type definition, we can now implement functions like
a distributedmap,reduce,and so on without needing to write any
networking or serializationcode.

All distributed programs in Unison eventually get translated into
calls into theRemoteability, and it's the handler of that ability
that does serialization and networking. Your actual code stays nice
and clean!

Let's try it for ourTree.This will be instructive!

stub.Tree.map : (a ->{Remote} b) -> Tree a ->{Remote} Tree b

stub.Tree.map : (a ->{Remote} b) -> Tree a ->{Remote} Tree b
stub.Tree.map f = cases
  Tree.Empty                    -> Tree.Empty
  Tree.One valueA               -> todo " hmmm, apply f to Value"
  Tree.Two leftValue rightValue ->
    todo "something recursive goes here "

We know we'll need to pattern match on the data constructors ofTree
and perform a recursive call of some kind, but now that the leaf and
branch are both wrapped indistributed.Value,what should we do? There
are two ways to do this that typecheck.

The first is not what quite what we want but it's instructive
nonetheless. It makes ample use ofValue.get : distributed.Value a ->{
Remote} ato force the lazydistributed.ValueandValue.pure : a -> 
distributed.Value afor creating a remote value from an in-memory
value:

eager.Tree.map : (a ->{Remote} b) -> Tree a ->{Remote} Tree b

eager.Tree.map : (a ->{Remote} b) -> Tree a ->{Remote} Tree b
eager.Tree.map f = cases
  Tree.Empty                    -> Tree.Empty
  Tree.One valueA               ->
    Tree.One (Value.pure (f (Value.get valueA)))
  Tree.Two leftValue rightValue ->
    use Value get pure
    use eager.Tree map
    Tree.Two
      (pure (map f (get leftValue))) (pure (map f (get rightValue)))

To see why this isn't what we want, let's look at the documentation
ofValue.getandValue.pure:

    Value.get

    Obtain theadenoted by thisdistributed.Value.

    A few properties:

      + Value.get (Value.pure a) === a
      + Value.get (Value.at loc a) === a
      + f (Value.get v) === Value.get (Value.map f v)

    Value.pure

    Create an in-memorydistributed.Value.

    Satisfies the property:Value.get (Value.pure a) === a

We'resummoningthe value from a potentially remote location withValue.
get,and thenValue.purecreates a value in memory at the location where
it's called. Our implementation ofeager.Tree.mapwill thus end up
sending the entire tree to the original caller of the function,
applying the function there, and storing the resulting tree in memory
at that location. That's no good--presumably the whole reason we're
using a distributed data structure is the data is too big to fit in
memory on a single machine.

The version ofmapwe want will do its work lazily, when the resulting
Treeis forced, and it will do the mapping in parallel at the
locations where the data lives rather shipping everything to the
caller.

It's often more efficient to "move the computation to the data" than
it is to "move the data to the computation".

A good rule of thumb when implementing functions likeTree.mapis to
never callValue.get.Instead we will useValue.mapto lazily apply a
functionat the location where the remote value lives.

The documentationforValue.maphas more details on when and how to use
Value.mapvsValue.get.

src.Tree.map : (a ->{Remote} b) -> Tree a -> Tree b

src.Tree.map : (a ->{Remote} b) -> Tree a -> Tree b
src.Tree.map f = cases
  Tree.Empty   -> Tree.Empty
  Tree.One a   -> Tree.One (Value.map f a)
  Tree.Two l r ->
    use src Tree.map
    l' = Value.map (Tree.map f) l
    r' = Value.map (Tree.map f) r
    Tree.Two l' r'

While you can perhaps see how this code typechecks, what this code
does is a bit brain-bending. First, because oursrc.Tree.mapfunction
does not callValue.get,src.Tree.mapis just the blueprint for the data
transformation, not the transformed data itself. Not until the tree
is evaluated (by say, thereducefunction we'll cover in Part 3) does
any mapping actually happen. This laziness gives us fusion

Fusion

Fusion, or map-fusion, is a property of the evaluation of a data
structure where multiple transformations over the data structure do
not require building up intermediate representations of the
structure. Instead, the desired transformations can be "fused"
together, resulting in a more efficient overall computation.

Strict data structures, likeListdo not support the fusion
optimization, so the expression

[1, 2, 3] |> List.map (x -> x Nat.+ 1) |> List.map (y -> y Nat.+ 2)

[4, 5, 6]

will build an intermediateListas a result of applyingx -> x Nat.+ 1in
theList.mapfunction.

"for free", so if wesrc.Tree.mapmultiple times, the functions will
get composed together and applied in the same pass over the data as
thereduce.

Morever, because we are usingValue.map,the function will be applied
at the location where the data lives, rather than shipping the entire
tree to the caller and applying the transformation function there.

Optional but fun exercises
Optional but fun exercises

See if you can write more functions forTree.We've provided a codebase
with the necessary dependencies and stubbed functions. To pull in the
codebase, run the following in the UCM:

.> pull git@github.com:unisonweb/website:.articles.distributedDatasets .article1

TheTreefunction stubs are undersrc.Tree.



Exercise:Another basicTreetransformation

src.Tree.mapis a structure preserving transformation that doesn't
force the distributed data to be evaluated. In a similar vein write
Tree.reverse : Tree a -> Tree awhich swaps the left and right
branches of the tree but doesn't force the evaluation of the tree.

Show me the answer
Show me the answer

ex1.Tree.reverse : Tree a -> Tree a

ex1.Tree.reverse : Tree a -> Tree a
ex1.Tree.reverse = cases
  Tree.Two left right ->
    use Value map
    use ex1.Tree reverse
    l = map reverse left
    r = map reverse right
    Tree.Two r l
  remainder           -> remainder



Exercise:Building aTreefrom a list

For testing purposes, it would be great if we had an easy way to make
an in-memoryTreefrom aList.

ImplementTree.fromList : [a] -> Tree a.It might be nice if the Tree
was roughly balanced. 

Show me the answer
Show me the answer

ex3.Tree.fromList : [a] -> Tree a

ex3.Tree.fromList : [a] -> Tree a
ex3.Tree.fromList = cases
  []  -> Tree.Empty
  [a] -> Tree.One (Value.pure a)
  as  ->
    use Value pure
    use ex3.Tree fromList
    (left, right) = halve as
    Tree.Two (pure (fromList left)) (pure (fromList right))

Takeaways

We learned a few things in this part:

  * To make a data structure distributed, wrapdistributed.Valuearound
    the fields it defines in its data constructors. This lets the
    data structure represent the data being "elsewhere" with a
    minimum amount of ceremony.
  * UseValue.mapjudiciously to build lazy computations likesrc.Tree.
    mapwhere the function is brought to the data.

We also saw how to obtain different runtime behaviors for your
distributed programs through implementations of theTree.mapfunction
usingValue.maporValue.get.

As a library author, you do have to be explicit when describing how
your program should behave, and we consider this a good thing: you
can achieve exactly what you want with a tiny amount of code, and you
can wrap up common patterns in reusable functions likesrc.Tree.map
that anyone can use without being a distributed systems expert!

If you've been wondering "how do I evaluate this in some way?" we'll
cover that next, in Part 3. Again there will be some instructive
decisions to make: how we implement functions likereducewill
determine whether the work happens in parallel close to the data or
whether the data gets sent to the caller and reduced there.

 
Prev
Spark-like distributed datasets in under 100 lines of Unison
 
Next
Distributed and parallel reductions

If you're interested in using Unison at work for this sort of thing,
we'd love to chat with you.

Get in touch
(c) 2021
Unison Computing, a public benefit corp
and contributors.