Skip navigation.

exploreopera

| Help

Sign up | Help

The whole Enchilada

A sequence of thought

STICKY POST

Documentation

Go to http://enchilada.javaforge.com to find the latest documentation and instructions for download.

Read more...

Enchilada.Sequence v0.1

A first release of Enchilada.Sequence v0.1: a Haskell library that implements an immutable sequence.

You can find the code here:
sequence.zip.
Feel free to comment!

Order, probability and fast equality

Checking for exact/deep equality is pretty fast for sequences that are very different. But when two sequences (of equal size n) differ only by one element, equality takes O(n * m), where O(m) is the time complexity for element comparison.
Now if elements are big data structures themselves, equality can become very expensive in practice.

Of course, comparing (equally constructed) values could be done much faster: you just have compare their pointers. But what if the same value has been constructed at different locations, by different processes? In that case, pointer equality isn't possible and we are back to expensive (deep) equality.

An interesting alternative is to compare the hashes of two values, instead of their pointers. This way it doesn't matter where the values were constructed: equal hashes === equally constructed values (with a small chance of a false positive).

Of course, the hashing operation should be faster than O(n), otherwise we wouldn't gain anything. So for probabilistic equality to work, every single value has to also carry a hash.

But how big should such a hash be? 32 bits? 1024 bits?

Interestingly enough any number of bits would do, but with the additional constraint that complex values can be easily cut into smaller ones. When the hashes of two values are the same, you just have to cut the values into smaller parts, and recursively compare the hashes of the parts. Then, 1024 bit probabilistic equality will take 32 integer comparisons, using 32 bit hashes.

However, when hashes turn out different, that doesn't mean that the values are different too. Equal values could have different hashes because their inner structure can be different. The following example shows two binary trees that encode the same sequence [1;2;3]:

    2     1
   /\      \
  1  3      2
             \
              3

All the values in Enchilada are build on (balanced) binary trees with each node carrying a hash of its left and right sub-tree. Next to hashes, ordering information is also kept at every node, with the following cases:

  • UndefinedOrder
    -- examples: [3;2;3], [2;4;1]

  • AnyOrder
    -- examples: [], [1], [2], [-2]

  • StrictAscending
    -- examples: [1;2;3], [4;5;6]

  • Ascending
    -- examples: [1;1;3], [2,3,3]

  • Flat
    -- examples: [1;1;1], [3;3;3]

  • Descending
    -- examples: [3;2;2], [3;3;2]

  • StrictDescending
    -- examples: [3;2;1], [6;5;4]

With ordering information at every node, the sorting of a (Strict)Ascending list takes O(1) in Enchilada. Also, sorting nearly sorted lists in Enchilada is much faster than ordinary sorting algorithms.

Combining probabilistic equality and the ordering of (balanced) binary trees yields many interesting results. For instance, the multi-set difference and union of two nearly identical multi-sets becomes a magnitude faster than any algorithm that I know of.

Currently I'm writing a reusable Haskell library that implements all the above. Hopefully, this library will be useful outside the realm of Enchilada.

Enchilada in Haskell

For the past three weeks i've been porting Enchilada to yet another language, this time Haskell. The reason? I didn't like the Factor code I was writing: it turned out to be too Object Oriented. I also experienced a lot of runtime errors at tricky corner cases. But most importantly, I couldn't understand the Factor code I wrote months ago.

This isn't the fault of Factor of course: it's just that my coding style currently doesn't go well with idiomatic Factor code. Nevertheless, I can say that manipulating the stack without using named parameters was painful at times. Probably that's because I didn't factor my code enough. Anyway, I wanted to give Haskell a try, just to see if I my coding style could benefit (=cleaned up) from strong typing.

And I must say that I'm very impressed with the result. Besided strong typing, type classes, higher order functions, etc, Haskell's laziness has given me new ways of expressing Enchilada's algorithms more succinctly. Another big plus is that Haskell's pureness matches that of Enchilada, which probably gives GHC an opportunity to optimize certain algorithms and data structures more agressively. And not surprisingly: my Haskell code turned out to be very easy to (re)read (at least for me).

But the most interesting result is that the Haskell implementation of Enchilada runs MUCH faster than the previous Java and Factor implementations I did. The Glasgow Haskell Compiler is a very amazing piece of technology indeed.

Fingerprinting distributed computation

We can imagine a situation where we want certain proof that a program has been executed correctly.

Unfortunately, such proof is not easy to obtain in distributed environments. In such environment, network outages, benign hackers or even memory corruption can severely alter the outcome of lengthy computations.

Moreover, in the real world, absolute certainty can't be obtained. The statistical method, however, can get us as much certainty we want.

Imagine that we want to compute factorial (10^100) but instead of computing it ourselves, we browse the internet to find services that can do it for us.

After some searching, we end up with roughly 10 cheap computational nodes. We instruct each service to compute factorial (10^100) and gather the results to compare for equality.

The more equal the results, the more confidence we will have.

Of course, this statistical approach only works with purely functional programs. In contrast, impure programs can't always be run multiple times to produce the same result (because they depend on referenced mutable data).

Now let us assume that databases, file-systems etc are pure functions too. Then programs manipulating such functions (=databases) are just higher-order functions that produce new pure functions (=databases): i.e. 'inserting' a record into a database should always produce a complete new version, without altering the old.

For instance, purely functional programs can safely combine distributed CVS's iff such programs explicitly point to all the (immutable) versions they depend on. This way it becomes possible to run such programs multiple times, at any place (and in any order) to always produce the same result.

The only remaining issue to solve is how to compare the results efficiently? What if the results are huge distributed data structures themselves? Fetching and comparing them would be just too costly.

Again the statistical method saves the day. Instead of comparing results, we compare their fingerprints. In fact, we could use fingerprints to reference any piece of remote data: arguments, results or programs, there shouldn’t be a distinction.

Of course, fingerprints do have a unit cost involved. For fingerprints to scale in a distributed setup, every computational step must also calculate the fingerprint of its arguments. To the extreme, the computational step itself also has to be fingerprinted.

Enchilada has fingerprinted computation build-in. In my next blog entry I will demonstrate some of its uses (and pitfalls).

New Enchilada version in Factor

After three months of learning and coding in Factor I have fully ported Enchilada from Java to Factor.

Nevertheless, the new Factor version of Enchilada probably still has a number of bugs. Also, I'm sure that my factor code could be much improved.
However, I'm convident that the next version will be in a much better shape. I will be keen on aggressive refactoring steps - fully in spirit of the Factor language.

While I was doing the port I noticed a subtle bug in the Java version. The following expression threw an exception in Java:{a aa==[a;aa] {a aa c=a aa c}}here is the correct result in the Factor version:{a aa=={aaaa aaaaaaaa=aaaa aaaaaaaa [a;aa]}}The new Enchilada version will be releaed when Factor 0.90 is released. However, if you are interested you can download the latest version from the http://www.factorcode.org repository and see for yourself.

Factor as Enchilada's backend

For the past four months i've been implementing Enchilada's immutable sequence in various languages.
Until now, I didn't consider Slava Pestov's Factor as a possible platform.

Guess what? The Factor version of the immutable sequence is very short in lines of code. Next to that, implementing the various arithmetical Enchilada operators is a breeze in Factor.

Factor's ability to compile quotations could also prove very fruitful to be able to speed up certain Enchilada expressions. That is, Enchilada expressions could in principle be compiled to Factor quotations.

Although I'm still learning Factor I am already convinced that the next version of Enchilada will be build on top of Factor instead of Java.
So for the next two months I'll be busy "porting" Enchilada's Java version to Factor.

Performance & runtime

This blog has suffered from some low traffic lately but for a good reason.

I've have done some eleborate micro-benchmarking in various languages / runtimes on Enchilada's most important data structure: a purely functional binary tree. Next to that, I've benchmarked another important feature: the efficient implementation of infinite precision integers.

In the end I will choose a language / runtime that can serve as the most efficient backend for Enchilada.

I've benchmarked the binary tree implementation (5.000.000 concats) in java (JVM), ocaml and scheme and got the following results on my iBook G4:

- Java 1.5: 70 seconds

- Ocaml (functional style): 40 seconds

- Ocaml (OO style): 60 seconds

- Scheme (macros + gambitc): 92 seconds

So the JVM isn't so bad after all, which is surprising because I'm only using immutable objects.

So I think I going to opt for the JVM, despite of ocaml being the real winner. Java's enormous amount of libraries, support and very advanced garbage collector implementations can only become better in the near future. I don't think ocaml is able to keep up with the JVM performance and research.

Nevertheless, I'm starting to like Ocaml - it is very clean and mature language.

Here is the java code:

public abstract class Node
{
  public abstract Node left();
  public abstract Node right();
  public abstract Object size();
  
  public final Node concat(Node n)
  {
    Object ts = size();
    Object ns = n.size();
    
    if (Number.cmp(Number.add(ts, ts), ns) <= 0)
    {
      Node nl = n.left();
      Node nr = n.right();
      if (Number.cmp(nl.size(), (nr.size())) > 0)
      {
        // right rotation
        return new TreeNode(concat(nl.left()), new TreeNode(nl.right(), nr));
      }
      return new TreeNode(concat(nl), nr);
    }
    
    if (Number.cmp(Number.add(ns, ns), ts) <= 0)
    {
      Node tl = left();
      Node tr = right();
      if (Number.cmp(tr.size(), tl.size()) > 0)
      {
        // left rotation
        return new TreeNode(new TreeNode(tl, tr.left()), tr.right().concat(n));
      }
      return new TreeNode(tl, tr.concat(n));
    }
    return new TreeNode(this, n);
  }
  
  public static void main(String[] s)
  {
    Node n = LeafNode.create("singleton");
    Node t = n;
    
    for (int i = 1; i < 5000000; i++)
    {
      t = t.concat(n);
      if ((i % 100000) == 0)
      {
        System.out.println("i: " + i);
      }
    }
  }
}

final class NullNode extends Node
{
  public static final NullNode inst = new NullNode();
  
  private NullNode() {}
  
  public final Node left() { return null; }
  public final Node right()  { return null; }
  public final Object size() {return Number.zero; }
}

final class LeafNode extends Node
{
  private final Object value;
  
  private LeafNode(Object v) { value = v; }
  
  public static final Node create(Object v) { return new LeafNode(v); }
  
  public final Node left() { return NullNode.inst; }
  public final Node right() { return NullNode.inst; }
  public final Object size() { return Number.one; }
}

final class TreeNode extends Node
{
  private final Object size;
  private final Node left;
  private final Node right;
  
  public TreeNode(Node l, Node r)
  {
    left = l;
    right = r;
    size = Number.add(l.size(), r.size());
  }
  
  public final Object size() { return size; }
  public final Node left() { return left; }
  public final Node right() { return right; }
}

final class Number
{
  public static final int c_s = 65535;
  public static final int _c_s = -c_s;
  
  public static final Object[] cache = new Object[(int)c_s * 2];
  
  static
  {
    for (int i = _c_s; i < c_s; i++)
    {
      int[] v = {i};
      cache[(int)(i + c_s)] = v;
    }
  }
  
  public static final int[] one = {1};
  public static final int[] zero = {0};
  
  public final static int[] create(int i)
  {
    if (_c_s < i && i < c_s)
    {
      return (int[])cache[(int)(i + c_s)];
    }
    int[] n = {i};
    return n;
  }
  
  public static final Object add(Object n1, Object n2)
  {
    int[] nn1 = (int[])n1;
    int[] nn2 = (int[])n2;
    if (nn1.length > 1 || nn2.length > 1)
    {
      // bignum
      throw new RuntimeException("BIGNUM not yet supported");
    }
    return create(nn1[0] + nn2[0]);
  }
  
  public static final Object sub(Object n1, Object n2)
  {
    int[] nn1 = (int[])n1;
    int[] nn2 = (int[])n2;
    if (nn1.length > 1 || nn2.length > 1)
    {
      // bignum
      throw new RuntimeException("BIGNUM not yet supported");
    }
    return create(nn1[0] - nn2[0]);
  }
  
  public static final int cmp(Object n1, Object n2)
  {
    int[] nn1 = (int[])n1;
    int[] nn2 = (int[])n2;
    if (nn1.length > 1 || nn2.length > 1)
    {
      // bignum
      throw new RuntimeException("BIGNUM not yet supported");
    }
    if (nn1[0] == nn2[0])
    {
      return 0;
    }
    if (nn1[0] > nn2[0])
    {
      return 1;
    }
    return -1;
  }
  
  public static final String toString(Object n1)
  {
    int[] nn1 = (int[])n1;
    return "" + nn1[0];
  }
}

I dare anyone to prove that their favourite language can improve the above code, both in performance and in succintness! (but remember: your solution needs to have dynamic dispatch on left(), right() and size()).

Read more...

Document oriented programming

I've changed the sort(<) semantics to allow for (multi)set operations.

The new sort operator takes two operands L1 and L2, sorts them, and then takes their (multiset) union and difference. The sorting step is ommited if L1 and L2 are already sorted (because Enchilada caches the sort property for every list).

Here is an example of the new sort operator(<):

[0;4;2;2;2;3;5;1;5;7] [2;5;5;1;9] <

gives:

[0;1;1;2;2;2;2;3;4;5;5;5;5;7;9] [0;2;2;3;4;7]

If L1 and L2 share a lot of structure, the union and difference can be computed very efficiently. For instance, if L1 and L2 are equal, their difference will be the empty list.

The new sort operator opens a doorway to an efficient document management system. Consider the following 'document':

D1=[{H};{e};{l};{l};{o};{""};{w};{o};{r};{l};{d}]

Let's say we want to insert 'sunny' in between 'Hello' and 'world'.

[{H};{e};{l};{l};{o};{""};{w};{o};{r};{l};{d}]
[{s};{u};{n};{n};{y};{""}] 6 {L1 L2 i=L1 i & L2 i - L1 + + +}

gives:

D2=[{H};{e};{l};{l};{o};{""};{s};{u};{n};{n};{y};{""};{w};{o};{r};{l};{d}]

Now if we want to take the difference between D1 and D2 we could use the unix-style diff command. However, unix-diffs don't scale to terabyte lists because every list element has to be considered.

Another approach to this is to uniquely label each element in a document, with all labels in ascending order. Here's where dual lists come in handy:

D3=[0={H};1={e};2={l};3={l};4={o};5={""};6={w};7={o};8={r};9={l};10={d}]

we insert 'sunny':

D4=[0={s};1={u};2={n};3={n};4={y};5={""}]

.. in between 'Hello' and 'world'. We do this by prefixing all the labels of D4 with a label of D3, after the point of insertion. This way we maintain the necessary order:

D5=D3 D4 6 {L1 L2 i=L1 i & {p=p p ` 1 & : # :} L2 * | i - L1 + + +}

D5=
[0={H};1={e};2={l};3={l};4={o};5={""}
 5 0={s};5 1={u};5 2={n};5 3={n};5 4={y}
 5 5={""};6={w};7={o};8={r};9={l};10={d}]

Note that D5's labels are still unique and sorted. Also note the usage of the max(|) operator to prefix the labels of 'sunny'. Because everything is ordered and unique we use the sort(<) operator the take the difference between D3 (the original document) and D5 (the modified document):

D5 D3 < {union diff=diff}

 [0={H};1={e};2={l};3={l};4={o};5={""};5 0={s};5 1={u};5 2={n}
 5 3={n};5 4={y};5 5={""};6={w};7={o};8={r};9={l};10={d}] 
 [0={H};1={e};2={l};3={l};4={o};5={""}
  6={w};7={o};8={r};9={l};10={d}] < {union diff=diff}

gives the difference:

[5 0={s};5 1={u};5 2={n};5 3={n};5 4={y};5 5={""}]

Because D5 shares a lot of structure with D3, the difference between them can be computed much faster than in the case of mutable lists.

Now Imagine millions of (distributed) edits on a document. How do you determine all the edits and differences in the most efficient way? Enchilada's dual lists and operators are carefully crafted to give an answer to that difficult question.

You can find the lastest release (0.5.6) by following the documentation link.

Immutable papers

July 2008
SMTWTFS
June 2008August 2008
12345
6789101112
13141516171819
20212223242526
2728293031