Order, probability and fast equality
Friday, 24. August 2007, 07:40:38
Now if elements are big data structures themselves, equality can become very expensive in practice.
Of course, comparing (equally constructed) values could be done much faster: you just have compare their pointers. But what if the same value has been constructed at different locations, by different processes? In that case, pointer equality isn't possible and we are back to expensive (deep) equality.
An interesting alternative is to compare the hashes of two values, instead of their pointers. This way it doesn't matter where the values were constructed: equal hashes === equally constructed values (with a small chance of a false positive).
Of course, the hashing operation should be faster than O(n), otherwise we wouldn't gain anything. So for probabilistic equality to work, every single value has to also carry a hash.
But how big should such a hash be? 32 bits? 1024 bits?
Interestingly enough any number of bits would do, but with the additional constraint that complex values can be easily cut into smaller ones. When the hashes of two values are the same, you just have to cut the values into smaller parts, and recursively compare the hashes of the parts. Then, 1024 bit probabilistic equality will take 32 integer comparisons, using 32 bit hashes.
However, when hashes turn out different, that doesn't mean that the values are different too. Equal values could have different hashes because their inner structure can be different. The following example shows two binary trees that encode the same sequence [1;2;3]:
2 1
/\ \
1 3 2
\
3
All the values in Enchilada are build on (balanced) binary trees with each node carrying a hash of its left and right sub-tree. Next to hashes, ordering information is also kept at every node, with the following cases:
- UndefinedOrder
-- examples: [3;2;3], [2;4;1] - AnyOrder
-- examples: [], [1], [2], [-2] - StrictAscending
-- examples: [1;2;3], [4;5;6] - Ascending
-- examples: [1;1;3], [2,3,3] - Flat
-- examples: [1;1;1], [3;3;3] - Descending
-- examples: [3;2;2], [3;3;2] - StrictDescending
-- examples: [3;2;1], [6;5;4]
With ordering information at every node, the sorting of a (Strict)Ascending list takes O(1) in Enchilada. Also, sorting nearly sorted lists in Enchilada is much faster than ordinary sorting algorithms.
Combining probabilistic equality and the ordering of (balanced) binary trees yields many interesting results. For instance, the multi-set difference and union of two nearly identical multi-sets becomes a magnitude faster than any algorithm that I know of.
Currently I'm writing a reusable Haskell library that implements all the above. Hopefully, this library will be useful outside the realm of Enchilada.