Cheap equality
Monday, 4. September 2006, 20:26:56
This post is about equality and what it could possibly mean for Enchilada.
One major goal of Enchilada is to achieve massive scaleability. Interestingly, this goal can be achieved with immutability - i.e. not to allow mutable data. Why is this immutable property so important? The key is referential transparency. With referential transparency, the relationship between value and reference will persist and never change in time.
With immutability, checking for equality comes pretty cheap - just compare references instead of huge complex nested values.
But this doesn't work out if two values are created from two different reductions. For instance, the expressions (1 3 +) and (2 2 +) reduce to the same answer (4), but produce different references. The result? Two distinct references can point to the same value. So that leaves us with two options:
1) the references are equal => the values are definitely equal.
2) the references are not equal => we don't know and may be have to compare the complex nested values (costly).In case 1, that's all nice and well, but how are references created in the first place?
In C or Java, references are typically created by the memory manager: free heap space is allocated for new objects. In C, references are equal to memory locations. In Java, the memory addresses are hidden, but behind the scenes, Java references eventually map to memory locations nevertheless.
For one single Java virtual machine such references work pretty satisfactory, but what if we want to have global references, spanning over millions of heterogeneous machines? One solution would be to use a separate global reference service with the sole purpose to create unique IDs (kind of similar to a memory manager) for every request.
However, such service would be likely to choke on the number of requests coming from so many machines. Moreover, it would have a hard time coming up with unique references. For instance, a sequential number scheme (1, 2, 3, etc.) would kill concurrency because of the need of serialisation.
For maximum concurrency, it would be better if every virtual machine would create its own unique references (this also applies for ordinary UNIX processes). But how can we achieve this? By hashing the value to a reference locally. And with hashing we can further strengthen the relationship between reference and value.
But there is one snag that we have to deal with: Two different values could map to the same hash reference (collision). Interestingly, with hash references, the options turn the other way around:
1) The hashes are the same => there is a high probability that the complex values are the same but we may have to compare the values to be absolutely sure (expensive).
2) The hashed are not equal => then we are absolutely sure that the values are not equal too.Because all values in Enchilada are final, their hashes are final too. With hashes we have a very cheap non-equality test. Conversely, we have an equality test with a very low probability of being false.







