Fingerprinting distributed computation
Tuesday, 3. July 2007, 11:19:04
We can imagine a situation where we want certain proof that a program has been executed correctly.
Unfortunately, such proof is not easy to obtain in distributed environments. In such environment, network outages, benign hackers or even memory corruption can severely alter the outcome of lengthy computations.
Moreover, in the real world, absolute certainty can't be obtained. The statistical method, however, can get us as much certainty we want.
Imagine that we want to compute factorial (10^100) but instead of computing it ourselves, we browse the internet to find services that can do it for us.
After some searching, we end up with roughly 10 cheap computational nodes. We instruct each service to compute factorial (10^100) and gather the results to compare for equality.
The more equal the results, the more confidence we will have.
Of course, this statistical approach only works with purely functional programs. In contrast, impure programs can't always be run multiple times to produce the same result (because they depend on referenced mutable data).
Now let us assume that databases, file-systems etc are pure functions too. Then programs manipulating such functions (=databases) are just higher-order functions that produce new pure functions (=databases): i.e. 'inserting' a record into a database should always produce a complete new version, without altering the old.
For instance, purely functional programs can safely combine distributed CVS's iff such programs explicitly point to all the (immutable) versions they depend on. This way it becomes possible to run such programs multiple times, at any place (and in any order) to always produce the same result.
The only remaining issue to solve is how to compare the results efficiently? What if the results are huge distributed data structures themselves? Fetching and comparing them would be just too costly.
Again the statistical method saves the day. Instead of comparing results, we compare their fingerprints. In fact, we could use fingerprints to reference any piece of remote data: arguments, results or programs, there shouldn’t be a distinction.
Of course, fingerprints do have a unit cost involved. For fingerprints to scale in a distributed setup, every computational step must also calculate the fingerprint of its arguments. To the extreme, the computational step itself also has to be fingerprinted.
Enchilada has fingerprinted computation build-in. In my next blog entry I will demonstrate some of its uses (and pitfalls).
Unfortunately, such proof is not easy to obtain in distributed environments. In such environment, network outages, benign hackers or even memory corruption can severely alter the outcome of lengthy computations.
Moreover, in the real world, absolute certainty can't be obtained. The statistical method, however, can get us as much certainty we want.
Imagine that we want to compute factorial (10^100) but instead of computing it ourselves, we browse the internet to find services that can do it for us.
After some searching, we end up with roughly 10 cheap computational nodes. We instruct each service to compute factorial (10^100) and gather the results to compare for equality.
The more equal the results, the more confidence we will have.
Of course, this statistical approach only works with purely functional programs. In contrast, impure programs can't always be run multiple times to produce the same result (because they depend on referenced mutable data).
Now let us assume that databases, file-systems etc are pure functions too. Then programs manipulating such functions (=databases) are just higher-order functions that produce new pure functions (=databases): i.e. 'inserting' a record into a database should always produce a complete new version, without altering the old.
For instance, purely functional programs can safely combine distributed CVS's iff such programs explicitly point to all the (immutable) versions they depend on. This way it becomes possible to run such programs multiple times, at any place (and in any order) to always produce the same result.
The only remaining issue to solve is how to compare the results efficiently? What if the results are huge distributed data structures themselves? Fetching and comparing them would be just too costly.
Again the statistical method saves the day. Instead of comparing results, we compare their fingerprints. In fact, we could use fingerprints to reference any piece of remote data: arguments, results or programs, there shouldn’t be a distinction.
Of course, fingerprints do have a unit cost involved. For fingerprints to scale in a distributed setup, every computational step must also calculate the fingerprint of its arguments. To the extreme, the computational step itself also has to be fingerprinted.
Enchilada has fingerprinted computation build-in. In my next blog entry I will demonstrate some of its uses (and pitfalls).








