Software Development

Correcting The Future

Project V: Universal ID's

I need a way for a type (the code behind this type), to be able to recognise itself. Let's say I send data to another computer, that computer needs the ability to either A) recognise the data or B) figure out that the data is alien to it.

How do I do this? An easy way is to use GUID's. All types can have a GUID associated with it and problem solved. Or is it? Since there are millions (maybe billions) of computers and devices out there, how can we be assured that no two GUID's are the same? An often, and incorrect argument, is that there are 2128 possible GUID's. It would take forever to create that many GUID's. But things are not as they seem.

Let's take a simpler example. Let's say you are in a class of X students. How many students do you need in your class so that you have a better than 50/50 chance that two students have the same birthday? It's not 365/2. The answer is 23. You need 23 people in your class to have a better than 50% chance that two of them will have the same birthday. This is quite a bit less than the total sample range of 365. In fact, it's closer to the square root.

Why is this? Let's flip it around. Let's look at all the possible pairs possible with any two students. We can pick any of the 23 students for the first birthday. And we can pick any of the remaining 22 students for the second birthday. This makes 23*22 = 506. But we have duplicated pairs. (Tom,Jerry) is the same as (Jerry,Tom). So we have twice as many pairs. Divide 506 by 2 and we have 253 combinations. Perhaps it is now easier to see why we only need 23 students. The odd that all dates are different is (365/365) * (364/365) * (363/365) ... * (343/365) = 365!/(36523 342!) = 49%. The odds that they are the same is the inverse. 100%-49% = 51%.

But in computers, 50% chance is astronomical. We want the odds near zero. Let's check what it would be for a 1% chance. We won't bother with calculations, but it's around 3. Out of a sample space of 365, we only need 3 tries to get a 1% chance of duplication. The scenario doesn't look so optimistic anymore.

Let's consider what happens with a sample space of 2128. With numbers this large, the square root is about the number of tries you need before we get a 50% chance of collision. So 264 random GUID's will get you over 50% chance of collision. For a 1% chance it's something like 261. You may think this is still quite large. But consider what happens with just 1 million computers around the world. This means that each of these computers can only generate on average 241 GUID's. 241 is nothing in computer terms. This number would not be for a day or a year. It would be forever. You would not be allowed to generate more. But a computer is able generate more than that in under a day even on old hardware.

So thinking that GUID's are adequate is not very realistic. It may be OK today. It may even last quite a few years. But the more computers that start generating GUID's, the less GUID's are useful as unique identifiers. The original draft says that GUID's should be adequate until at least 3400 A.D. I don't buy it. Give me the power that SETI has with its distributed computing where people share their computing time and I bet I could generate a collision before the day is out.

It'd be nice if we could have a central repository where we could generate as many GUID's as we wanted. That way, we could truly use the full sampling space. But who controls the repository? And this GUID system is monolithic. There is no way to be able to use multiple ID systems together. I consider these minor issues though. We can always designate a certain GUID to be a container. Then we can use other ID systems within. Repositories can be assigned a certain range as well.

All of this doesn't help resolving the main issue of having a component recognise itself. For now, I'm leaning towards a compromise. I'm thinking of generating GUID's based on the source of the component itself. GUID's can be based on text or other input. That may be enough. But I hear that these too have their own sets of problems and recent news came out that the hashes used actually have a much lower sample space than expected. I could use a 256 GUID based on SHA-256. I'm not sure the extra space is needed.

One possible scenario is to use a large enough GUID based on something like SHA-512. Then, instead of using that GUID every time you transmit data, you would instead use handshaking and allocate a common ID between the two machines that would then be used on subsequent transfers. This generated ID would only be for these two computers and would be MUCH smaller. A cache could be used where after a certain amount of transfers, the lowest used component ID's would get removed from the cache. Any use of these ID's at a later time would cause an error as that ID is no longer in the cache and a handshaking session would begin.

I think this would work fine. The only problem would be servers that don't talk to other computers for long, but does serve many of them. The bandwidth could get excessive. I intend to use predefined allocation lists for this. Most computers would theoretically be sending similar kinds of information, so it would make sense to use a predefined set of ID's. I'm worried this would create fragmentation sort of like the mess that Unicode is going through.

I don't see an easy solution to this problem. Still, I keep thinking there is an easier way. The only thing I know for sure is that I despise namespaces and I hate even more the hierarchical library system of Java. I'm hoping for a viable alternative.

How Far Will Programmers Go?Stacks

Comments

Robbert van Dalenrapido Wednesday, October 11, 2006 3:34:17 PM

Vorlath, you may want to have a look at http://www.hpl.hp.com/techreports/2002/HPL-2002-32.pdf and http://www.hpl.hp.com/techreports/2004/HPL-2004-95.pdf#search=%22hp%20content%20based%20digest%20%22

I have thought of a similar approach with Enchilada (a computer language I'm developing). In Enchilada, hashes are build-in to provide means to distribute Enchilada data-structures. It can also be used to compare huge structures very fast (with a very small probability of a false positive).

Indeed, GUID's are the ideal references for distributed computing - but collisions will always be a problem. And yes, scaling up to 1024 bits hashes could solve potential collisions for a long time, but as you mentioned, this comes at a cost.

But not all is lost. Exactly because a GUID refers to data that *cannot* change, there is also a lot of opportunity to *share* data efficiently.

Just look at pure functional (immutable) implementation of binary trees to see what I mean. If you want to add a node to such a tree, the path to that node is rebuild - but most of the old nodes can be shared in the new tree. All this can be done in O(log(n)) time.

If you consider a tree with a trillion elements this means that you can share a trillion - O(log(trillion)) nodes after 'modification'. Now replace all internal node pointers with 1024 bit GUID's that are digests of their descendent nodes, et voila, you have a distributed binary tree.

Vorlath Wednesday, October 11, 2006 11:27:54 PM

I haven't yet read the paper, but I intend to because this issue is VERY important to what I'm doing. But just to get this straight, do you mean that when you add a node, you build the new tree using only the head nodes (and their descendents), but by referencing their GUID's? Only the inserted node would contain actual data, since the other nodes are already available. Is that right? I was thinking of something like that for custom, modified components. I wasn't sure how to implement it, but as far as component composition is concerned, it's basically a tree also. I wanted to use this more for custom components more than changing the definition of the component itself.

Thanks for everything. The more info I get on this, the better.

Unregistered user Thursday, October 19, 2006 2:26:03 AM

kyle@arcavia.com writes: I have also considered this problem with little luck, and I have since given up. I am looking forward to your solution. GUID management always has a problem with isolated machines requiring/generating their own keys that collide with the general populations. Reintegrating these isolates into the general population requires pain. My last thought towards a solution was to name the set of IDs generated by a machine as its own "language" (a language of only nouns really). As with human language; words in one can mean something different in the other, much like the GUID collision problem. Mapping the GUID problem to a language problem allows me to inspect how humans deal with the language barrier; and maybe find a solution. The questions on GUID management can be transformed into questions like: How do isolated tribes of people integrate with newly discovered others? How do nations based on different languages still work together? What is the best way for me to quickly conduct business in the country I do not know the language? Do I use a translation book, or do I acquire the services of another human translator? What if I plan to stay longer, as an immigrant?

Vladasvladas Thursday, October 19, 2006 8:29:04 AM

What if you use a time (in milliseconds since some Epoch) of unit creation as a part of the GUID hash. This would dramatically decrease the GUID set range.

This technique is used f.e. in Personal ID numbering. First there goes the date, then the male/female separation, and then the ordering number within that date. For 3.5 million citizens for each birth date this system uses only 3 decimal digits to hold collision number (it's enough for our country).

Unregistered user Thursday, October 19, 2006 10:49:53 AM

kyle@arcavia.com writes: There are two problems with that. First, the slow machine will not need many GUIDs and this wastes some 27bits (milliseconds per day) per GUID. Second, the fast machine may not have enough available GUIDs with only millisecond precision. If one wants a GUID source stream, like time, it should be adjustable to meet the extremes in demands.

Vladasvladas Thursday, October 19, 2006 12:00:01 PM

Yes, in some aspects you are right. But woudn't 16bits per one millisecond be quite enough? The ever fastest processors aren't to exhaust that maximum.

Ok, let the first 64 bits contain time/number hash, then next 64 bits may contain unique (and not so unique between objects in time) object hash. I suggest this solution only for minimizing the duplication rate of hashes.

Regarding slow processors - Dosen't full SHA 128 provide the same waste?

Vorlath Friday, October 20, 2006 6:29:43 AM

Hierarchical ID's were a thought. Each computer has an ID and then you just have to worry about the computer's ID colliding. This would mean a much smaller risk of collision and each computer can use sequential ID's.

What it comes down to is how to identify an entity that is identical to another. I like to look at existing examples that weren't built by humans. This technology was either built by aliens or God. Either one of these is theoretically smarter than me. So let's see what they came up with. Our genome is one such example of an ID stamp. Each DNA would be identical to a clone's DNA. Each person would be a separate entity, but we're not worried about uniquely identifying them. That's not necessary. Its existance is enough for that. You can always tag them if you need to keep track of them with your own custom system. No, the problem is knowing if two clones are indeed clones.

A first criteria is obviously appearance. I'm going to use the structure of the entity as part of the hash. This will tell me with great accuracy if any complex structure is the same or not. This is not fullproof, but will eleminate quite a few cases. We do this already with CRC checks. I'll just have a sort of CRC on the structure instead.

The bigger problem is how do we differentiate from lookalike clones? These would look identical in appearance, but are actually different. There's also the off chance that different structures give the same hash. How do we tell the difference?

The cool thing about DNA is that you can compare them. Too bad there wasn't something similar for computing. But even with DNA, if we were to manipulate them directly, we would also run the risk of duplication, although very unlikely.

Since all components are built from other more primitive components, it'd be cool if there was an algorithm that would always produce a different result based on the internal components' DNA so to speak.

If I could find such an algorithm, I would be set. The length of the 'DNA' could even grow the more complex the component was, I don't really care. I'd also like for commonly used DNA to shrink over time. Where these common components would be as necessary as primitive components and as such would be better served with a smaller ID. The only real problem is adding more primitive types. I don't see that happening too often, so that may not be a problem at all. DNA only has 4 primitive types after all. I could use a similar system where I can actually reconstruct the data structure from the ID. That would reduce my problem to one of differentiating similar structures for different usage. This is quite manageable, I think.

Those are my thoughts so far.

So I think vladas was on the right track using the building blocks of the entity itself for constructing the hash, but I think kyle was right about wasting bits because once they're gone, they're gone.

Vladasvladas Friday, October 20, 2006 7:24:42 AM

Vorlath, the amazing thihg we miss, that the object (or its code) is a beautiful hash/ID on itself! And when we get collision here (when codes are the same) it makes no problem, because it doesn't matter they are different physically - the matter is - they provide the same code.

The only problem with such hash is that it is quite big (long). So the very problem of making IDs - is a proper compression/decompression algorithm. This is what hash algorithms try to do.

From an Infinite Power Theory prospect - it should be no difference how long the hash is. If someone needs to compress it - it would be another meaningless waste of resources (algorithm, man power, support, versioning...). smile Do people with their DNA need to compress it in order to self-identify? We may think about it.

Write a comment

New comments have been disabled for this post.

June 2012
S M T W T F S
May 2012July 2012
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30