Interning
Sunday, October 2, 2011 2:32:29 AM
So far, so good. One problem that comes up is that sometimes you need to use third party software or whatnot where they take raw pointers. There are some cases where they allow callbacks as well and you no longer have a shared pointer when this is really what you need. Whatever the situation, what I'm trying to get at is that sometimes you are given the raw pointer when you really need the shared pointer. Sometimes you can leave the object itself with a reference to an existing shared pointer. This way, you can always copy the shared pointer. But what if you can't do that either? Also, what if you only ever want one copy of a certain object, but you don't know if it already exists?
Enter interning. This is where you have a repository of shared objects that are associated with their reference counters. Interning involves more than just keeping unique references to pointers. It actually makes sure that only one copy of any given object exists. So if you create a new object and pass it to an interned pointer, that object will get deleted and replaced with a different pointer if the object in question already exists in the repository (IOW, if the objects' equality operator returns true). So any time you want to share an object, you can look it up in a collection. If it's there, the counter is incremented, otherwise the object is added to the collection. I ended up writing an interned_pointer class that did all of this automatically including the creation of the shared collection. The class works exactly like a shared pointer except that when you use a raw pointer, it looks up the object in the repository. It is wise to copy from an existing interned pointer whenever possible because it can update the reference count right directly without searching.
The great thing about interning is that you can obtain a reference count to any pointer anywhere anytime. The problem with this is that it requires searching every time unless you copy from a pre-existing interned pointer. That searching can be VERY costly if you do it too often (ie. use raw pointers to create an interned pointer).
Binary search is logn. That's not bad. But there's gotta be something that can make it faster. With cheaper and faster memory, I'm wondering if we won't find ourselves one day using interned pointers instead of shared pointers. The only roadblock that I can see at the present is search time in two specific cases. Searching for an object when we think it may already be shared. And when we create a new interned pointer. In both cases, searches must be done to make sure it doesn't already exist. The first case can be alleviated by making sure to copy interned pointers from other pre-existing interned pointers when possible. But when creating a new interned pointer, there is no choice. A search must happen if only to store the pointer in the repository. And that can be costly when used with temporary objects.
Interning seems like an underused, yet extremely powerful technique. One specific advantage of interning over other techniques is that you can compare pointers for equality or inequality once they've been created. For example, when comparing strings, you can compare their pointers if they match instead of doing a character by character comparison since uniqueness is guaranteed by the interning process. I suppose another kind of weak interning could be done on pointers only where the pointer is guaranteed to be unique instead of the object itself.
If anyone has ideas about this, specifically about speeding up searches and inserts, please post them here.


Unregistered user # Monday, October 10, 2011 12:35:32 AM
Vorlath # Monday, October 10, 2011 5:29:52 PM
About your second point, it's a cool idea. But I'm using a map (associative array) to store the nodes. So even if I already have a sorted list of nodes, I don't think it's possible to enter them all at once without going to a different data structure.
One thing I am doing right now is inserting strings into the repository that I know will be used throughout the execution of the software. But for you idea of a resource identifier, I'd really like to use this. I see a few issues though. While it would work for pre-existing data, a new search would be difficult to do since I would need to also search the static list unless I duplicated the entries.
Some really good ideas here. Thank you for that.
Unregistered user # Monday, October 10, 2011 7:17:56 PM
Vorlath # Monday, October 10, 2011 11:37:01 PM
Ends up I'm using my skiplist library, not a map. It stores a node that is sorted both on pointer and on the value. I do searches on pointers first, but now I'm thinking that I should only search on the value. The only way I can have the same pointer is if I copied from an existing interned pointer and I don't touch the repository at all in that case. So there's no point in searching pointers. I must fix this. Well, people could create a new interned pointer from the object itself instead of passing the other interned pointer. However, I don't think it's worth optimizing situations where the user is doing it wrong.
How do you provide a pointer as a template argument? I thought only ints and classes were allowed. I'm still unsure of the static optimization. Seems like you're saying to have two storage spaces. One is dynamic and the other is static. The static version would certainly be fast if I only copy interned pointer to interned pointer. But I don't do any searches in those cases anyways. Well, if the reference count goes to zero, then it would need to be removed, but my static objects never get deleted (there's always one in memory until the program terminates).
Suppose I create a new intered pointer with a new object I created. How will the static storage be any faster?
If I copy from another interned pointer, there is never any search. However, there can be a search upon deletion when the counter reaches zero. This is why I keep a single instance of objects that will be used a lot in the repository at all times. The eliminates searches for deletes for those objects.
My problem is when I create a new interned pointer from a newly created object that happens to be the same as an object that is already in the repository. Whether or not I also have static storage doesn't help me find that pointer in the repository any faster, does it?
How do I convert the pointer into an index in the static array?
Unregistered user # Tuesday, October 11, 2011 3:41:00 AM
Unregistered user # Tuesday, October 11, 2011 10:29:06 PM
Vorlath # Wednesday, October 12, 2011 1:22:36 AM
I think I see what you're getting at now. If you know beforehand what objects will be in the repository, then you can optimize this because you have extra information.
The thing is that I often have situations where I create a new object. It ends up being identical to one of the static objects, but there's no way to know that until a comparison is done. So that's the issue. And that's where the pointer comes from. When you create a new object, you have a pointer to it (unless you build it on the stack, but my interning functionality only works on pointers to objects so that they can be stored without copying).