Software Development

Correcting The Future

Interning

If you're a C++ programmer, then you should be well aware of smart pointers or reference counting. Instead of dealing directly with a pointer, you use a container that also has a shared reference counter. When the counter reaches zero, then it deletes the object that the pointer is referencing.

So far, so good. One problem that comes up is that sometimes you need to use third party software or whatnot where they take raw pointers. There are some cases where they allow callbacks as well and you no longer have a shared pointer when this is really what you need. Whatever the situation, what I'm trying to get at is that sometimes you are given the raw pointer when you really need the shared pointer. Sometimes you can leave the object itself with a reference to an existing shared pointer. This way, you can always copy the shared pointer. But what if you can't do that either? Also, what if you only ever want one copy of a certain object, but you don't know if it already exists?

Enter interning. This is where you have a repository of shared objects that are associated with their reference counters. Interning involves more than just keeping unique references to pointers. It actually makes sure that only one copy of any given object exists. So if you create a new object and pass it to an interned pointer, that object will get deleted and replaced with a different pointer if the object in question already exists in the repository (IOW, if the objects' equality operator returns true). So any time you want to share an object, you can look it up in a collection. If it's there, the counter is incremented, otherwise the object is added to the collection. I ended up writing an interned_pointer class that did all of this automatically including the creation of the shared collection. The class works exactly like a shared pointer except that when you use a raw pointer, it looks up the object in the repository. It is wise to copy from an existing interned pointer whenever possible because it can update the reference count right directly without searching.

The great thing about interning is that you can obtain a reference count to any pointer anywhere anytime. The problem with this is that it requires searching every time unless you copy from a pre-existing interned pointer. That searching can be VERY costly if you do it too often (ie. use raw pointers to create an interned pointer).

Binary search is logn. That's not bad. But there's gotta be something that can make it faster. With cheaper and faster memory, I'm wondering if we won't find ourselves one day using interned pointers instead of shared pointers. The only roadblock that I can see at the present is search time in two specific cases. Searching for an object when we think it may already be shared. And when we create a new interned pointer. In both cases, searches must be done to make sure it doesn't already exist. The first case can be alleviated by making sure to copy interned pointers from other pre-existing interned pointers when possible. But when creating a new interned pointer, there is no choice. A search must happen if only to store the pointer in the repository. And that can be costly when used with temporary objects.

Interning seems like an underused, yet extremely powerful technique. One specific advantage of interning over other techniques is that you can compare pointers for equality or inequality once they've been created. For example, when comparing strings, you can compare their pointers if they match instead of doing a character by character comparison since uniqueness is guaranteed by the interning process. I suppose another kind of weak interning could be done on pointers only where the pointer is guaranteed to be unique instead of the object itself.

If anyone has ideas about this, specifically about speeding up searches and inserts, please post them here.

Wachowski Brothers used the Halting Problem in the MatrixFinding least distance from equidistant points on a circle to random points on circle.

Comments

Unregistered user Monday, October 10, 2011 12:35:32 AM

Simon Schlee writes: Probably you already have thought of this yourself, but I'll mention it anyway ;). The only way I can think of to speed up search and insert, is by specializing the type of the repository depending on the type the pointer points to. So you could e.g. use some kind of highly optimized cache aware trie/prefix tree to do string-key lookups. The downside is, that by sharing prefixes in the lookup implementation, the string value either has to be reconstructed which would be a very non-local operation, or has to be saved twice (one time implicitly within the trie, another associated to the corresponding node). This particular problem is one of the reasons I imagine non linear/programmable memory-devices to be cool, would be nice if one could see a path in a prefix tree as string without having to do excessive computations on the software side. So I guess my main point is to just use one default-repository and implement specialized ones where it hurts performance. Another point would be to control the amount of interning by controlling when it is done, if you have some predefined chunk of data which is known before runtime you can intern them statically and make the implementation to load this part of the repository. Then you could invent some form of a resource identifier which could be compile time translated to some kind of address/index/position within the repository. Of course this has to be supported by the repository in some way. I would do it (if I weren't too lazy ;) by indexing into the node-array/deque/whatever which is used to implement the repository.

Vorlath Monday, October 10, 2011 5:29:52 PM

If I understand the first part correctly, you're thinking that I have a global repository. What I'm doing is that I have a separate repository for each type that I want to intern. However, you may be onto something. Suppose two different areas used string interning, I could have two repositories. Even if there's the odd chance of having duplicate strings, there would be fewer elements in each repository since they are used in distinct parts of the software. That's a good idea. BTW, I have templates that automatically create the repositories and links them to interned pointers.

About your second point, it's a cool idea. But I'm using a map (associative array) to store the nodes. So even if I already have a sorted list of nodes, I don't think it's possible to enter them all at once without going to a different data structure.

One thing I am doing right now is inserting strings into the repository that I know will be used throughout the execution of the software. But for you idea of a resource identifier, I'd really like to use this. I see a few issues though. While it would work for pre-existing data, a new search would be difficult to do since I would need to also search the static list unless I duplicated the entries.

Some really good ideas here. Thank you for that.

Unregistered user Monday, October 10, 2011 7:17:56 PM

Simon Schlee writes: I guess the part about the default repository is a little bit vague. I really meant that I would separate the key-lookup-implementation of the repository into a template with one suitable default implementation and then specialize that template for e.g. string. Whether only the key lookup is specialized or you use whole different containers within the specialization is a design question. I wrote something like this a while back and there I used boost::multi_index_container with a hashed string key, this container never moves his nodes because it supports multiple keys associated with this node. At last I used a string-set which didn't move the strings and had the value directly in the container except a pointer to it. No, I thought to make one repository per type too ;). But having the possibility to create more then one repository per type to make the search having to process fewer elements is a good idea too, probably only useful when you have some sets of strings which don't have to be compared to one another. I am curious why you are storing the nodes in an associative array, did you mean the values? The normal std::map implementations all use sorted nodes I think. If you don't want to make your repository nodes to stay at their address you maybe could just delegate prototype-style to a static repository if it fails in the dynamic one, or even make it some kind of policy of your implementation so that by default all is dynamic. And if you provide some pointer to it as template argument, it assumes some kind of static-repository to be stored there. This would be actually quite similar to Qt`s resource system, you could do it like them and generate a c-string which represents the bytes of the static-repository. So you wouldn't have to load the static repository because its already within the loaded executable.

Vorlath Monday, October 10, 2011 11:37:01 PM

Yeah, I have templates that implement everything about this. I don't use template specialization though because I don't know if I could speed up any specific type more than another.

Ends up I'm using my skiplist library, not a map. It stores a node that is sorted both on pointer and on the value. I do searches on pointers first, but now I'm thinking that I should only search on the value. The only way I can have the same pointer is if I copied from an existing interned pointer and I don't touch the repository at all in that case. So there's no point in searching pointers. I must fix this. Well, people could create a new interned pointer from the object itself instead of passing the other interned pointer. However, I don't think it's worth optimizing situations where the user is doing it wrong.

How do you provide a pointer as a template argument? I thought only ints and classes were allowed. I'm still unsure of the static optimization. Seems like you're saying to have two storage spaces. One is dynamic and the other is static. The static version would certainly be fast if I only copy interned pointer to interned pointer. But I don't do any searches in those cases anyways. Well, if the reference count goes to zero, then it would need to be removed, but my static objects never get deleted (there's always one in memory until the program terminates).

Suppose I create a new intered pointer with a new object I created. How will the static storage be any faster?

If I copy from another interned pointer, there is never any search. However, there can be a search upon deletion when the counter reaches zero. This is why I keep a single instance of objects that will be used a lot in the repository at all times. The eliminates searches for deletes for those objects.

My problem is when I create a new interned pointer from a newly created object that happens to be the same as an object that is already in the repository. Whether or not I also have static storage doesn't help me find that pointer in the repository any faster, does it?

How do I convert the pointer into an index in the static array?

Unregistered user Tuesday, October 11, 2011 3:41:00 AM

Kyle Lahnakoski writes: If you have a staticly defined set of interned objects, you may try testing a few prime numbers to find a (almost) perfect hash for O(1) conversion from value to array index. In practice, I have found reasonably sparse hashes are close enough to O(1), given almost any big prime. I intern many of my strings, and like Simon Schlee suggested, I am careful about what I intern. There are many cases where strings are just data; data that is parsed, copied, and transformed. I save my interning for strings that will be compared.

Unregistered user Tuesday, October 11, 2011 10:29:06 PM

Simon Schlee writes: You only can provide certain kinds of pointers as template arguments. I used function-pointers from time to time, char* doesn't work but it seems like char** works:(http://codepad.org/2T6NeYR2). This guy has a nice article about it, but I only took a glance:(http://blog.asymptotic.co.uk/2011/02/c-pointer-template-parameters-are-weird/). "Suppose I create a new intered pointer with a new object I created. How will the static storage be any faster?" Not in any way, unless you do what Kyle Lahnakoski suggests. The static repository is only useful combined with compile-time interning, or if you use the information you have about the static objects, to select a more efficient search algorithm to compile into the implementation of the static repository. "However, there can be a search upon deletion when the counter reaches zero." Why? Is this a detail of your skiplist? If your reference count is working (you don't have uncounted references), you just can delete the object from the repository. If the object gets recreated the repository has to be searched anyway so it doesn't matter whether its already in there, only if delete+insert is a time consuming operation I would think about letting some of the objects in the repository. But I think often you can not do good estimations as to what objects should stay. Maybe some generational-collection like stuff would work but it seems to me like its not worth the effort. "How do I convert the pointer into an index in the static array?" What pointer? I do not get your last question. If you use a static repository you have to generate its content before compile time and convert resource identifiers to offsets/addresses by having some kind of (res-id -> object) mapping which can be meta information which is not within the repository. Would be cool if at least line breaks/paragraphs were allowed in the comments.

Vorlath Wednesday, October 12, 2011 1:22:36 AM

I don't know why paragraphs aren't appearing. They did in the past. Unfortunately, Opera doesn't allow very much formatting from non-members. I used to be able to edit comments to make them show up correctly, but that's gone too.

I think I see what you're getting at now. If you know beforehand what objects will be in the repository, then you can optimize this because you have extra information.

The thing is that I often have situations where I create a new object. It ends up being identical to one of the static objects, but there's no way to know that until a comparison is done. So that's the issue. And that's where the pointer comes from. When you create a new object, you have a pointer to it (unless you build it on the stack, but my interning functionality only works on pointers to objects so that they can be stored without copying).

Write a comment

New comments have been disabled for this post.

June 2012
S M T W T F S
May 2012July 2012
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30