Skip navigation.

Software Development

Correcting The Future

Generic List

I don't know if this is possible, but it'd be cool if it was. In Project V, I use a generic list with the objective that no operation takes more than O(log n). However, it has one drawback which causes it to be just short of that objective. Before going into that, a description of the list is required.

I will simplify the scenario since Project V has functionality that would only serve to complicate what I need.

Suppose we have a generic list and in it, you can only insert entries. An entry has the following three members.

1. ID (Guaranteed to be globally unique)
2. Name (not necessarily unique)
3. Value (not used within context of the generic list).

An entry has a fourth "member" which is a list of named generic lists. These named lists are actually used as named sets.

We can retrieve an entry from a set by ID or by Name. This is no problem. Just use a map, hash, or whatever else that is sorted.

I use three internal lists. One is an associative array (otherwise known as STL::map in C++) for sorting by ID, another associative array where they key need not be unique (otherwise known as STL::multimap in C++) for sorting by name and an indexed list. That last one is a custom list I've built by using a skiplist except that I use the index as the key. The index isn't stored. Instead, I store the amount of items that is skipped along with each forward pointer. When inserting and deleting, I simply update the amounts when I update the pointers. So it comes at no extra cost in runtime complexity.

Inserting at the beginning of this custom skip list is O(1) which is just fine by me (details about inserting at the end at the bottom of this article). Inserting in at a random location requires O(log n). Also fine by me since a vector takes O(n) by comparison. Yes, this is the worst case for a skip list as well, but expected runtime is O(log n) for the skip list. If you get O(n) for large sets, something went horribly wrong. Searching is O(log n) which is also fine because it has the same runtime as an associative array (map) though perhaps a little slower in some cases, but hardly noticeable. And deletion has expected runtime of O(log n) as well.

An associative array takes O(log n) for each of insertion, deletion and search. The data structure I'm using for my numerically ordered list (an indexed skip list) likewise takes O(log n) insertion, deletion and search.

Separately, everything looks great. But when used together and without an index, deletion takes O(n) as we have to scan all entries. Is there a way around this? Can we do O(log n) or better?

Something tells me it's possible, but I don't see it right now. If anyone has any suggestions, or knows that it's impossible, please let me know.

        Insert  Delete Search
map     log(n)  log(n) log(n)  by name or ID
indexed log(n)  log(n) log(n)  by index
map     N/A     N/A    N/A     by index
indexed 1       n      N/A     by name or ID
----------------------------------------------
overall log(n)  n      log(n)


Note how insert, delete and search is never done by index on the map. Since we must do the operation on the indexed list first, we will obtain the entry and this will give us the name and ID needed by the the map (so the index is not used at all with the map). But the opposite is not true. If we insert by name or ID, there is NO index stored at all. This helps in the case of inserting by index since we default to inserting it at the end of the list. This results in O(1) (if we tweak the skip list). But it has disastrous effects when trying to find the item in the list so as to delete it. Although N/A is listed under Search for indexing by name or ID, it is actually used when deleting. In all other cases, searching by name or ID is always done with the map instead.

I had once tried to associate an index with each element, but when you delete an item, you need to update the rest. This is O(n).

I could create another associative array where I map the entry to the data structure that wraps said entry to its forward pointers used by the skip list. I could locate the element in O(log n) inside my skip list, but the problem is that I need all of the previous pointers. Currently, I'm only using forward pointers and not a doubly linked list. If I modify the skip list to use a doubly linked list, I think I can achieve O(log n) for deletion without having the index (only the element). This would also allow me to insert at the end in O(1).

This generic list is quite the memory hog needing 4 times the required memory for references on top of the memory used for the keys. I'll have to look if there isn't a way to use an STL map where the key is in the value. That would save a huge amount of storage. Right now, I'm using a pointer. This takes less space than using the key proper, but it's still taking up space for no reason.

Oh, and if anyone is wondering, I'm not going to implement undo in Project V right now as it's too much of a pain. I can implement it later anyhow since the framework is there to support it. So I'm going to concentrate on getting something up and running and then worry about adding other features later.

P =/!= NP? Travelling Salesman Problem (Updated Nov 7, 2009)

The travelling salesman problem (TSP) is where you try to find the shortest route when visiting N number of towns. Apparently, the decision version of this problem, if solved in polynomial time, would be a solution to an NP-complete problem and would also solve all other NP-complete problems in polynomial time. Not only that, but since you can use binary search in O(log(n)) time when used with the decision version of TSP, finding the shortest distance up to X digits of precision would likewise be done in polynomial time.

My first attempt was to first build a convex hull around all the cities (points). A nice property of this is that if all points are part of the convex hull, then you have the shortest path.

Read more...

Easy Convex Hull Construction

If you've ever tried to look at P=NP and related problems, you've probably encountered the traveling salesman problem. And if you've ever tried to find a solution to this, even if just for fun, you may have thought about building a convex hull to start off. Well, I thought of building one. But I couldn't find an easy algorithm that I could understand. Sure, there are plenty of them out there and I know how they work, but they all seemed a little convoluted. So I created my own algorithm only to find out it's a variation of the Graham Scan (you can see the one liner at the bottom of the article describing it under "Notes"). Still, the version described here is super simple to understand and I thought it deserves a mention.

A Convex Hull is essentially a list of points that completely contains all other points where all angles between any three points of this convex hull is convex (hence its name). You can think of it as an elastic band surrounding all points.

Read more...

What If... C++ Pointer Notation Was Different?

A long time ago... on this very blog, I was trying to create a new programming language that fit a certain set of requirements. We all know this ended up with Project V. But one thing from back then keeps me wondering "what if?". In my theoretical programming language that never was, I had a different notation for pointers.

(Note: I had touched on the subject a while back, but I can't find it. So I'm posting about it again.)

Currently in C and C++, you use the star notation. Each star you use is a dereference. And each ampersand is getting the address of the variable in question. But each of those operators works on whatever type that has been declared for the variable. So getting the address of a pointer variable will get you an address to that pointer (which then points to a variable). In essence, you have to keep track of where you are with dereferences.

What if it worked slightly differently? Declaring pointer variables would still be the same. But what if no star always meant accessing the variable directly regardless of the pointer type. Using one star would get the address of the variable. Two stars would get the address of a pointer and so on.

What this would mean is that you would not need two operators. You would need only one. Dereferencing will be done as required by the compiler all the way down to the variable itself. Getting an address can be done up to one level higher than the declared pointer type.

int a,b;
int *var; // pointer variable.
int **ptr2; // pointer to pointer variable.

// Note the same amount of *'s in the line below.
*var = *a; // set var pointer to address of a.
var = 5;
a = 10; // Overwrites 5.

*b = *a; // error

**ptr2 = **var;
ptr2 = 15; // overwrites 10 in a.


Since a and b aren't pointers, you can only go one level up. IOW, you can use one star, but not two. With var, you can use zero, one or two stars (one higher than it was declared). With ptr2, you can go up to three stars. Also, when you use the maximum allowed stars, it is read only. That's why writing to "*b" would cause an error.

I think this kind of code would be much easier to deal with. When a function needs a pointer, you know that the variable needs one star. No questions asked. Something else that would happen is that arrays would simply be accessed by index into the list of variables. The compiler probably shouldn't allow a[1] for example. It would technically be valid syntax. But the declaration would tell the compiler that the extra items haven't been allocated. To get access to a pointer in an array of pointers, you would do *ptr2[index]. Again, no star means data and one star means a pointer, etc.

The real reason behind all this is to keep things consistent. With pointers, you would now essentially be creating aliases. And you don't need to know all the weird details of pointer arithmetic for most tasks. "*var = *a" would become the standard way to set an alias. Now you can use 'var' as if it was 'a' itself, except it's done through a pointer. In fact, you can create aliases at any level such as is the case with "**ptr2 = **var". Since ptr2 aliases to the pointer of 'var', it also aliases to the variable of 'a'. This is why you can do "ptr2 = 15;" which is the same as "a = 15" by the compiler doing a double dereference.

What's more, the ampersand isn't used to get the address of a variable anymore. So the current way that C++ handles aliasing could be kept as is. In fact, the ampersand would only be used for this style of aliasing. When combined with the new star notation, you would have to use the same amount of stars as found in the declaration.

int& *var2 = *var; // OK
int& var3 = var; // Error, 'var' was declared as '*var' with one star.


Passing arguments by reference would use this same restriction though temporaries can be created by the compiler as is currently the case.

An example using arrays.

int array[1024];

int *ptr = *array[0];
int total=0;
for(int i=0;i<1024;i++,(*ptr)++)
{
  total+=ptr;
}
cout << total;


You could also do funky stuff like set up a triangular array using a vertical pointer to pointer array and then allocate each horizontal array independently. When the triangular (or irregular) array is set up, you access it like a normal array "total+=triarray[row][column]". This is presently impossible without using pointer syntax such as "(*triarray[row])[column]".

If someone were to implement everything described here, would there be any glaring disadvantages or perhaps something that wouldn't work? I wonder what the C and C++ world would be like if this syntax were used instead. Maybe it would be minor. I don't know. But I can't help but believe that it would have made pointers much easier to deal with.

(edit: Wanted to mention that declarations would also make more sense. The way it is now in C and C++, you can assign a pointer in the declaration even though there is a star which normally indicates a dereference.)

(edit2: One problem I've noticed is that you wouldn't be able to dereference a pointer returned from a function. You'd have to store it and then use it. Unless the programmer can decide what pointer level he wants to use. So func() would try to access the variable no matter what. *func() would try to access the pointer. That would work I guess.)

Writing Multithreaded Programs Can't Be Done In C - Spolsky

Found this via reddit.

I'll quote it here for those that don't want to go to the link.

Spolsky: Quite probably. I mean C is a car, it’s very dangerous – it doesn’t have seatbelts, but it’s very powerful because it goes very fast. In fact, I’d go so far as to say, unless some people are gonna punch me about this one, but you cannot write multi-threaded code in the C programming language. You just can’t. Although technically all the capabilities are there, it is beyond the capability of mortal humans – and I know, some of you out there are very smart and you think that you have the capability of writing multi-threaded code in the C programming language, because you’re hot shit. Well, let me tell you, it is beyond the capability of humans on this planet, for their brains are just not adequate to the task of writing multi-threaded code in most languages, least of all low-level languages like C. It’s just not gonna work, it’s just not gonna make you happy. So there.



Emphasis mine. Those sections in bold leave me scratching my head. Spolsky is known for saying outrageous things, but I never thought he was on par with Richard Stallman and Linus Torvalds.

Is it hard to get multithreaded programming right? Sure. But Spolsky is saying that it's humanly impossible. That "it's just not gonna work, it's just not gonna make you happy. So there." Give up now.

Personally, I'm not a big fan of threads. I find too many people use them when there's no need. But those needs do arise from time to time. Blocking calls are the biggest problem. Other times, it just makes sense to separate code into their respective units that have their own execution point. For example, I once wrote a backgammon server and separated the database part from the rest of the server. I eventually put it in its own process so that it could be run on a different machine, but the principle is the same. The server and database handling run in parallel (or concurrently if you want the proper term). Why would database access need its own thread? Because when you make a request, you need to wait for the actual database program to return you a result. During that time, delays can be caused by different things like network latency if on a different machine, disk latency, processing (if using multicore), etc. So during that time, you have free processing power that is going to waste. By having another thread(s) for the main server, you can process other requests while waiting for the database to return a response.

Are there other times you need to use threads? Well, this is probably what Spolsky actually meant (though I can't be sure). That when we go to multicore, Spolsky would prefer to use functionality that is simpler to use for using all the cores such as special "parallel" commands found in many languages. These will send statements to different processors to be computed and then the "parallel" command will put the results back together in the correct order. A simple (and not very good example) might be to increment numbers in a list. No locking needed and each processor can increment their respective ranges within the list.

Are there ways to simplify your life when programming multithreaded programs in C or C++? Yeah. I've talked about this many times. Over and over again, I've promoted the idea of having systems or global modules that only talk to other modules through data messages (without a function call). In effect, each module essentially acts like a separate process. But you have the advantage that you have shared memory and message passing can happen so much easier and simpler. So if you ever need to extract the module into it's own process, it's no big deal. Also, any globally shared data would simply be put into a module of its own. And then I've often talked about helper classes in C++ to help insert and extract data from those messages. But that's another topic.

The biggest problem is taking something that isn't multithreaded and then trying to turn it that way after that fact. That'll cause a lot of problems because the foundation isn't built for it. Hence my promotion of self executing modules because you can expand and add more at any time.

So why would someone say multithreading is impossible? I don't know. But it made me laugh. A lot! At first, I thought he was just saying it was hard. But no. He's saying flat out impossible. Strange stuff.

edit: I wonder what Spolsky thinks of P2P software like uTorrent (written in C++). Or browsers and servers (Apache for example uses many processes and was written in C), Operating Systems, or anything distributed. Anyways, I just don't get what he was going for.

Parallel vs. Concurrent

I'm seeing more of these debates on Reddit, especially as functional programmers try to redefine what these words mean. I'm reading comments from instructors who are teaching the incorrect meaning. Having been blasted time and again when I started this blog about terminology, I made sure to use the correct words, but parallel and concurrent have never been a problem.

When speaking in general terms, both can be used interchangeably in many cases. The expression "running in parallel" is especially troublesome because it simply means that multiple things are executing at the same time. There's a reason for this use, but we can back up a little and start simple.

Parallel usually means several instances of the same thing executing at the same time (usually independently of each other).
Concurrency usually means several things executing at the same time and working together toward a common goal.

Read more...

Busy

Will be busy on finishing other projects for probably the next two months. So there won't be too many posts during that time.

Thank you

Connections and Usage

I just got done removing all spam trackbacks from my blog. Obviously, SPAM isn't what I meant this blog to be used for. While thinking about this, it reminded me of how things used to be done in some cases.

Let's create an imaginary scenario. Suppose we have a very fast machine with all the hardware and resources one would need except for ONE thing. It would only have a CRT monitor. You can use an LCD if it supports the right connection, but that's not the point. The scenario I'm setting up has one more oddity. Your video card can only display ONE color. You can select from 16 million colors, but it only has ONE register where you can set the color that will be sent out to the monitor.

Some people would have a problem with this. For others, it wouldn't phase them one bit.

Suppose our CRT runs at 60hz with 60 frames per second, we can change the screen color each frame. With a normal machine, we can change the screen 60 times a second no problem. So that much stays the same. But having a single color isn't very interesting. If we can change screen colors once per frame, what's stopping us from changing it more often? Nothing.

Read more...

Moon Machines

If you like rockets, computers and the space program, you'll want to see this. Talks about creating the computer that went into the Apollo program. 72K of hand weaved memory if you can believe that. Found it on reddit.

http://www.youtube.com/view_play_list?p=7F223B0B5228CB54

I'm reading that it's from a 6 part show called Moon Machines aired on Discovery.

Kryptos Topics Moved to Other Blog

All future Kryptos discussion will be moved to another blog called Kryptos Solutions.
November 2009
S M T W T F S
October 2009December 2009
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30