Update on Parallel QuickSort
Saturday, 27. October 2007, 00:59:20
Here's the new Parallel QuickSort.
Here, I have a few different components. Filter on Count is a component that can redirect the stream based on how many items are in that stream. So if the number you're checking is high, it can stall the network unless the stream is a known fixed size. Here, I'm checking for a length of 1, so it'll go through immediately. But you could also check for lists of 40 items and if it's less, you could insert an Insertion Sort component before sending it to the FCFS component.
Dup duplicates the stream. One is used to choose a pivot and the other is sent to the Split Stream component. The split stream component is another UInt32 component that will redirect items based on the condition. There will be one of these components for all comparisons possible. Later, I'll have a generic one where you can define your own comparison operators for your custom types.
The two internal quicksorts are a reuse of the container itself. Concatenate will take two streams and concatenate them into one stream. It will read the entire stream from the first input before reading the second one. Although not shown here, I'll also have a sequencer component. It'll do the exact same thing as concatenate, but will output two streams instead of one. IOW, the streams won't be merged. Internally, this component does nothing except to give sequence numbers to the streams. This allows both streams to be processed in parallel until a component specifically requires the correct order. In that case, the sequence numbers come into effect to put the streams back in their proper order.
FCFS is First Come First Serve component. It takes any non-empty stream that appears on any of its inputs and outputs it. Once data arrives on one input, the other input is closed. If both inputs receive an empty stream, it'll output an empty stream as well. There will be components that can act on begin and end markers for streams as well. These are actually quite important because streams can be nested.
I think I'll keep the old components I've built in addition to the new ones. Some of these new ones can be built from existing ones like the Split Stream component, but I think this is generic enough to warrant it. And for a parallel QuickSort, I think this is awfully simple. What I like is that it'll use as many cores as it can. Eventually, I can even use idle nodes to process multiple branches at once and only use the one that ended up being correct. I've mentioned the Filter on Count component. Well, I can output the stream on both outputs until I know the answer. When I do, I can kill one of the paths and that's already built in functionality. It's like lighting a wick for bombs in cartoons. It follows all paths until everything it touches is destroyed. Other streams can't be touched because they create their own paths (and aren't merged with existing ones until the answer is known).
Building real applications shows what components are actually needed. I'm at the point where I have to choose which ones are best. I've been reading JPM's book again and I noticed I'll have to include an Assign component where you can update specific elements within a stream. What I like even more about all this is that while most components are about the manipulation of streams, they deal with the data within them too. So these components aren't just there for the sake of the network. They actually manipulate the data in some fashion.
I'll be looking at different kinds of components and implementing them. I can't wait to start building the runtime engine. At the same time, this little exercise has given me a lot of good ideas.



Vorlath # 27. October 2007, 01:38
Vorlath # 27. October 2007, 07:48
spc476 # 29. October 2007, 05:45
I also don't see how Project V would be of any benefit to me. Over the past two months I've been implementing a greylist daemon for use at work (it's an anti-spam measure). Superficially, I can see how a "dataflow-centric" type development might naturally work with it, but the devil is in the details.
Simplified, the greylist daemon is given a tuple [IP address, sender email, recipient email]. It then checks a list of IP addresses to either accept the email (since there are known servers that can't deal with greylisting) or rejecting (known spam sources—if I'm checking exceptions here, I might as well support accepting, rejecting or greylisting). The server then checks to see if the tuple has been seen before, and if not, keep a copy for future reference. Periodically, tuples are removed from the list as they expire (I only keep tuples for a period of time). I should also note that the IP list can change from an outside source (a control program that allows me to add and remove IP addresses while the daemon is running).
What I can't see (or simply fail to see) is how such a program can work in your system. There's a lot of data that is stored (IP list, the tuples), some read-mostly (the IP list), and some that's write-mostly (the tuple list). It's also data that can be updated at unpredictable times (I'm not even pretending to do this multithreaded, mainly because of the write-mostly issue with the tuples; coodinating multiple units of execution (avoiding the terms "threads" and "processes" here) will wipe out any performance gains, and keeping multiple lists of tuples around (to keep the performance) bloats the memory footprint.
As it is, I can handle 130 requests per second; even on an old 120MHz 486 I'm able to get a respectable 77 requests per second, so it leads me to believe that the limit I'm seeing is purely communication overhead and not necessarily programming overhead.
spc476 # 29. October 2007, 05:51
Vorlath # 29. October 2007, 08:52
The Filter On Count only outputs on one channel. Split Stream outputs on both no matter what. Pretend there are at least 3 outputs. The extra one is for equal and is sent directly to the concatenate component.
I'd really like to build this in Project V. I think it'd be extremely easy. But I couldn't figure out what you do with the tuple after it's stored other than delete it after a certain amount of time. Anyways, what you need is a storage component. What you describe happens a lot. I haven't talked about that component in a very long time.
It keeps one copy of your data and only it can operate on it. What you do is modify it to handle the requests you want to make. You build each request independantly. Once you're done, Project V will handle all the parallelism and concurrency for you. Your final network will basically have two main components along with a few external ones. One main component for tuples and one for IP's. Each of these will have inputs for whatever you want to do to it. If you want to check IP's, you can check an entire stream of them at once in parallel (assuming you have more than one core or more than one machine). If you want to delete IP's, same thing.
There is a limit to how much parallelism you can get because there's only one copy of the data. But it can be partitioned quite well over a few nodes.
Vorlath # 29. October 2007, 09:00
Vorlath # 29. October 2007, 09:03
HAHA! That was my reaction too when I heard about it back when.
spc476 # 29. October 2007, 19:29
Later, the greylist daemon gets the tuple. It checks, and if it exists (in this example, it does), the timestamp is checked. If less than 25 minutes (the setting I'm currently using), it sends back "try again later" to the SMTP server. If over 25 minutes, the tuple is marked as being "whitelisted" and the greylist daemon returns "okay". It will then return "okay" for subsequent requests for this tuple.
If a tuple is older than six hours, and hasn't been whitelisted, it is removed (this check is made every five minutes, by the way).
If the tuple is whitelisted, it won't expire until 36 days of inactivity.
spc476 # 29. October 2007, 19:35
But it's interesting to see "no time accumulated" when I removed the two routines responsible for all the CPU utilization (that's in the post as well).
spc476 # 29. October 2007, 19:40
Vorlath # 29. October 2007, 21:11
Vorlath # 29. October 2007, 21:29
The total time in your software is under 0.02 seconds. And your lowest sampling unit is 0.01. You can't tell anything with that. Either use smaller sampling times or run the software for longer. Otherwise, this is all garbage. Trust me on this one. You got 100% useless data there. Also, software cannot run at absolute zero seconds no matter how small it is.
Now, it could be that all the time is spend in system calls. If so, threading might be worth it, but it shouldn't be necessary.
Just take one of your functions. It's called 661477 times. I know a call takes more than 1 cycle. But let's assume that's all it takes. At 120Mhz, your code takes at least 0.0055 seconds to run. But calls usually take a few cycles. So maybe it'll run three or four times that. But that's still way too close to your sampling rate. It's impossible for you to gather anything useful or at all.
Rule #1 of any profiling. Make sure your code runs for at least 10 seconds.
Rule #2, make sure all the timings in the output add up to 10 seconds.
Those rules aren't being followed.
spc476 # 29. October 2007, 21:47
spc476 # 29. October 2007, 21:49
spc476 # 29. October 2007, 21:54
Vorlath # 30. October 2007, 00:30
So yeah, there is an end marker for Filter on Count.
In my first and second comment, I did say there was a bug. I thought you had seen this comment already and there was something else. I said I need to do something with equal values. So if the pivot is 5, the first 5 would be equal and passed directly to the concatenate component (would be another input there in the middle).
The 4<5 so the 4 would go to the upper QuickSort component. Split Stream always closes all outputs when it gets an end marker. So the three inputs to Concatenate would be [[4],[5],[]] and the output of that component would be [4,5] where brackets are start and end markers.
Vorlath # 30. October 2007, 00:32
spc476 # 30. October 2007, 02:34
It'll be a day or two before I get to test on the 2.6GHz machine.
Vorlath # 30. October 2007, 23:05
spc476 # 31. October 2007, 00:24
Thus the "rewrite" (which was actually more of a "write" than a "rewrite").
But I did manage to get better profiling data on a 2.6GHz machine.
Vorlath # 31. October 2007, 02:21
spc476 # 31. October 2007, 04:32
As for data structures, the IP addresses are stored in a trie structure (specialized tree) where the average number of comparisons is equal to the average size of the netmask (on the data set, 25.3 comparisons on average) regardless of the number of actual entries (114), so that's more like O(c), but given that particular problem, not bad.
The rest are all binary searches on a sorted array, which is O(lgn). A hash table would consume way too much memory for my tastes for possibly little gain (I'd rather use memory to store tuples than for overhead).
The CRC calculation I could take or leave. I put it in for two reasons: one, it's UDP, and two, for open network access I wanted a bit more assurance of getting a valid packet (I was more concerned about packet sniffers throwing random garbage to random ports). Since our particular setup has everything on a single machine, the CRC could go if required, but for now, I don't see a reason to remove it.
Remember, I'm profiling on a 2.6GHz machine, and it's likely that most of the time, the sampling happens when the process is in kernelspace (recvfrom() or sendto() calls).
spc476 # 31. October 2007, 04:36
Vorlath # 1. November 2007, 14:11