Software Development

Correcting The Future

Optimizations

There have been quite a few articles written about Knuth's famous quote warning against premature optimizations, both for and against. In this article, I will not only argue that Knuth was 97% wrong and 3% right, but also that most people who agree with Knuth tend to be 100% wrong. I will also be mentioning different places and strategies that can be done at a very high level to get acceptable performance out of your application. Note that performance isn't something that can be an afterthought. If your application isn't responsive, users won't be inclined to continue using it. If they have to wait an eternity for an operation to complete that they believe should be relatively quick, they will have a negative opinion of not only the software, but of the people responsible for writing it, and correctly so. In today's world of multicores and distributed computing, it's all too easy to write code that unacceptably lacks performance.

Knuth's quote is as follows:

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance a considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.



So why is this 97% wrong and 3% right?

It is obvious that the 97% is the part that Knuth is saying we should forget about small efficiencies. The problem is both that Knuth's quote is wrong and ambiguous. No matter what he's trying to say, he's wrong. Now, I do understand what he's trying to say and where it comes from. Back in the day, optimizations use to involve tricky code that was difficult to read and unmaintainable. It often meant going to assembly. And there were indeed people who optimized everything they could regardless of the final impact on the software. I think the last extreme example of this was Roller Coaster Tycoon that was written completely in assembly.

The simple fact is that it's impossible to optimize prematurely. IMPOSSIBLE!

I'm leaving that on its own line for dramatic effect. Optimization isn't just about the act of writing code that is faster than what was there before. Optimization also involves the act of analyzing performance. This is where the ambiguous part of Knuth's statement comes in. Far too many people have used his quote to defer analysis to after the code has been written. In fact, there is something in Knuth's quote that indicates this very meaning.

Programmers waste enormous amounts of time thinking about or worrying about, the speed of noncritical parts of their programs...



and...

A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.



He's saying AFTER THAT CODE HAS BEEN IDENTIFIED. Sorry, but NO, NO and HELLZ NO.

What he's saying here is exactly what those who object to Knuth's statement have been saying all along. Unfortunately, you cannot wait until the code is written to optimize. Ok, you can, but it's incredibly difficult and not the way to go. You will need experts that know how to optimize, refactor, and are able to redo your architecture's design from the ground up. The people with this kind of expertise are getting more difficult to find. Also, this kind of optimization will take time to complete and can often kill a project because it was an afterthought.

Back to the notion of waiting until after the code has been written, this is not a misunderstanding of Knuth's quote. It is exactly where people who disagree with Knuth have an objection, myself included. If you only identify critical code at the end, that means you don't worry about identifying performance issues at all until the end. Otherwise, you're worrying about it prematurely according to Knuth. BUBKIS!

Analyze Performance BEFORE Writing Code

When you design the architecture of your software, not doing premature worrying of performance will be disastrous. Not maybe. Not likely. 100% guaranteed. Now, "premature" is doing it too early. No such a thing. This is why the quote is so ambiguous. When is it premature to look at performance? What does it mean to worry about something too early when you can never worry about too early?

Well, just to confirm my interpretation, or rather my interpretation of how others read it, is correct... here's wikipedia's entry:

Optimization can reduce readability and add code that is used only to improve the performance. This may complicate programs or systems, making them harder to maintain and debug. As a result, optimization or performance tuning is often performed at the end of the development stage.



It follows up with Knuth's quote to give it justification. The sad reality is that this is not only naive, but dangerous. It costs real time and money.

Invariably, when you profile after the code has been written, the profiler will tend to show you where there is slow performance, usually some kind of loop. So you go in and fix that and perhaps you make your code significantly faster and all is well. This does happen. And you should consider yourself very lucky if it does. A more typical scenario is that the bottleneck can be fixed if the method gets called once. But now you have several passes that all have to be put through the same operation. Oops. You forgot about the high level iteration. Only problem is that the profiler is giving you about 2 to 5% on 4 or 5 methods and the rest are all exponentially lower. Those top 4 or 5 methods have already been optimized as much as they can.

What do you do when it's still too slow by several orders of magnitude?

Go to threading? Dump some of it to the video card? Concurrent programming? Or redesign the algorithm?

These are all possible solutions. You can come up with you own. Problem is this is generally not the right time to think about doing any of these things. For example, threading is generally difficult to get right. Adding it as an afterthought is what many experts in the field say is nigh impossible to do. Not my words. This is a generally accepted viewpoint, as far as those go in the programming field.

There are several points, all interrelated, that have changed over the years that now make it rather difficult to think about performance after the fact.

1. The free lunch is over.

Click on the link. It's a very nice article from 2004 with a great graph updated in 2009 showing that transistor count are still growing exponentially, but performance per core has flatlined.

The take away quote from that article is this:

Efficiency and performance optimization will get more, not less, important.



Very true. Since the invention of the CPU, speed has gotten exponentially faster. There was actually a time in the 80's and 90's when, if you knew how long your project was going to take, you could not worry about critical parts of your software having the appropriate performance because by the time you released your software, the common desktop processor would be fast enough. Note that they still had to think about performance up front. But the idea that people took away from this is that you didn't need to think about performance at all. A lot of people were getting away with it. Even if their software was slow upon release, it would only get faster as new machines came out.

Those were really unusual times that became the norm. To consider an analogy, think of Ferrari releasing a car with a 4 cylinder engine at design time that automatically upgraded itself to a V6 by the time it was shipped, and would continue to upgrade itself the following year to a V8 all the way to a V12 8 years down the road.

Since about 2003 and certainly by 2005, if 3 to 4 Ghz isn't enough, then you can't really wait for processors to get faster. Recent advances are no longer concentrated on speed, but rather more cores, faster cache, and more registers that your compiler may or may not use.

2. Multicore

The 80's and 90's did not have much for multicore desktops. There were a few multi processor machines that I have seen. A friend of a friend had one. When I asked how it worked, he said it was great for the most part, but that many programs that used threading did not work. So he had to force those programs to work on a single processor. Why would a program that used threading not work on multiple processors? Because on a single core, you never have to worry about anything accessing the same data at the same time. Yes, you need to synchronize, but that's not what I'm talking about. What I'm saying is that you could get away with less than optimal synchronization because there would only ever be one thread active at a time. When both processors would run at the same time, you could get into trouble where both threads would update values at the same time and cause a crash. For example, I've actually seen situations where once a thread obtained a lock, it would be free to update everything it wanted within its context. But if you somehow forgot to restrict creating an identical thread when another core was present, you would get into trouble if it had access to the same context. IOW, there was plenty of multithreaded code out there that assumed only one thread could execute at a time. What worked an a single core would crash on multiple cores. Note that this wasn't done intentionally. It was just difficult to think about true multicore programming when you didn't actually have it.

3. Concurrent programming

Networks were far and few between in the past unless you worked at a large company or had access to University or college resources. Luckily, by the 90's everyone had ethernet and the Internet started being installed in homes. But even today, concurrent programming is something that is not taking off as fast as I thought it would. Likely due to the complexities involved. RPC doesn't work well over a network. You can't just wait. Not only is the current processor sitting idle unless you also implement threading, but if the other machine can't return anything or something goes wrong with the network, or the packet is lost, etc., then you'll be waiting forever. So how long do you wait? Do you have pingbacks? No matter what you do, there will always be a point of failure.

So if you're going to write scalable software like this, it can't be an afterthought what happens when multiple machines work together and failures happen. And you certainly can't add concurrency after the software has been written.

It's actually impossible for both sides to know that a message and its confirmation has been received. I think the original problem goes that there are two allied armies that will arrive on either side of the castle to attack it. For this to be successful, they must arrive at the same time. But until then, they can only send a messenger an horseback lest they be discovered by patrols. Either one can send the message with the time, but both must agree on the same time. If someone has the exact problem, please let me know and I'll update it.

4. Caching

My old C64 did not have a cache. At least, I don't think it did. So if you wrote quicksort and you tried both traversals for the partitions, you would get identical results.

One traversal is where you process and subdivide all partitions at the same level before going to the next level. The other is to always process the highest and leftmost partition. This last method is the normal recursive way of doing things where you process each smaller partition as soon as possible. When you do this and you have a cache, then the same data remains in the cache and access to that data is that much faster. So much so that insertion sort becomes faster once you get to Level 1 cache which is O(n2). If you tried the level by level traversal on a machine with cache, then it'd be ridiculously slow because every piece of data has to be reloaded through all levels of cache every time. And on a machine without cache, the traversal order simply does not matter.

So today, there are some that believe QuickSort is simply faster because of O(nlogn). Unfortunately, that's not the entire picture. There are many algorithms that will benefit from thinking a few seconds about the cache. It doesn't take much time. It won't make your code any less readable. And it won't make maintainability any more difficult.

The idiom can be stated as such: When you have the decision of doing multiple passes over the same data or operating multiple times on the same data before outputting a result, ALWAYS go for the second option. Do as many operations on the data as you can before outputting it. The other way will not effectively use the cache. This means you have to think about how your iterators work. If you have methods that only implement one algorithm per call, you need to rethink that strategy.

What I find funny about this is that the running time with O(n), O(logn), etc. can go out the window. Academics will object to that last statement on a technicality. That cache is limited while running times are meant for n that can grow to infinity. I'll leave it up to the reader to decide how useful big OH notation is relative to the cache.

One last thing about the cache that you can't leave to the last minute is how you deal with the combination of multicore and data access. It is easy to thrash the cache. To understand this, one must realize that Level 3 cache is the only cache that is shared amongst all the cores. So if you have one core do some processing and it passes the data to another core to do some more processing, you will trash the cache. This is because the first core's level 1 and level 2 cache have the data in question. In order for the other core to use said data, the other core must wait for the level 3 cache to be updated and to propagate up to its own Level 1 cache before being able to use said data. This can be so detrimental to performance as to make threading slower than single core processing.

5. Lantency and throughput

This is related to the last item about caching. Specifically, we are now talking about the delay between a request for data and the first data item coming in as well as how much data per second we get once the data starts coming in.

Normally, we have the following levels of access in order of latency from best to worst.

- Registers
- Level 1 cache.
- Level 2 cache.
- Level 3 cache.
- Memory
- HDD

As for HDD, we must include other external resources such as networks, SSD, DVD, BluRay, etc. The main point I want to make here is that one must be certain at the outset, before any code has been written, as to where your data will be stored. You can't just use a certain arrangement and assume that because it works fine in Level 1 cache that it'll be fine if you store it on disk. Random access on disk is nothing like random access in Level 1 cache. This is where latency really has an effect. But throughput also has a dramatic effect even when stored sequentially.

Then you need to consider things like using a video card. Latency is a very real problem when using video cards. This is why double buffering and triple buffering have existed since storage space made these techniques possible. An old rule with video cards is that what goes on the video card stays on the video card. Don't ping pong if you can avoid it. Though with newer cards, this isn't a big a deal as it used to be. But for optimal performance, do all processing on the video card once the data is on said video card.

This rule goes for all levels of latency and throughput. Don't go back and forth between levels. This rule should explain why cache trashing in the last section is bad. The latency and throughput of slower levels will eat up any performance gained at the faster levels of access.

Simple Things to Consider

Here are a few things not mentioned earlier that you can do that will always help with performance.

- For UI, make sure everything is responsive. If there is a long operation, do something with less quality or a bounding box until the final results are ready to be displayed.

- Try and keep things sequential. Memory will automatically fetch the next block of memory before you need it if you access it sequentially. Disk access is also much faster when done sequentially.

- Only use threading if one or more cores are at less than 100%. Otherwise, you need a different algorithm and threading alone won't help you.

This last one needs some explaining. Some people use threading because if a large request comes in and takes up most of the CPU time, then the other smaller requests must wait. So they use threading even if all CPUs are at 100% because this means that the smaller requests can get done. This may work for that one particular case, but you really need to change your algorithm and here's why. You're making your system MUCH MUCH slower than it needs to be.

I've mentioned this case countless times by now, but it's worth repeating. If you have 4 requests coming in and it takes 1 second per request and you process them in parallel, everyone will wait on average 4 seconds for a total wait time of 16 seconds. Note that this is for a single core. But we're talking about maxed out cores. So if you try to run this on maxed out cores, the wait time will be even longer. OTOH, if you allocate an entire core for a single request, the first user waits 1 second, the second user 2 seconds and so on with only the last user waiting 4 seconds. You get no worse performance from then on as the threaded model, but for short bursts of requests, the total wait time is 1+2+3+4 = 10 seconds. And if requests come in after one or two requests have been processed, then you still gain on user wait times.

So don't think it's just about overall processing time. You also need to think about user response times. And sometimes these wait times apply to concurrent processing as well because this will determine how long other machine will stay idle waiting for data.

Conclusion

These are all contemporary issues that have to be dealt with. To worry about these things after the code has been written isn't the way to go. Now, having said that, if you have to make a choice between having something working on a single machine in X amount of time or having multiple machines working together in X2 amount of time, then get something working first. Especially if you can make money on the first version. There's an old adage to work on both and use the money of the first to update the correct version. But don't upgrade the original version.

This means that what Knuth was talking about, changing code in inner loops and making it less readable etc., is a laughing matter today. If that's all you need to worry about (or shouldn't worry about), then you're in pretty good shape. Those kinds of optimizations are child's play compared to what we deal with today.

The ultimate point is that we cannot afford to to not worry about optimizations until problematic "code has been identified". It's no longer just about simple things like moving stuff out of the inner loop or using machine specific code in critical areas. Where optimizations are needed today is about threading, multicore, concurrency, UI responsiveness, latency, etc. Identifying critical code after it's been written and then implementing threading is not good advice. Scalability cannot be an afterthought. Today's software simply cannot be designed that way.

So when someone says that premature optimization is the root of all evil, tell them it's good to be bad.

What I Learned About Programming or Computers From Watching MoviesMoving Targets

Write a comment

New comments have been disabled for this post.

June 2012
S M T W T F S
May 2012July 2012
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30