Skip navigation.

Software Development

Correcting The Future

Wasted Power

I want to talk about a couple of simple features found on old x86 compatible hardware. Mostly after PII or PIII. You'll have to check your hardware to make sure. In any case, all hardware has these features today. I often think about how software sucks or how we use software that doesn't always have the user's best interest in mind. But one area that is a polar opposite is hardware. Hardware always has more features than we need these days. And that's a huge statement right there. Could we say this 20 years ago? Probably not. Could we say it 10 years ago? I think we could begin to say it.

Today, there is TONS of computing power that is going to waste. Forget multi-core, most processors today come with ridiculous amounts of registers that go unused. First, there is the fixed bank of registers that you can define in your software. From the Pentium with MMX and onwards, we were given 8 64bit registers that can operate on data in parallel. Most compilers never touch these. And it's not really hard to see why. When do we ever tell the machine to operate on 4 16bit values at once? Or 2 32bit values at once? There are some operations that will work on single 64bit values too. Unfortunately, if you do use these registers directly, you have to think in a different way. For example, errors aren't generated in the usual fashion. Errors are usually stored in another register and/or error values are stored in the result. So what you do is boolean operations with the values beforehand to check if they are valid and then do another boolean operation on the result to weed out results that you know are invalid and set them to something that you can use. Basically, anything that has to do with comparisons works quite differently though you can work around that. While there is extra functionality, there are also some things that are lacking. And one drawback is that MMX registers use the same registers as floating point registers. This makes it a no show for a lot of applications.

The power is there though. It can and should be used. When the PIII came along, it added another completely new and independent bank of 8 registers just for parallel floating point numbers. This was called SSE. It can work on 4 32-bit values at a time making each of these registers 128bit wide. Later processors can work on 2 64-bit values at a time and also allow MMX integer operations to work on these registers allowing for much more parallelism relieving any contention from the floating point unit as there is no longer any need to use the MMX registers. Again, the SSE operations are limited, but do have quite a lot of power available to it. It doesn’t have all the special functions as the floating point unit, but it does come with a square root and inverse, along with all the other basic operations and a few extras. In short, it has no sine, cosine and those kinds of functions. But if basic math is your thing, there's tons of power to be had here.

Newer processors have added another 8 registers to all of these register banks though I think they may have ditched MMX all together, or at least it will be phased out eventually. Anyways, all the power is still available through the normal and SSE registers. These each have 16 of them. Oh, and the normal general purpose registers have been expanded to 64 bit.

When we code, we have expectations of the register size. We really do. If we didn't, we'd all use big integers that can grow and shrink. Luckily, now that we've gotten to 64bit registers, that should be adequate for most needs. The numbers that these registers can hold is staggering. While the largest number a 32bit register can hold is about 4 billion, a 64bit register can hold 16 billion billion, whatever that number is. There's at least 18 zero's in there is all I know.

What most people don't realise is that these aren't the only registers available. The others can't be accessed directly, but are used by the processor to speed things up. For example, if an operation uses EAX and EBX and then in the next operation you set EBX to another value right afterwards, the CPU will not pause until the previous operation is done. Instead, it will assign EBX to a new internal register and continue on its way. The old operations will still point to the old register and those will complete as normal. The new operations will point to the new value of EBX. This allows many operations to execute in parallel even though they appear to use the same registers. Internal banks of registers can be in the hundreds. This kind of power cannot be used by high level languages. Most compilers are getting good at this, but I can still write assembly code that beats compilers just by using this technique to its fullest.

So far, we have SSE registers going to waste. We have 8 extra general purpose registers going to waste. We have the high end of the general purpose registers going to waste (unless you use a 64bit OS). And although compilers do take advantage of register renaming, I have to wonder how much. These are just the obvious wasted power in your machine. The biggest wastes are yet to come.

Let's say you are copying an array or you're taking as input two arrays and outputting the result in another array. I'll just write something simple:

void myfunc(int arr1[], int arr2[], int results[], int length)
{
  for(int i=0;i<length;i++)
  {
    results[i] = arr1[i] + arr2[i];
  }
}


The weird thing is that some compilers will know that the indexes are always incrementing and will produce code accordingly (as if I used pointers that were incremented). Other compilers will use base+index addressing. As in "MOV EAX, [EBX+ECX*4]" where EBX is arr1 and ECX is the index. Nothing wrong with it. I think it's almost as fast as incrementing pointers. Except for one thing. You can only read two permanent register names per clock cycle. This limitation may be less on newer processors. For reading, it's not so bad, but for writing, you'll be wasting an extra cycle for no reason. For the entire list, that makes 1 million wasted cycles. It may not be much on a 1Ghz processor. 1 millisecond. But this is just one function and ONE operation. You can start wasting seconds really quickly.

Unfortunately, some compilers won't keep these values in registers even in release mode. So it ends up being terribly slow even though the algorithm is fine. The main flaw that I could tell is that arr1, arr2 and results are already stored on the stack, so it'll dump those every chance it gets if you do anything that requires the use of extra registers. Sometimes it'll dump them for no reason. So it's reloading values when there's no need.

That's all minor though. Well, if it's done incorrectly, your code is basically crap at the native level. Even if done correctly, there's still plenty of power going to waste. Notice that arr1 and arr2 are only being read and that results is only being written to? This is VERY important. And now I have to explain a little something about how hardware reads and writes memory.

The absolute fastest access is with registers. If your code can be done entirely by using registers and no memory, assuming your algorithm is good, this is the best that you can achieve. This is why having extra registers is so important. In most code today, we don't even use all 8. This is because one is used for the stack pointer (ESP). Another is used for a pointer to our local variables (EBP). That leaves 6. So imagine that. Try writing software where you can only ever use 6 variables at a time. If you need to use more, you must swap it out to some permanent storage in the meantime. Well, hardware manufacturers knew this was slow too. This is where caching comes in.

In most machines today, there are two levels of cache. So we have registers, Level 1 cache, Level 2 cache, and finally RAM. Some processors also have a Level 3 cache. Each level can be hundreds of times slower than the level immediately above it. This applies to code as well. Although code isn't stored in registers, it is stored in the cache. So the tighter your loops and other code, the better.

On most processors, Level 1 cache is about 32 to 128K. And Level 2 cache is 256K to 1MB. These just keep a copy of the most recently used memory locations. Sometimes it will have some internal logic to control when the cache will get overwritten because the processor can often predict what memory is used more often.

Why does this concern us? Well, this is where you can get the most gains in speed other than writing all your code not to use memory. Most of the time, we will be using memory. Software is meant to transform data. So let's go back to the code above. We'll look at the writing part first. Notice that we don't use the values that are written to the results array in our function. This means there's no need to put it in the cache. It never gets used again. The cache is meant to keep values that will be reused. If we were to just output the values in the usual way, this would pollute our cache.

Is there a way around this? Well, actually there is. You can use what is called non-temporal stores. But again, this isn't available in high level languages. Not in C. Not in C++. Nowhere. This is one case where hand written assembly code will beat any compiler hands down. There's no competition. Compilers come nowhere near the speed that can be accomplished here by hand. What a non-temporal store does is bypass all the caches and writes directly to RAM. You must make sure that any memory you are writing to isn't already in the cache though. This will incur a severe penalty in speed if it is. Overall, it's a safe bet. This won't be just a little faster. It'll be faster by incredible amounts. Just ONE operation changed for the writing and you get this huge speed increase.

Things aren't all rosy though. I think only newer processors allow non-temporal stores on general purpose registers. And even then, I think you have to check if that option is available. What all processors have these days is non-temporal stores on MMX or SSE registers. So it makes even more sense that the function I wrote above should be written in a more parallel fashion. Not only can you operate on 2 or 4 items at once, but non-temporal stores will give another level of speedup beyond that. And then we look back at C, something that we think is high level assembly, and it fails miserably at this. It cannot do any of these things without outside help.

So I already mentioned that we could take these values 4 at a time in SSE registers (or two at a time in MMX). But we have 8 (or 16) SSE registers, so we can process 32 to 64 items at once. Why would we do this though? What difference does it make if we process a few at a time or a whole bunch at a time? Obviously using one entire register will enable our code to run 2 (MMX) or 4 (SSE) times faster. But using extra registers will require that much more time. What's the point of using these extra registers? Whether we do them one register at a time or a whole bunch at a time, it should take the same time, no? Well, no. Sometimes, yes if we have no choice and are running out of registers. But in the example above, we have plenty of registers to use. There is a real reason to use them.

Here, I have to mention what a cache line is. It's one section of your Level 1 cache. On the PIII, it's 32 bytes. That's only 8 32bit words. On PIV, it's 64 bytes. So your entire L1 cache is divided into 32 or 64 byte sections. Here's the trick. If you read data in sequential order one after the other in increasing order, the processor will anticipate that you will want to read more and will preload this memory into L2 cache waiting to be placed into L1 cache. That's why you would want to read a lot of data in one go. Normally, the first read at the start of a cache line would be slow while the processor loads up that cache line and then the other reads within that same cache line would be extremely fast. With sequential loads, the processor preloads subsequent cache lines and you don't incur the load penalty. By using non-temporal stores, you can also interleave these within your code without a penalty because you're not affecting the cache.

This isn't the only speedup. And frankly, you may not notice anything at all because your code may be too fast for the cache to keep up. Think of your code and the cache as being on a race. If the code can execute faster than the cache can fill up with data, your code will be slowed down. If only there was a way to tell the cache ahead of time what memory you will be using. Well, guess what? You can do this too. Processors have prefetch instructions. You can suggest to the processor what level cache (L1 or L2) a certain cache line should be loaded up. So while in one iteration of your loop, you can tell it to prefetch the data you'll need two or three iterations down the road. I've found that 64 to 96 bytes ahead for PIII and 128 to 192 bytes ahead for PIV are best. It depends on the machine, but it's usually around two to three times the cache line. One thing to watch out for is to check that you don't prefetch both input arrays if they are on the same offset from the size of your L1 cache. For example, if your L1 cache is 64K and both arr1 and arr2 pointers are exactly at the same offset from a 64K boundary, you'll get a cache collision. They'll both be fighting for the same spot in your cache. I don't think anything can slow down your code more than this.

The last obvious wasted power is the best usage of your L1 cache. Or all cache. The best example I can give is quicksort. It's probably one of the few algorithms that properly uses the cache. I think it was accidental. Most people think that quicksort is the fastest sort algorithm. And for most cases it is. But it's not the algorithm. If it wasn't for the cache, there would be other algorithms that would have a faster running time. Just as an example, we don't use quicksort for less than about 40 items. We use insertion sort. But even for larger lists, quicksort isn't what we should be using if all things were equal. All things not being equal, I want to show why quicksort actually performs best on von Neumann machines.

I said earlier that each level of cache is several orders of magnitude slower than the one above it. So what you want to do is remain in the lowest levels as little as possible. Say we have a list of 1 million items (1MB to even it out). In the first pass of quicksort, we must stay at the lowest level because we must read and write to any part of the array. Let's say that RAM access costs 1000 units of time for one item. Each level above will be 10 times faster. This is MUCH less of a difference than reality. But even these conservative estimates will bring home the point.

So we have 1MB of data to sort. We have a L2 cache of 512K and a L1 cache of 64K.

Memory: 1000 units
L2 cache (512K): 100 units
L1 cache (64K): 10 units
registers: 1 unit

Ok, so going through 1MB in RAM will cost us 1048576 * 1000 = 1048576000 units.

Assuming that our data is split evenly (unrealistic, but this is just an example), this means that when we deal with the first partition (512K), it'll fit within our L2 cache. So we will now have all the speedups afforded by the L2 cache. We only deal with one partition at a time, so even though we go through an entire MB when going through both partitions, the speedup is still there. This is because we FULLY sort the first partition before going to the second partition. We use and reuse this data over and over as we create smaller partitions. During the entire time that we are sorting this first 512K partition, we will NEVER go to RAM because the L2 cache contains all the data. OTOH, if we went through sequentially, we would wipe out the cache every time and we would get no speedup. That's what happened when we went through the entire MB the very first time.

So going through each of these 512K partitions will cost us 512*1024 * 100 * 2 = 104857600. As you can see, this second iteration was 10 times faster already and we've only used L2 cache.

The third and fourth iterations will be 256 and 128K. These are still more than the L1 cache, so this will be the same as the last iteration. There will be SOME speedup, but only at the beginning. We'll leave that out for now. We now have two more 104857600 units of time.

The fourth iteration will be 64K. This fits in the L1 cache. So we now get 64*1024 * 10 * 16 = 10485760 units of time. All other iterations will be the same. 32, 16, 8, 4, 2, 1K = 6 iterations. And then 9 more makes a total of 16 iterations. 10485760 * 16 = 167772160 All these iterations put together are still 10 times faster than the first iteration. That should show the importance of the cache. Real-life examples are WAY more drastic than this.

Putting it all together:
             TIME
RAM:   1048576000 (1 iteration)
L2:     314572800 (3 iterations)
L1:     167772160 (16 iterations)
Total: 1530920960 

Note that each single iteration processes the EXACT same amount of data.

If everything was done in RAM, it would be: 20971520000
We get a 92.7% speedup.  That's insane!


If you want to prove this to yourself, write two versions of quicksort. Write one that is recursive (or sorts the frontmost partitions first all the way through before moving on to the next data items) and then write one where you partition the entire list at every iteration. Then compare your results. The big OH notation is the same for both, but the runtimes are worlds apart. So when you deal with big OH notation, remember that it can fool you into wrong assumptions. At other times, you can luck out.

With large amounts of data, you want to partition it to fit within your cache as quickly as possible. Quicksort has the optimal algorithm for doing this and so it wins out. But be clear that it has nothing to do with the sorting. It has to do with how it fits in with the cache. Other algorithms like combsort would kick quicksort's ass if it weren't for the cache, and it does on random lists.

One thing that's been driven into our brains is that these kinds of details are bad. We shouldn't put them in our code. We need abstractions that take care of these things. Or, more often than not, we just leave them aside to rot. If you're writing business applications, maybe you don't care about speed. Though I would guess that there are some CEO's that would have quite a different opinion than what programmers have been taught. Adding more machines is fine... if that'd doable. But what about problems that can't be partitioned that well? You're back to having the programmer deal with it. Currently, the best we can do is hope for the best or choose more appropriate data structures. Sorry, but I have to say it. That's fucking pathetic.

If you, as a programmer, believe that you are not responsible for understand how the machine works, then you do not deserve to call yourself a programmer. We create software to make the machine process our data. It is our job to know how this machine works. At least in some rudimentary way. The above is not rocket science. It's not complicated. It shouldn't be. When we hear that we shouldn't touch the hardware, what we should really be hearing is that we're too lazy to come up with better solutions. What about having a repository of different kind of code for specific machines? The originals stay intact, so if anything needs to be changed, you work with that. But if you need more speed, you can configure it. And it's not about having multiple versions. Software that needs this... well, needs it. Usually, you end up with complicated code. You could keep the initial code clean and then configure how the algorithm should work on the hardware.

Today, we have WAY too many people writing code. If you look at the vast majority of software, almost everything in it is identical to what other software does. Finding something new in software is extremely rare. No matter how clever you think you are, chances are someone else has either already done it or will come up with the same solution given the same problem. The point is that there should be repositories of available functionality that work specifically for all sorts of machines. That we can't even write a pixel on screen without jumping through hoops is ridiculous in 2007.

There is no reason why any of the power I've mentioned here should go to waste. While it'd be impossible to make 100% optimal use of it, we can do much better than today, which is basically 0%. We don't have far to climb to make huge leaps forward. I can also say that computers today, while computationally impressive, aren't more responsive than computers 20 years ago. What does that say about where we are going? And multicore? Good luck. We've been told over and over again that hardware is the root of all evil.

I don't know if it's just me, but I get a sense that programming is hitting a brick wall these days. Maybe I've just gotten tired of seeing the same things over and over. I don't understand what would cause one to become a programmer anymore. In the real world, people pay for the best stuff and the best quality. In the programming world, the best cannot be bought because it's simply not available. No matter what you think of your software, no one cares. Really! No one gives a shit about any of your code. Ask a CEO if he cares about the code. He doesn't care. What he cares about is that it works. But be sure that if there was a cheaper way that didn't require code, he'd go for it. At the end of the day, the things that people care about are not there. So we live in a world where code "just works", but no better. There are no Ferrari's in the world of software. Who needs Ferrari's? Well, it's not so much about needing Ferrari's as it is that they're going to become commonplace and hardly any programmer knows how to drive one.

Who Don't Love Java? ME! But Not Because It's Java.Project V: Multiple Output

Comments

Sean Conner 8. October 2007, 06:25

What's happening in this case is that C and C++ operate at too low a level to effectively use multicore CPUs. What I mean by that is, why can't I have a systems level language where I can do something like:

double a[1000000];
double b[1000000];
double c[1000000];

c = a + b;

And have the compiler generate appropriate code depending upon the system? At worst, the compiler will translate that to:

for (tmp = 0 ; tmp < 1000000 ; tmp++)
{
c[tmp] = a[tmp] + b[tmp];
}

(or a pointer equivalent version). And a system with multiple cores, the system can break it up as:

inline array_add(restrict double *d,restrict double *a,restrict double *b,size_t start,size_t end)
{
size_t i;

for (i = start ; i <= end ; i++)
*d++ = *a++ + *b++;
}

spawn(array_add(c,a,b, 0,249999));
spawn(array_add(c,a,b,250000,499999));
spawn(array_add(c,a,b,500000,749999));
spawn(array_add(c,a,b,750000,999999));

four a four-core CPU system. Let the compiler take care of the thread creation. Make threading invisible. I can no longer keep track of what optimizations to use for what CPU anymore (in college, I learned not to, as I was writing code for the x86, 68k, MIPS and SPARC architures---I let the compiler handle the details for the most part (and believe me, I couldn't *beat* the MIPS compiler at its highest setting, as it could optimize code across scores of source files---sure, it took a hideous amount of time, but I've yet to come across another compiler what could do that)).

Vorlath 8. October 2007, 19:33

You mean C is too high level. And I agree. I've oftened wondered why you couldn't do things like add arrays, but my issue isn't with specific features. If we leave things to the compiler, then we're stuck with the way the compiler does it though that's usually better than handling it ourselves. Like your hardcoded example of spawning threads. You definitely don't want to be doing that.

This deal with design patterns. Depending on what you want to do, there will be different best ways to do things based on the hardware. So you use the proper design pattern for what you want to do and you'd hope that the system in question already has such an implementation. But if we put the implementation within our code like your spawn thing, it's not portable when we get more than 4 cores for example. And if it's within a compiler, we're dependent on the compiler knowing all possible variations of a particular system. At some point, this will become unrealistic.

However, if you are going to use these machines, you need to know what's going on so that you can configure your code. This still seems alien to people.

Anonymous 11. October 2007, 16:25

Anonymous writes:

No, he clearly meant C is too LOW level. If you write at a higher level, the compiler can make more intelligent optimizations. The more the compiler knows, the better.

On another note, an interesting idea just occurred to me while I was reading this, or perhaps just before I read this while I was reading Lambda the Ultimate. We have this problem of proliferation of alternative designs, but all of it can be implemented with the same components. What if computers were created without any specific processor architecture? What? How could that be possible?

Imagine a computer that consists of a large array of FPGAs. Software could then consist of not only instructions that are executed by circuits, but also consist of the circuits themselves -- the optimum layout for computing the problem at hand. Software could even dynamically reconstruct the circuitry it runs on. I'm not sure what the current technical limits are on the speed of FPGA programming, but it's conceivable to me to imagine a system where reprogramming FPGAs is analogous to what task switching is now.

If the problem at hand was searching a database of photos, than circuitry could be conjured to compute convolution filters, or if the task at hand was playing a game, than circuitry could be programmed for matrix multiplication. All of this would obviously be on a vastly parallel scale. Differences in architecture would now be largely superficial, and could be easily abstracted away with software tools.

What do you think about that?

Anonymous 11. October 2007, 16:49

Anonymous writes:

"Differences in architecture would now be largely superficial, and could be easily abstracted away with software tools."

I realized, even as I wrote it, that that was a ridiculous statement to make. There are just as many ways to design the sort of flexible system I was imagining as there are ways to design a rigid processor. Probably more in fact.

I also realized as I wrote the above that this couldn't possibly be a new idea. It's not. It was conceived in the 60s, but only now are we developing the necessary technology to make it practical. Here's a Wikipedia article on the concept:

http://en.wikipedia.org/wiki/Reconfigurable_computing

Vorlath 12. October 2007, 05:25

"What's happening in this case is that C and C++ operate at too low a level to effectively use multicore CPUs."

That's the statement I was referring to. You can't achieve better usage of multicore at a higher level. That's pure fantasy. You can never be at too low a level to use the hardware. His examples were nonsensical even if he meant to have better high level tools because he just ends up hardcoding everything.

Threading, multicore, concurrency and all that are low level concepts. They should not be found in high level languages. So we must differentiate between usage and implementation of these tools. If you say we need better high level tools that will handle all of this for us, I'd agree. If you say that we can more effectively use multicore with high level languages, I'd have to slap you upside the head. Easier? Maybe. But more effectively? Not necessarilly. Again, you'd be at the mercy of whomever built the high level tool. So "effective" is not an absolute. And again, the examples given were not high level. They were hardcoded examples. If you see things like "Parallel.execute(A,B,C)" where A,B and C are executed in parallel, that's hardcoding. That's hitting the hardware directly. If you see "new Thread()", that's hardcoding. You're basically creating a new hardware program counter (virtual or not). If you see "Spawn(...)", same thing with more baggage.

About reconfigurable machines, I'd love to have one. I could actually use it too. Project V could easily take advantage of it if there would be some people willing to configure different designs for different tasks. For Project V, this is just another set of inputs and outputs with dynamic implementation (all features that Project V supports already via its component model).

Sean Conner 12. October 2007, 21:47

Perhaps I did not make myself clear, but did you miss where I wrote “At worst, the compiler will translate that to …” and “At worst, the compiler will translate that to …” as examples of what the compiler will produce, not what a programmer would write.

Vorlath 12. October 2007, 22:17

Oh, you're right. I though you meant that someone could code that up. My apologies. It still leaves open the issue of how and what will split that code up. You need to go low level to make that happen. I've seen many attempts to do just that with less than stellar results though most times, it does help.

Anonymous 5. December 2007, 16:40

Anonymous writes:

I know that this has been dead for a couple of months, but I stumbled across it today from a link. This fascinates me - not because I did not know the hardware, I am taught tat as part of my computer science degree - but the fact that there is apparently no complier to do this.

Would it not make sense to write a compiler which optimises for the cpu of the machine it is running on, and have this complier know the threading and the cache of the cpu in order to feed the right information to cache in order to releive pressure from the disk and RAM?

Last year I took a course in Computer Graphics and Visualisation. In this course we were given a project where we were to write a simulation in C/C++ using GLUT/mesa for our graphical processing. I was appalled at the performance of C++ when performing simple collision detection.

My base collision detection algorithm consisted of a simple check of where a pseudo sensor was in relation to an arbitrary surface ; a collision would be detected when the distance was 0 or the sign of the number changed ( the mathematical operation produced a signed floating point integer to give the displacement according to an arbitrary right-hand-positive ) . This ran fine as a simple operation in mathematical computation.

The abstraction of this which made the base algorithm practical, was to find the three closest surfaces. This ran through every object of subclass 'collisionSurface' to find the three closest using an unsigned double floating point operation. This slowed things down as it was suddenly necessary to access many (read 'thousands of' ) arrays and perform math on them, then the algorithm to see if the sensor had collided with any of them would be performed.

There were then 16 of these sensors placed strategically on the only object that was moving in the simulation, in such positions that there was a >95% chance that one of these sensors would be the first point to collide with any surface.

I quickly discovered that the reason that this was running so slowly, was because of the mathematical calculations in the base algorithm ; the base algorithm worked with matrices of 3 values to store the Cartesian coordinates of objects that were to be worked on, and this involved many vector additions and matrix multiplications, however one requirement of the project was to make it usable on the department computers, which were a metaphorical back box, and so we were not given any information about what optimizations we could make, consequently many of these arrays were called from memory, or even from disk, causing the whole thing to fall over, producing an eventual frame rate of under 5 fps.

Amazingly I was still given a high grade for this project, however I find it appalling that there is no way to optimize this at compile time, and as C programmers we must surrender performance if we want any usability.

NTiOzymandias 5. December 2007, 20:09

The best way to take advantage of a particular system's capabilities is JIT compilation. Ironically, the runtimes that are most thoroughly enabled to promote such fine-tuned optimizations (Java and .NET) are so encumbered by their heinous APIs that using them for high-performance computing amounts to playing Russian roulette with a bazooka.

Vorlath 5. December 2007, 23:47

To anonymous: C and C++ are meant for the programmer to know everything. You have to take care of most details. Unfortunately, the lack of features to handle the cache and non-temporal moves makes it rather restrictive in that department. If C++ fails or is slower than anything else, it's usually because of the programmer. Caching features would simply make it 2 or 3 times faster that anything else out there for specific problems that deal with lots of data.

To NTiOzmandias: The main problem with JIT with Java and .NET is that you are hardcoding at the programming level. Once you hardcode, there is nothing that the JIT can do. It cannot change the specifications that the programmer put in. Ironically, the programmer is unable to NOT give hardcoded specification with the use of commands that are meant for the SINGLE processor. It's been said before, but it's a good quote worth repeating. Until we start treating the CPU as a resource and not as a fixed and absolute entity, we will never be able to get the most out of our systems.

How to use Quote function:

  1. Select some text
  2. Click on the Quote link

Write a comment

Comment
(BBcode and HTML is turned off for anonymous user comments.)

If you can't read the words, press the small reload icon.


Smilies

January 2010
S M T W T F S
December 2009February 2010
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30