Manual Code Optimizations – Loop edition

June 6, 2008

This is one of those optimizations that a compiler can do perfectly well for itself. Still, I want to explain something about a famous optimization that is often more assumed then understood.

So the optimization I wanted to talk about is the countdown-to-zero loop. I’m sure you’ve heard somewhere sometime that looping down to 0 is faster then normal loops, but why exactly is that?

It actually goes back down to the low level assembly code. If you read a typical x86 CPU opcode manual, you’ll notice that a processor has a number of flags; e.g. Overflow-flag, Sign-flag and the Zero-flag. It’s the Zero-flag we’re interested in.

Let me first explain a bit how a regular loop works. There are quite a few options for loops, but some of them aren’t that useful or useable in most circumstances.

I’ll use the loop in the previous post for this example:

   for ( unsigned int i = 0; i < length; i++ ) {
      image[start] = color;
      start += width;
   }

The best thing about this loop is that the code inside the loop doesn’t actually use the loop variable i. So the compiler can do anything with it, and it will if you got the proper compiler optimization flags on.

The way MingW compiles it without optimization is, and this is a pretty standard way:
1. CMP EAX, [length]
2. JAE end_of_our_loop // if EAX is Above or Equal to [length], jump to the end of our loop
3. // stuff inside the loop
4. INC EAX // increment
5. JMP comparison // jump to the comparison at the beginning

So, if I take a look at my old printout from a random Cyrix x86 CPU, it indicates that even then the CMP opcode takes just 1 clock count, and the same goes for any kind of Jump opcode.

So here goes, if you look at the list of Jump opcodes, you’ll hit opcodes called JECXNZ, JZ, JNZ. These are opcodes that look at the zero flags, the first one looks at ECX specifically, and if the flag says something is zero, it will jump to wherever you want to go. The twist about these opcodes, is that by looking at the flags, you won’t have to use the CMP opcode.

Flags get updated to whatever register you either looked at or changed last. So when you do that increment of i, it will unset the Zero-flag to false and you’re ready to use a jump opcode that looks at that. However, incrementing something will not get you to zero. Decrementing something will. And since we don’t need i in the example given, we can turn things around.

   for ( unsigned int i = length - 1; i != 0; i-- ) {
      image[start] = color;
      start += width;
   }

My MingW compiler will compile that as something like this:

1. CMP EAX, 0   // compares EAX (i) to 0
2. JE end_of_loop   // if they were equal to eachother, go to the end
3. // stuff inside the loop
4. DEC EAX   // i--
5. JMP comparison

The reason this gets generated by default, is just because it’s probably a standardized way to compile it. If you did this by hand, and here comes the actual optimization, it will look something like this:

1. CMP EAX, 0   // check the precondition != 0 before we go into the loop
2. JE end_of_loop   // if already 0, exit immediately
3. // stuff inside the loop
4. DEC EAX   // i--
5. JNZ stuff_inside loop  // if i != 0 continue loop

As you can see, this loop is 2 opcodes shorter than the regular way, because it can jump right to the contents instead of the comparison. And it’s all because of the zero-flag.

Now, for some reason MingW doesn’t want to compile something like that with any of the optimization flags. It uses a different approach that does makes sense. I’ll c/p the disassembly:

0040130E	test   %edx,%edx
00401310	jmp    0x401318
00401312	mov    %esi,(%eax,%ecx,4)
00401315	add    %ebx,%ecx
00401317	dec    %edx
00401318	jne    0x401312

The ‘test’ opcode, just basically sets flags and below/above/equal hints. And then it jumps to the JNE statement, magically testing if EDX is not 0, and jumps to the MOV statement, that’s the contents of the loop (the “image[start] = color;” line, followed by “start += width;”). Then it decrements i, and then it executes JNE again.

It basically does the same thing, and it really is the optimized version. Instead of JNZ, they’re using JNE. And as an extra, they check for 0 before they actually do an iteration, which is something I forgot about, but it’s only executed once anyway.

So this makes one wonder. Why add the Zero-flag opcodes to the instruction set, if the cpu automatically knows you’re talking about 0 when you do a JNE without a CMP… Probably something historic, and I’m sure they JNE and JNZ had different implementations at some point.

It doesn’t really surprise you if you think about how old the x86 architecture is by now, and how many upgrades and additions exist while remaining x86 compatible…

Anyway, so that’s the Zero-loop-optimization, and your compiler knows all about it…

Entry Filed under: optimization. Tags: , , , , , , , , , , , , .

1 Comment Add your own

  • 1. twigleaf  |  March 14, 2009 at 4:19 pm

    Testing comments… 123…

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


RSS Twitter

 

June 2008
S M T W T F S
« May   Jul »
1234567
891011121314
15161718192021
22232425262728
2930  

Categories

Blogroll

Meta