Manual Code Optimizations – part 1
June 3, 2008
During the many years I spent programming, I’ve had a lot of fun in code optimization (for both speed and space). Not by compiler optimization, but just doing it by hand. Specifically, optimization that’s going to work on any kind of x86 offspring, without the fancy new technologies.
I hope to do a series of these things, but here’s the first one.
The area where code optimization is used at great length, is the world of computer graphics. No stone is left unturned when it comes to graphics for 3D applications and 3D games.
There are a lot of optimizations available for 2D graphics. Early optimizations in the days when I for example set eye on the Quake 1 sourcecode, I remember seeing a lot of assembly code that was tuned to the max. Later on came MMX technology, FPU’s, more registries, and the number of ways to optimize increases everytime – yet they remain untouched as companies prefer to write ‘clean readable code’ in favor of optimized code. They all trust the compilers to optimize, and they can up to a certain level.
The following manual optimization example does not use any kind of special technology, but just utilizes common sense and plain code.
The function that I’m going to optimize is a drawing function called drawVerticalLine(). It’s purpose is to draw a line from given x-y coordinate, vertically, with a given length, on a 2D rgb-plane.
A horizontal line wouldn’t be much of a problem, since all the pixels are adjacent to eachother. However a vertical line – despite visual appearance – is not something you could easily spot from memory.
A 2D plane is made up from a large collection of pixels, where every horizontal line is pasted next to eachother. If you had a plane with a width of 4 pixels, and a height of 2 pixels, the plane would look in memory something like this: {pixel1_1, pixel1_2, pixel1_3, pixel1_4, pixel2_1, pixel2_2, pixel2_3, pixel2_4}
The way the horizontal lines are seperated from eachother by code is to multiply by the width of the plane.
Let me give the entire test-program (it’s in C++, but can be easily translated into C).
#include <cstdlib>
#include <cstdio>
#include <mem.h>
#include <time.h>
typedef struct RGB {
unsigned char r;
unsigned char g;
unsigned char b;
unsigned char a;
};
typedef struct PICTURE {
RGB *image;
unsigned int width;
unsigned int height;
};
void drawVerticalLine( PICTURE *picture, unsigned int x, unsigned int y, unsigned int length, struct RGB color ) {
for ( unsigned int i = 0; i < length; i++ ) {
picture->image[((y + i) * picture->width) + x] = color;
}
}
int main() {
// create a picture with given sizes 100x75
struct PICTURE myPicture;
myPicture.width = 100;
myPicture.height = 75;
// allocates memory space to hold all of the pixels,
// the number of total pixels is calculated by width * height,
// the number of bytes is calculated by multiplying
// the pixelcount with the size of the RGBA structure (which is 4 bytes or 32bits)
myPicture.image = reinterpret_cast<RGB *>( malloc( myPicture.width * myPicture.height * sizeof(struct RGB) ) );
// set every byte in the memory block to 0
memset( myPicture.image, 0, myPicture.width * myPicture.height * sizeof(struct RGB) );
// we want the line to be white
struct RGB white;
white.r = 255;
white.g = 255;
white.b = 255;
// we're not using alpha, but if the RGB structure is aligned to 32 bits, processing the pixels is much easier
white.a = 0;
// nothing can be optimized properly without actually measuring the results
time_t tStart, tEnd;
time( &tStart );
// draw the line loads of times to get a measurable result (the time() function works with seconds)
for ( unsigned int i = 0; i < 0xFFFFFF; i++ ) {
// draw a vertical line on myPicture with length 50 and colour white, starting at (10,10)
drawVerticalLine( &myPicture, 10, 10, 50, white );
}
time( &tEnd );
printf( "time: %ds\n", tEnd - tStart );
// free our allocated image
free( myPicture.image );
return 0;
}
This program takes 14 seconds to run on my machine (not compiler optimized). So we want to optimize this code, but where to start?
There are a few things that cost a lot of time, and you can take this to just about any kind of program and code. The main issue in this example is the time spent inside a loop, specifically the loop inside the function drawVerticalLine. Depending on the length of the line to draw, the time spent while looping through all the pixels is enormous compared to things that are just done once.
void drawVerticalLine( PICTURE *picture, unsigned int x, unsigned int y, unsigned int length, struct RGB color ) {
for ( unsigned int i = 0; i < length; i++ ) {
picture->image[((y + i) * picture->width) + x] = color;
}
}
What the function does is copy the pixel ‘color’ to the destination, which is at a constant X-coordinate, and a incremental Y-coordinate until the given length is reached.
There are a few things here that cost a lot of time, and the obvious one is that we calculate the current position everytime we loop. The thing that is less obvious, is the simple technique of using a pointer inside a pointer to get to the right pixel (the -> assignment). Now, there is generally nothing wrong with those assignments, but it takes an extra opcode that can’t be optimized by the compiler since the pointer can be accessed from outside our function as well.
Since we know we’re not going to use and alter our picture structure from outside the function, we can speed up the code a little by writing:
void drawVerticalLine( PICTURE *picture, unsigned int x, unsigned int y, unsigned int length, struct RGB color ) {
RGB *image = picture->image;
for ( unsigned int i = 0; i < length; i++ ) {
image[((y + i) * picture->width) + x] = color;
}
}
This is the smallest optimization ever, since it doesn’t even show any difference in time. Back in the days when you did wrote MOV memory to registry, it could take 4 clockticks, but these days it’s gotten pretty fast and we could’ve just left it the original way.
So the biggest issue still is the calculation. The slowest part of the calculation is the multiplication. This is not only because it’s an intensive operation, but also because before you actually call the MUL or IMUL opcode you need to prepare at least 2 registers for the operation.
To optimize this code, we needn’t really find a fancy technology, because it has a mathematical solution.
If you really go back into your memory you’ll remember that multiplication is just another fancy word adding the same number to your result for a number of times. 4*4 is the same as 4+4+4+4, and lucky enough for us, addition is something a cpu can do at really fast rates.
What the multiplication in our function basically does, is go 1 pixel down every time, which in memory is the same as going 1 line forward. So we can add the length of 1 line to our current location, but we need a starting point. And that starting point, which does contain a multiplication, can be calculated Before we go into the loop.
The optimized function will look something like this:
void drawVerticalLine( PICTURE *picture, unsigned int x, unsigned int y, unsigned int length, struct RGB color ) {
unsigned int width = picture->width;
unsigned int start = (y * width) + x;
RGB *image = picture->image;
for ( unsigned int i = 0; i < length; i++ ) {
image[start] = color;
start += width;
}
}
As an extra, you’ll see that I assigned an extra variable for width upfront of everything. This is because we’re going to reuse that variable, that we hope will be optimized by the compiler to translate directly into a reusable cpu register.
And there we have it, the application runs now in 6 seconds instead of that initial 14 seconds.
Now, if you actually use the compiler optimize function – which is in my case done with -O3 since I use MingW – the compiler cheats and speeds it up to 2 seconds. Somehow it notices that calling the function a zillion time doesn’t really do anything new.
Entry Filed under: optimization. Tags: 2d, c++, draw, image, line, loop, optimization, rgb.
Trackback this post | Subscribe to the comments via RSS Feed