In my experience with GCC you're not going to see revolutionary improvements by turning your existing code into assembler. So, you are wise to avoid that until you've taken the C code as far as you can.
Without code to look at it's hard to say how it could be improved. Using 32bit ints easily adds up because it's double the amount of memory you're shuffling around and 32bit instructions are slower. So it pays off to constrain your data to 16bits everywhere. Shifts are terribly slow so it helps to keep them to a minimum or always 16bit shifts so the compiler will use a swap instruction.
Maybe you can share one part of the code so you can get feedback that might lead to ideas you can propagate through the rest of the code.
The only way to make progress is to chip away at each part of the code making it slightly faster until all that works starts to add up.
But at the moment you probably don't know what to do with any one piece of code, so pick one to share and maybe people will have good suggestions.
|