If I might interject here...
I'm writing a BASIC interpreter and naturally speed is a very important concern for me - by their nature interpreted languages are slow (which is why things like JIT and Dynarecs were created). Normally I'd jump straight into assembly to get the best speed, but unfortunately I can't - it needs to build for both x86 and ARM chipsets, so asm is completely ruled out.
To that end, I have to rely on the compiler to produce good code - which is where algorithm choice comes in. But when it really comes down to it, I've made great progress by disassembling the output and tweaking the code to avoid bottlenecks and compiler idiocy - though the cases where I could have done better than the compiler are few and far between.
Of course, there comes a point where I just don't care about getting more speed - dropping into asm quickly becomes a non-starter when it comes to getting the code ported.