View Single Post
Old 31 October 2016, 22:00   #13
Registered User

Join Date: May 2016
Location: Rostock/Germany
Posts: 44
Originally Posted by meynaf View Post
Interpolation implies some kind of upsampling, and here your code writes as much data as it reads. If you've done a box filter before, you'd be better off by integrating the computation there.
Actually, I've done plenty of filtering. And I tend to avoid box filters, whenever possible btw. In case of anything from MPEG-1 to -4, however there are rules to obey. My code implements exactly what's needed to interpolate the subpixels for these standards. This is nothing else than the classic polyphase approach, where you calculate (and keep) only what you really need.

Btw 1.
Upsampling isn't very important in the final timing.
The most important code is the DCT. As it's supposed to be done with this data parallelism stuff, well, it's that i want to see. Good luck without SIMD multiply.
You know, people have been implementing ISO/IEC 23002-2 compliant DCT/iDCT algorithms with just shifts and adds. But apart from that, my recent AMMX iDCT (which sparked TuKo's post) performs parallel multiplies at full throughput just fine.

Btw 2.
The parallel instructions you use here are not documented anywhere.
They just come out of nowhere and i'm supposed to trust this...
They are documented, but as work in progress not in public. Besides, I see not much point in arguing about a necessity of trust. Take it or leave it, your choice. I just provided a code example you asked for and I'm not a marketing department.

Btw 3.
You have to understand that this SIMD stuff will only work for very simple tasks. As soon as it becomes relatively complex, it starts to fail miserably - or you'll have to create new instructions for almost everything you do.
I'm not against doing things in parallel, i'm against creating new big fat registers for the sake of speed.
Let me break it to you this way: I've been coding in SIMD since 20 years ago and tend to think that I know quite well where it applies, where not and which engineering compromises led to the trend. AMMX is not the first SIMD ISA where I've contributed the one or other thought.

Btw 4.
This example can be rewritten by creating a simple longword parallel byte average instruction. One instruction added instead of a whole block, same timing if executed on two pipes.
Yes, it can. But it wasn't the only functionality we liked to have.
buggs is offline  
Page generated in 0.08910 seconds with 9 queries