View Single Post
Old 31 October 2016, 20:25   #12
meynaf
68k wisdom
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon (France)
Age: 44
Posts: 2,381
Quote:
Originally Posted by buggs View Post
Well Meynaf, you'd like to see some code? Here you go. Core loop inhorizontal interpolation as an example. Hope, the post ain't too long.
Interpolation implies some kind of upsampling, and here your code writes as much data as it reads. If you've done a box filter before, you'd be better off by integrating the computation there.

This is horizontal interpolation (from jpeg decoder) - try to rewrite it with SIMD if you wish :
Code:
; a0 = input, a1 = output
; 1st row
 moveq #0,d1
 moveq #0,d0
 move.b (a0)+,d1
 move.b (a0)+,d0
 move.b d1,(a1)+
 move.l d1,d2
 add.l d2,d2
 add.l d1,d2
 add.l d0,d2
 addq.l #2,d2
 lsr.l #2,d2
 move.b d2,(a1)+
; general case
 moveq #0,d2
.xloop
 move.l d0,d3
 add.l d3,d3
 add.l d0,d3
 add.l d3,d1
 addq.l #1,d1
 lsl.l #6,d1
 move.b (a0)+,d2
 add.l d2,d3
 addq.l #2,d3
 lsr.l #2,d3
 move.b d3,d1
 move.w d1,(a1)+
 move.l d0,d1
 move.l d2,d0
 dbf d6,.xloop
; last row
 move.l d0,d2
 add.l d2,d2
 add.l d0,d2
 add.l d1,d2
 addq.l #1,d2
 lsl.l #6,d2
 move.b d0,d2
 move.w d2,(a1)+
Btw 1.
Upsampling isn't very important in the final timing.
The most important code is the DCT. As it's supposed to be done with this data parallelism stuff, well, it's that i want to see. Good luck without SIMD multiply.

Btw 2.
The parallel instructions you use here are not documented anywhere.
They just come out of nowhere and i'm supposed to trust this...

Btw 3.
You have to understand that this SIMD stuff will only work for very simple tasks. As soon as it becomes relatively complex, it starts to fail miserably - or you'll have to create new instructions for almost everything you do.
I'm not against doing things in parallel, i'm against creating new big fat registers for the sake of speed.

Btw 4.
This example can be rewritten by creating a simple longword parallel byte average instruction. One instruction added instead of a whole block, same timing if executed on two pipes.
meynaf is offline  
 
Page generated in 0.05522 seconds with 9 queries