View Single Post
Old 13 July 2016, 11:02   #205
buggs
Registered User
 
Join Date: May 2016
Location: Rostock/Germany
Posts: 132
Faster nearest neighbor resampling

My apologies for hijacking a concluded thread but I thought I might have something worth sharing in terms of nearest-neighbor resampling. Of course, with fast hardware one might think about filtered resampling. On the other hand, to reproduce the "original" Amiga sound on Paula hardware (for good or bad, which depends on the individual point of view), nearest neighbor is the way to go.

I recently got hooked up to 68k Asm again, started tinkering around with my old code and shortly thereafter stumbled upon this thread. Out of my own (historical) knowledge, I first saw the "addx/add" combo for the position update in Jarno Paananen's PS3M. That one is already quite runtime efficient (though not as accurate as Bresenham), even on 68000. Out of my personal experience, I wouldn't recommend step tables. One might save registers but that comes at the cost of the mov from memory which is more costly than two simple adds.

That being said, my tinkering led to a routine that doesn't need the second add of the well known "addx/add" combo.

My first-stage 8 Bit mixing main loop looks like this (with 68000 cycle annotations):
A0=input array
D0=fractional position (upper 16 bits), byte position (lower 16 bits), D4=fractional increment (upper 16 bits), byte increments (lower 16 bits), D2=pointer to current volume table (upper 24 bits), where the lower 8 Bits come from each sample and are addressed as "unsigned"
A2=output
A5=remaining bytes in the output-16
D6=remaining bytes in the input

Code:
.mix_fastloop 
                rept    8
                 move.b (a0,d0.w),d2                               ;14
                 addx.l d4,d0                                           ;8
                 move.l d2,a3                                           ;4
                 move.b (a0,d0.w),d2                                    ;14
                 move.b (a3),d3                                         ;8
                 addx.l d4,d0                                           ;8
                 move.l d2,a3                                           ;4
                 swap   d3                                              ;4
                 move.b (a3),d3                                         ;8
                 move.l d3,(a2)+                                        ;12
                                                                        ;=84 cycles for two bytes (42 cyc per byte w/o check)
                endr
                lea     -16(a5),a5      ; no condition codes changed    ;8
                move.l  a5,d1                                           ;
                swap    d1      ; if( remaining_output_bytes < 16 ) 0xffff else 0x0000
                or.w    d6,d1   ; if( remaining_output_bytes < 16 ) -1 else
                                ;                                   remaining_input_bytes
                lea     (A1,D0.l),A3    ; input: processed bytes
                cmp.w   A3,d1   ; if( D1 < A3 ) -> stop
                bgt.w   .mix_fastloop                              ;48 cycles
                                            ; total 8*68+48 / 16 = 37 cycles/byte
The trick is: I crafted the code to avoid operations that change the "x" bit. Hence, I can leave out the second "add" in the usual "addx/add" combo. The prerequisite is the combination "sub.w d4,d0" "add.l d4,d0" before entering the main loop, thus having the proper state of the "x" bit for the first "addx".

This part of my mixing loop shown above is the "copy" loop, called for the first mixed channel. It doesn't require clearing of the output array (A2). For additional channels to be mixed with said first channel, just two more instructions are added in my routine instead of move.l d3,(a2)+:
Code:
move.l (a2),a6
adda.l  d3,a6
move.l a6,(a2)+
This way, one can mix an arbitrary number of channels into an intermediate 16 bit representation. Please note that I mix to an "unsigned" output format in my intermediate representation which changes the "zero point" with each added channel. I keep track of the number of mixed channels and perform the compensation in the output to ChipRAM stage.

This main loop is quite a bit faster than the old code I used (derived from PS3M) on 68000 and also performs nicely on 68060.

Downsides: The volume table shown in this sniplet is 8 Bit deep. For low volume channels, this method introduces quantization noise in the mixing stage that would be an annoyance when 14 or 16 Bit output of the mixed channels is desired.

Maybe that stuff is of some use to someone. At least, I had fun coding on 68000 again.
buggs is offline  
 
Page generated in 0.04369 seconds with 11 queries