English Amiga Board


Go Back   English Amiga Board > Coders > Coders. Asm / Hardware

 
 
Thread Tools
Old 03 June 2018, 17:06   #21
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 42
Posts: 3,085
Quote:
Originally Posted by ross View Post
But surely pure code, like Thorham suggested, is a great deal!
I sure hope so


How about these (NOT tested!!):
Code:
   clr.l    d1
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   rol.w    #8,d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   swap     d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0
   rol.w    #8,d0
   move.b   d0,d1
   move.b   (a0,d1.w),d0
Code:
   bfextu   d0{0:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{8:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{16:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{24:8},d2
   move.b   (a0,d2.w),d1
Thorham is offline  
Old 03 June 2018, 18:03   #22
ross
Omnia fert aetas

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 48
Posts: 1,249
Quote:
Originally Posted by Thorham View Post
How about these (NOT tested!!):
They works, but are in the same league than pure code (the same or slighty slower)
(actually in the second there is a small change to do but the concept is that)

However there is too little difference to fully understand if there is any kind of income (real machine tests are absolutely due).

ross is offline  
Old 03 June 2018, 18:23   #23
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 42
Posts: 3,085
Quote:
Originally Posted by ross View Post
actually in the second there is a small change to do but the concept is that
What exactly? I never really use those bit field instructions.

Here's another one. Uses four 1kb tables:
Code:
   clr.l    d1
   move.b   d0,d1
   move.l   (a0,d1.w*4),d1
   lsr.w    #8,d0
   or.l     (a1,d0.w*4),d1
   swap     d0
   move.b   d0,d1
   or.l     (a2,d1.w*4),d1
   lsr.w    #8,d0
   or.l     (a3,d0.w*4),d1
This will most certainly be faster than all code with fastmem.

Last edited by Thorham; 03 June 2018 at 19:51. Reason: Changed move to or.
Thorham is offline  
Old 03 June 2018, 19:45   #24
ross
Omnia fert aetas

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 48
Posts: 1,249
Quote:
Originally Posted by Thorham View Post
What exactly? I never really use those bit field instructions.
You have reversed bit positions (bit 0 is MSB).
This is right, but at the end is practically same as mine:
Code:
_lut8flip3:
   lea	_8lut(pc),a0
   bfextu   d0{24:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{16:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{8:8},d2
   move.b   (a0,d2.w),d1
   lsl.l    #8,d1
   bfextu   d0{0:8},d2
   move.b   (a0,d2.w),d1
   move.l   d1,d0
   rts
Quote:
Here's another one. Uses four 1kb tables:
The four address register usage can be an handicap (save/load from stack..) but hey, the more the better
Quote:
This will most certainly be faster than all code with fastmem.
Yes, in fact it takes a different version depending on the conditions: pure 68k, only chipmem or only 16bitmem, 020+, real fastmem available, ..).

ross is offline  
Old 03 June 2018, 19:50   #25
Thorham
Computer Nerd

Thorham's Avatar
 
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 42
Posts: 3,085
Quote:
Originally Posted by ross View Post
You have reversed bit positions (bit 0 is MSB).
Thanks

Quote:
Originally Posted by ross View Post
The four address register usage can be an handicap (save/load from stack..)
Depends on how many tiles you have to flip.

Quote:
Originally Posted by ross View Post
Yes, in fact it takes a different version depending on the conditions: pure 68k, only chipmem or only 16bitmem, 020+, real fastmem available, ..).
Yes, one size really doesn't fit all in this case. And after all that, there's also the 68060...
Thorham is offline  
Old 03 June 2018, 20:28   #26
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
ross seems capable, so I think he could both code and compare to give us the answer

Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles. It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible.

Code:
moveq #0,d0
moveq #0,d1
move.b (a0)+,d0
move.b (a0)+,d1
move.w (a2,d1.w),d1
move.b (a2,d0.w),d1
swap d1
move.b (a0)+,d0
move.b (a0)+,d1
move.w (a2,d1.w),d1
move.b (a2,d0.w),d1
Photon is offline  
Old 03 June 2018, 22:09   #27
ross
Omnia fert aetas

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 48
Posts: 1,249
Quote:
Originally Posted by Photon View Post
Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles.
Hi Photon, the big problem here is the 020 limits.
There is not a data cache for all the
(a0)+
access.
If the tiles data is properly 32bit chipmem aligned a single .l is a big win.
So have d0.l filled is good!

Another bottleneck in your code is the
move.w (a2,d1.w),d1
that can span two memory line! (actually resulting in two separate read)
All the rol/lsl and even register bfextu is relatively cheap compared to chip mem access, this makes the pure (instruction cached) code so fast.

Your code can maybe be adapted for a pure 68k version (avoiding access to odd addresses).
Quote:
It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible.
I think you are right
Anyway not a good idea in this context.

ross is offline  
Old 04 June 2018, 10:17   #28
meynaf
son of 68k
meynaf's Avatar
 
Join Date: Nov 2007
Location: Lyon / France
Age: 45
Posts: 3,245
Quote:
Originally Posted by ross View Post
Your code can maybe be adapted for a pure 68k version (avoiding access to odd addresses).
Word access on odd addresses are normal speed if not crossing a longword boundary (i think). That would make only 25% accesses slower than normal one. But in chipmem and no dcache i'm not sure.

Anyway for pure 68000 it may (?) be faster to use few ops in memory rather than many ops in registers (inner loop here for 2x 8-bit steps) :
Code:
 move.b (a0)+,d3
 move.b -(a2),d4
 move.b (a4,d4.w),(a1)+
 move.b (a4,d3.w),-(a3)
There you have a pointer on start and end of line, for both read and write, and a 256-byte table.
Could be the fastest solution if allowed to use 64k table.
meynaf is offline  
Old 05 June 2018, 00:16   #29
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 104
If you optimize for 68k, you could also try to use the blitter. I was theorizing about it some time ago here: http://eab.abime.net/showpost.php?p=...&postcount=248 (didn't know it has a fancy name, lol)
It's basically the same thing Thorham proposed (flipping pairwise). It should be 4x6 = 24 clock cycles (and 12 memory cycles) per word, plus overhead for blitter setup and extra word.
The byte table approach given above should be 8cc/2ma + 10/2 + 18/4 + 18/4, so 54 clock cycles and 12 memory accesses per word, plus a small overhead for the loop (negligible for an sufficiently unrolled loop).

So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders).

I do not understand btw. the idea behind using pointers to start and end of the line - pre-decrement for source is slightly slower than post-increment (same for destination), so why not flip the line from left to right simply? Or use one 256-byte and one 256-word table to do word writes, like proposed here: http://eab.abime.net/showpost.php?p=...7&postcount=53

EDIT: Ah, think I got it - you can do the flip in-place without a buffer? Very clever!

Last edited by chb; 05 June 2018 at 15:14.
chb is offline  
Old 05 June 2018, 01:17   #30
Photon
Moderator
Photon's Avatar
 
Join Date: Nov 2004
Location: Hult / Sweden
Posts: 4,589
chb - he's not optimizing for 68000, and I don't like LUTs, it's hit and miss if you can get a good gain. Certainly more miss the higher up the Motorola family you travel.

ross, I was thinking you'd just run them and report? (Including chipmem r/w for the data words for all variants.) It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters.

From this, I end up with

Code:
	move.w (a6,Rn.w*2),Rn
	swap Rn
	move.w (a6,Rn.w*2),Rn
which I see matches one you posted. Probably a replacement for the SWAP for A-regs would still pay off, so that you could do a MOVEM.L d0-a5, but aligning and just doing d0-d7 might be an even better idea.

Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them.
Photon is offline  
Old 05 June 2018, 11:41   #31
ross
Omnia fert aetas

ross's Avatar
 
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 48
Posts: 1,249
Quote:
Originally Posted by meynaf View Post
Word access on odd addresses are normal speed if not crossing a longword boundary (i think). That would make only 25% accesses slower than normal one.
Yep, on average 25% slower.
Quote:
Originally Posted by meynaf View Post
Anyway for pure 68000 it may (?) be faster to use few ops in memory rather than many ops in registers
You are right, sure for 68k you can trade ops in memory with some instruction.


Quote:
Originally Posted by chb View Post
If you optimize for 68k, you could also try to use the blitter.
---
So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders).
Also here we should do some tests but I do not have much trust that goes faster than the CPU for two reasons:
- on pure 68k the ops that need more cycles (lsx, rox) does't have to compete with video DMA contrary to blitter;
- on 020+ instructions cycles are reduced, memory access and ALU are full 32bit so a 16 bit blitter can be a limit...

Quote:
Originally Posted by Photon View Post
ross, I was thinking you'd just run them and report?
uh, sorry I have absurdly busy days and I could not even turn on my (virtual) Amiga..
But i've simply adapted my http://eab.abime.net/showpost.php?p=1199574&postcount=1 code, it give steady and rock solid results.
Quote:
Originally Posted by Photon View Post
It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters.
This is a good idea.
Quote:
Originally Posted by Photon View Post
From this, I end up with

Code:
	move.w (a6,Rn.w*2),Rn
	swap Rn
	move.w (a6,Rn.w*2),Rn
which I see matches one you posted.
Yes, this is the faster!

Quote:
Originally Posted by Photon View Post
Probably a replacement for the SWAP for A-regs would still pay off, so that you could do a MOVEM.L d0-a5, but aligning and just doing d0-d7 might be an even better idea.
Yes, previous code with some MOVEM prefilled regs can be the absolute winner.

Quote:
Originally Posted by Photon View Post
Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them.
Sure
ross is offline  
Old 05 June 2018, 15:24   #32
chb
Registered User

 
Join Date: Dec 2014
Location: germany
Posts: 104
Quote:
Originally Posted by ross View Post
Also here we should do some tests but I do not have much trust that goes faster than the CPU for two reasons:
- on pure 68k the ops that need more cycles (lsx, rox) does't have to compete with video DMA contrary to blitter;
- on 020+ instructions cycles are reduced, memory access and ALU are full 32bit so a 16 bit blitter can be a limit...
Yep, it's probably useful only for a quite restricted number of scenarios: no competing dma (borders), cpu busy with other tasks like muls and shifts that do not require a lot of memory cycles, or if cpu can work in fastmem. And on 020+ without fast maybe not at all. So likely not worth the hassle.
chb is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
Workaround to X-Flipping issue found. No actual solution as yet. Brick Nash Coders. AMOS 12 13 October 2017 20:01
flipping through screens using middle mouse button Yulquen74 request.Apps 5 27 June 2014 22:31
Too fast CD32 emulation Amigabest support.WinUAE 1 13 May 2012 21:13
wing commander cd32 too fast JuvUK support.Games 8 21 March 2009 22:43
Flipping floppies Dave_wb support.Hardware 8 03 December 2006 13:36

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 15:53.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2018, vBulletin Solutions Inc.
Page generated in 0.08417 seconds with 14 queries