03 June 2018, 16:06 | #21 |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
I sure hope so
How about these (NOT tested!!): Code:
clr.l d1 move.b d0,d1 move.b (a0,d1.w),d0 rol.w #8,d0 move.b d0,d1 move.b (a0,d1.w),d0 swap d0 move.b d0,d1 move.b (a0,d1.w),d0 rol.w #8,d0 move.b d0,d1 move.b (a0,d1.w),d0 Code:
bfextu d0{0:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{8:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{16:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{24:8},d2 move.b (a0,d2.w),d1 |
03 June 2018, 17:03 | #22 |
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
They works, but are in the same league than pure code (the same or slighty slower)
(actually in the second there is a small change to do but the concept is that) However there is too little difference to fully understand if there is any kind of income (real machine tests are absolutely due). |
03 June 2018, 17:23 | #23 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Quote:
Here's another one. Uses four 1kb tables: Code:
clr.l d1 move.b d0,d1 move.l (a0,d1.w*4),d1 lsr.w #8,d0 or.l (a1,d0.w*4),d1 swap d0 move.b d0,d1 or.l (a2,d1.w*4),d1 lsr.w #8,d0 or.l (a3,d0.w*4),d1 Last edited by Thorham; 03 June 2018 at 18:51. Reason: Changed move to or. |
|
03 June 2018, 18:45 | #24 | ||
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
You have reversed bit positions (bit 0 is MSB).
This is right, but at the end is practically same as mine: Code:
_lut8flip3: lea _8lut(pc),a0 bfextu d0{24:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{16:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{8:8},d2 move.b (a0,d2.w),d1 lsl.l #8,d1 bfextu d0{0:8},d2 move.b (a0,d2.w),d1 move.l d1,d0 rts Quote:
Quote:
|
||
03 June 2018, 18:50 | #25 | |
Computer Nerd
Join Date: Sep 2007
Location: Rotterdam/Netherlands
Age: 47
Posts: 3,751
|
Thanks
Quote:
Yes, one size really doesn't fit all in this case. And after all that, there's also the 68060... |
|
03 June 2018, 19:28 | #26 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
ross seems capable, so I think he could both code and compare to give us the answer
Here's an untested variant from the obvious LUT. It trades 3 memory writes for 2 decode cycles. It was mostly to find out if one could do the same for the memory reads, but from toying with it a few minutes I don't think that's possible. Code:
moveq #0,d0 moveq #0,d1 move.b (a0)+,d0 move.b (a0)+,d1 move.w (a2,d1.w),d1 move.b (a2,d0.w),d1 swap d1 move.b (a0)+,d0 move.b (a0)+,d1 move.w (a2,d1.w),d1 move.b (a2,d0.w),d1 |
03 June 2018, 21:09 | #27 | ||
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
There is not a data cache for all the (a0)+access. If the tiles data is properly 32bit chipmem aligned a single .l is a big win. So have d0.l filled is good! Another bottleneck in your code is the move.w (a2,d1.w),d1that can span two memory line! (actually resulting in two separate read) All the rol/lsl and even register bfextu is relatively cheap compared to chip mem access, this makes the pure (instruction cached) code so fast. Your code can maybe be adapted for a pure 68k version (avoiding access to odd addresses). Quote:
Anyway not a good idea in this context. |
||
04 June 2018, 09:17 | #28 | |
son of 68k
Join Date: Nov 2007
Location: Lyon / France
Age: 51
Posts: 5,322
|
Quote:
Anyway for pure 68000 it may (?) be faster to use few ops in memory rather than many ops in registers (inner loop here for 2x 8-bit steps) : Code:
move.b (a0)+,d3 move.b -(a2),d4 move.b (a4,d4.w),(a1)+ move.b (a4,d3.w),-(a3) Could be the fastest solution if allowed to use 64k table. |
|
04 June 2018, 23:16 | #29 |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
If you optimize for 68k, you could also try to use the blitter. I was theorizing about it some time ago here: http://eab.abime.net/showpost.php?p=...&postcount=248 (didn't know it has a fancy name, lol)
It's basically the same thing Thorham proposed (flipping pairwise). It should be 4x6 = 24 clock cycles (and 12 memory cycles) per word, plus overhead for blitter setup and extra word. The byte table approach given above should be 8cc/2ma + 10/2 + 18/4 + 18/4, so 54 clock cycles and 12 memory accesses per word, plus a small overhead for the loop (negligible for an sufficiently unrolled loop). So blitter approach is probably only useful for bigger tiles and if it can use memory cycles not available for the cpu (e.g. borders). I do not understand btw. the idea behind using pointers to start and end of the line - pre-decrement for source is slightly slower than post-increment (same for destination), so why not flip the line from left to right simply? Or use one 256-byte and one 256-word table to do word writes, like proposed here: http://eab.abime.net/showpost.php?p=...7&postcount=53 EDIT: Ah, think I got it - you can do the flip in-place without a buffer? Very clever! Last edited by chb; 05 June 2018 at 14:14. |
05 June 2018, 00:17 | #30 |
Moderator
Join Date: Nov 2004
Location: Eksjö / Sweden
Posts: 5,602
|
chb - he's not optimizing for 68000, and I don't like LUTs, it's hit and miss if you can get a good gain. Certainly more miss the higher up the Motorola family you travel.
ross, I was thinking you'd just run them and report? (Including chipmem r/w for the data words for all variants.) It might be that the fastest one is the one who can do the largest MOVEM. But 68020 isn't infinitely faster than the 68000, so obviously instruction count and heft matters. From this, I end up with Code:
move.w (a6,Rn.w*2),Rn swap Rn move.w (a6,Rn.w*2),Rn Obviously the fastest would be to run the tile conversion in batch long beforehand rather than stream them. |
05 June 2018, 10:41 | #31 | ||||||
Defendit numerus
Join Date: Mar 2017
Location: Crossing the Rubicon
Age: 53
Posts: 4,468
|
Quote:
Quote:
Quote:
- on pure 68k the ops that need more cycles (lsx, rox) does't have to compete with video DMA contrary to blitter; - on 020+ instructions cycles are reduced, memory access and ALU are full 32bit so a 16 bit blitter can be a limit... uh, sorry I have absurdly busy days and I could not even turn on my (virtual) Amiga.. But i've simply adapted my http://eab.abime.net/showpost.php?p=1199574&postcount=1 code, it give steady and rock solid results. Quote:
Quote:
Quote:
Sure |
||||||
05 June 2018, 14:24 | #32 | |
Registered User
Join Date: Dec 2014
Location: germany
Posts: 439
|
Quote:
|
|
07 August 2019, 16:42 | #33 |
Registered User
Join Date: Oct 2015
Location: Landsberg / Germany
Posts: 526
|
@Thorham: Thanks for your flipping code. Works great! I´d love to use it in my next game if you don´t mind.
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Workaround to X-Flipping issue found. No actual solution as yet. | Brick Nash | Coders. AMOS | 12 | 13 October 2017 19:01 |
flipping through screens using middle mouse button | Yulquen74 | request.Apps | 5 | 27 June 2014 21:31 |
Too fast CD32 emulation | Amigabest | support.WinUAE | 1 | 13 May 2012 20:13 |
wing commander cd32 too fast | JuvUK | support.Games | 8 | 21 March 2009 21:43 |
Flipping floppies | Dave_wb | support.Hardware | 8 | 03 December 2006 12:36 |
|
|