Most optimized Atari ST to Amiga real time screen converter

Galahad/FLT · 12 January 2014, 17:00

Right, so i've been tackling this two different ways, the first routine I did was fast enough, but it was more optimized for size, the second way, I did away with optimized size and went for reducing the routines to only the most essential functions without the need for decrementing a counter and use of a bne.

So heres my first routine thats optimized for size:

Process_game_screen:

movem.l d0/a0-a4,-(a7)
move.l videobase(pc),a0 ;Base address of Atari ST screen
lea Amiga_screen,a1 ;Base address of Amiga Screen
move.l #$1f40,d0 ; Size of bitplane
move.l a1,a2
add.l d0,a2
move.l a2,a3
add.l d0,a3
move.l a3,a4
add.l d0,a4
loop_until_copied:
move.w (a0)+,(a1)+
move.w (a0)+,(a2)+
move.w (a0)+,(a3)+
move.w (a0)+,(a4)+
subq.l #2,d0
bne.s loop_until_copied
movem.l (a7)+,d0/a0-a4
rts

However, its not a great routine because that tight loop of moves to address registers is repeated 4000 times!

So I thought that if I removed the subq.l #2,d0 and the bne, that would make it slightly quicker, obviously removing those means I have to now repeat that tight loop 4000 times instead, but obviously if I do that, i'm also not repeating the subq.l and the bne 4000 times either.

Clearly, that leads to a massive routine, but I have memory I need in extra memory so thats not an issue.

So, can anyone else see any better ways of doing this which will lead to a faster routine?

Please note, i'm not looking for coding elegance, i'm looking to see if my routine can be significantly, or even slightly speeded up, because I have only tested Where Time Stood Still on an emulated A500, I have no clue as to whether or not on a physical machine it will be exactly the same.

If it is the same, then it runs at an acceptable speed, but any improvements would be welcome.

Don_Adan · 12 January 2014, 17:55

Quote:

Originally Posted by Galahad/FLT

Right, so i've been tackling this two different ways, the first routine I did was fast enough, but it was more optimized for size, the second way, I did away with optimized size and went for reducing the routines to only the most essential functions without the need for decrementing a counter and use of a bne.

So heres my first routine thats optimized for size:

Process_game_screen:

movem.l d0/a0-a4,-(a7)
move.l videobase(pc),a0 ;Base address of Atari ST screen
lea Amiga_screen,a1 ;Base address of Amiga Screen
move.l #$1f40,d0 ; Size of bitplane
move.l a1,a2
add.l d0,a2
move.l a2,a3
add.l d0,a3
move.l a3,a4
add.l d0,a4
loop_until_copied:
move.w (a0)+,(a1)+
move.w (a0)+,(a2)+
move.w (a0)+,(a3)+
move.w (a0)+,(a4)+
subq.l #2,d0
bne.s loop_until_copied
movem.l (a7)+,d0/a0-a4
rts

However, its not a great routine because that tight loop of moves to address registers is repeated 4000 times!

So I thought that if I removed the subq.l #2,d0 and the bne, that would make it slightly quicker, obviously removing those means I have to now repeat that tight loop 4000 times instead, but obviously if I do that, i'm also not repeating the subq.l and the bne 4000 times either.

Clearly, that leads to a massive routine, but I have memory I need in extra memory so thats not an issue.

So, can anyone else see any better ways of doing this which will lead to a faster routine?

Please note, i'm not looking for coding elegance, i'm looking to see if my routine can be significantly, or even slightly speeded up, because I have only tested Where Time Stood Still on an emulated A500, I have no clue as to whether or not on a physical machine it will be exactly the same.

If it is the same, then it runs at an acceptable speed, but any improvements would be welcome.

Add.w and subq.w is fastest for 68000 than add.l and subq.l. Anyway fastest than subq.w and bne.b is simple dbf.

Galahad/FLT · 12 January 2014, 18:00

Quote:

Originally Posted by Don_Adan

Add.w and subq.w is fastest for 68000 than add.l and subq.l. Anyway fastest than subq.w and bne.b is simple dbf.

But i'm thinking doing away with a sub, bne or a dbf altogether and having that tight loop repeated instead is going to be quicker still, although that is 32K of instructions it has to run through to build the screen

Just a shame the blitter can't be used to any degree here

Don_Adan · 12 January 2014, 19:59

Quote:

Originally Posted by Galahad/FLT

But i'm thinking doing away with a sub, bne or a dbf altogether and having that tight loop repeated instead is going to be quicker still, although that is 32K of instructions it has to run through to build the screen

Just a shame the blitter can't be used to any degree here

You can check Lethal Xcess game, This is dual format game and used ST graphics on Amiga. Perhaps Mad Max can used something interesting.

Asman · 12 January 2014, 20:29

@Galahad/FLT

You can check this routine. I didn't test it yet, but should works - I will do some tests today evening.

Code:

    move.l  #amount,A5  ;Damn I' matmematician and I will calc this today evening :)

.loop
    movem.l (A0)+,D0-D7
    
    ;D0 - 0 and 1    ;D1 - 2 and 3
    ;D2 - 0 and 1    ;D3 - 2 and 3
    ;D4 - 0 and 1    ;D5 - 2 and 3
    ;D6 - 0 and 1    ;d7 - 2 and 3
    
    movem.w D0/D2/d4/D6,(A2)
    swap    D0
    swap    D2
    swap    D4
    swap    D6
    movem.w D0/D2/D4/D6,(A1)
    addq.l  #8,A1
    addq.l  #8,A2
    
    movem.w D1/D3/D5/D7,(A4)
    swap    D1
    swap    D3
    swap    D5
    swap    D7
    movem.w D1/D3/D5/D7,(A3)
    addq.l  #8,A3
    addq.l  #8,A4

    subq.w  #1,A5
    bne .loop

Don_Adan · 12 January 2014, 20:42

Quote:

Originally Posted by Asman

@Galahad/FLT

You can check this routine. I didn't test it yet, but should works - I will do some tests today evening.

Code:

    move.l  #amount,A5  ;Damn I' matmematician and I will calc this today evening :)

.loop
    movem.l (A0)+,D0-D7
    
    ;D0 - 0 and 1    ;D1 - 2 and 3
    ;D2 - 0 and 1    ;D3 - 2 and 3
    ;D4 - 0 and 1    ;D5 - 2 and 3
    ;D6 - 0 and 1    ;d7 - 2 and 3
    
    movem.w D0/D2/d4/D6,(A2)
    swap    D0
    swap    D2
    swap    D4
    swap    D6
    movem.w D0/D2/D4/D6,(A1)
    addq.l  #8,A1
    addq.l  #8,A2
    
    movem.w D1/D3/D5/D7,(A4)
    swap    D1
    swap    D3
    swap    D5
    swap    D7
    movem.w D1/D3/D5/D7,(A3)
    addq.l  #8,A3
    addq.l  #8,A4

    subq.w  #1,A5
    bne .loop

Next code can't works:
subq.w #1,A5
bne .loop

Galahad/FLT · 12 January 2014, 21:11

Quote:

Originally Posted by Asman

@Galahad/FLT

You can check this routine. I didn't test it yet, but should works - I will do some tests today evening.

Code:

amount = $3e8 ; added
     move.l  #amount,A5  ;Damn I' matmematician and I will calc this today evening :)
     suba.l a6,a6       ;Added
.loop
    movem.l (A0)+,D0-D7
    
    ;D0 - 0 and 1    ;D1 - 2 and 3
    ;D2 - 0 and 1    ;D3 - 2 and 3
    ;D4 - 0 and 1    ;D5 - 2 and 3
    ;D6 - 0 and 1    ;d7 - 2 and 3
    
    movem.w D0/D2/d4/D6,(A2)
    swap    D0
    swap    D2
    swap    D4
    swap    D6
    movem.w D0/D2/D4/D6,(A1)
    addq.l  #8,A1
    addq.l  #8,A2
    
    movem.w D1/D3/D5/D7,(A4)
    swap    D1
    swap    D3
    swap    D5
    swap    D7
    movem.w D1/D3/D5/D7,(A3)
    addq.l  #8,A3
    addq.l  #8,A4

    subq.w  #1,A5
    cmp.l    a5,a6    ;added
    bne .loop

I've added some stuff to get it to work, not entirely sure its faster, but it does in fact work, so nice one

Asman · 12 January 2014, 21:21

@Don_Adan

Right. Thanks.

@Galahad/FLT
I have another idea to use blitter to copy one bitplan but for sure I will first check it

.

Galahad/FLT · 12 January 2014, 21:36

Quote:

Originally Posted by Asman

@Don_Adan

Right. Thanks.

@Galahad/FLT
I have another idea to use blitter to copy one bitplan but for sure I will first check it

.

Good man, enthusiasm, I love it

mr.spiv · 12 January 2014, 22:02

Quote:

Originally Posted by Galahad/FLT

But i'm thinking doing away with a sub, bne or a dbf altogether and having that tight loop repeated instead is going to be quicker still, although that is 32K of instructions it has to run through to build the screen

Just a shame the blitter can't be used to any degree here

Multiple blitter passes does the job. And when using blitter interrupts your code does not need to wait between passes and you can use the CPU to do other stuff in a meanwhile.

Example from top of head.. no warranties as I did not think too much of this

Use A and D channels. A modulo 6, D modulo to 0, A to ST fb word 0 and D to amiga plane 0, start with the size width 1 height 1024 for plane 0. In the blitter interrupt just restart blitter until plane 0 has been copied. Then move to plane 1 etc.. Just a thought.

kamelito · 13 January 2014, 08:06

Might contains interesting ideas.
http://www.looksgoodworkswell.com/el...macpaint-code/

Kamelito

phx · 13 January 2014, 11:36

Quote:

Originally Posted by Don_Adan

Add.w and subq.w is fastest for 68000 than add.l and subq.l.

There is no difference between subq.w and subq.l when the destination is an address register (same for addq, of course).

Galahad/FLT · 13 January 2014, 21:42

Quote:

Originally Posted by kamelito

Might contains interesting ideas.
http://www.looksgoodworkswell.com/el...macpaint-code/

Kamelito

Yes, Asmans example uses a genesis of that idea.

Unfortunately, because of the weird way that the Atari ST displays its graphics, its not possible to simply do a straight copy which is what that MacPaint example uses.

Quote:

Originally Posted by mr.spiv

Multiple blitter passes does the job. And when using blitter interrupts your code does not need to wait between passes and you can use the CPU to do other stuff in a meanwhile.

Example from top of head.. no warranties as I did not think too much of this

Use A and D channels. A modulo 6, D modulo to 0, A to ST fb word 0 and D to amiga plane 0, start with the size width 1 height 1024 for plane 0. In the blitter interrupt just restart blitter until plane 0 has been copied. Then move to plane 1 etc.. Just a thought.

Care to elaborate with an example?

mr.spiv · 13 January 2014, 22:59

Quote:

Originally Posted by Galahad/FLT

Yes, Asmans example uses a genesis of that idea.

Unfortunately, because of the weird way that the Atari ST displays its graphics, its not possible to simply do a straight copy which is what that MacPaint example uses.

Care to elaborate with an example?

Check your PM.

Asman · 13 January 2014, 23:10

@Galahad/FLT

I did some tests and there is - it uses mr.spiv method ( thanks a lot mr.spiv ) plus copy. It must be called twice, or use copy paste method. (For tests I use degas picture from Rolling Thunder - LOADER.PI1 - I still have hope that some day I will so angry and I will convert this game as should be

). So for sure you must adapt some things and some things can be optimized.

Code:

;use WAITBLITTER somewhere on the begining of the program
		move.w	#6,bltamod(a5)
		move.w	#0,bltdmod(a5)
		move.l	#$09f00000,bltcon0(a5)
		move.l	#$ffffffff,bltafwm(a5)

		lea	degas+34,a0
		move.l	screen(a6),a1 
		bsr	CopySt
		lea	degas+34+4,a0
		move.l	screen(a6),a1
		add.l	#$1f40*2,a1
		bsr	CopySt

Everything should be clear - If not then just ask.

Code:

CopySt:
		move.l	a0,bltapt(a5)
		move.l	a1,bltdpt(a5)
		move.w	#0*64+1,bltsize(a5)	;1024 height

		move.l	#$1f40,D0
		move.l	a1,a2
		add.l	d0,a2
		lea	2(a0),a3
		
		move.w	#$1f40/8-1,D1
.1		move.w	(a3),(a2)+
		addq.l	#8,a3
		dbf	D1,.1

		WAITBLITTER
		move.w	#0*64+1,bltsize(a5)	;1024 height

		move.w	#$1f40/8-1,D1
.2		move.w	(a3),(a2)+
		addq.l	#8,a3
		dbf	d1,.2

		WAITBLITTER
		move.w	#0*64+1,bltsize(a5)	;1024 height

		move.w	#$1f40/8-1,D1
.3		move.w	(a3),(a2)+
		addq.l	#8,a3
		dbf	d1,.3

		WAITBLITTER
		move.w	#928*64+1,bltsize(a5)

		move.w	#$1f40/8-1,D1
.4		move.w	(a3),(a2)+
		addq.l	#8,a3
		dbf	d1,.4
		rts

It's faster and I tested it on my A1200. I have another idea but I'm not sure if it works - so I will check it first.

TCD · 13 January 2014, 23:25

for this thread

Galahad/FLT · 13 January 2014, 23:37

@Asman, great work dude, its definately faster, but the last few lines are missing from the bottom of the screen as if a couple of bitplanes haven't been written properly, will check that i've actually copied your code properly!

EDIT: Right, a typo on my part.

I've got a feeling that the CPU routine you wrote before was quicker with the movem.w instructions, because, i'm getting a flickering when moving which I didn't have before, and i'm not so sure its quicker.

Will have to do more testing to see.

Asman · 14 January 2014, 18:36

Hm.... my next idea was to use blitter but this attempt is slower then previous one. I use blitter and for channels to speed up previous blitter copy (longword instead word). It uses operation D = A + BC, a with mask $ffff0000 and C contains mask $0000ffff.

Code:

	lea	maskC,a3
	lea	degas+34,a0
	lea	6(a0),a1
	
	move.l	screen(a6),a2
	
	WAITBLITTER
	move.w	#12,bltamod(a5)
	move.w	#12,bltbmod(a5)
	move.w	#-4,bltcmod(a5)
	move.w	#0,bltdmod(a5)
	move.l	#$0df80000,bltcon0(a5)
	move.l	#$ffff0000,bltafwm(a5)
	move.l	a0,bltapt(a5)
	move.l	a1,bltbpt(a5)
	move.l	a3,bltcpt(a5)
	move.l	a2,bltdpt(a5)
	move.w	#0*64+2,bltsize(a5) ;1024 longwords
	
	rts

	;must be located in CHIP
maskC:	dc.w	0,-1

So I think that best approach will be previous one, perhaps with other CPU routine.

mr.spiv · 14 January 2014, 20:07

I would, as originally hinted, chain blitts using blitter interrupt. Since we are only using two channels I would blitt two planes with blitter and once that has started do the other two using CPU. Then you do not need to have blitter waits between CPU passes.

Quote:

Originally Posted by Asman

Hm.... my next idea was to use blitter but this attempt is slower then previous one. I use blitter and for channels to speed up previous blitter copy (longword instead word). It uses operation D = A + BC, a with mask $ffff0000 and C contains mask $0000ffff.

Code:

    lea    maskC,a3
    lea    degas+34,a0
    lea    6(a0),a1
    
    move.l    screen(a6),a2
    
    WAITBLITTER
    move.w    #12,bltamod(a5)
    move.w    #12,bltbmod(a5)
    move.w    #-4,bltcmod(a5)
    move.w    #0,bltdmod(a5)
    move.l    #$0df80000,bltcon0(a5)
    move.l    #$ffff0000,bltafwm(a5)
    move.l    a0,bltapt(a5)
    move.l    a1,bltbpt(a5)
    move.l    a3,bltcpt(a5)
    move.l    a2,bltdpt(a5)
    move.w    #0*64+2,bltsize(a5) ;1024 longwords
    
    rts

    ;must be located in CHIP
maskC:    dc.w    0,-1

So I think that best approach will be previous one, perhaps with other CPU routine.

copse · 14 January 2014, 21:18

Has anyone been measuring the timings for these, and can they give them?

Should also note that this thread is some top shit.

13 January 2014, 23:37	#17
Galahad/FLT Going nowhere Join Date: Oct 2001 Location: United Kingdom Age: 50 Posts: 8,986	@Asman, great work dude, its definately faster, but the last few lines are missing from the bottom of the screen as if a couple of bitplanes haven't been written properly, will check that i've actually copied your code properly! EDIT: Right, a typo on my part. I've got a feeling that the CPU routine you wrote before was quicker with the movem.w instructions, because, i'm getting a flickering when moving which I didn't have before, and i'm not so sure its quicker. Will have to do more testing to see. Last edited by Galahad/FLT; 14 January 2014 at 01:26.

14 January 2014, 18:36	#18
Asman 68k Join Date: Sep 2005 Location: Somewhere Posts: 828	Hm.... my next idea was to use blitter but this attempt is slower then previous one. I use blitter and for channels to speed up previous blitter copy (longword instead word). It uses operation D = A + BC, a with mask $ffff0000 and C contains mask $0000ffff. Code: lea maskC,a3 lea degas+34,a0 lea 6(a0),a1 move.l screen(a6),a2 WAITBLITTER move.w #12,bltamod(a5) move.w #12,bltbmod(a5) move.w #-4,bltcmod(a5) move.w #0,bltdmod(a5) move.l #$0df80000,bltcon0(a5) move.l #$ffff0000,bltafwm(a5) move.l a0,bltapt(a5) move.l a1,bltbpt(a5) move.l a3,bltcpt(a5) move.l a2,bltdpt(a5) move.w #0*64+2,bltsize(a5) ;1024 longwords rts ;must be located in CHIP maskC: dc.w 0,-1 So I think that best approach will be previous one, perhaps with other CPU routine.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Amiga Juggler real-time reimplementation?	Mequa	Amiga scene	10	29 May 2023 16:12
Amiga Real-Time 3D Graphics	Jherek Carnelia	Coders. Tutorials	14	13 April 2023 00:01
WTB: Amiga Real-Time 3d graphics	Fridrik	MarketPlace	0	27 September 2012 01:53
Wanted - Amiga Real-Time 3D Graphics book	michel3105	MarketPlace	0	02 September 2011 08:29
F/S: Vidi Amiga 24-bit real time colour digitiser	John64	MarketPlace	4	06 June 2009 18:47

12 January 2014, 17:00	#1
Galahad/FLT Going nowhere Join Date: Oct 2001 Location: United Kingdom Age: 50 Posts: 8,986	Most optimized Atari ST to Amiga real time screen converter Right, so i've been tackling this two different ways, the first routine I did was fast enough, but it was more optimized for size, the second way, I did away with optimized size and went for reducing the routines to only the most essential functions without the need for decrementing a counter and use of a bne. So heres my first routine thats optimized for size: Process_game_screen: movem.l d0/a0-a4,-(a7) move.l videobase(pc),a0 ;Base address of Atari ST screen lea Amiga_screen,a1 ;Base address of Amiga Screen move.l #$1f40,d0 ; Size of bitplane move.l a1,a2 add.l d0,a2 move.l a2,a3 add.l d0,a3 move.l a3,a4 add.l d0,a4 loop_until_copied: move.w (a0)+,(a1)+ move.w (a0)+,(a2)+ move.w (a0)+,(a3)+ move.w (a0)+,(a4)+ subq.l #2,d0 bne.s loop_until_copied movem.l (a7)+,d0/a0-a4 rts However, its not a great routine because that tight loop of moves to address registers is repeated 4000 times! So I thought that if I removed the subq.l #2,d0 and the bne, that would make it slightly quicker, obviously removing those means I have to now repeat that tight loop 4000 times instead, but obviously if I do that, i'm also not repeating the subq.l and the bne 4000 times either. Clearly, that leads to a massive routine, but I have memory I need in extra memory so thats not an issue. So, can anyone else see any better ways of doing this which will lead to a faster routine? Please note, i'm not looking for coding elegance, i'm looking to see if my routine can be significantly, or even slightly speeded up, because I have only tested Where Time Stood Still on an emulated A500, I have no clue as to whether or not on a physical machine it will be exactly the same. If it is the same, then it runs at an acceptable speed, but any improvements would be welcome.

12 January 2014, 21:21	#8
Asman 68k Join Date: Sep 2005 Location: Somewhere Posts: 828	@Don_Adan Right. Thanks. @Galahad/FLT I have another idea to use blitter to copy one bitplan but for sure I will first check it .

13 January 2014, 08:06	#11
kamelito Zone Friend Join Date: May 2006 Location: France Posts: 1,801	Might contains interesting ideas. http://www.looksgoodworkswell.com/el...macpaint-code/ Kamelito

13 January 2014, 23:25	#16
TCD HOL/FTP busy bee Join Date: Sep 2006 Location: Germany Age: 46 Posts: 31,518	for this thread

14 January 2014, 21:18	#20
copse Registered User Join Date: Jul 2009 Location: Lala Land Posts: 520	Has anyone been measuring the timings for these, and can they give them? Should also note that this thread is some top shit.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)