Optimized Alpha-Blending in asm? - Page 5

AGS · 02 September 2014, 14:14

Ah! And is this new method as accurate as the divs-version?

Mrs Beanbag · 02 September 2014, 14:23

with the multiplication by 258 it should be accurate to 2 parts in 65535 (255*255*258/256 = 65533.0078...)

that should be good enough for anyone

AGS · 02 September 2014, 14:27

But it's more accurate and faster than the methode with the lsr #8 (instead of the div by 255), right? An lsr takes double as cycles as the amount of bits that are shiftet.

Thorham · 02 September 2014, 15:11

Quote:

Originally Posted by AGS

But it's more accurate and faster than the methode with the lsr #8 (instead of the div by 255), right? An lsr takes double as cycles as the amount of bits that are shiftet.

On 20s and 30s eight bit shifts and swaps are all four cycles when executed from the cache. For shifting more than eight bits you have to add two cycles (swap is faster when you can use it, but you can't always, because it's essentially a cheap rotate).

On the 68000 shifts take up more cycles per bit.

Mrs Beanbag · 02 September 2014, 15:13

on 68000, lsr takes 6+2n cycles, swap takes 4 cycles, so yes it is both more accurate and faster.

on 68020+, however, lsr and swap take the same time. But then 68000 doesn't have muls.l anyway. muls.l is much slower than muls.w, too. We could instead multiply alpha by 129 and use a muls.w followed by a left shift.

AGS · 02 September 2014, 15:20

Quote:

Originally Posted by Mrs Beanbag

We could instead multiply alpha by 129 and use a muls.w followed by a left shift.

Please give me the example. Will it still work with swap then or ... ?

Mrs Beanbag · 02 September 2014, 15:31

well it would be either lsl #1 followed by swap, or lsr #7 and lsr #8. We'd need to shift right 15 bits instead of 16.

AGS · 02 September 2014, 15:40

Ok, I tried it this way:

Code:

	moveq	#0,d0
	moveq	#0,d2

		move.b	d2,d0
		lsl.w	#7,d2
		add.w	d0,d2

		moveq	#0,d0

		move.b	(a0)+,d0
		move.b	(a1),d1
		sub.l	d1,d0
		muls.w	d2,d0
		lsl.l	#1,d0
		swap	d0
		add.w	d1,d0
		move.b	d0,(a1)+

Doesn't work though.

AGS · 02 September 2014, 15:57

Quote:

Originally Posted by Mrs Beanbag

well it would be either lsl #1 followed by swap, or lsr #7 and lsr #8. We'd need to shift right 15 bits instead of 16.

Doesn't the highest bit dissapear in an lsl #1,d0?

Mrs Beanbag · 02 September 2014, 15:59

can't see why not. it would help to have a more detailed description of the not working.

those first two moveq #0s are inside the loop, right?

Quote:

Originally Posted by AGS

Doesn't the highest bit dissapear in an lsl #1,d0?

Multiplying by 258 or multiplying by 129 followed by a left shift should be the same. I hope that seems obvious. Lsl is no different to Asl, the sign is only important in right shifts.

254*129 = 32766 < 32768 so that shouldn't overflow. 255 is handled elsewhere. hmm...

AGS · 02 September 2014, 16:09

Yes they are inside. Here's what it looks like. None of the transparent pixels are drawn.

Mrs Beanbag · 02 September 2014, 16:13

and the correct result?

AGS · 02 September 2014, 16:23

Here:

JimDrew · 02 September 2014, 16:35

If you guys don't mind burning 64K of memory, this would be way faster using a simple lookup table.

AGS · 02 September 2014, 17:07

How can I measure the speed? It's running in an OS app. Looking up the fields with the seconds and micros in the intbase did not reveal much. It's seemingly too fast for intuition updating those fields.

alkis · 02 September 2014, 17:19

call it a gazillion times

Mrs Beanbag · 02 September 2014, 17:38

so where is D2 being read from memory?

AGS · 02 September 2014, 17:41

AH! It was cleared after the read! Thus all partially pixels became total transparent. I am now measuring the speed.

AGS · 02 September 2014, 18:01

It works, but it turns out being 5 seconds slower (66 vs 71) at 3000 iterations, at least here in FS-UAE.

ps: I think this is because for the emulation there is no difference between muls.w and muls.l and then the additional lsl.l takes more time. @Toni Wilen: What do you mean?

AGS · 02 September 2014, 18:38

@Mrs Beanbag

I compared speed with the original muls/divs version and found that the muls/divs version is faster than the new optimized variant by 3 seconds. And additionally, when we draw the same alpha picture onto the screen over and over again, the result is surprising. Left is the variant with muls and divs, and right the new optimized variant. Seems like something is going wrong?

02 September 2014, 15:40	#88
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	Ok, I tried it this way: Code: moveq #0,d0 moveq #0,d2 move.b d2,d0 lsl.w #7,d2 add.w d0,d2 moveq #0,d0 move.b (a0)+,d0 move.b (a1),d1 sub.l d1,d0 muls.w d2,d0 lsl.l #1,d0 swap d0 add.w d1,d0 move.b d0,(a1)+ Doesn't work though.

02 September 2014, 16:09	#91
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	Yes they are inside. Here's what it looks like. None of the transparent pixels are drawn. Attached Thumbnails

02 September 2014, 16:23	#93
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	Here: Attached Thumbnails

02 September 2014, 18:01	#99
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	It works, but it turns out being 5 seconds slower (66 vs 71) at 3000 iterations, at least here in FS-UAE. ps: I think this is because for the emulation there is no difference between muls.w and muls.l and then the additional lsl.l takes more time. @Toni Wilen: What do you mean? Last edited by AGS; 02 September 2014 at 18:09.

02 September 2014, 18:38	#100
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	@Mrs Beanbag I compared speed with the original muls/divs version and found that the muls/divs version is faster than the new optimized variant by 3 seconds. And additionally, when we draw the same alpha picture onto the screen over and over again, the result is surprising. Left is the variant with muls and divs, and right the new optimized variant. Seems like something is going wrong? Attached Thumbnails Last edited by AGS; 02 September 2014 at 18:59.

02 September 2014, 14:14	#81
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	Ah! And is this new method as accurate as the divs-version?

02 September 2014, 14:23	#82
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	with the multiplication by 258 it should be accurate to 2 parts in 65535 (255255258/256 = 65533.0078...) that should be good enough for anyone

02 September 2014, 14:27	#83
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	But it's more accurate and faster than the methode with the lsr #8 (instead of the div by 255), right? An lsr takes double as cycles as the amount of bits that are shiftet.

02 September 2014, 15:13	#85
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	on 68000, lsr takes 6+2n cycles, swap takes 4 cycles, so yes it is both more accurate and faster. on 68020+, however, lsr and swap take the same time. But then 68000 doesn't have muls.l anyway. muls.l is much slower than muls.w, too. We could instead multiply alpha by 129 and use a muls.w followed by a left shift.

02 September 2014, 15:31	#87
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	well it would be either lsl #1 followed by swap, or lsr #7 and lsr #8. We'd need to shift right 15 bits instead of 16.

02 September 2014, 16:13	#92
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	and the correct result?

02 September 2014, 16:35	#94
JimDrew Registered User Join Date: Dec 2013 Location: Lake Havasu City, AZ Posts: 741	If you guys don't mind burning 64K of memory, this would be way faster using a simple lookup table.

02 September 2014, 17:07	#95
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	How can I measure the speed? It's running in an OS app. Looking up the fields with the seconds and micros in the intbase did not reveal much. It's seemingly too fast for intuition updating those fields.

02 September 2014, 17:19	#96
alkis Registered User Join Date: Dec 2010 Location: Athens/Greece Age: 53 Posts: 720	call it a gazillion times

02 September 2014, 17:38	#97
Mrs Beanbag Glastonbridge Software Join Date: Jan 2012 Location: Edinburgh/Scotland Posts: 2,243	so where is D2 being read from memory?

02 September 2014, 17:41	#98
AGS XoXo/Tasko Developer Join Date: Dec 2013 Location: Munich Age: 48 Posts: 450	AH! It was cleared after the read! Thus all partially pixels became total transparent. I am now measuring the speed.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tool to convert asm to gnu asm (gas)	Asman	Coders. Asm / Hardware	13	30 December 2020 11:57
TCP/IP stack: Most optimized//small?	Amiga1992	support.Apps	17	14 June 2008 00:42
Optimized Protracker playroutine?	Photon	Coders. General	10	11 June 2005 00:54