Optimizing polygonfill bitcopy - Page 5

deimos · 27 November 2019, 20:20

Quote:

Originally Posted by TCH

Edit: If program with old algorithm runs 100 secs and with the new it runs 95 and everything else is unchanged, then it's a 5% percent speedup. What's your problem?

Because if you're spending 90s in both cases in start up code, then it's not.

You always have a constant bit.

Then a bit that grows with the number of objects you draw, or whatever.

It's the second bit that's important.

Ideally you also get a growth that doesn't become exponential, and stuff that's outside this conversation.

But to measure improvement you have to remove that constant bit.

a/b · 27 November 2019, 20:21

To go back to the original discussion...
One way of speeding it up further would probably be to switch loop order (1..100 outer and 1..3 inner is slower than 1..3 outer and 1..100 inner), since there are 3 nested loops and the inner 2 have small repetition count.
Also, I usually hardcode the depth (=> unrolling), which should be totally reasonable for a game/demo, and that would free another register and eliminate swaps.

TCH · 27 November 2019, 20:26

@deimos:
Okay, now i understand, what you wanted: we're back at the start again and you're saying the direct approach is better for measurements and measuring the change in time by measuring the whole runtime and comparing it between two versions is flawed and will not show the true performance gain. And again you do without actual countermeasures. You may be right, but during the time you stated this for the umpteenth time, you could change the code and measure it by the direct way.
But now i am really curious, what would do such an impact on the precision, if it is measured by comparing total running times?

Quote:

Originally Posted by deimos

Because if you're spending 90s in both cases in start up code, then it's not.

You always have a constant bit.

Then a bit that grows with the number of objects you draw, or whatever.

It's the second bit that's important.

Ideally you also get a growth that doesn't become exponential, and stuff that's outside this conversation.

But to measure improvement you have to remove that constant bit.

You're right, it's a constant bit. And if x + c < y + c, then x < y. I cannot imagine why the rest of the program would run slower, if i use the asm routines here. The number of drawn objects is always 64. Code is public, see for yourself.

@a/b:
This is a three loop copy. We could unroll the depth by cloning the routine for each depth from 1 to 8, but how would you unroll the innermost loop? I've tried your trick with the other routine, that i unroll it for 8 and do the rest, but it did not get faster. Or did you mean something else?

deimos · 27 November 2019, 20:43

Quote:

Originally Posted by TCH

And again you do without actual countermeasures.

What? You want me to give you a written example?

Quote:

Originally Posted by TCH

but during the time you stated this for the umpteenth time, you could change the code and measure it by the direct way.

I'm here to help, not to do it for you.

Quote:

But now i am really curious, what would do such an impact on the precision, if it is measured by comparing total running times?

Ah well, you see, that's the thing you don't seem to be understanding. It was never about precision, it was about accuracy. If I could draw you a graph here I would. But I urge you to just invest the effort on getting rid of this thing that's measuring from the outside and get yourself into the position where you know the constant time bit vs the bit that driven by algorithm / implementation choices. Because you've not shared analysable numbers I have to go by gut, but I don't believe the percentage numbers you've been sharing really represent the improvement that's been made.

TCH · 27 November 2019, 21:14

Quote:

Originally Posted by deimos

What? You want me to give you a written example?

No. If i could write a program which can measure the runtime of an external program, then i could've done this myself too. I just wanted to point out, that since you are so obsessed in convincing me in this matter, the amount of time you invested into convincing, would be enough to directly provide countermeasures.

Quote:

Originally Posted by deimos

I'm here to help, not to do it for you.

Help? By futilely trolling me with useless "advices" about switching to windows and calling me things if i don't? Since i've asked about the availability of GCC 8.3 on other platforms (without any backward intentions; i've just asked if there is a version i could use), you're really triggered by this, at least you're giving this impression by your remarks...

Quote:

Originally Posted by deimos

Ah well, you see, that's the thing you don't seem to be understanding. It was never about precision, it was about accuracy. If I could draw you a graph here I would.

You can. You draw an image and link it here.

Quote:

Originally Posted by deimos

But I urge you to just invest the effort on getting rid of this thing that's measuring from the outside and get yourself into the position where you know the constant time bit vs the bit that driven by algorithm / implementation choices. Because you've not shared analysable numbers I have to go by gut, but I don't believe the percentage numbers you've been sharing really represent the improvement that's been made.

I already conceded, that the percentages may not be the closest to reality, but still: what if we measure it directly and it turns out, that the gain was not 3.4%, but 3.6 or 3.2? You claim, that the gain may be much more. Please name one sane reason, what and how would cause that big hiccup outside the loop, which would only occur when i use the assembly routine posted here?

deimos · 27 November 2019, 21:36

Quote:

Originally Posted by TCH

measure the runtime of an external program

But this was one of the key points, you're measuring from outside, I asked you to measure from inside.

Quote:

Originally Posted by TCH

Help? By futilely trolling me with useless "advices" about switching to windows and calling me thing if i don't? Since i've asked about the availability of GCC 8.3 on other platforms (without any backward intentions; i've just asked if there is a version i could use), you're really triggered by this, at least you're giving this impressions by your remarks...

You were triggered, not me:

Quote:

Originally Posted by TCH

I do not own any copy of windows, but even if i would, i'd rather tidy the flat.

And I remind you, I offered to help by building and running your code "If you don't do Windows, and if your code can run and output two numbers (C vs asm) for valid comparison". But you wouldn't do that. And then you didn't even provide a makefile (the most basic of courtesies), instead you expected me to find this 'amtime' thing, do some waffley handwavey thing to compile the sources, than modify your code to run the test?

Quote:

Originally Posted by TCH

the percentages may not be the closest to reality

I've not convinced you've even tried to understand.

I did try to help. Really.

hooverphonique · 27 November 2019, 21:44

Quote:

Originally Posted by ross

https://franke.ms/cex/

Interesting - I actually earlier pondered asking the compiler explorer team to include an m68k gcc on godbolt.org

Now we just need it to generate motorola syntax

TCH · 27 November 2019, 22:11

Quote:

Originally Posted by deimos

But this was one of the key points, you're measuring from outside, I asked you to measure from inside.

I get it, but this part was the answer about your question, if i expect you to write it instead of me.

Quote:

Originally Posted by deimos

You were triggered, not me:

No. If i would be triggered, then i would be already out of here without a word, because that is better, than unnecessarily engaging in a flamewar. I just wanted to deflect your stabs with jokes. Evidently i failed. Either you did not get the joke, or my jokes sucks. Nobody is perfect.

Quote:

Originally Posted by deimos

And I remind you, I offered to help by building and running your code "If you don't do Windows, and if your code can run and output two numbers (C vs asm) for valid comparison". But you wouldn't do that.

Now, that is terribly wrong. In this post, i have provided all the sources you needed for that.

Quote:

Originally Posted by deimos

And then you didn't even provide a makefile (the most basic of courtesies), instead you expected me to find this 'amtime' thing, do some waffley handwavey thing to compile the sources, than modify your code to run the test?

Because you do not need a makefile. You just need to assemble the two 68k files (

vasmm68k_mot ClearBlock32.68k -spaces -Fhunk -o ClearBlock32.o && vasmm68k_mot PolygonBitmapToPlanes32.3.68k -spaces -Fhunk -o PolygonBitmapToPlanes32.3.o

) and then compile the C file together with the object files (

m68k-amigaos-gcc -O2 polygon1.c *.o -o polygon1

). It is true, that i did not said this explicitly, but i thought this was trivial.

As for the 'amtime', i've released it in this topic a few days ago. It was also on coders general, so i thought you've seen it, but i admit, this was not trivial, so i apologize for this mistake.

Quote:

Originally Posted by deimos

I've not convinced you've even tried to understand.

Wrong. I've tried.

Quote:

Originally Posted by deimos

I did try to help. Really.

I did try to listen to your arguments. No sarcasm, you've got me wrong. I remind you, that when you told me about splitting concave polygons to convex ones and by that escape the ordering or deduplicating of the horizontal point collector arrays, i immediately admitted, that you are right, because it was obvious, that this way i only have two points per row; it's just that this solution did not occurred in my mind. I told you before, i am convinceable, but you did not tell me any believable reason, about what would cause that vast hiccup. For me currently it seems, that we have C for the constant part, X for aglorithm one, Y for algorithm two and if X + C < Y + C for several times, then X < Y. This is plain math and i've told this to you above, but you ignored it.

So, i tried to cooperate. Really.

deimos · 27 November 2019, 22:47

Quote:

Originally Posted by TCH

you do not need a makefile. You just need to assemble the two 68k files (

vasmm68k_mot ClearBlock32.68k -spaces -Fhunk -o ClearBlock32.o && vasmm68k_mot PolygonBitmapToPlanes32.3.68k -spaces -Fhunk -o PolygonBitmapToPlanes32.3.o

) and then compile the C file together with the object files (

m68k-amigaos-gcc -O2 polygon1.c *.o -o polygon1

). It is true, that i did not said this explicitly, but i thought this was trivial.

I'm not at a PC anymore, so I'll let this rest, except to ask why you do all this instead of typing 'make'?

TCH · 27 November 2019, 23:02

Of course i don't type this by hand all the time, i have a small buildscript. (With some additional parts, for switching OS or CPU and other stuff, but those parts are irrelevant here.)

But based on your comments, you use windows, so a POSIX shellscript would be useless for you and i have no idea how would i had to do these calls and any related stuff under windows, so i could not provided you a batchfile. As for providing a makefile, i also have no idea if a UNIX makefile even works under windows or not.

It's just that, it's just three line for putting together the C with the two assembly file and i thought you can write your own batch or makefile.

Antiriad_UK · 27 November 2019, 23:26

Quote:

Originally Posted by hooverphonique

Interesting - I actually earlier pondered asking the compiler explorer team to include an m68k gcc on godbolt.org

Now we just need it to generate motorola syntax

I just tried this and it blew my mind what code it gave for a for loop. Step one break out the lnk instruction wtf

roondar · 28 November 2019, 10:53

About the time measuring... Measuring the total execution time of the program rather than just the part of the code that you've changed will obviously report a lower percentage improvement than measuring just the code that changed. After all, you're adding in (effectively) static overhead for each run.

This may be a valid way to measure overall program performance (which may indeed be what you want to know), but won't measure the performance gain of the new code itself fairly.

However, this can be mostly fixed - simply make sure that the part you wish to know about (the fill in this case) takes up a fairly large chunk of time. Rather than having the code & program running a few seconds, have it run a minute (or perhaps even more if it's a large or floppy disk based program) or so.

The goal is to effectively make any overhead of the OS/init trivial in comparison to the code you want to test. This should give a fairly accurate result. Perhaps this is a possibly better approach if you don't want to measure the run time inside of the program itself?

a/b · 28 November 2019, 15:49

New try: using d/y/x loops instead of y/d/x.

Modulo must be calculated *differently*!!. Instead of
(bitmap_width*bitmap_height-blit_width)>>2
you have to pass this in d2
(bitmap_width*(bitmap_height-blit_height))>>2

Code:

	movem.l	d2-d7/a2-a6,-(a7)
	asl.l	#2,d2		; longwords to bytes

	move.w	d1,d7
	asl.w	#2,d7		; longwords to bytes
	sub.w	d7,d3
	move.w	d3,a6		; a6 = Rowsize-Width<<2;

	subq.w	#1,d0		; Height--;
	subq.w	#1,d1		; Width--;
	subq.w	#1,d4		; Depth--;
	move.w	d0,a5

c_p:	move.l	8<<2(a2),a4	; NextPattern
	move.l	a1,a3		; SrcPtr = TempArea;
	move.l	(a2)+,d6	; CurrentPattern

	move.w	a5,d3		; HeightCounter
c_h:	move.w	d1,d5		; WidthCounter
c_w:	move.l	(a0),d7
	move.l	d6,d0
	eor.l	d7,d0
	and.l	(a3)+,d0
	eor.l	d7,d0
	move.l	d0,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
	dbf	d5,c_w		; if (--WidthCounter >= 0) goto c_w;

	adda.l	a6,a0		; DestPtr += RowSize-Width<<2;
	adda.l	a6,a3		; SrcPtr  += RowSize-Width<<2;
	exg	d6,a4		; swap patterns
	dbf	d3,c_h		; if (--HeightCounter >= 0) goto c_h;

	adda.l	d2,a0		; DestPtr += Modulo<<2;
	dbf	d4,c_p		; if (--Depth >= 0) goto c_p;

	movem.l	(a7)+,d2-d7/a2-a6
	rts

a/b · 28 November 2019, 20:27

Getting annoyed? Good, there is more ;P.

Code:

	movem.l	d2-d7/a2-a6,-(a7)
	asl.l	#2,d2		; longwords to bytes

	move.w	d1,d7
	asl.w	#2,d7		; longwords to bytes
	sub.w	d7,d3
	move.w	d3,a6		; a6 = Rowsize-Width<<2;

	subq.w	#1,d0		; Height--;
	subq.w	#1,d1		; Width--;
	subq.w	#1,d4		; Depth--;
	move.w	d0,a5
	move.w	d1,a4

c_p:	move.l	8<<2(a2),d1	; NextPattern
	move.l	a1,a3		; SrcPtr = TempArea;
	move.l	(a2)+,d6	; CurrentPattern

	move.w	a5,d3		; HeightCounter
	lsr.w	#1,d3
	bcc.s	odd_h
c_h:
	move.w	a4,d5		; WidthCounter
c_w1:	move.l	(a0),d7
	move.l	d6,d0
	eor.l	d7,d0
	and.l	(a3)+,d0
	eor.l	d7,d0
	move.l	d0,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
	dbf	d5,c_w1	; if (--WidthCounter >= 0) goto c_w;

	adda.l	a6,a0		; DestPtr += RowSize-Width<<2;
	adda.l	a6,a3		; SrcPtr  += RowSize-Width<<2;

odd_h:	move.w	a4,d5		; WidthCounter
c_w2:	move.l	(a0),d7
	move.l	d1,d0
	eor.l	d7,d0
	and.l	(a3)+,d0
	eor.l	d7,d0
	move.l	d0,(a0)+	; *DestPtr++ = (*DestPtr&~Temp)|(CurrentPattern&Temp);
	dbf	d5,c_w2	; if (--WidthCounter >= 0) goto c_w;

	adda.l	a6,a0		; DestPtr += RowSize-Width<<2;
	adda.l	a6,a3		; SrcPtr  += RowSize-Width<<2;
	dbf	d3,c_h		; if (--HeightCounter >= 0) goto c_h;

	adda.l	d2,a0		; DestPtr += Modulo<<2;
	dbf	d4,c_p		; if (--Depth >= 0) goto c_p;

	movem.l	(a7)+,d2-d7/a2-a6
	rts

TCH · 30 November 2019, 00:23

@roondar:
The problem with this overhead what the OS gives, that it exists in both measures and it's always almost the same, so there should not be any big difference.

But, okay, i gave in. In a neighbouring topic i just released two timer units, here is the internal measuring by public demand.

First however, i measured the difference with the external tool. I've run the program ten times in both cases.

Here is the times of the C algorithm:

And here is the ASM (not this newest two a/b provided yesterday, the one before them):

Calculating an average from the times and then calculating the difference gives a 3.7466% as the result.

Now, deimos claimed:

Quote:

Originally Posted by deimos

you gave only the time for the entire executable, taken with an outside tool, giving the impression that the assembly optimisations where giving minute 3-4% improvements, when the reality was that they were probably 5 or 10 times that.

So, according to deimos, we should got results what will show that the speed gain is between 18.733% and 37.466% (5-10x3.7466%).

Here are the internally measured runs (modified C file here: http://oscomp.hu/depot/polygon1.c):

After calculating the averages, the difference here is 5.3153%. Which means - if we assume, that this is really the more accurate measuring - that the assembly routine were actually 1.5687% faster than the external measures has shown. This is nowhere from the 5 and 10 times bigger numbers deimos predicted. This entire discussion about the measuring was a big waste of time...

@a/b:
Nope, i am not annoyed, i am always grateful any help, thank you. However this version crashed. Did i understood correctly that

Modulo

should be

(320*(256-h))>>2

?

a/b · 30 November 2019, 00:41

Oh, crap... My bad, it's supposed to be /32 (or (/8)>>2), which is pixels to longwords. I have it written correctly in my test code, but I only copy/pasted the actual function.
Old/new modulo, blit is 256x128px:

Code:

;	move.l	#Width*Height/32-256/32,d2
 move.l	#Width*(Height-Height/2)/32,d2

re: annoyed. I just joking, it's actually how I felt at that point. Instead of taking my time and doing it properly, I was optimizing it incrementally and making posts. It's not how I like to do things, but it's a very busy period for me (the last two months actually) and I was just trying to have some fun doing things I like (m68k asm) while taking breaks from work (writing "serious" code).

TCH · 30 November 2019, 00:59

Got it, now works. And wow, 4% gain, now the assembly routine is almost 10% faster than the C one.

Yeah, i hear you. Quite interesting though, that originally the planes were the outer cycle, but as i said in the opening post, it was 2x slower. I really have to analyse this code, but time is scarce here too... Anyways, thanks again.

roondar · 30 November 2019, 15:45

Quote:

Originally Posted by TCH

@roondar:
The problem with this overhead what the OS gives, that it exists in both measures and it's always almost the same, so there should not be any big difference.

Please understand I'm not trying to make a big deal out of this. But I respectfully disagree here. In my view your figures show a pretty big margin of error for measuring using amtime:

Quote:

Calculating an average from the times and then calculating the difference gives a 3.7466% as the result.
<...>
After calculating the averages, the difference here is 5.3153%.

That's a difference in measured percentages of over 40%. Which would seem to me to prove rather conclusively that measuring internally gives a much more accurate result.

It all boils down to perspective - you're looking purely at the level of improvement over the original code. In that case, 5.3% vs 3.7% isn't a big deal and you're correct in pointing that out. I was looking at how accurate the measurement was for seeing the improvement in the algorithm you changed, in which case reporting 5.3% instead of 3.7% is actually a pretty big 'error'.

Quote:

Which means - if we assume, that this is really the more accurate measuring

It is more accurate. No one was questioning that, were they?
I thought the difference of opinion was not about what was more accurate but rather whether the 'amtime' option was 'good enough'.

Personally, I think your example showed it wasn't really. But your mileage might vary and I respect your opinion in the matter. I was only trying to offer an alternative, not start an argument.

Quote:

This is nowhere from the 5 and 10 times bigger numbers deimos predicted. This entire discussion about the measuring was a big waste of time...

I'd say it (and indeed the whole thread) actually has been rather illuminating. It has shown that adding assembly to C might not always make all that much difference or sense and that modern C compilers are quite viable for all sorts of Amiga development that might previously be considered 'impossible' or 'low performance'.

TCH · 30 November 2019, 17:16

Quote:

Originally Posted by roondar

It is more accurate. No one was questioning that, were they?

No, at least i did not. But maybe i was in error. Now i am not sure if this approach is really that reliable:

It shows ~3-~3% differences for the very same routines. With the external measure, the difference between runs was under 0.1%.

Quote:

Originally Posted by roondar

I'd say it (and indeed the whole thread) actually has been rather illuminating. It has shown that adding assembly to c might not always make all that much difference or sense and that modern c compilers are quite viable for all sorts of amiga development that might previously be considered 'impossible' or 'low performance'.

I agree with this. But this experience is entirely independent of the discussion about the measure methods.

roondar · 30 November 2019, 22:12

Quote:

Originally Posted by TCH

No, at least i did not. But maybe i was in error. Now i am not sure if this approach is really that reliable:

It shows ~3-~3% differences for the very same routines. With the external measure, the difference between runs was under 0.1%.

I don't know the cause for this discrepancy, but I am certain it has nothing to do with the difference between external and internal timing itself. Some of my own programs have had performance measuring done in the program itself and I only ever saw minute differences between runs (usually below 0.1%). Now, I can think of several reasons why you'd see such differences. But...

This discussion is getting fairly far off topic, I think everyone has said what we want on the issue and if not: I suggest we either discuss the pro's and con's of internal/external timing in the thread you made for the timing unit you wrote or make a separate thread about.

Quote:

I agree with this. But this experience is entirely independent of the discussion about the measure methods.

Well, not entirely - the method used could have caused an under reporting of the gains made. So IMHO it was useful to see if this was the case. As is, it turns out to be only a minor factor at best. But until doing the experiment this was not certain.

Anyway, back to seeing if better code can be figured out yet

30 November 2019, 00:41	#96
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	Oh, crap... My bad, it's supposed to be /32 (or (/8)>>2), which is pixels to longwords. I have it written correctly in my test code, but I only copy/pasted the actual function. Old/new modulo, blit is 256x128px: Code: ; move.l #WidthHeight/32-256/32,d2 move.l #Width(Height-Height/2)/32,d2 re: annoyed. I just joking, it's actually how I felt at that point. Instead of taking my time and doing it properly, I was optimizing it incrementally and making posts. It's not how I like to do things, but it's a very busy period for me (the last two months actually) and I was just trying to have some fun doing things I like (m68k asm) while taking breaks from work (writing "serious" code).

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Optimizing HAM8 renderer.	Thorham	Coders. Asm / Hardware	5	22 June 2017 18:29
NetSurf AGA optimizing	arti	Coders. Asm / Hardware	199	10 November 2013 14:36
Layered tile engine optimizing.	Thorham	Coders. General	0	30 September 2011 20:43
Benching and optimizing CF-IDE speed	Photon	support.Hardware	12	15 July 2009 01:48
For people who like optimizing 680x0 code.	Thorham	Coders. General	5	28 May 2008 11:48

27 November 2019, 20:21	#82
a/b Registered User Join Date: Jun 2016 Location: europe Posts: 1,039	To go back to the original discussion... One way of speeding it up further would probably be to switch loop order (1..100 outer and 1..3 inner is slower than 1..3 outer and 1..100 inner), since there are 3 nested loops and the inner 2 have small repetition count. Also, I usually hardcode the depth (=> unrolling), which should be totally reasonable for a game/demo, and that would free another register and eliminate swaps.

27 November 2019, 23:02	#90
TCH Newbie Amiga programmer Join Date: Jun 2012 Location: Front of my A500+ Age: 38 Posts: 372	Of course i don't type this by hand all the time, i have a small buildscript. (With some additional parts, for switching OS or CPU and other stuff, but those parts are irrelevant here.) But based on your comments, you use windows, so a POSIX shellscript would be useless for you and i have no idea how would i had to do these calls and any related stuff under windows, so i could not provided you a batchfile. As for providing a makefile, i also have no idea if a UNIX makefile even works under windows or not. It's just that, it's just three line for putting together the C with the two assembly file and i thought you can write your own batch or makefile.

28 November 2019, 10:53	#92
roondar Registered User Join Date: Jul 2015 Location: The Netherlands Posts: 3,421	About the time measuring... Measuring the total execution time of the program rather than just the part of the code that you've changed will obviously report a lower percentage improvement than measuring just the code that changed. After all, you're adding in (effectively) static overhead for each run. This may be a valid way to measure overall program performance (which may indeed be what you want to know), but won't measure the performance gain of the new code itself fairly. However, this can be mostly fixed - simply make sure that the part you wish to know about (the fill in this case) takes up a fairly large chunk of time. Rather than having the code & program running a few seconds, have it run a minute (or perhaps even more if it's a large or floppy disk based program) or so. The goal is to effectively make any overhead of the OS/init trivial in comparison to the code you want to test. This should give a fairly accurate result. Perhaps this is a possibly better approach if you don't want to measure the run time inside of the program itself?

30 November 2019, 00:59	#97
TCH Newbie Amiga programmer Join Date: Jun 2012 Location: Front of my A500+ Age: 38 Posts: 372	Got it, now works. And wow, 4% gain, now the assembly routine is almost 10% faster than the C one. Yeah, i hear you. Quite interesting though, that originally the planes were the outer cycle, but as i said in the opening post, it was 2x slower. I really have to analyse this code, but time is scarce here too... Anyways, thanks again.

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)