22 August 2018, 09:25 | #61 |
Inviyya Dude!
Join Date: Sep 2016
Location: Amiga Island
Posts: 2,774
|
I only want to drop by to say that I think it's nice someone is doing new 3D game stuff on OCS Amigas. We have a lot of 2D game dev going on again, but 3D was missing so far...
|
22 August 2018, 19:59 | #62 |
Registered User
Join Date: Oct 2009
Location: Salem, OR
Posts: 1,770
|
|
23 August 2018, 10:23 | #63 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
It's surprising how many pixels you can push with the CPU.
A 320x200 pixel display area is 32k per frame in 16 colours. 1.6mb/sec for 50 fps. At the lower end say 24k for 8 colours and 600k/sec for 25 fps. That's more realistic for an A500 that also needs to do some geometry processing. Games like NSP and Fighter Duel must be getting close to the bandwidth limits of the RAM. If I find the time I'd like to do a frame rate analysis of NSP, but my estimate would be 15-20 fps. |
23 August 2018, 13:29 | #64 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
|
23 August 2018, 15:17 | #65 |
Registered User
Join Date: Jun 2016
Location: UK
Posts: 428
|
The horizon on Fighter Duel is interesting. Looks like several colours, relatively expensive to draw and considering the rest of the 3D display only uses a few colours it would slow that drawing down a lot unless some trick was used.
|
24 August 2018, 09:46 | #66 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Update
Just so that anyone who's interested knows, I just got back to this after a bit of a forced break. I've done some work on the assembly version of the horizontal line routine and have knocked 5s of the 52s processing time. That's about 25% off that bit of code.
Most of the improvement came from changing a lot of .l instructions to .w - I think that, as C constants are always at least int sized, the compiler is forced to treat them as 32 bits. This caused a lot of things to be extended to longs before long operations are applied and then the top half of the long result is discarded as the result is assigned to a word. I'm not sure how I can fix that in C, but if I could, I'd see a small improvement straight away, and the generated code would be a better starting point for an assembly re-implementation. Quote:
|
|
24 August 2018, 11:13 | #67 | |
Registered User
Join Date: Aug 2008
Location: Sintra, Portugal
Posts: 12
|
Quote:
This. Although the objects are simple, it's the fastest and smoothest I've ever seen on Amiga OCS, and manages full frame rate most of the time. Very playable too... |
|
24 August 2018, 13:46 | #68 | ||
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
Quote:
You see, if I understand what you wrote correctly, your drawing code takes 38*0.75=28,5% of the total time and your frame rate is now 100/47=2,13fps. Of this time, 28,5% is spent drawing. This would lead to a 'no-calculations frame rate' of (1/0,285)*2,13=7,5fps. And that seems like it's rather slow, even when accounting for the fact the CPU is used to draw. Edit: reading this back, I'm fairly sure my calculated new percentage if not correct, but my question still stands even if I am a bit off with the numbers This all makes me wonder: what is your effective fill rate in bytes (or pixels if you prefer) per second? Quote:
Last edited by roondar; 24 August 2018 at 13:53. |
||
24 August 2018, 14:07 | #69 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Quote:
15 / 47 = 32%. But, your point stands, if the 15 + 17 weren't there, I'd have 15 seconds for 100 frames, or 6.67 fps, which isn't fast enough. What's my effective fill rate? Quick maths, looking at the image in the very first post in this thread and estimating the ball at 80px diameter, leads me to say approximately 10,000 pixels per second (at 6 bit planes). I'm not sure if that's what you meant. Edit: I don't think that maths is correct, the ball is about 5000 pixels, at 6.67 fps that's 33,000 pixels per second. I'll experiment with casting down to 16 bits when I get to the next step. And hopefully I can post my line draw code later today or tomorrow. |
|
24 August 2018, 14:42 | #70 | |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
Quote:
Without the code it's of course tricky to be certain how you implemented it, but assuming you've optimised the scanline algorithm to draw as many full words per step as possible (rather than separate pixels which would be very slow) that would lead to an optimal case of moving 248 words of data per frame. Something isn't right there. I understand much more is happening than just moving the data, but this result would indicate the code spends in the region of 500-600 cycles for every word drawn (@ 1bpl). Which feels like it's a very high amount. Last edited by roondar; 24 August 2018 at 14:46. Reason: Forgot a '3' somewhere ;) |
|
24 August 2018, 16:52 | #71 |
Registered User
Join Date: Jun 2015
Location: Germany
Posts: 1,918
|
Not sure anyone has mentioned it yet: you can define a bounding box for all your objects, project that onto the screen and determine the (word-aligned) regions where the updated bounding box and the preceding bounding box do not overlap. Then you just need to clear the non-overlapping segments of the bounding boxes for the next frame.
|
24 August 2018, 19:17 | #72 |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Code:
section "CODE", code public _FastDrawHorizontalLine ; void FastDrawHorizontalLine(const PLANEPTR planes [], const WORD x1, const WORD x2, const WORD y, const WORD colour) ; ; a0 planes ; d0 x1 ; d1 x2 ; d2 y ; d3 colour _FastDrawHorizontalLine movem.l d2-d7/a2-a5, -(a7) mulu.w #20, d2 ; 20 words per 320 pixel row - aim to replace this multiply with an add move.w d0, d4 asr.w #4, d4 add.w d2, d4 move.w d1, d5 asr.w #4, d5 add.w d2, d5 and.w #$000f, d0 lsl.w #1, d0 lea left_bits, a2 move.w (a2, d0.w), d0 move.w d0, a3 and.w #$000f, d1 lsl.w #1, d1 lea right_bits, a2 move.w (a2, d1.w), d1 move.w d1, a4 and.w d1, d0 move.w d0, a5 moveq #6, d6 ; 6 bitplanes bra loop_over_planes loop_over_planes_start move.l (a0)+, a1 move.w d4, d7 lsl.w #1, d7 lea (a1, d7.w), a1 lsr.b d3 bcc 1$ moveq #-1, d7 bra 2$ 1$ moveq #0, d7 2$ cmp.w d4, d5 beq do_overlap do_left move.w a3, d1 move.w (a1), d0 tst.w d7 beq 1$ or.w d1, d0 bra 2$ 1$ not.w d1 and.w d1, d0 2$ move.w d0, (a1)+ do_middle move.w d5, d0 sub.w d4, d0 sub.w #2, d0 cmp.w #17, d0 bhi do_right lsl.w #2, d0 move.l duff(pc, d0.w), a2 jmp (a2) cnop 0, 4 duff dc.l 2$ dc.l 3$ dc.l 4$ dc.l 5$ dc.l 6$ dc.l 7$ dc.l 8$ dc.l 9$ dc.l 10$ dc.l 11$ dc.l 12$ dc.l 13$ dc.l 14$ dc.l 15$ dc.l 16$ dc.l 17$ dc.l 18$ dc.l 19$ 19$ move.w d7, (a1)+ 18$ move.w d7, (a1)+ 17$ move.w d7, (a1)+ 16$ move.w d7, (a1)+ 15$ move.w d7, (a1)+ 14$ move.w d7, (a1)+ 13$ move.w d7, (a1)+ 12$ move.w d7, (a1)+ 11$ move.w d7, (a1)+ 10$ move.w d7, (a1)+ 9$ move.w d7, (a1)+ 8$ move.w d7, (a1)+ 7$ move.w d7, (a1)+ 6$ move.w d7, (a1)+ 5$ move.w d7, (a1)+ 4$ move.w d7, (a1)+ 3$ move.w d7, (a1)+ 2$ move.w d7, (a1)+ do_right move.w a4, d1 move.w (a1), d0 tst.w d7 beq 1$ or.w d1, d0 bra 2$ 1$ not.w d1 and.w d1, d0 2$ move.w d0, (a1)+ bra loop_over_planes do_overlap move.w a5, d1 move.w (a1), d0 tst.w d7 beq 1$ or.w d1, d0 bra 2$ 1$ not.w d1 and.w d1, d0 2$ move.w d0, (a1)+ loop_over_planes dbra d6, loop_over_planes_start movem.l (a7)+, d2-d7/a2-a5 rts section "DATA", data cnop 0, 4 left_bits dc.w $ffff, $7fff, $3fff, $1fff, $0fff, $07ff, $03ff, $01ff dc.w $00ff, $007f, $003f, $001f, $000f, $0007, $0003, $0001 right_bits dc.w $8000, $c000, $e000, $f000, $f800, $fc00, $fe00, $ff00 dc.w $ff80, $ffc0, $ffe0, $fff0, $fff8, $fffc, $fffe, $ffff Last edited by deimos; 24 August 2018 at 19:49. |
24 August 2018, 22:19 | #73 |
Registered User
Join Date: May 2017
Location: Munich/Bavaria
Posts: 2,295
|
ok .. not sure about the rest, but for the multiplication you want to replace:
move it to registers and do two shifts (lsl.w #4, d3 and lsl.w #2, d2) and add both results... but still: I am sure we are missing something obvious, as it is way too slow |
24 August 2018, 23:13 | #74 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
Looking at your code and new times quoted, there are a few things to note. As before, I'm going to just look at the drawing part and pretend that the rest is not being done
1) Your new code runs at 10fps/6 bitplanes. It should draw at 15fps if you used 4 bitplanes/20 fps if you used 3 bitplanes. Probably even faster, as 6 bitplanes slows down the CPU by something on the order of 30% per frame (it loses 50% during the period where a six bitplane screen is shown, 0% where no screen is shown). Most, if not all, OCS Amiga 3D games run using a maximum of 4 bitplanes - the machine is just not fast enough to do 3D well in more than that. I'd suggest you try that. 2) Your indirect jump can be optimized. Each of the move statements take 2 bytes, so you don't need a lookup table. You should be able to do something like Code:
add.w d0,d0 ; Adding is slightly faster on 68000 than a shift up to a shift of 2. Shifting three or more is faster using the shift commands. jmp movetable(pc,d0.w) ; Jump can be indirect movetable move.w d7,(a1)+ move.w d7,(a1)+ ; etc 3) There is one other thing that might make it faster for bigger objects (as in, wider than 32 pixels on average). Right now, you are using move.w d7,(a1)+. If you draw mostly those larger objects, using move.l d7,(a1)+ could be quite a bit faster. Code:
move.w dn,(an)+ ; 8 cycles per word => 16 cycles per long move.l dn,(an)+ ; 12 cycles per long =>25% faster Other than that, the middle part of the loop (which is where you should spend the most time and so I concentrated on it) is looking fine. -- For the other parts, it might be possible to change part of the code so that you could drop a few tst.w and other branches. I haven't looked at it in great detail, but it feels you might be able to drop a few of those by reordering things. However, I feel the screen depth you used is probably the biggest hurdle right now. Maybe others can chip in and get us to see more optimizations than me though The other code I might not be able to help with as much, as I haven't done any 3D stuff. I do know about 2D stuff and 68k, but not 3D. Last edited by roondar; 24 August 2018 at 23:19. |
25 August 2018, 14:03 | #75 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Regarding 1, yes, I know, but I'm still hoping to make it work. I'd really like to keep the extra colours as I get some nice subtle shading effects and more realistic shadows, and I hope that if I have the top half of the screen in 6 bitplane mode and the bottom half for dashboard / controls in 3 or 4 bitplanes then the CPU will still have enough access to memory. If it doesn't work out then I can change this.
Regarding 2 & 3, I can do 2 right now, but if I decide to do 3, or to use movem.l for longer lines, then I guess I'll need the lookup table back, so I might leave it for now. At the moment all my polygons are small, only around half of them have a middle part to fill. I don't yet what the final mix will be, and I suspect this is something where I'll have to implement all the options and let them fight it out amongst themselves. I can imagine all sorts of complex combinations, but considering there's only 10 longs across the screen, perhaps a simple split into even and odd runs where the odd runs get one extra move.w before falling into a duff's device of move.l's. Regarding 4, these sound like easy things that I should do right now, and start to build up my list of similar quick fixes. Regarding the other parts, I think I may be able to separate the do_overlap part (where the line begins and ends within the same word) and give it it's own bitplane loop. That would eliminate one comparison and branch per loop, and maybe free up a register (of which I currently have none spare). I'll see what I can do today. Quote:
|
|
25 August 2018, 14:45 | #76 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
Right, so I misunderstood the drawing routine originally. I thought you called this routine once per line of the object being displayed. But you are in fact calling it once per line per polygon being displayed.
With this and your desire to keep running at 6 bitplanes in mind, I've thought up a few other things you could try, though one of them isn't going to help for readability much. As for the move.l idea, yeah that's only worthwhile when plotting big polygons. 1) This is the most involved suggestion. You could try to change the way you deal with drawing a polygon in the colour of your choice. Right now, you basically draw the entire polygon, plane by plane (alternating between drawing and erasing by choosing what value to use as source for the move.w's based on the colour per plane) using one routine. An option could be to do this only once, effectively drawing to a one bitplane polygon 'buffer'. Then combine this pre-drawn one bitplane version with the actual image. This will cause overhead in the form of drawing one additional bitplane of pixels, but will remove a lot of overhead by no longer needing a fairly involved drawing process for each plane but just for one plane. The remaining planes will lose all that overhead because they essentially become copy or mask operations. 2) This is easier to implement and might still help quite a bit. Currently, you calculate the Y offset using a mulu for every line of every polygon. This is not required. You could move this calculation out into whatever routine calls the line drawing routine. Because you draw line by line, all you'd need to do is multiply once before the start of drawing the polygon and then, for every line, add the modulo to Y, instead of adding one to Y. This way, a ten line polygon will save out on nine multiply commands. You can go one further and actually implement this all the way in your projection (I'm assuming the 3D coordinates you use are projected onto the screen and that they are not on a one-to-one scale here). This is more involved, but could remove the need to multiply from the drawing process entirely and keep it there where it's probably being used already. 3) A bit of a micro optimisation this, but you could opt to turn the routine into a macro instead of a called routine. A bsr (or jsr)/rts combo takes up quite a few cycles, especially given how often you call the routine (once per line of each polygon adds up). This will definitely reduce readability of code and isn't the neatest of things but it can help. Do note that this will save more cycles than my 4th suggestion in my earlier post. 4) Another micro optimisation, but consider stack usage. Each register pushed to & later pulled from the stack also costs something like 16 cycles. This is not a lot, but when you call routines very often, it does become worthwhile to check if the registers all need to be put onto the stack and if all of those that really do need to be on the stack actually need to be put there as long values. Last edited by roondar; 25 August 2018 at 14:57. |
25 August 2018, 15:11 | #77 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Yes, this is just a polygon fill, not a multiple polygon fill, which I looked at and decided (rightly or wrongly) were only suited to chunky displays where you could write a pixel at a time. Also, I can limit myself to convex, non-intersecting polygons, so one polygon at a time works for me.
1) Might actually be a really smart idea, I will give it proper consideration. 2) Already considered and planned, this is what I meant by the comment next to it "aim to replace this multiply with an add". Coherence. 3 & 4) I can't see any purpose of this code apart from supporting the scanline polygon fill, so once it's right (if point 1 doesn't change things) I'll bung it in place and make its register saving specific to the one use case instead of making it a general purpose subroutine. Quote:
|
|
25 August 2018, 16:25 | #78 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
@roondar, I'm just trying this out, as a last micro-optimisation before I step back and evaluate everything...
Quote:
My current code is as follows: Code:
do_middle move.w d5, d0 sub.w d4, d0 sub.w #2, d0 cmp.w #17, d0 bhi do_right add.w d0, d0 jmp movetable(pc, d0.w) movetable move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ move.w d7, (a1)+ do_right |
|
25 August 2018, 16:32 | #79 |
Registered User
Join Date: Jul 2015
Location: The Netherlands
Posts: 3,414
|
That's quite odd. I use code very similarly in a sprite update routine I once wrote (for very small sprites, the CPU beats the blitter for filling). And that works just fine.
Code:
add.w d1,d1 ; *2 to get the offset in words. jmp .unrolled(pc,d1) ; Jump to the right address here. .unrolled move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ move.l (a0)+,(a1)+ ...etc Edit: wait... I understand, you call into the jump table using the value of D0, which is numbered precisely opposite to what it would need to be For 0 pixels, you would need a D0 of (<table entries>+1)*2, for max width it'd have to be 0. Might not work as an optimisation then. Last edited by roondar; 25 August 2018 at 16:41. Reason: So yeah, the 68000 doesn't like odd addresses :P |
25 August 2018, 16:43 | #80 | |
It's coming back!
Join Date: Jul 2018
Location: comp.sys.amiga
Posts: 762
|
Edit: just saw your edit. Yes, higher numbers are at the start of the table. Maybe I could alter the maths to make it work, but considering everything I may be better off leaving it how it was.
Right now, I have no idea how to debug something like this. With my small polygons this bit of code doesn’t get called all the time, so if I had a debugger I’d set a breakpoint. Maybe in ‘87 I’d know where to start, but today? No chance. Quote:
Last edited by deimos; 25 August 2018 at 16:56. Reason: Too slow |
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Games that are Full Frame Rate or Slower - Limitations or Choice? | Foebane | Retrogaming General Discussion | 35 | 08 April 2018 13:22 |
F1 grand prix frame rate | universale | support.Games | 18 | 13 July 2015 21:45 |
The First Person Shooter frame rate tolerance poll... | DDNI | Retrogaming General Discussion | 41 | 30 June 2011 03:32 |
Vsync Fullscreen and Double Buffer, incorrect frame rate? | rsn8887 | support.WinUAE | 1 | 07 April 2011 20:43 |
Propper speed request when recording with "Disable frame rate" turned on. | Ironclaw | request.UAE Wishlist | 9 | 02 August 2006 07:21 |
|
|